Crawl4AI: Web Scraping Library Tested!

The Crawl4AI is a web scraping library that facilitates web data extraction in an efficient and structured manner. It was designed to be a modern alternative to other popular tools like Scrapy, BeautifulSoup and Selenium, offering several functionalities such as:

✅ Ease of use – Provides a simple API to configure and run scrapers without complications.
✅ JavaScript support – Can render dynamic pages, essential for websites that load content via AJAX.
✅ Anti-blocking mechanisms – Supports proxies, rotating user-agents and automatic delays to avoid detection and blocks.
✅ LLM usage - Allows the use of LLMs to process the collected data.

But does Crawl4AI deliver all of this in practice?

To answer this question, I ran a test: I created a scraper to collect information about a Pokémon from an online Pokédex. Let's check the results!

Test Scraper

The challenge was to build a simple scraper to extract detailed information about a Pokémon directly from the Pokédex page. The data I tried to collect was:

id → Pokémon ID in the Pokédex
name → Pokémon name
height → Height
weight → Weight
category → Pokémon category
abillities → Pokémon abilities
types → Pokémon types
weakness → Pokémon weaknesses
image_src → Pokémon image URL

The complete code is available in the repository: https://github.com/subipranuvem/crawl4ai-test.

In the next section, I will detail the positive and negative points I found while using Crawl4AI in this test.

Positive Points

Installation

The installation is really easy to do, without any problems, just follow the documentation:

pip install crawl4ai && \
crawl4ai-setup && \
crawl4ai-doctor

It's quite simple, but remember to execute these steps if you have to build a Docker image with Crawl4AI.

Markdown Output

If you need a structured result in Markdown, then this library will do magic for you!

This simple piece of code can already output a markdown with all the necessary information to obtain from the Pokémon:

import asyncio

from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig


async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://sg.portal-pokemon.com/play/pokedex/0981",
            cache_mode=CacheMode.DISABLED,
            config=CrawlerRunConfig(
                cache_mode=CacheMode.DISABLED,
                simulate_user=True,
                magic=True,
            ),
        )

        if result.markdown_v2:
            print(result.markdown_v2)


if __name__ == "__main__":
    asyncio.run(main())

Result:

...
[ ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/arrow_left_btn.png) ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/0980>) 0980 Clodsire
0981 <---------------- id
Farigiraf <---------------- name
[ ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/arrow_right_btn.png) ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/0982>) Dudunsparce 0982
![](https://sg.portal-pokemon.com/play/resources/pokedex/img/pokemon_bg.png) ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/pokemon_circle_bg.png) ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/pm/566c8ecfa9f9fddf539ca05a7ae8c86ac3465f5b.png)
Type
[ Normal ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/normal#result>) <---------------- type
[ Psychic ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/psychic#result>) <---------------- type
Weakness
[ Bug ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/bug#result>) <---------------- weakness
[ Dark ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/dark#result>) <---------------- weakness
Height 3.2 m <---------------- height
Category Long Neck Pokémon <---------------- category
Weight 160.0 kg <---------------- weight
Gender ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/icon_male.png) / ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/icon_female.png)
Ability Cud Chew ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/icon_question.png) Armor Tail ![](https://sg.portal-pokemon.com/play/resources/pokedex/img/icon_question.png) <---------------- abillities
Versions
...

But the positive points, unfortunately, end here.

Negative Points

During the development of this project, one of the biggest challenges was dealing with the Crawl4AI documentation. Many links were broken and several examples simply didn't work, which made learning and implementation more time-consuming than expected.

Below, I highlight the main problems I encountered while using this library.

Cache Configuration

An unexpected problem arose when running the scraper: even after the program finished, the results remained the same in subsequent runs. This happened due to the default cache configuration, which kept the data stored.

To solve this, it was necessary to explicitly disable the cache by adding the cache_mode=CacheMode.DISABLED property in the crawler configuration:

result = await crawler.arun(
    url="https://sg.portal-pokemon.com/play/pokedex/0981",
    cache_mode=CacheMode.DISABLED, # <---------- HERE
    config=CrawlerRunConfig(
        cache_mode=CacheMode.DISABLED,
        ...),
    ...,
)

The Crawl4AI documentation suggests using CacheMode.BYPASS, which also avoids this problem.

Here, I admit it was a mix of lack of attention and ignorance about how the library's cache works. I didn't imagine that, even after closing and reopening the terminal, the old data would still be returned. This probably happens due to the browser used by the crawler.

This peculiarity only became evident when I removed the abillities_info property from the schema, but it continued to appear in the extracted data, indicating that the response was being reused from the cache.

import asyncio
import json

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, JsonXPathExtractionStrategy


async def main():
    async with AsyncWebCrawler() as crawler:
        schema = {
            "name": "Pokémon information",
            "baseSelector": "//div[@class='pokemon-detail']",
            "fields": [
                {
                    "name": "id",
                    "selector": ".//p[@class='pokemon-slider__main-no size-28']",
                    "type": "text",
                },
                ...,
                # {
                #     "name": "abillities_info",
                #     "selector": ".//span[@class='pokemon-info__value pokemon-info__value--body size-14']/span",
                #     "type": "list",
                #     "fields": [
                #         {
                #             "name": "item",
                #             "type": "text",
                #         },
                #     ],
                # },
            ],
        }

        config = CrawlerRunConfig(
            extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True),
            disable_cache=True,
        )

        result = await crawler.arun(
            url="https://sg.portal-pokemon.com/play/pokedex/0981",
            config=config,
        )

        data = json.loads(result.extracted_content)
        print(json.dumps(data, indent=2) if data else "No data found.")


if __name__ == "__main__":
    asyncio.run(main())

# Output
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Time: 0.01s
[COMPLETE] ● https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Total: 0.01s
[
  {
    "id": "0981",
    "name": "Farigiraf",
    "height": "3.2 m",
    "weight": "160.0 kg",
    "category": "Long Neck Pok\u00e9mon", 
    "abillities_info": [], # <---------- This is not set in schema
    "image_src": ""
  }
]

Non-intuitive Models

Data extraction with XPath in Crawl4AI requires some "tricks" that complicate the creation of a cleaner and more intuitive data model.

An example of this is the extraction of lists. Instead of allowing a simple list of strings, the library requires the data to be structured as a list of objects, making the model more verbose and unnecessarily complex.

See the problem in practice:

❌ Model required by Crawl4AI (less elegant)

# Schema
schema = {
    "fields": [
        {
            "name": "types",
            "selector": ".//div[contains(@class,'pokemon-type__type')]//span",
            "type": "list",
            "fields": [ # Unnecessary...
                {
                    "name": "item",
                    "type": "text",
                },
            ],
        },
    ]
}

🔍 Result obtained

{
    "types": [
        {
            "item": "Normal"
        },
        {
            "item": "Psychic"
        }
    ]
}

✅ Expected result (cleaner and more intuitive)

{
    "types": [
        "Normal",
        "Psychic"
    ]
}

This behavior forces the developer (you and me) to post-process the data to obtain a more suitable format, which could be avoided with a more flexible approach by the library.

Inconsistent Examples

Throughout the scraper development, I encountered small but frustrating challenges due to inconsistencies in the Crawl4AI documentation. Some examples worked perfectly, while others simply didn't run, without any clear explanation as to why.

An example of this is the configuration of the XPath extraction strategy. Depending on where the extraction_strategy parameter is declared, the code can work or fail silently.

❌ Code that DOES NOT work

result = await crawler.arun(
    url="https://sg.portal-pokemon.com/play/pokedex/0981",
    config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
    extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True),
)

✅ Code that works

result = await crawler.arun(
    url="https://sg.portal-pokemon.com/play/pokedex/0981",
    config=CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True),
    ),
)

And no, it wasn't a lack of attention. The documentation itself contains contradictory examples, as you can see on the page below:

Example with the extraction_strategy with CSS extraction.

Example with the extraction_strategy with CSS extraction, but with the configuration inside CrawlerConfig

Unfortunately, the scraper was configured with the example that, for some reason, prevents the correct collection of structured data.

Scraper Obtains Items Only Using Rendered HTML

Now you must be wondering: if the scraper can capture the data directly from the rendered HTML, shouldn't this be a positive point?

Not necessarily. Especially when some information is only available in the raw version of the page, before JavaScript manipulation.

An example of this was when I tried to extract the descriptions of Pokémon abilities. This information was present in the raw HTML, before the JavaScript loaded. However, since the Crawl4AI parser operates on the already rendered HTML, the data had already been removed from the page before it was even processed.

The image below clearly shows that the information exists in the raw HTML:

This data was inside <transition> elements, which are removed as soon as JavaScript kicks in. As a result, they became inaccessible without directly interacting with the rendered version of the page.

My attempt to work around this problem was to modify the DOM via JavaScript, inserting new <p> elements in the hope of making them visible again and thus being able to extract them. But it didn't work.

The most frustrating part? If I had just used the requests library combined with lxml, I could have captured this data without any headache.

Data Extraction Using LLM (The Magic of Crawl4AI)

The idea was to use this functionality with Google Gemini, since it's an LLM model with free API access.
(More information about request quotas here / How to get your API key here.)

Model Definition

The first step was to define a model using Pydantic:

from pydantic import BaseModel

class Pokemon(BaseModel):
    id: str
    name: str
    height: str
    weight: str
    category: str
    abillities: list[str]
    types: list[str]
    weakness: list[str]
    image_src: str

The Problem with the Documentation

When consulting the Crawl4AI documentation, there was no indication that Gemini was supported.
What was informed, however, is that the framework supports models from LightLLM.

Great, right? Wrong.

The link provided led to a non-existent page.

I had to resort to Google and find the LiteLLM documentation on my own. There, there was a list of supported models... but Gemini was not among them.

A Shot in the Dark

Without knowing if the configuration was correct or not, I proceeded with the following setup:

llm_strategy = LLMExtractionStrategy(
        provider="google/gemini-2.0-flash",
        api_token="<your_token>",
        schema=Pokemon.model_json_schema(),
        extraction_type="schema",
        instruction="Extract the informations about the pokémon",
        chunk_token_threshold=1400,
        overlap_rate=0.1,
        apply_chunking=True,
        input_format="html",
        extra_args={"temperature": 0.1, "max_tokens": 1000},
        verbose=True,
    )

The result? Error.

But, ironically, it was the best thing that could have happened.

The error message brought an extremely useful link: docs.litellm.ai/docs/providers, where I finally found support for Google Gemini: docs.litellm.ai/docs/providers/gemini.

Now tell me, wasn't it easier to have put this directly in the documentation? 🤔

Results of Collection with LLM

The collection was painfully slow and inefficient:

Total time: 148 seconds (2 minutes and 28 seconds) High token consumption: If you pay for tokens, your wallet will feel it. The Crawl4AI documentation itself warns about the reasons not to use LLM for this type of extraction:

✅ High cost

✅ Low speed

✅ Not scalable for thousands/millions of requests

And you know what's most curious?

The data obtained via LLM was practically identical to that extracted by the scraper with XPath – only with a slightly more coherent data model.

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Time: 24.20s
[SCRAPE].. ◆ Processed https://sg.portal-pokemon.com/play/pokedex/0981... | Time: 24ms
[LOG] Call LLM for https://sg.portal-pokemon.com/play/pokedex/0981 - block index: 0
[LOG] Call LLM for https://sg.portal-pokemon.com/play/pokedex/0981 - block index: 1
[LOG] Extracted 1 blocks from URL: https://sg.portal-pokemon.com/play/pokedex/0981 block index: 0
[LOG] Extracted 1 blocks from URL: https://sg.portal-pokemon.com/play/pokedex/0981 block index: 1
[EXTRACT]. ■ Completed for https://sg.portal-pokemon.com/play/pokedex/0981... | Time: 124.10343073799959s
[COMPLETE] ● https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Total: 148.33s

=== Token Usage Summary ===
Type                   Count
------------------------------
Completion               368
Prompt                15,632
Total                 16,000

=== Usage History ===
Request #    Completion       Prompt        Total
------------------------------------------------
1                   155          837          992
2                   213       14,795       15,008
[
  {
    "id": "N/A",
    "name": "N/A",
    "height": "N/A",
    "weight": "N/A",
    "category": "N/A",
    "abillities": [],
    "types": [],
    "weakness": [],
    "image_src": "N/A",
    "error": false
  },
  # <---- HERE, almost the same data as XPath Strategy
  {  
    "id": "0981",
    "name": "Farigiraf",
    "height": "3.2 m",
    "weight": "160.0 kg",
    "category": "Long Neck Pok\u00e9mon",
    "abillities": [
      "Cud Chew",
      "Armor Tail"
    ],
    "types": [
      "Normal",
      "Psychic"
    ],
    "weakness": [
      "Bug",
      "Dark"
    ],
    "image_src": "https://sg.portal-pokemon.com/play/resources/pokedex/img/pm/566c8ecfa9f9fddf539ca05a7ae8c86ac3465f5b.png",
    "error": false
  }
]

My recommendation? Avoid using LLM for this type of task.

Other Issues

There are still some minor issues I encountered while running other tests, including:

1 - Outdated Docker Image

The current version of the Crawl4AI Docker image is outdated, which can lead to inconsistencies.
The good news is that, according to the official repository, a fix is already on the way.

2 - REST API without XPath Support

The REST API does not support structured data collection via XPath; it is only possible through code. When attempting to make a query to the API by passing the json_xpath strategy for extraction, the following error was returned:

{
    "detail": [
        {
            "type": "enum",
            "loc": [
                "body",
                "extraction_config",
                "type"
            ],
            "msg": "Input should be 'basic', 'llm', 'cosine' or 'json_css'",
            "input": "xpath",
            "ctx": {
                "expected": "'basic', 'llm', 'cosine' or 'json_css'"
            }
        }
    ]
}

Final Considerations

Unfortunately, my experience with Crawl4AI was not the best. The documentation is confusing, full of inconsistent examples and external references that simply don't work.

Do you remember that dilemma: Read 5 minutes of documentation or spend 5 hours on Google to find the error? Well, it doesn't apply here….

Of all the scraping tools I've tested, this was by far the most frustrating. Simple and trivial tasks end up becoming a real puzzle.

But does that mean the library is bad?

Not necessarily. For those who need to extract specific information from a page and process it with an LLM, Crawl4AI can be useful. However, if the goal is to handle large volumes of data, the strategy of using LLMs becomes impractical due to the cost and execution time.

Furthermore, the library is still under development, which means you might waste a lot of time trying to solve problems that other tools, like requests and lxml, would solve without any hassle.

And if the idea is to use an LLM locally, alternatives like Ollama can be even slower, especially if your hardware isn't powerful enough. This only worsens the overall experience.

In its current state, Crawl4AI is not a reliable option for production projects. Methods, APIs, and functionalities are still changing rapidly, which can lead to instability.

My recommendation? Keep an eye on the updates and changelog. Who knows, in the future, with more robust documentation and usability improvements, Crawl4AI might actually be worth it.

Crawl4AI: A Modern Web Scraping Library? A Real-World Test

Table of contents