The Crawl4AI is a web scraping library that facilitates web data extraction in an efficient and structured manner. It was designed to be a modern alternative to other popular tools like Scrapy, BeautifulSoup and Selenium, offering several functionalities such as:
✅ Ease of use – Provides a simple API to configure and run scrapers without complications.
✅ JavaScript support – Can render dynamic pages, essential for websites that load content via AJAX.
✅ Anti-blocking mechanisms – Supports proxies, rotating user-agents and automatic delays to avoid detection and blocks.
✅ LLM usage - Allows the use of LLMs to process the collected data.
But does Crawl4AI deliver all of this in practice?
To answer this question, I ran a test: I created a scraper to collect information about a Pokémon from an online Pokédex. Let's check the results!
Test Scraper
The challenge was to build a simple scraper to extract detailed information about a Pokémon directly from the Pokédex page. The data I tried to collect was:
id
→ Pokémon ID in the Pokédexname
→ Pokémon nameheight
→ Heightweight
→ Weightcategory
→ Pokémon categoryabillities
→ Pokémon abilitiestypes
→ Pokémon typesweakness
→ Pokémon weaknessesimage_src
→ Pokémon image URL
The complete code is available in the repository: https://github.com/subipranuvem/crawl4ai-test.
In the next section, I will detail the positive and negative points I found while using Crawl4AI in this test.
Positive Points
Installation
The installation is really easy to do, without any problems, just follow the documentation:
pip install crawl4ai && \
crawl4ai-setup && \
crawl4ai-doctor
It's quite simple, but remember to execute these steps if you have to build a Docker image with Crawl4AI.
Markdown Output
If you need a structured result in Markdown, then this library will do magic for you!
This simple piece of code can already output a markdown with all the necessary information to obtain from the Pokémon:
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://sg.portal-pokemon.com/play/pokedex/0981",
cache_mode=CacheMode.DISABLED,
config=CrawlerRunConfig(
cache_mode=CacheMode.DISABLED,
simulate_user=True,
magic=True,
),
)
if result.markdown_v2:
print(result.markdown_v2)
if __name__ == "__main__":
asyncio.run(main())
Result:
...
[  ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/0980>) 0980 Clodsire
0981 <---------------- id
Farigiraf <---------------- name
[  ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/0982>) Dudunsparce 0982
  
Type
[ Normal ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/normal#result>) <---------------- type
[ Psychic ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/psychic#result>) <---------------- type
Weakness
[ Bug ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/bug#result>) <---------------- weakness
[ Dark ](https://sg.portal-pokemon.com/play/pokedex/</play/pokedex/dark#result>) <---------------- weakness
Height 3.2 m <---------------- height
Category Long Neck Pokémon <---------------- category
Weight 160.0 kg <---------------- weight
Gender  / 
Ability Cud Chew  Armor Tail  <---------------- abillities
Versions
...
But the positive points, unfortunately, end here.
Negative Points
During the development of this project, one of the biggest challenges was dealing with the Crawl4AI documentation. Many links were broken and several examples simply didn't work, which made learning and implementation more time-consuming than expected.
Below, I highlight the main problems I encountered while using this library.
Cache Configuration
An unexpected problem arose when running the scraper: even after the program finished, the results remained the same in subsequent runs. This happened due to the default cache configuration, which kept the data stored.
To solve this, it was necessary to explicitly disable the cache by adding the cache_mode=CacheMode.DISABLED
property in the crawler configuration:
result = await crawler.arun(
url="https://sg.portal-pokemon.com/play/pokedex/0981",
cache_mode=CacheMode.DISABLED, # <---------- HERE
config=CrawlerRunConfig(
cache_mode=CacheMode.DISABLED,
...),
...,
)
The Crawl4AI documentation suggests using CacheMode.BYPASS
, which also avoids this problem.
Here, I admit it was a mix of lack of attention and ignorance about how the library's cache works. I didn't imagine that, even after closing and reopening the terminal, the old data would still be returned. This probably happens due to the browser used by the crawler.
This peculiarity only became evident when I removed the abillities_info
property from the schema, but it continued to appear in the extracted data, indicating that the response was being reused from the cache.
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, JsonXPathExtractionStrategy
async def main():
async with AsyncWebCrawler() as crawler:
schema = {
"name": "Pokémon information",
"baseSelector": "//div[@class='pokemon-detail']",
"fields": [
{
"name": "id",
"selector": ".//p[@class='pokemon-slider__main-no size-28']",
"type": "text",
},
...,
# {
# "name": "abillities_info",
# "selector": ".//span[@class='pokemon-info__value pokemon-info__value--body size-14']/span",
# "type": "list",
# "fields": [
# {
# "name": "item",
# "type": "text",
# },
# ],
# },
],
}
config = CrawlerRunConfig(
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True),
disable_cache=True,
)
result = await crawler.arun(
url="https://sg.portal-pokemon.com/play/pokedex/0981",
config=config,
)
data = json.loads(result.extracted_content)
print(json.dumps(data, indent=2) if data else "No data found.")
if __name__ == "__main__":
asyncio.run(main())
# Output
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Time: 0.01s
[COMPLETE] ● https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Total: 0.01s
[
{
"id": "0981",
"name": "Farigiraf",
"height": "3.2 m",
"weight": "160.0 kg",
"category": "Long Neck Pok\u00e9mon",
"abillities_info": [], # <---------- This is not set in schema
"image_src": ""
}
]
Non-intuitive Models
Data extraction with XPath in Crawl4AI requires some "tricks" that complicate the creation of a cleaner and more intuitive data model.
An example of this is the extraction of lists. Instead of allowing a simple list of strings, the library requires the data to be structured as a list of objects, making the model more verbose and unnecessarily complex.
See the problem in practice:
❌ Model required by Crawl4AI (less elegant)
# Schema
schema = {
"fields": [
{
"name": "types",
"selector": ".//div[contains(@class,'pokemon-type__type')]//span",
"type": "list",
"fields": [ # Unnecessary...
{
"name": "item",
"type": "text",
},
],
},
]
}
🔍 Result obtained
{
"types": [
{
"item": "Normal"
},
{
"item": "Psychic"
}
]
}
✅ Expected result (cleaner and more intuitive)
{
"types": [
"Normal",
"Psychic"
]
}
This behavior forces the developer (you and me) to post-process the data to obtain a more suitable format, which could be avoided with a more flexible approach by the library.
Inconsistent Examples
Throughout the scraper development, I encountered small but frustrating challenges due to inconsistencies in the Crawl4AI documentation. Some examples worked perfectly, while others simply didn't run, without any clear explanation as to why.
An example of this is the configuration of the XPath extraction strategy. Depending on where the extraction_strategy
parameter is declared, the code can work or fail silently.
❌ Code that DOES NOT work
result = await crawler.arun(
url="https://sg.portal-pokemon.com/play/pokedex/0981",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True),
)
✅ Code that works
result = await crawler.arun(
url="https://sg.portal-pokemon.com/play/pokedex/0981",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True),
),
)
And no, it wasn't a lack of attention. The documentation itself contains contradictory examples, as you can see on the page below:
Example with the extraction_strategy
with CSS extraction.
Example with the extraction_strategy
with CSS extraction, but with the configuration inside CrawlerConfig
Unfortunately, the scraper was configured with the example that, for some reason, prevents the correct collection of structured data.
Scraper Obtains Items Only Using Rendered HTML
Now you must be wondering: if the scraper can capture the data directly from the rendered HTML, shouldn't this be a positive point?
Not necessarily. Especially when some information is only available in the raw version of the page, before JavaScript manipulation.
An example of this was when I tried to extract the descriptions of Pokémon abilities. This information was present in the raw HTML, before the JavaScript loaded. However, since the Crawl4AI parser operates on the already rendered HTML, the data had already been removed from the page before it was even processed.
The image below clearly shows that the information exists in the raw HTML:
This data was inside <transition>
elements, which are removed as soon as JavaScript kicks in. As a result, they became inaccessible without directly interacting with the rendered version of the page.
My attempt to work around this problem was to modify the DOM via JavaScript, inserting new <p>
elements in the hope of making them visible again and thus being able to extract them. But it didn't work.
The most frustrating part? If I had just used the requests
library combined with lxml
, I could have captured this data without any headache.
Data Extraction Using LLM (The Magic of Crawl4AI)
The idea was to use this functionality with Google Gemini, since it's an LLM model with free API access.
(More information about request quotas here / How to get your API key here.)
Model Definition
The first step was to define a model using Pydantic:
from pydantic import BaseModel
class Pokemon(BaseModel):
id: str
name: str
height: str
weight: str
category: str
abillities: list[str]
types: list[str]
weakness: list[str]
image_src: str
The Problem with the Documentation
When consulting the Crawl4AI documentation, there was no indication that Gemini was supported.
What was informed, however, is that the framework supports models from LightLLM.
Great, right? Wrong.
The link provided led to a non-existent page.
I had to resort to Google and find the LiteLLM documentation on my own. There, there was a list of supported models... but Gemini was not among them.
A Shot in the Dark
Without knowing if the configuration was correct or not, I proceeded with the following setup:
llm_strategy = LLMExtractionStrategy(
provider="google/gemini-2.0-flash",
api_token="<your_token>",
schema=Pokemon.model_json_schema(),
extraction_type="schema",
instruction="Extract the informations about the pokémon",
chunk_token_threshold=1400,
overlap_rate=0.1,
apply_chunking=True,
input_format="html",
extra_args={"temperature": 0.1, "max_tokens": 1000},
verbose=True,
)
The result? Error.
But, ironically, it was the best thing that could have happened.
The error message brought an extremely useful link: docs.litellm.ai/docs/providers, where I finally found support for Google Gemini: docs.litellm.ai/docs/providers/gemini.
Now tell me, wasn't it easier to have put this directly in the documentation? 🤔
Results of Collection with LLM
The collection was painfully slow and inefficient:
Total time: 148 seconds (2 minutes and 28 seconds) High token consumption: If you pay for tokens, your wallet will feel it. The Crawl4AI documentation itself warns about the reasons not to use LLM for this type of extraction:
✅ High cost
✅ Low speed
✅ Not scalable for thousands/millions of requests
And you know what's most curious?
The data obtained via LLM was practically identical to that extracted by the scraper with XPath – only with a slightly more coherent data model.
[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Time: 24.20s
[SCRAPE].. ◆ Processed https://sg.portal-pokemon.com/play/pokedex/0981... | Time: 24ms
[LOG] Call LLM for https://sg.portal-pokemon.com/play/pokedex/0981 - block index: 0
[LOG] Call LLM for https://sg.portal-pokemon.com/play/pokedex/0981 - block index: 1
[LOG] Extracted 1 blocks from URL: https://sg.portal-pokemon.com/play/pokedex/0981 block index: 0
[LOG] Extracted 1 blocks from URL: https://sg.portal-pokemon.com/play/pokedex/0981 block index: 1
[EXTRACT]. ■ Completed for https://sg.portal-pokemon.com/play/pokedex/0981... | Time: 124.10343073799959s
[COMPLETE] ● https://sg.portal-pokemon.com/play/pokedex/0981... | Status: True | Total: 148.33s
=== Token Usage Summary ===
Type Count
------------------------------
Completion 368
Prompt 15,632
Total 16,000
=== Usage History ===
Request # Completion Prompt Total
------------------------------------------------
1 155 837 992
2 213 14,795 15,008
[
{
"id": "N/A",
"name": "N/A",
"height": "N/A",
"weight": "N/A",
"category": "N/A",
"abillities": [],
"types": [],
"weakness": [],
"image_src": "N/A",
"error": false
},
# <---- HERE, almost the same data as XPath Strategy
{
"id": "0981",
"name": "Farigiraf",
"height": "3.2 m",
"weight": "160.0 kg",
"category": "Long Neck Pok\u00e9mon",
"abillities": [
"Cud Chew",
"Armor Tail"
],
"types": [
"Normal",
"Psychic"
],
"weakness": [
"Bug",
"Dark"
],
"image_src": "https://sg.portal-pokemon.com/play/resources/pokedex/img/pm/566c8ecfa9f9fddf539ca05a7ae8c86ac3465f5b.png",
"error": false
}
]
My recommendation? Avoid using LLM for this type of task.
Other Issues
There are still some minor issues I encountered while running other tests, including:
1 - Outdated Docker Image
The current version of the Crawl4AI Docker image is outdated, which can lead to inconsistencies.
The good news is that, according to the official repository, a fix is already on the way.
2 - REST API without XPath Support
The REST API does not support structured data collection via XPath; it is only possible through code. When attempting to make a query to the API by passing the json_xpath
strategy for extraction, the following error was returned:
{
"detail": [
{
"type": "enum",
"loc": [
"body",
"extraction_config",
"type"
],
"msg": "Input should be 'basic', 'llm', 'cosine' or 'json_css'",
"input": "xpath",
"ctx": {
"expected": "'basic', 'llm', 'cosine' or 'json_css'"
}
}
]
}
Final Considerations
Unfortunately, my experience with Crawl4AI was not the best. The documentation is confusing, full of inconsistent examples and external references that simply don't work.
Do you remember that dilemma: Read 5 minutes of documentation or spend 5 hours on Google to find the error? Well, it doesn't apply here….
Of all the scraping tools I've tested, this was by far the most frustrating. Simple and trivial tasks end up becoming a real puzzle.
But does that mean the library is bad?
Not necessarily. For those who need to extract specific information from a page and process it with an LLM, Crawl4AI can be useful. However, if the goal is to handle large volumes of data, the strategy of using LLMs becomes impractical due to the cost and execution time.
Furthermore, the library is still under development, which means you might waste a lot of time trying to solve problems that other tools, like requests
and lxml
, would solve without any hassle.
And if the idea is to use an LLM locally, alternatives like Ollama can be even slower, especially if your hardware isn't powerful enough. This only worsens the overall experience.
In its current state, Crawl4AI is not a reliable option for production projects. Methods, APIs, and functionalities are still changing rapidly, which can lead to instability.
My recommendation? Keep an eye on the updates and changelog. Who knows, in the future, with more robust documentation and usability improvements, Crawl4AI might actually be worth it.