Batch Scrape
Scrape multiple URLs in a single call. The sync client runs sequentially, while the async client runs concurrently with configurable parallelism. For the async client, batch_scrape_iter() yields results as they complete for streaming processing.
Sync Batch Scrape
from datablue import DataBlue
urls = [
"https://example.com",
"https://news.ycombinator.com",
"https://github.com",
"https://stackoverflow.com",
]
with DataBlue(api_key="wh_your_api_key") as client:
results = client.batch_scrape(urls, concurrency=5)
for result in results:
if result.success:
print(f"{result.data.url}: {result.data.metadata.word_count} words")
else:
print(f"FAILED: {result.error}")
Sync with Scrape Options
results = client.batch_scrape(
urls,
concurrency=3,
scrape_options={
"formats": ["markdown", "links"],
"only_main_content": True,
"timeout": 45000,
},
)
Async Batch (Collect All)
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
results = await client.batch_scrape(
urls,
concurrency=10,
scrape_options={"formats": ["markdown"]},
)
print(f"Scraped {len(results)} pages")
for r in results:
print(f" {r.data.url}: {r.success}")
Async Streaming (Recommended for Large Batches)
Use batch_scrape_iter() to process results as they arrive — no need to wait for all pages to finish before starting processing.
from datablue import AsyncDataBlue
urls = ["https://example.com/page/" + str(i) for i in range(100)]
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
completed = 0
async for result in client.batch_scrape_iter(urls, concurrency=10):
completed += 1
if result.success:
print(f"[{completed}/{len(urls)}] {result.data.url} — {result.data.metadata.word_count} words")
else:
print(f"[{completed}/{len(urls)}] FAILED: {result.error}")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | list[str] | required | List of URLs to scrape |
concurrency | int | 5 | Maximum concurrent requests (async only; sync runs sequentially) |
scrape_options | dict | None | Options passed to each scrape: formats, only_main_content, timeout, etc. |
Methods Summary
| Method | Client | Returns | Behavior |
|---|---|---|---|
batch_scrape() | DataBlue | list[ScrapeResult] | Sequential, blocks until all done |
batch_scrape() | AsyncDataBlue | list[ScrapeResult] | Concurrent, blocks until all done |
batch_scrape_iter() | AsyncDataBlue | AsyncIterator[ScrapeResult] | Concurrent, yields as completed |
Note: The sync client runs batch scrape sequentially regardless of the concurrency parameter. For parallel execution, use the async client's batch_scrape() or batch_scrape_iter().
Error resilience: Batch methods never raise on individual page failures. Failed pages return a ScrapeResult with success=False and the error message in the error field. Always check result.success before accessing result.data.