Batch Scrape

Scrape multiple URLs in a single call. The sync client runs sequentially, while the async client runs concurrently with configurable parallelism. For the async client, batch_scrape_iter() yields results as they complete for streaming processing.

Sync Batch Scrape

from datablue import DataBlue

urls = [
    "https://example.com",
    "https://news.ycombinator.com",
    "https://github.com",
    "https://stackoverflow.com",
]

with DataBlue(api_key="wh_your_api_key") as client:
    results = client.batch_scrape(urls, concurrency=5)

    for result in results:
        if result.success:
            print(f"{result.data.url}: {result.data.metadata.word_count} words")
        else:
            print(f"FAILED: {result.error}")

Sync with Scrape Options

results = client.batch_scrape(
    urls,
    concurrency=3,
    scrape_options={
        "formats": ["markdown", "links"],
        "only_main_content": True,
        "timeout": 45000,
    },
)

Async Batch (Collect All)

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    results = await client.batch_scrape(
        urls,
        concurrency=10,
        scrape_options={"formats": ["markdown"]},
    )
    print(f"Scraped {len(results)} pages")
    for r in results:
        print(f"  {r.data.url}: {r.success}")

Async Streaming (Recommended for Large Batches)

Use batch_scrape_iter() to process results as they arrive — no need to wait for all pages to finish before starting processing.

from datablue import AsyncDataBlue

urls = ["https://example.com/page/" + str(i) for i in range(100)]

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    completed = 0
    async for result in client.batch_scrape_iter(urls, concurrency=10):
        completed += 1
        if result.success:
            print(f"[{completed}/{len(urls)}] {result.data.url} — {result.data.metadata.word_count} words")
        else:
            print(f"[{completed}/{len(urls)}] FAILED: {result.error}")

Parameters

Parameter	Type	Default	Description
`urls`	list[str]	required	List of URLs to scrape
`concurrency`	int	5	Maximum concurrent requests (async only; sync runs sequentially)
`scrape_options`	dict	None	Options passed to each scrape: formats, only_main_content, timeout, etc.

Methods Summary

Method	Client	Returns	Behavior
`batch_scrape()`	DataBlue	list[ScrapeResult]	Sequential, blocks until all done
`batch_scrape()`	AsyncDataBlue	list[ScrapeResult]	Concurrent, blocks until all done
`batch_scrape_iter()`	AsyncDataBlue	AsyncIterator[ScrapeResult]	Concurrent, yields as completed

Note: The sync client runs batch scrape sequentially regardless of the concurrency parameter. For parallel execution, use the async client's batch_scrape() or batch_scrape_iter().

Error resilience: Batch methods never raise on individual page failures. Failed pages return a ScrapeResult with success=False and the error message in the error field. Always check result.success before accessing result.data.