Batch Scrape

Scrape multiple URLs in a single call. The sync client runs sequentially, while the async client runs concurrently with configurable parallelism. For the async client, batch_scrape_iter() yields results as they complete for streaming processing.

Sync Batch Scrape

from datablue import DataBlue

urls = [
    "https://example.com",
    "https://news.ycombinator.com",
    "https://github.com",
    "https://stackoverflow.com",
]

with DataBlue(api_key="wh_your_api_key") as client:
    results = client.batch_scrape(urls, concurrency=5)

    for result in results:
        if result.success:
            print(f"{result.data.url}: {result.data.metadata.word_count} words")
        else:
            print(f"FAILED: {result.error}")

Sync with Scrape Options

results = client.batch_scrape(
    urls,
    concurrency=3,
    scrape_options={
        "formats": ["markdown", "links"],
        "only_main_content": True,
        "timeout": 45000,
    },
)

Async Batch (Collect All)

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    results = await client.batch_scrape(
        urls,
        concurrency=10,
        scrape_options={"formats": ["markdown"]},
    )
    print(f"Scraped {len(results)} pages")
    for r in results:
        print(f"  {r.data.url}: {r.success}")

Async Streaming (Recommended for Large Batches)

Use batch_scrape_iter() to process results as they arrive — no need to wait for all pages to finish before starting processing.

from datablue import AsyncDataBlue

urls = ["https://example.com/page/" + str(i) for i in range(100)]

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    completed = 0
    async for result in client.batch_scrape_iter(urls, concurrency=10):
        completed += 1
        if result.success:
            print(f"[{completed}/{len(urls)}] {result.data.url} — {result.data.metadata.word_count} words")
        else:
            print(f"[{completed}/{len(urls)}] FAILED: {result.error}")

Parameters

Parameter Type Default Description
urlslist[str]requiredList of URLs to scrape
concurrencyint5Maximum concurrent requests (async only; sync runs sequentially)
scrape_optionsdictNoneOptions passed to each scrape: formats, only_main_content, timeout, etc.

Methods Summary

Method Client Returns Behavior
batch_scrape()DataBluelist[ScrapeResult]Sequential, blocks until all done
batch_scrape()AsyncDataBluelist[ScrapeResult]Concurrent, blocks until all done
batch_scrape_iter()AsyncDataBlueAsyncIterator[ScrapeResult]Concurrent, yields as completed

Note: The sync client runs batch scrape sequentially regardless of the concurrency parameter. For parallel execution, use the async client's batch_scrape() or batch_scrape_iter().

Error resilience: Batch methods never raise on individual page failures. Failed pages return a ScrapeResult with success=False and the error message in the error field. Always check result.success before accessing result.data.