Crawl

Crawl a website starting from a seed URL, discovering and scraping pages via BFS, DFS, or best-first strategy. The SDK provides two methods: crawl() blocks until all pages are scraped, while start_crawl() returns immediately for manual polling.

Blocking Crawl

The simplest approach. crawl() starts the job and polls until completion, then returns all results at once.

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    status = client.crawl(
        "https://docs.python.org/3/",
        max_pages=50,
        max_depth=2,
        include_paths=["/3/library/*"],
        poll_interval=2.0,     # check every 2 seconds
        timeout=300.0,         # give up after 5 minutes
    )

    print(f"Status: {status.status}")                # "completed"
    print(f"Pages: {status.completed_pages}/{status.total_pages}")
    print(f"Progress: {status.progress:.0%}")         # "100%"

    for page in status.data:
        print(f"  {page.url} — {page.metadata.word_count} words")

Non-blocking Crawl (Manual Polling)

Use start_crawl() to get a job ID, then poll with get_crawl_status() at your own pace.

import time
from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    # Start the crawl — returns immediately
    job = client.start_crawl(
        "https://docs.python.org/3/",
        max_pages=100,
        max_depth=3,
        concurrency=5,
        crawl_strategy="bfs",
    )
    print(f"Job ID: {job.job_id}")
    print(f"Status: {job.status}")    # "started"

    # Poll until done
    while True:
        status = client.get_crawl_status(job.job_id)
        print(f"  Progress: {status.completed_pages}/{status.total_pages}")

        if status.is_complete:
            break
        time.sleep(3)

    # Process results
    for page in status.data:
        if page.success:
            print(f"  {page.url}: {len(page.markdown or '')} chars")
        else:
            print(f"  {page.url}: FAILED — {page.error}")

Paginating Large Crawls

Results are paginated (20 per page by default). For crawls with more than 20 pages, you must paginate through the results to retrieve all scraped data:

# Get page 2 of results
status = client.get_crawl_status(job.job_id)  # page 1 (default)
# Access pagination info:
# status.total_results — total pages crawled
# status.page — current page number
# status.per_page — results per page (default 20)

Use the total_results, page, and per_page fields on the CrawlStatus response to determine how many pages of results exist and iterate through them. If your crawl returns fewer than 20 pages, all results will be in the first response.

Cancel a Crawl

result = client.cancel_crawl(job.job_id)
print(result)  # {"success": True, "message": "Crawl cancelled"}

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    # Blocking async crawl
    status = await client.crawl(
        "https://docs.python.org/3/",
        max_pages=50,
        max_depth=2,
    )
    for page in status.data:
        print(page.url)

    # Non-blocking: start + poll
    job = await client.start_crawl("https://example.com", max_pages=20)
    status = await client.get_crawl_status(job.job_id)
    await client.cancel_crawl(job.job_id)

Parameters

Parameter Type Default Description
urlstrrequiredStarting URL to crawl
max_pagesint100Maximum pages to crawl (1–1000)
max_depthint3Maximum link depth from starting URL (1–10)
concurrencyint3Parallel scrape workers (1–10)
include_pathslist[str]NoneOnly crawl URLs matching these glob patterns
exclude_pathslist[str]NoneSkip URLs matching these glob patterns
allow_external_linksboolFalseFollow links to external domains
respect_robots_txtboolTrueObey the site's robots.txt rules
filter_faceted_urlsboolTrueDeduplicate faceted/navigation URL variations
crawl_strategyLiteral["bfs", "dfs", "bff"]None"bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first frontier)
scrape_optionsdictNoneOptions passed to each page scrape (formats, only_main_content, wait_for, etc.)
use_proxyboolFalseRoute all requests through configured proxy
webhook_urlstrNoneURL to receive webhook on completion
webhook_secretstrNoneHMAC secret for webhook signature verification

The crawl() blocking method accepts two additional parameters:

Parameter Type Default Description
poll_intervalfloat2.0Seconds between status polls
timeoutfloat300.0Maximum seconds to wait before raising TimeoutError
on_progressCallable[[CrawlStatus], None]NoneOptional callback invoked after each poll with the latest CrawlStatus. Useful for progress bars or logging.

Streaming Crawl

Use crawl_stream() to receive pages in real time via NDJSON streaming — no polling required. Pages are yielded as they are discovered and scraped.

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    for page in client.crawl_stream("https://docs.example.com", max_pages=50):
        print(f"{page.url} — {len(page.markdown or '')} chars")

Streaming with Callbacks

Use crawl_stream_with_callback() for event-driven architectures. Provide callback functions instead of iterating.

from datablue import DataBlue

pages = []

with DataBlue(api_key="wh_your_api_key") as client:
    client.crawl_stream_with_callback(
        "https://example.com",
        max_pages=100,
        on_document=lambda page: pages.append(page),
        on_complete=lambda: print(f"Done! {len(pages)} pages"),
        on_error=lambda e: print(f"Error: {e}"),
    )
Method Returns Behavior
crawl_stream(url, **opts)Iterator[CrawlPageData]Yields pages via NDJSON stream as they are discovered
crawl_stream_with_callback(url, on_document=..., **opts)NoneCallback-based streaming: on_document, on_complete, on_error

Response Model

class CrawlJob:                         # from start_crawl()
    success: bool                        # Whether the crawl job was accepted
    job_id: str                          # Unique job identifier for polling crawl status
    status: str                          # Current job status (typically "started")
    message: str | None                  # Human-readable status message or error description

class CrawlStatus:                       # from crawl() or get_crawl_status()
    success: bool                        # Whether the status request succeeded
    job_id: str                          # Unique job identifier
    status: str                          # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
    total_pages: int                     # Total number of pages discovered for crawling
    completed_pages: int                 # Number of pages successfully scraped so far
    data: list[CrawlPageData]            # List of scraped page results (CrawlPageData objects)
    total_results: int                   # Total number of results available (for pagination)
    page: int                            # Current page number in paginated results (default: 1)
    per_page: int                        # Number of results per page (default: 20)
    error: str | None                    # Human-readable error message if the crawl failed

    # Properties
    is_complete: bool                    # True when status in {"completed", "failed", "cancelled"}
    progress: float                      # completed_pages / total_pages (0.0 to 1.0)

class CrawlPageData:
    id: str | None                       # Unique identifier for this page within the crawl job
    url: str | None                      # The URL of the crawled page
    markdown: str | None                 # Clean Markdown conversion of the page content
    fit_markdown: str | None             # Markdown trimmed to fit LLM context windows
    html: str | None                     # Cleaned HTML content with boilerplate removed
    raw_html: str | None                 # Original unmodified HTML source of the page
    links: list[str] | None              # List of URLs found on the page
    links_detail: dict | list | None     # Detailed link info including anchor text and attributes
    screenshot: str | None               # Base64-encoded PNG screenshot of the page
    structured_data: dict | None         # JSON-LD and microdata extracted from the page
    headings: list[dict] | None          # List of headings with level and text
    images: list[dict] | None            # List of images with src, alt, and dimensions
    extract: dict | None                 # LLM-extracted structured data matching the provided schema
    citations: list[dict] | None         # Source citations for extracted content
    markdown_with_citations: str | None  # Markdown content with inline citation references
    content_hash: str | None             # SHA-256 hash of the content for change detection
    metadata: PageMetadata | None        # Page metadata including title, status_code, word_count, and SEO tags
    error: str | None                    # Error message if scraping this page failed
    success: bool                        # Whether this individual page was scraped successfully