Crawl

Crawl a website starting from a seed URL, discovering and scraping pages via BFS, DFS, or best-first strategy. The SDK provides two methods: crawl() blocks until all pages are scraped, while start_crawl() returns immediately for manual polling.

Blocking Crawl

The simplest approach. crawl() starts the job and polls until completion, then returns all results at once.

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    status = client.crawl(
        "https://docs.python.org/3/",
        max_pages=50,
        max_depth=2,
        include_paths=["/3/library/*"],
        poll_interval=2.0,     # check every 2 seconds
        timeout=300.0,         # give up after 5 minutes
    )

    print(f"Status: {status.status}")                # "completed"
    print(f"Pages: {status.completed_pages}/{status.total_pages}")
    print(f"Progress: {status.progress:.0%}")         # "100%"

    for page in status.data:
        print(f"  {page.url} — {page.metadata.word_count} words")

Non-blocking Crawl (Manual Polling)

Use start_crawl() to get a job ID, then poll with get_crawl_status() at your own pace.

import time
from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    # Start the crawl — returns immediately
    job = client.start_crawl(
        "https://docs.python.org/3/",
        max_pages=100,
        max_depth=3,
        concurrency=5,
        crawl_strategy="bfs",
    )
    print(f"Job ID: {job.job_id}")
    print(f"Status: {job.status}")    # "started"

    # Poll until done
    while True:
        status = client.get_crawl_status(job.job_id)
        print(f"  Progress: {status.completed_pages}/{status.total_pages}")

        if status.is_complete:
            break
        time.sleep(3)

    # Process results
    for page in status.data:
        if page.success:
            print(f"  {page.url}: {len(page.markdown or '')} chars")
        else:
            print(f"  {page.url}: FAILED — {page.error}")

Paginating Large Crawls

Results are paginated (20 per page by default). For crawls with more than 20 pages, you must paginate through the results to retrieve all scraped data:

# Get page 2 of results
status = client.get_crawl_status(job.job_id)  # page 1 (default)
# Access pagination info:
# status.total_results — total pages crawled
# status.page — current page number
# status.per_page — results per page (default 20)

Use the total_results, page, and per_page fields on the CrawlStatus response to determine how many pages of results exist and iterate through them. If your crawl returns fewer than 20 pages, all results will be in the first response.

Cancel a Crawl

result = client.cancel_crawl(job.job_id)
print(result)  # {"success": True, "message": "Crawl cancelled"}

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    # Blocking async crawl
    status = await client.crawl(
        "https://docs.python.org/3/",
        max_pages=50,
        max_depth=2,
    )
    for page in status.data:
        print(page.url)

    # Non-blocking: start + poll
    job = await client.start_crawl("https://example.com", max_pages=20)
    status = await client.get_crawl_status(job.job_id)
    await client.cancel_crawl(job.job_id)

Parameters

Parameter	Type	Default	Description
`url`	str	required	Starting URL to crawl
`max_pages`	int	100	Maximum pages to crawl (1–1000)
`max_depth`	int	3	Maximum link depth from starting URL (1–10)
`concurrency`	int	3	Parallel scrape workers (1–10)
`include_paths`	list[str]	None	Only crawl URLs matching these glob patterns
`exclude_paths`	list[str]	None	Skip URLs matching these glob patterns
`allow_external_links`	bool	False	Follow links to external domains
`respect_robots_txt`	bool	True	Obey the site's robots.txt rules
`filter_faceted_urls`	bool	True	Deduplicate faceted/navigation URL variations
`crawl_strategy`	Literal["bfs", "dfs", "bff"]	None	"bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first frontier)
`scrape_options`	dict	None	Options passed to each page scrape (formats, only_main_content, wait_for, etc.)
`use_proxy`	bool	False	Route all requests through configured proxy
`webhook_url`	str	None	URL to receive webhook on completion
`webhook_secret`	str	None	HMAC secret for webhook signature verification

The crawl() blocking method accepts two additional parameters:

Parameter	Type	Default	Description
`poll_interval`	float	2.0	Seconds between status polls
`timeout`	float	300.0	Maximum seconds to wait before raising TimeoutError
`on_progress`	Callable[[CrawlStatus], None]	None	Optional callback invoked after each poll with the latest CrawlStatus. Useful for progress bars or logging.

Streaming Crawl

Use crawl_stream() to receive pages in real time via NDJSON streaming — no polling required. Pages are yielded as they are discovered and scraped.

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    for page in client.crawl_stream("https://docs.example.com", max_pages=50):
        print(f"{page.url} — {len(page.markdown or '')} chars")

Streaming with Callbacks

Use crawl_stream_with_callback() for event-driven architectures. Provide callback functions instead of iterating.

from datablue import DataBlue

pages = []

with DataBlue(api_key="wh_your_api_key") as client:
    client.crawl_stream_with_callback(
        "https://example.com",
        max_pages=100,
        on_document=lambda page: pages.append(page),
        on_complete=lambda: print(f"Done! {len(pages)} pages"),
        on_error=lambda e: print(f"Error: {e}"),
    )

Method	Returns	Behavior
`crawl_stream(url, **opts)`	Iterator[CrawlPageData]	Yields pages via NDJSON stream as they are discovered
`crawl_stream_with_callback(url, on_document=..., **opts)`	None	Callback-based streaming: on_document, on_complete, on_error

Response Model

class CrawlJob:                         # from start_crawl()
    success: bool                        # Whether the crawl job was accepted
    job_id: str                          # Unique job identifier for polling crawl status
    status: str                          # Current job status (typically "started")
    message: str | None                  # Human-readable status message or error description

class CrawlStatus:                       # from crawl() or get_crawl_status()
    success: bool                        # Whether the status request succeeded
    job_id: str                          # Unique job identifier
    status: str                          # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
    total_pages: int                     # Total number of pages discovered for crawling
    completed_pages: int                 # Number of pages successfully scraped so far
    data: list[CrawlPageData]            # List of scraped page results (CrawlPageData objects)
    total_results: int                   # Total number of results available (for pagination)
    page: int                            # Current page number in paginated results (default: 1)
    per_page: int                        # Number of results per page (default: 20)
    error: str | None                    # Human-readable error message if the crawl failed

    # Properties
    is_complete: bool                    # True when status in {"completed", "failed", "cancelled"}
    progress: float                      # completed_pages / total_pages (0.0 to 1.0)

class CrawlPageData:
    id: str | None                       # Unique identifier for this page within the crawl job
    url: str | None                      # The URL of the crawled page
    markdown: str | None                 # Clean Markdown conversion of the page content
    fit_markdown: str | None             # Markdown trimmed to fit LLM context windows
    html: str | None                     # Cleaned HTML content with boilerplate removed
    raw_html: str | None                 # Original unmodified HTML source of the page
    links: list[str] | None              # List of URLs found on the page
    links_detail: dict | list | None     # Detailed link info including anchor text and attributes
    screenshot: str | None               # Base64-encoded PNG screenshot of the page
    structured_data: dict | None         # JSON-LD and microdata extracted from the page
    headings: list[dict] | None          # List of headings with level and text
    images: list[dict] | None            # List of images with src, alt, and dimensions
    extract: dict | None                 # LLM-extracted structured data matching the provided schema
    citations: list[dict] | None         # Source citations for extracted content
    markdown_with_citations: str | None  # Markdown content with inline citation references
    content_hash: str | None             # SHA-256 hash of the content for change detection
    metadata: PageMetadata | None        # Page metadata including title, status_code, word_count, and SEO tags
    error: str | None                    # Error message if scraping this page failed
    success: bool                        # Whether this individual page was scraped successfully