Crawl
Crawl a website starting from a seed URL, discovering and scraping pages via BFS, DFS, or best-first strategy.
The SDK provides two methods: crawl() blocks until all pages are scraped, while start_crawl() returns immediately for manual polling.
Blocking Crawl
The simplest approach. crawl() starts the job and polls until completion, then returns all results at once.
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
status = client.crawl(
"https://docs.python.org/3/",
max_pages=50,
max_depth=2,
include_paths=["/3/library/*"],
poll_interval=2.0, # check every 2 seconds
timeout=300.0, # give up after 5 minutes
)
print(f"Status: {status.status}") # "completed"
print(f"Pages: {status.completed_pages}/{status.total_pages}")
print(f"Progress: {status.progress:.0%}") # "100%"
for page in status.data:
print(f" {page.url} — {page.metadata.word_count} words")
Non-blocking Crawl (Manual Polling)
Use start_crawl() to get a job ID, then poll with get_crawl_status() at your own pace.
import time
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
# Start the crawl — returns immediately
job = client.start_crawl(
"https://docs.python.org/3/",
max_pages=100,
max_depth=3,
concurrency=5,
crawl_strategy="bfs",
)
print(f"Job ID: {job.job_id}")
print(f"Status: {job.status}") # "started"
# Poll until done
while True:
status = client.get_crawl_status(job.job_id)
print(f" Progress: {status.completed_pages}/{status.total_pages}")
if status.is_complete:
break
time.sleep(3)
# Process results
for page in status.data:
if page.success:
print(f" {page.url}: {len(page.markdown or '')} chars")
else:
print(f" {page.url}: FAILED — {page.error}")
Paginating Large Crawls
Results are paginated (20 per page by default). For crawls with more than 20 pages, you must paginate through the results to retrieve all scraped data:
# Get page 2 of results
status = client.get_crawl_status(job.job_id) # page 1 (default)
# Access pagination info:
# status.total_results — total pages crawled
# status.page — current page number
# status.per_page — results per page (default 20)
Use the total_results, page, and per_page fields on the CrawlStatus response to determine how many pages of results exist and iterate through them. If your crawl returns fewer than 20 pages, all results will be in the first response.
Cancel a Crawl
result = client.cancel_crawl(job.job_id)
print(result) # {"success": True, "message": "Crawl cancelled"}
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
# Blocking async crawl
status = await client.crawl(
"https://docs.python.org/3/",
max_pages=50,
max_depth=2,
)
for page in status.data:
print(page.url)
# Non-blocking: start + poll
job = await client.start_crawl("https://example.com", max_pages=20)
status = await client.get_crawl_status(job.job_id)
await client.cancel_crawl(job.job_id)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | Starting URL to crawl |
max_pages | int | 100 | Maximum pages to crawl (1–1000) |
max_depth | int | 3 | Maximum link depth from starting URL (1–10) |
concurrency | int | 3 | Parallel scrape workers (1–10) |
include_paths | list[str] | None | Only crawl URLs matching these glob patterns |
exclude_paths | list[str] | None | Skip URLs matching these glob patterns |
allow_external_links | bool | False | Follow links to external domains |
respect_robots_txt | bool | True | Obey the site's robots.txt rules |
filter_faceted_urls | bool | True | Deduplicate faceted/navigation URL variations |
crawl_strategy | Literal["bfs", "dfs", "bff"] | None | "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first frontier) |
scrape_options | dict | None | Options passed to each page scrape (formats, only_main_content, wait_for, etc.) |
use_proxy | bool | False | Route all requests through configured proxy |
webhook_url | str | None | URL to receive webhook on completion |
webhook_secret | str | None | HMAC secret for webhook signature verification |
The crawl() blocking method accepts two additional parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval | float | 2.0 | Seconds between status polls |
timeout | float | 300.0 | Maximum seconds to wait before raising TimeoutError |
on_progress | Callable[[CrawlStatus], None] | None | Optional callback invoked after each poll with the latest CrawlStatus. Useful for progress bars or logging. |
Streaming Crawl
Use crawl_stream() to receive pages in real time via NDJSON streaming — no polling required. Pages are yielded as they are discovered and scraped.
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
for page in client.crawl_stream("https://docs.example.com", max_pages=50):
print(f"{page.url} — {len(page.markdown or '')} chars")
Streaming with Callbacks
Use crawl_stream_with_callback() for event-driven architectures. Provide callback functions instead of iterating.
from datablue import DataBlue
pages = []
with DataBlue(api_key="wh_your_api_key") as client:
client.crawl_stream_with_callback(
"https://example.com",
max_pages=100,
on_document=lambda page: pages.append(page),
on_complete=lambda: print(f"Done! {len(pages)} pages"),
on_error=lambda e: print(f"Error: {e}"),
)
| Method | Returns | Behavior |
|---|---|---|
crawl_stream(url, **opts) | Iterator[CrawlPageData] | Yields pages via NDJSON stream as they are discovered |
crawl_stream_with_callback(url, on_document=..., **opts) | None | Callback-based streaming: on_document, on_complete, on_error |
Response Model
class CrawlJob: # from start_crawl()
success: bool # Whether the crawl job was accepted
job_id: str # Unique job identifier for polling crawl status
status: str # Current job status (typically "started")
message: str | None # Human-readable status message or error description
class CrawlStatus: # from crawl() or get_crawl_status()
success: bool # Whether the status request succeeded
job_id: str # Unique job identifier
status: str # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
total_pages: int # Total number of pages discovered for crawling
completed_pages: int # Number of pages successfully scraped so far
data: list[CrawlPageData] # List of scraped page results (CrawlPageData objects)
total_results: int # Total number of results available (for pagination)
page: int # Current page number in paginated results (default: 1)
per_page: int # Number of results per page (default: 20)
error: str | None # Human-readable error message if the crawl failed
# Properties
is_complete: bool # True when status in {"completed", "failed", "cancelled"}
progress: float # completed_pages / total_pages (0.0 to 1.0)
class CrawlPageData:
id: str | None # Unique identifier for this page within the crawl job
url: str | None # The URL of the crawled page
markdown: str | None # Clean Markdown conversion of the page content
fit_markdown: str | None # Markdown trimmed to fit LLM context windows
html: str | None # Cleaned HTML content with boilerplate removed
raw_html: str | None # Original unmodified HTML source of the page
links: list[str] | None # List of URLs found on the page
links_detail: dict | list | None # Detailed link info including anchor text and attributes
screenshot: str | None # Base64-encoded PNG screenshot of the page
structured_data: dict | None # JSON-LD and microdata extracted from the page
headings: list[dict] | None # List of headings with level and text
images: list[dict] | None # List of images with src, alt, and dimensions
extract: dict | None # LLM-extracted structured data matching the provided schema
citations: list[dict] | None # Source citations for extracted content
markdown_with_citations: str | None # Markdown content with inline citation references
content_hash: str | None # SHA-256 hash of the content for change detection
metadata: PageMetadata | None # Page metadata including title, status_code, word_count, and SEO tags
error: str | None # Error message if scraping this page failed
success: bool # Whether this individual page was scraped successfully