POST/v1/crawl
Crawl
Start a crawl job that discovers and scrapes pages starting from a seed URL using BFS, DFS, or best-first strategy. Returns a job ID for tracking progress via SSE or polling.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| url | string | REQUIRED | The starting URL to crawl. |
| max_pages | number | optional | Maximum pages to crawl (1-1000). |
| max_depth | number | optional | Maximum link depth from the starting URL (1-10). |
| concurrency | number | optional | Number of parallel scrape workers (1-10). |
| crawl_strategy | string | optional | Strategy: "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first). |
| allow_external_links | boolean | optional | Follow links to external domains. |
| respect_robots_txt | boolean | optional | Obey the target site's robots.txt rules. |
| include_paths | string[] | optional | Only crawl URLs matching these glob path patterns. |
| exclude_paths | string[] | optional | Skip URLs matching these glob path patterns. |
| scrape_options | object | optional | Options passed to each page scrape: { formats, only_main_content, wait_for, timeout, include_tags, exclude_tags, mobile, extract }. |
| use_proxy | boolean | optional | Route all requests through a configured proxy. |
| filter_faceted_urls | boolean | optional | Deduplicate faceted/navigation URL variations. |
| webhook_url | string | optional | Webhook URL for crawl completion notification. |
| webhook_secret | string | optional | HMAC secret for webhook signature verification. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/crawl" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.python.org/3/",
"max_pages": 50,
"max_depth": 2,
"crawl_strategy": "bfs",
"include_paths": [
"/3/library/*"
]
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
EXAMPLE RESPONSE
{
"success": true,
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started",
"message": "Crawl job started"
}