POST/v1/crawl

Crawl

Start a crawl job that discovers and scrapes pages starting from a seed URL using BFS, DFS, or best-first strategy. Returns a job ID for tracking progress via SSE or polling.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
urlstringREQUIREDThe starting URL to crawl.
max_pagesnumberoptionalMaximum pages to crawl (1-1000).
max_depthnumberoptionalMaximum link depth from the starting URL (1-10).
concurrencynumberoptionalNumber of parallel scrape workers (1-10).
crawl_strategystringoptionalStrategy: "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first).
allow_external_linksbooleanoptionalFollow links to external domains.
respect_robots_txtbooleanoptionalObey the target site's robots.txt rules.
include_pathsstring[]optionalOnly crawl URLs matching these glob path patterns.
exclude_pathsstring[]optionalSkip URLs matching these glob path patterns.
scrape_optionsobjectoptionalOptions passed to each page scrape: { formats, only_main_content, wait_for, timeout, include_tags, exclude_tags, mobile, extract }.
use_proxybooleanoptionalRoute all requests through a configured proxy.
filter_faceted_urlsbooleanoptionalDeduplicate faceted/navigation URL variations.
webhook_urlstringoptionalWebhook URL for crawl completion notification.
webhook_secretstringoptionalHMAC secret for webhook signature verification.

cURL Example

curl -X POST "https://api.datablue.dev/v1/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://docs.python.org/3/",
  "max_pages": 50,
  "max_depth": 2,
  "crawl_strategy": "bfs",
  "include_paths": [
    "/3/library/*"
  ]
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "message": "Crawl job started"
}