Introduction
DataBlue is a self-hosted web scraping and structured data API platform. It provides a Firecrawl-compatible REST API for scraping, crawling, searching, and extracting content from any website, plus 22+ Data APIs for Google, YouTube, Twitter, and Reddit.
Key Features
- 5-tier parallel scraping engine — HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass run in a staggered race. First valid result wins. Strategy cache remembers winning strategy per domain.
- Self-learning strategy cache — Domains that require stealth are automatically detected and upgraded to hard mode on subsequent requests. No manual configuration needed.
- 22+ structured Data APIs — Google Search, Maps, News, Finance. YouTube channels, videos, comments. Twitter profiles, tweets. Reddit subreddits, posts, users. All return clean JSON.
- 100% self-hosted — Deploy on your own infrastructure with Docker Compose. No third-party SaaS dependencies. Your data never leaves your servers.
- Firecrawl-compatible API — Drop-in replacement for Firecrawl. Same endpoint paths, same request/response shapes. Migrate existing integrations with zero code changes.
Search
Search the web with Google, DuckDuckGo, or Brave and get scraped content from each result page. Returns markdown, HTML, links, screenshots, and structured data.
Scrape
Scrape any URL with automatic anti-bot bypass. Returns markdown, HTML, raw HTML, links, screenshots, headings, images, structured data, and LLM-extracted fields.
Extract
Extract structured data from any page using natural language prompts and JSON Schema. Powered by LLMs (OpenAI, Anthropic, Groq). Returns typed JSON matching your schema.
Data APIs
22+ structured data endpoints for Google (Search, Maps, News, Finance), YouTube (channels, videos, comments), Twitter (profiles, tweets), and Reddit (subreddits, posts, users).
Make Your First Request
curl -X POST "https://api.datablue.dev/v1/scrape" \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": ["markdown", "links"]
}'
Authentication
DataBlue supports two authentication methods. Both are sent as Authorization: Bearer <token> in the request header.
| Method | Format | TTL | Source |
|---|---|---|---|
| JWT Token | eyJ... |
7 days | POST /v1/auth/login |
| API Key | wh_... |
Persistent | Dashboard → API Keys |
JWT Authentication
Obtain a JWT token by authenticating with your email and password:
curl -X POST "https://api.datablue.dev/v1/auth/login" \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com", "password": "your_password"}'
# Response: {"access_token": "eyJ...", "token_type": "bearer"}
API Key Authentication (Recommended)
API keys are persistent and do not expire. Generate them from the Dashboard under the API Keys panel. All API keys use the wh_ prefix.
# Use your API key in every request
curl -X POST "https://api.datablue.dev/v1/scrape" \
-H "Authorization: Bearer wh_abc123def456" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Security note: API keys and JWT tokens are interchangeable in the Authorization header. Sensitive data (LLM keys, proxy credentials) is encrypted at rest with Fernet (AES-256).
Quick Start
Step 1: Sign Up
Create an account at datablue.dev or register via the API:
curl -X POST "https://api.datablue.dev/v1/auth/register" \
-H "Content-Type: application/json" \
-d '{"email": "you@example.com", "password": "secure_password", "name": "Your Name"}'
Step 2: Generate an API Key
Log into the Dashboard and navigate to API Keys. Click Create New Key to generate a persistent API key with the wh_ prefix.
Step 3: Make Your First Request
cURL
curl -X POST "https://api.datablue.dev/v1/scrape" \
-H "Authorization: Bearer wh_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": ["markdown"]
}'
Python
import requests
response = requests.post(
"https://api.datablue.dev/v1/scrape",
headers={"Authorization": "Bearer wh_your_api_key"},
json={
"url": "https://news.ycombinator.com",
"formats": ["markdown"]
}
)
data = response.json()
print(data["data"]["markdown"][:200])
JavaScript
const response = await fetch("https://api.datablue.dev/v1/scrape", {
method: "POST",
headers: {
"Authorization": "Bearer wh_your_api_key",
"Content-Type": "application/json"
},
body: JSON.stringify({
url: "https://news.ycombinator.com",
formats: ["markdown"]
})
});
const data = await response.json();
console.log(data.data.markdown.slice(0, 200));
Example Response
{
"success": true,
"data": {
"markdown": "# Hacker News\n\n1. Show HN: I built an open-source web scraper...",
"metadata": {
"title": "Hacker News",
"language": "en",
"source_url": "https://news.ycombinator.com",
"status_code": 200,
"word_count": 1847
}
}
}
Installation
The official datablue Python SDK provides both synchronous and asynchronous clients for the DataBlue API. Built on httpx and Pydantic v2, it offers full type safety, automatic retries with exponential backoff, and strongly-typed response models.
Requirements
| Dependency | Version |
|---|---|
| Python | >= 3.10 |
| httpx | >= 0.27.0 |
| pydantic | >= 2.0.0 |
Install from PyPI
pip install datablue
Or with a specific version:
pip install datablue==2.0.0
Install with Poetry / uv
# Poetry
poetry add datablue
# uv
uv add datablue
Verify Installation
python -c "import datablue; print(datablue.__version__)"
# 2.0.0
Quick Start
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
result = client.scrape("https://example.com")
print(result.data.markdown)
Async support: Every method available on DataBlue (sync) is also available on AsyncDataBlue (async) with the same signature. Use await and async with for the async variant.
Using with AI Assistants
The v2.0.0 SDK is designed to be AI-first. It ships with two machine-readable reference files that AI coding assistants (Claude Code, Cursor, GitHub Copilot, etc.) can read for accurate code generation:
- CLAUDE.md — A structured quick-reference file at the SDK root (
sdk/CLAUDE.md) containing all method signatures, response models, error types, and common patterns. AI assistants automatically read this file for context. - llms.txt — A standardized machine-readable documentation file following the
llms.txtconvention. Provides a condensed API surface for LLM consumption.
Tip: When using an AI coding assistant with the DataBlue SDK, point it at the sdk/ directory. The CLAUDE.md file contains every method signature, typed model, and error type with copy-pasteable examples. This eliminates hallucinated API calls and ensures the AI generates code that actually works.
SDK Features
- Sync + Async clients —
DataBluefor synchronous code,AsyncDataBluefor asyncio/FastAPI/Django - Pydantic v2 response models — every response is a typed dataclass with autocomplete and validation
- Automatic retries — exponential backoff on 429 and 5xx errors, configurable max retries
- Context manager support —
with/async withfor clean resource management - Job polling built-in — crawl/search blocking methods poll automatically with configurable timeout
- Batch scraping — concurrent scraping with semaphore-based concurrency control and streaming results
- Typed error hierarchy — catch specific errors like
RateLimitError,AuthenticationError, etc. - Environment variable config — zero-config setup with
DataBlue.from_env()
Authentication
The SDK supports three authentication methods: API key (recommended), environment variables, and email/password login.
API Key (Recommended)
Pass your API key directly to the constructor. API keys use the wh_ prefix and never expire.
from datablue import DataBlue
# Sync client
client = DataBlue(api_key="wh_your_api_key")
result = client.scrape("https://example.com")
client.close()
# Or use context manager (recommended)
with DataBlue(api_key="wh_your_api_key") as client:
result = client.scrape("https://example.com")
print(result.data.markdown)
from datablue import AsyncDataBlue
# Async client
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
result = await client.scrape("https://example.com")
print(result.data.markdown)
Environment Variables
Set DATABLUE_API_KEY and optionally DATABLUE_API_URL in your environment, then use from_env():
# Shell — set environment variables
export DATABLUE_API_KEY=wh_your_api_key
export DATABLUE_API_URL=https://api.datablue.dev # optional, default: http://localhost:8000
export DATABLUE_TIMEOUT=120 # optional, seconds (default: 60)
export DATABLUE_MAX_RETRIES=5 # optional (default: 3)
from datablue import DataBlue
# Reads all DATABLUE_* environment variables automatically
with DataBlue.from_env() as client:
result = client.scrape("https://example.com")
print(result.data.markdown)
| Variable | Required | Default | Description |
|---|---|---|---|
DATABLUE_API_KEY | Yes | — | Your API key (wh_... prefix) |
DATABLUE_API_URL | No | http://localhost:8000 | Base URL of the DataBlue API |
DATABLUE_TIMEOUT | No | 60 | Request timeout in seconds |
DATABLUE_MAX_RETRIES | No | 3 | Max retry attempts on transient errors |
Email/Password Login
Use login() to authenticate with email and password. This obtains a JWT token (7-day TTL) and stores it internally on the client instance.
from datablue import DataBlue
with DataBlue(api_url="https://api.datablue.dev") as client:
# Authenticate — JWT is stored automatically
auth = client.login("you@example.com", "your_password")
print(f"Token: {auth['access_token'][:20]}...")
# All subsequent requests use the JWT
result = client.scrape("https://example.com")
print(result.data.markdown)
# Async variant
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_url="https://api.datablue.dev") as client:
await client.login("you@example.com", "your_password")
result = await client.scrape("https://example.com")
Recommendation: Use API keys for production and CI/CD. Use email/password login only for interactive scripts or development. API keys are persistent and do not expire, while JWT tokens expire after 7 days.
Scrape
Scrape a single URL and return structured content. The scrape() method is synchronous and returns immediately with the result. The scraping engine uses a 5-tier parallel race: HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass — first valid result wins.
Basic Usage
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
result = client.scrape("https://example.com")
print(result.success) # True
print(result.data.markdown) # "# Example Domain\n\nThis domain is..."
print(result.data.metadata) # PageMetadata(title="Example Domain", ...)
With All Options
result = client.scrape(
"https://news.ycombinator.com",
formats=["markdown", "html", "links", "screenshot"],
only_main_content=True,
wait_for=2000, # wait 2s after page load
timeout=45000, # 45s timeout
include_tags=["article", "main"], # only these HTML tags
exclude_tags=["nav", "footer"], # remove these tags
headers={"Accept-Language": "en-US"},
cookies={"session": "abc123"},
mobile=True,
mobile_device="iphone_14",
css_selector="main.content", # target specific element
use_proxy=True,
capture_network=True,
)
print(result.data.markdown)
print(result.data.html)
print(result.data.links)
print(result.data.screenshot) # base64-encoded PNG
print(result.data.metadata.title)
print(result.data.metadata.word_count)
Saving Screenshots
Screenshots are returned as base64-encoded PNG strings. Save to a file, embed in HTML, or pass directly to an LLM:
import base64
result = client.scrape("https://example.com", formats=["screenshot"])
# Save to file
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.data.screenshot))
# Embed in HTML
html_img = f'<img src="data:image/png;base64,{result.data.screenshot}" />'
# Store in database as a string field
db.save(url=result.data.metadata.source_url, screenshot=result.data.screenshot)
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
result = await client.scrape(
"https://example.com",
formats=["markdown", "links"],
only_main_content=True,
)
print(result.data.markdown)
Browser Actions
Execute browser actions before content extraction. Useful for pages that require interaction (clicking buttons, filling forms, scrolling to load content).
result = client.scrape(
"https://example.com/infinite-scroll",
actions=[
{"type": "wait", "milliseconds": 1000},
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down", "amount": 3},
{"type": "screenshot"},
],
)
Action Types Reference
| Action Type | Required Fields | Optional Fields | Description |
|---|---|---|---|
click | selector | button, click_count, modifiers | Click an element |
wait | — | milliseconds | Wait for duration |
scroll | — | direction (up/down), amount | Scroll the page |
type | selector, text | — | Type text into input |
screenshot | — | — | Capture screenshot |
hover | selector | — | Hover over element |
press | key | modifiers | Press keyboard key |
select | selector, value | — | Select dropdown option |
fill_form | fields | — | Fill multiple form fields |
evaluate | script | — | Run JavaScript |
go_back | — | — | Navigate back |
go_forward | — | — | Navigate forward |
LLM Extraction
Extract structured data from a page using an LLM. Pass a prompt and optional schema (JSON Schema) in the extract parameter.
result = client.scrape(
"https://openai.com/pricing",
extract={
"prompt": "Extract all pricing tiers with name and price",
"schema": {
"type": "object",
"properties": {
"tiers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
},
},
},
},
},
},
)
print(result.data.extract) # {"tiers": [{"name": "GPT-4o", "price": "$2.50/1M"}, ...]}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | The URL to scrape |
formats | list[Literal["markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images"]] | None | Output formats to return. Defaults to ["markdown"] if not specified. |
only_main_content | bool | True | Extract only main content, removing nav/footer/sidebar |
wait_for | int | 0 | Milliseconds to wait after page load before extracting |
timeout | int | 30000 | Request timeout in milliseconds |
include_tags | list[str] | None | Only include these HTML tags in extraction |
exclude_tags | list[str] | None | Exclude these HTML tags from extraction |
headers | dict[str, str] | None | Custom HTTP headers to send with the request |
cookies | dict[str, str] | None | Custom cookies to send as name/value pairs |
mobile | bool | False | Emulate a mobile device viewport |
mobile_device | str | None | Device preset: "iphone_14", "pixel_7", "ipad_pro" |
css_selector | str | None | Only extract content matching this CSS selector |
xpath | str | None | XPath expression for targeted extraction |
selectors | dict[str, Any] | None | Named CSS selectors for multi-element extraction |
actions | list[dict] | list[ActionStep] | None | Browser actions to execute before scraping. Accepts raw dicts or typed ActionStep model instances. |
extract | dict | ExtractConfig | None | LLM extraction config. Accepts a dict with "prompt" and "schema" keys, or a typed ExtractConfig model. |
capture_network | bool | False | Capture browser network requests. Results appear in result.data.network_data as a dict with request/response details. |
use_proxy | bool | False | Route through configured proxy for anti-bot bypass |
webhook_url | str | None | URL to receive webhook on completion |
webhook_secret | str | None | HMAC secret for webhook signature verification |
Response Model
Returns ScrapeResult — a Pydantic model with the following fields:
class ScrapeResult:
success: bool # Whether the scrape request succeeded
data: PageData | None # Scraped page content including markdown, html, links, and metadata
error: str | None # Human-readable error message if the scrape failed
error_code: str | None # Machine-readable error code (BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, NETWORK_ERROR)
job_id: str | None # Unique job identifier for async scrape requests
class PageData:
url: str | None # The URL of the scraped page
markdown: str | None # Clean Markdown conversion of the page content
fit_markdown: str | None # Markdown trimmed to fit LLM context windows
html: str | None # Cleaned HTML content with boilerplate removed
raw_html: str | None # Original unmodified HTML source of the page
links: list[str] | None # List of URLs found on the page
links_detail: list[dict] | None # Detailed link info including anchor text and attributes
screenshot: str | None # Base64-encoded PNG screenshot of the page
structured_data: dict | None # JSON-LD and microdata extracted from the page
headings: list[dict] | None # List of headings with level and text (e.g. [{level: 1, text: "Title"}])
images: list[dict] | None # List of images with src, alt, and dimensions
extract: dict | None # LLM-extracted structured data matching the provided schema
citations: list[dict] | None # Source citations for extracted content
markdown_with_citations: str | None # Markdown content with inline citation references
content_hash: str | None # SHA-256 hash of the content for change detection
metadata: PageMetadata | None # Page metadata including title, status_code, word_count, and SEO tags
class PageMetadata:
title: str | None # Page title from <title> tag
description: str | None # Meta description from <meta name='description'>
language: str | None # Page language code (e.g. 'en', 'fr', 'de')
source_url: str | None # The URL that was actually scraped (after redirects)
status_code: int | None # HTTP response status code (200, 404, 500, etc.)
word_count: int # Number of words in the main content
reading_time_seconds: int # Estimated reading time in seconds
content_length: int # Response body size in bytes
og_image: str | None # OpenGraph image URL for social sharing previews
canonical_url: str | None # Canonical URL from <link rel='canonical'>
favicon: str | None # Favicon URL
robots: str | None # Robots meta tag content (e.g. 'noindex, nofollow')
response_headers: dict | None # HTTP response headers as key-value pairs
Crawl
Crawl a website starting from a seed URL, discovering and scraping pages via BFS, DFS, or best-first strategy.
The SDK provides two methods: crawl() blocks until all pages are scraped, while start_crawl() returns immediately for manual polling.
Blocking Crawl
The simplest approach. crawl() starts the job and polls until completion, then returns all results at once.
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
status = client.crawl(
"https://docs.python.org/3/",
max_pages=50,
max_depth=2,
include_paths=["/3/library/*"],
poll_interval=2.0, # check every 2 seconds
timeout=300.0, # give up after 5 minutes
)
print(f"Status: {status.status}") # "completed"
print(f"Pages: {status.completed_pages}/{status.total_pages}")
print(f"Progress: {status.progress:.0%}") # "100%"
for page in status.data:
print(f" {page.url} — {page.metadata.word_count} words")
Non-blocking Crawl (Manual Polling)
Use start_crawl() to get a job ID, then poll with get_crawl_status() at your own pace.
import time
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
# Start the crawl — returns immediately
job = client.start_crawl(
"https://docs.python.org/3/",
max_pages=100,
max_depth=3,
concurrency=5,
crawl_strategy="bfs",
)
print(f"Job ID: {job.job_id}")
print(f"Status: {job.status}") # "started"
# Poll until done
while True:
status = client.get_crawl_status(job.job_id)
print(f" Progress: {status.completed_pages}/{status.total_pages}")
if status.is_complete:
break
time.sleep(3)
# Process results
for page in status.data:
if page.success:
print(f" {page.url}: {len(page.markdown or '')} chars")
else:
print(f" {page.url}: FAILED — {page.error}")
Paginating Large Crawls
Results are paginated (20 per page by default). For crawls with more than 20 pages, you must paginate through the results to retrieve all scraped data:
# Get page 2 of results
status = client.get_crawl_status(job.job_id) # page 1 (default)
# Access pagination info:
# status.total_results — total pages crawled
# status.page — current page number
# status.per_page — results per page (default 20)
Use the total_results, page, and per_page fields on the CrawlStatus response to determine how many pages of results exist and iterate through them. If your crawl returns fewer than 20 pages, all results will be in the first response.
Cancel a Crawl
result = client.cancel_crawl(job.job_id)
print(result) # {"success": True, "message": "Crawl cancelled"}
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
# Blocking async crawl
status = await client.crawl(
"https://docs.python.org/3/",
max_pages=50,
max_depth=2,
)
for page in status.data:
print(page.url)
# Non-blocking: start + poll
job = await client.start_crawl("https://example.com", max_pages=20)
status = await client.get_crawl_status(job.job_id)
await client.cancel_crawl(job.job_id)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | Starting URL to crawl |
max_pages | int | 100 | Maximum pages to crawl (1–1000) |
max_depth | int | 3 | Maximum link depth from starting URL (1–10) |
concurrency | int | 3 | Parallel scrape workers (1–10) |
include_paths | list[str] | None | Only crawl URLs matching these glob patterns |
exclude_paths | list[str] | None | Skip URLs matching these glob patterns |
allow_external_links | bool | False | Follow links to external domains |
respect_robots_txt | bool | True | Obey the site's robots.txt rules |
filter_faceted_urls | bool | True | Deduplicate faceted/navigation URL variations |
crawl_strategy | Literal["bfs", "dfs", "bff"] | None | "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first frontier) |
scrape_options | dict | None | Options passed to each page scrape (formats, only_main_content, wait_for, etc.) |
use_proxy | bool | False | Route all requests through configured proxy |
webhook_url | str | None | URL to receive webhook on completion |
webhook_secret | str | None | HMAC secret for webhook signature verification |
The crawl() blocking method accepts two additional parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval | float | 2.0 | Seconds between status polls |
timeout | float | 300.0 | Maximum seconds to wait before raising TimeoutError |
on_progress | Callable[[CrawlStatus], None] | None | Optional callback invoked after each poll with the latest CrawlStatus. Useful for progress bars or logging. |
Streaming Crawl
Use crawl_stream() to receive pages in real time via NDJSON streaming — no polling required. Pages are yielded as they are discovered and scraped.
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
for page in client.crawl_stream("https://docs.example.com", max_pages=50):
print(f"{page.url} — {len(page.markdown or '')} chars")
Streaming with Callbacks
Use crawl_stream_with_callback() for event-driven architectures. Provide callback functions instead of iterating.
from datablue import DataBlue
pages = []
with DataBlue(api_key="wh_your_api_key") as client:
client.crawl_stream_with_callback(
"https://example.com",
max_pages=100,
on_document=lambda page: pages.append(page),
on_complete=lambda: print(f"Done! {len(pages)} pages"),
on_error=lambda e: print(f"Error: {e}"),
)
| Method | Returns | Behavior |
|---|---|---|
crawl_stream(url, **opts) | Iterator[CrawlPageData] | Yields pages via NDJSON stream as they are discovered |
crawl_stream_with_callback(url, on_document=..., **opts) | None | Callback-based streaming: on_document, on_complete, on_error |
Response Model
class CrawlJob: # from start_crawl()
success: bool # Whether the crawl job was accepted
job_id: str # Unique job identifier for polling crawl status
status: str # Current job status (typically "started")
message: str | None # Human-readable status message or error description
class CrawlStatus: # from crawl() or get_crawl_status()
success: bool # Whether the status request succeeded
job_id: str # Unique job identifier
status: str # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
total_pages: int # Total number of pages discovered for crawling
completed_pages: int # Number of pages successfully scraped so far
data: list[CrawlPageData] # List of scraped page results (CrawlPageData objects)
total_results: int # Total number of results available (for pagination)
page: int # Current page number in paginated results (default: 1)
per_page: int # Number of results per page (default: 20)
error: str | None # Human-readable error message if the crawl failed
# Properties
is_complete: bool # True when status in {"completed", "failed", "cancelled"}
progress: float # completed_pages / total_pages (0.0 to 1.0)
class CrawlPageData:
id: str | None # Unique identifier for this page within the crawl job
url: str | None # The URL of the crawled page
markdown: str | None # Clean Markdown conversion of the page content
fit_markdown: str | None # Markdown trimmed to fit LLM context windows
html: str | None # Cleaned HTML content with boilerplate removed
raw_html: str | None # Original unmodified HTML source of the page
links: list[str] | None # List of URLs found on the page
links_detail: dict | list | None # Detailed link info including anchor text and attributes
screenshot: str | None # Base64-encoded PNG screenshot of the page
structured_data: dict | None # JSON-LD and microdata extracted from the page
headings: list[dict] | None # List of headings with level and text
images: list[dict] | None # List of images with src, alt, and dimensions
extract: dict | None # LLM-extracted structured data matching the provided schema
citations: list[dict] | None # Source citations for extracted content
markdown_with_citations: str | None # Markdown content with inline citation references
content_hash: str | None # SHA-256 hash of the content for change detection
metadata: PageMetadata | None # Page metadata including title, status_code, word_count, and SEO tags
error: str | None # Error message if scraping this page failed
success: bool # Whether this individual page was scraped successfully
Search
Search the web using Google (via SearXNG), DuckDuckGo, or Brave, then scrape each result page and return structured content.
Like crawl, search offers a blocking search() method and a non-blocking start_search() method.
Blocking Search
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
status = client.search(
"best python web scraping libraries 2026",
num_results=5,
engine="google",
formats=["markdown"],
only_main_content=True,
)
print(f"Query: {status.query}")
print(f"Results: {status.completed_results}/{status.total_results}")
for item in status.data:
print(f"\n--- {item.title} ---")
print(f"URL: {item.url}")
print(f"Snippet: {item.snippet}")
if item.markdown:
print(f"Content: {item.markdown[:200]}...")
Non-blocking Search
import time
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
job = client.start_search(
"machine learning frameworks comparison",
num_results=10,
engine="google",
)
print(f"Job ID: {job.job_id}")
while True:
status = client.get_search_status(job.job_id)
print(f" Progress: {status.completed_results}/{status.total_results}")
if status.is_complete:
break
time.sleep(2)
for item in status.data:
print(f" {item.title}: {item.url}")
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
status = await client.search(
"latest AI research papers",
num_results=5,
engine="google",
)
for item in status.data:
print(item.title, item.url)
With LLM Extraction
Apply LLM extraction to each search result for structured output:
status = client.search(
"python web frameworks comparison",
num_results=3,
extract={
"prompt": "Extract the framework name, pros, and cons",
"schema": {
"type": "object",
"properties": {
"framework": {"type": "string"},
"pros": {"type": "array", "items": {"type": "string"}},
"cons": {"type": "array", "items": {"type": "string"}},
},
},
},
)
for item in status.data:
if item.extract:
print(f"{item.extract['framework']}: {item.extract['pros']}")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | required | The search query string |
num_results | int | 5 | Number of search results to scrape |
engine | Literal["google", "duckduckgo", "brave"] | "google" | Search engine: "google" (via SearXNG), "duckduckgo", or "brave" |
formats | list[str] | None | Output formats for scraped results: "markdown", "html", "links", etc. |
only_main_content | bool | False | Extract only main content from each result page |
headers | dict[str, str] | None | Custom HTTP headers for scraping result pages |
cookies | dict[str, str] | None | Custom cookies for scraping result pages |
mobile | bool | False | Emulate mobile device viewport for scraping |
mobile_device | str | None | Mobile device preset name |
extract | dict | None | LLM extraction config applied to each result: { "prompt", "schema" } |
use_proxy | bool | False | Route requests through configured proxy |
google_api_key | str | None | Google Custom Search API key (alternative to SearXNG) |
google_cx | str | None | Google Custom Search Engine ID |
brave_api_key | str | None | Brave Search API key |
webhook_url | str | None | URL to receive webhook on completion |
webhook_secret | str | None | HMAC secret for webhook signature verification |
The search() blocking method accepts two additional parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval | float | 2.0 | Seconds between status polls |
timeout | float | 300.0 | Maximum seconds to wait before raising TimeoutError |
on_progress | Callable[[SearchStatus], None] | None | Optional callback invoked after each poll with the latest SearchStatus. Useful for progress bars or logging. |
Response Model
class SearchJob: # from start_search()
success: bool # Whether the search job was accepted
job_id: str # Unique job identifier for polling search status
status: str # Current job status (typically "started")
message: str | None # Human-readable status message or error description
class SearchStatus: # from search() or get_search_status()
job_id: str # Unique job identifier
status: str # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
query: str | None # The search query that was executed
total_results: int # Total number of search results found
completed_results: int # Number of results successfully scraped so far
data: list[SearchResultItem] # List of search result items with optional scraped content
error: str | None # Human-readable error message if the search failed
# Properties
is_complete: bool # True when status in {"completed", "failed", "cancelled"}
progress: float # completed_results / total_results (0.0 to 1.0)
class SearchResultItem:
id: str | None # Unique identifier for this search result
url: str # URL of the search result
title: str | None # Title of the search result from the search engine
snippet: str | None # Search result snippet/description text
success: bool # Whether this result was successfully scraped
markdown: str | None # Clean Markdown conversion of the scraped page content
html: str | None # Cleaned HTML content of the scraped page
links: list[str] | None # List of URLs found on the scraped page
links_detail: list[dict] | None # Detailed link info including anchor text and attributes
screenshot: str | None # Base64-encoded PNG screenshot of the page
structured_data: dict | None # JSON-LD and microdata extracted from the page
headings: list[dict] | None # List of headings with level and text
images: list[dict] | None # List of images with src, alt, and dimensions
extract: dict | None # LLM-extracted structured data matching the provided schema
metadata: PageMetadata | None # Page metadata including title, status_code, word_count, and SEO tags
error: str | None # Error message if scraping this result failed
Map
Discover all URLs on a website by combining sitemap.xml parsing, robots.txt discovery, and link crawling.
The map() method returns a flat list of discovered URLs with metadata. This is useful for understanding site structure before launching a targeted crawl.
Basic Usage
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
result = client.map("https://docs.python.org")
print(f"Total URLs: {result.total}")
for link in result.links:
print(f" {link.url}")
if link.title:
print(f" Title: {link.title}")
if link.lastmod:
print(f" Last modified: {link.lastmod}")
With Search Filter
# Only find URLs containing "tutorial"
result = client.map(
"https://docs.python.org",
search="tutorial",
limit=50,
)
print(f"Found {result.total} tutorial URLs")
for link in result.links:
print(f" {link.url}")
URL Shorthand
Use the urls property to get a flat list of URL strings:
result = client.map("https://example.com", limit=200)
# Get just the URLs as a plain list
url_list = result.urls # ["https://example.com/", "https://example.com/about", ...]
print(f"Found {len(url_list)} URLs")
# Feed into a crawl or batch scrape
crawl = client.crawl(url_list[0], max_pages=len(url_list))
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
result = await client.map(
"https://example.com",
limit=500,
include_subdomains=True,
)
print(f"Found {result.total} URLs")
for url in result.urls:
print(url)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | Website URL to map |
search | str | None | Filter URLs matching this search string |
limit | int | 100 | Maximum number of URLs to return |
include_subdomains | bool | True | Include URLs from subdomains |
use_sitemap | bool | True | Parse sitemap.xml for URL discovery |
Response Model
class MapResult:
success: bool # Whether the map request succeeded
total: int # Total number of links discovered on the site
links: list[LinkResult] # List of discovered links with URL and optional metadata
error: str | None # Human-readable error message if the map failed
job_id: str | None # Unique job identifier for async map requests
# Properties
urls: list[str] # Convenience: [link.url for link in links]
class LinkResult:
url: str # The discovered URL
title: str | None # Page title (from sitemap or page metadata)
description: str | None # Page description (from sitemap or meta tags)
lastmod: str | None # Last modification date in ISO 8601 format (from sitemap)
priority: float | None # Sitemap priority value between 0.0 and 1.0
Batch Scrape
Scrape multiple URLs in a single call. The sync client runs sequentially, while the async client runs concurrently with configurable parallelism. For the async client, batch_scrape_iter() yields results as they complete for streaming processing.
Sync Batch Scrape
from datablue import DataBlue
urls = [
"https://example.com",
"https://news.ycombinator.com",
"https://github.com",
"https://stackoverflow.com",
]
with DataBlue(api_key="wh_your_api_key") as client:
results = client.batch_scrape(urls, concurrency=5)
for result in results:
if result.success:
print(f"{result.data.url}: {result.data.metadata.word_count} words")
else:
print(f"FAILED: {result.error}")
Sync with Scrape Options
results = client.batch_scrape(
urls,
concurrency=3,
scrape_options={
"formats": ["markdown", "links"],
"only_main_content": True,
"timeout": 45000,
},
)
Async Batch (Collect All)
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
results = await client.batch_scrape(
urls,
concurrency=10,
scrape_options={"formats": ["markdown"]},
)
print(f"Scraped {len(results)} pages")
for r in results:
print(f" {r.data.url}: {r.success}")
Async Streaming (Recommended for Large Batches)
Use batch_scrape_iter() to process results as they arrive — no need to wait for all pages to finish before starting processing.
from datablue import AsyncDataBlue
urls = ["https://example.com/page/" + str(i) for i in range(100)]
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
completed = 0
async for result in client.batch_scrape_iter(urls, concurrency=10):
completed += 1
if result.success:
print(f"[{completed}/{len(urls)}] {result.data.url} — {result.data.metadata.word_count} words")
else:
print(f"[{completed}/{len(urls)}] FAILED: {result.error}")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | list[str] | required | List of URLs to scrape |
concurrency | int | 5 | Maximum concurrent requests (async only; sync runs sequentially) |
scrape_options | dict | None | Options passed to each scrape: formats, only_main_content, timeout, etc. |
Methods Summary
| Method | Client | Returns | Behavior |
|---|---|---|---|
batch_scrape() | DataBlue | list[ScrapeResult] | Sequential, blocks until all done |
batch_scrape() | AsyncDataBlue | list[ScrapeResult] | Concurrent, blocks until all done |
batch_scrape_iter() | AsyncDataBlue | AsyncIterator[ScrapeResult] | Concurrent, yields as completed |
Note: The sync client runs batch scrape sequentially regardless of the concurrency parameter. For parallel execution, use the async client's batch_scrape() or batch_scrape_iter().
Error resilience: Batch methods never raise on individual page failures. Failed pages return a ScrapeResult with success=False and the error message in the error field. Always check result.success before accessing result.data.
Error Handling
The SDK raises typed exceptions for all API errors. Every exception inherits from DataBlueError, making it easy to catch all errors or handle specific types.
The HTTP client automatically retries transient errors (429 and 5xx) with exponential backoff before raising.
Exception Hierarchy
DataBlueError # Base exception for all SDK errors
AuthenticationError # 401 — bad or missing API key / JWT
NotFoundError # 404 — resource does not exist
RateLimitError # 429 — rate limit exceeded (retryable)
ServerError # 5xx — server error (retryable)
JobFailedError # Polled job completed with "failed" status
TimeoutError # Polling timeout exceeded
Basic Error Handling
from datablue import (
DataBlue,
DataBlueError,
AuthenticationError,
RateLimitError,
NotFoundError,
ServerError,
JobFailedError,
TimeoutError,
)
with DataBlue(api_key="wh_your_api_key") as client:
try:
result = client.scrape("https://example.com")
except AuthenticationError as e:
print(f"Auth failed: {e.message}")
print(f"Status: {e.status_code}") # 401
print(f"Docs: {e.docs_url}") # https://docs.datablue.dev/errors/authentication
except RateLimitError as e:
print(f"Rate limited: {e.message}")
print(f"Retry after: {e.retry_after}s") # seconds to wait
print(f"Retryable: {e.is_retryable}") # True
except NotFoundError as e:
print(f"Not found: {e.message}") # 404
except ServerError as e:
print(f"Server error ({e.status_code}): {e.message}")
print(f"Retryable: {e.is_retryable}") # True
except DataBlueError as e:
print(f"API error: {e.message}")
print(f"Status: {e.status_code}")
print(f"Body: {e.response_body}")
Job Errors (Crawl / Search)
from datablue import DataBlue, JobFailedError, TimeoutError
with DataBlue(api_key="wh_your_api_key") as client:
try:
status = client.crawl(
"https://example.com",
max_pages=100,
timeout=60.0, # fail if not done in 60s
)
except TimeoutError as e:
print(f"Timed out after {e.elapsed:.1f}s")
print(f"Job ID: {e.job_id}")
# Optionally cancel the still-running job
client.cancel_crawl(e.job_id)
except JobFailedError as e:
print(f"Job failed: {e.message}")
print(f"Job ID: {e.job_id}")
print(f"Response: {e.response_body}")
Exception Attributes
| Attribute | Type | Available On | Description |
|---|---|---|---|
message | str | All | Human-readable error description |
status_code | int | None | All | HTTP status code (if from API response) |
response_body | dict | None | All | Raw API response body |
is_retryable | bool | All | Whether the request can be safely retried |
retry_after | float | None | RateLimitError | Seconds to wait before retrying |
docs_url | str | None | All | Link to documentation for this error type |
job_id | str | None | JobFailedError, TimeoutError | Job ID that failed or timed out |
elapsed | float | None | TimeoutError | Seconds elapsed before timeout |
AI-Friendly Error Messages
v2.0.0 errors include fix suggestions directly in the message, making them useful for both humans and AI coding assistants:
# AuthenticationError message includes fix instructions:
# "Authentication failed. Set DATABLUE_API_KEY environment variable
# or pass api_key to DataBlue(api_key='wh_...')"
# RateLimitError includes wait time:
# "Rate limit exceeded. Wait 42s before retrying,
# or reduce request frequency."
# TimeoutError includes fix suggestion:
# "Job crawl-abc123 did not complete within 300s.
# Try increasing the timeout parameter."
# ServerError indicates auto-retry:
# "Server error (502). This request will be automatically retried."
Automatic Retries
The SDK automatically retries on transient errors before raising an exception:
- 429 (Rate Limit) — waits for the
Retry-Afterheader, or uses exponential backoff (max 30s) - 5xx (Server Error) — exponential backoff: 0.5s, 1s, 2s (max 10s per wait)
- Connection errors — same exponential backoff as 5xx
- Max retries: 3 by default, configurable via
max_retriesparameter
Configuration
The SDK uses an immutable ClientConfig dataclass for all configuration. You can pass parameters directly to the constructor, use environment variables, or build a config object manually.
Constructor Parameters
from datablue import DataBlue
client = DataBlue(
api_url="https://api.datablue.dev", # Base URL (default: http://localhost:8000)
api_key="wh_your_api_key", # API key (wh_ prefix)
timeout=120.0, # Request timeout in seconds (default: 60)
max_retries=5, # Max retry attempts (default: 3)
)
ClientConfig Object
For advanced control, build a ClientConfig and pass it to the constructor:
from datablue import DataBlue, ClientConfig
config = ClientConfig(
api_url="https://api.datablue.dev",
api_key="wh_your_api_key",
timeout=120.0,
max_retries=5,
backoff_factor=1.0, # Multiplier for exponential backoff (default: 0.5)
)
client = DataBlue(config=config)
Config from Environment
from datablue import DataBlue, ClientConfig
# Build config from DATABLUE_* env vars
config = ClientConfig.from_env()
# Use with either client type
sync_client = DataBlue(config=config)
from datablue import AsyncDataBlue, ClientConfig
config = ClientConfig.from_env()
async_client = AsyncDataBlue(config=config)
Cloning Configs
Configs are immutable (frozen dataclass). Use clone() to create modified copies for different environments:
from datablue import DataBlue, ClientConfig
# Base config
prod = ClientConfig(
api_url="https://api.datablue.dev",
api_key="wh_prod_key",
timeout=60.0,
max_retries=3,
)
# Derive staging config (inherits everything except overrides)
staging = prod.clone(
api_url="https://staging.datablue.dev",
api_key="wh_staging_key",
)
# Derive a fast config for time-sensitive operations
fast = prod.clone(timeout=10.0, max_retries=1)
# Use each
with DataBlue(config=prod) as client:
result = client.scrape("https://example.com")
ClientConfig Fields
| Field | Type | Default | Description |
|---|---|---|---|
api_url | str | http://localhost:8000 | Base URL of the DataBlue API (trailing slash auto-stripped) |
api_key | str | None | None | API key with wh_ prefix |
timeout | float | 60.0 | HTTP request timeout in seconds |
max_retries | int | 3 | Maximum retry attempts on transient errors (429, 5xx, connection errors) |
backoff_factor | float | 0.5 | Multiplier for exponential backoff: delay = factor * 2^attempt |
Self-Hosted Setup
Point the SDK at your self-hosted DataBlue instance by setting the api_url:
# Direct constructor
with DataBlue(
api_url="https://scraper.internal.company.com",
api_key="wh_internal_key",
) as client:
result = client.scrape("https://example.com")
# Or via environment variables
export DATABLUE_API_URL=https://scraper.internal.company.com
export DATABLUE_API_KEY=wh_internal_key
from datablue import DataBlue
with DataBlue.from_env() as client:
result = client.scrape("https://example.com")
print(result.data.markdown)
Default URL: The SDK defaults to http://localhost:8000 which works out of the box with the Docker Compose development setup. For production deployments, always set the URL explicitly.
Complete API Reference (v2.0.0)
| Method | Description |
|---|---|
scrape(url, **opts) | Scrape a single URL, returns ScrapeResult |
crawl(url, **opts) | Crawl a site (blocking with polling), returns CrawlStatus |
start_crawl(url, **opts) | Start crawl (non-blocking), returns CrawlJob |
get_crawl_status(job_id) | Poll crawl status, returns CrawlStatus |
cancel_crawl(job_id) | Cancel an in-progress crawl |
crawl_stream(url, **opts) | Stream crawl pages via NDJSON, returns Iterator[CrawlPageData] |
crawl_stream_with_callback(url, on_document=..., **opts) | Callback-based crawl streaming (on_document, on_complete, on_error) |
search(query, **opts) | Search the web (blocking with polling), returns SearchStatus |
start_search(query, **opts) | Start search (non-blocking), returns SearchJob |
get_search_status(job_id) | Poll search status, returns SearchStatus |
map(url, **opts) | Discover URLs on a site, returns MapResult |
batch_scrape(urls, **opts) | Scrape multiple URLs, returns list[ScrapeResult] |
batch_scrape_iter(urls, **opts) | Async-only: stream batch results as they complete, returns AsyncIterator[ScrapeResult] |
login(email, password) | Authenticate with email/password, stores JWT internally |
close() | Close the HTTP connection pool |
from_env() | Class method: create client from DATABLUE_* env vars |
AI-First Documentation Files
v2.0.0 ships with machine-readable reference files for AI coding assistants:
| File | Location | Purpose |
|---|---|---|
CLAUDE.md | sdk/CLAUDE.md | Complete SDK quick-reference: all method signatures, response models, error types, and patterns. Read automatically by Claude Code and other AI assistants. |
llms.txt | sdk/llms.txt | Standardized machine-readable documentation following the llms.txt convention. Condensed API surface for LLM consumption. |
Why AI-first? AI coding assistants hallucinate API calls when they lack accurate documentation. The CLAUDE.md file ensures AI assistants generate code using real method signatures, real parameter names, and real response types — no guessing.
Scrape
Scrape a single URL and return the content in your desired format. Uses a 5-tier parallel scraping engine with automatic strategy selection and domain-level strategy caching.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| url | string | REQUIRED | The URL to scrape. Protocol is auto-prepended if missing. |
| formats | string[] | optional | Output formats: "markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images". |
| only_main_content | boolean | optional | Extract only the main content, removing navs, footers, sidebars. |
| mobile | boolean | optional | Emulate a mobile device viewport. |
| mobile_device | string | optional | Device preset name: "iphone_14", "pixel_7", "ipad_pro". |
| timeout | number | optional | Request timeout in milliseconds. |
| wait_for | number | optional | Wait this many ms after page load before extracting. |
| css_selector | string | optional | Only extract content matching this CSS selector. |
| xpath | string | optional | XPath expression for targeted extraction. |
| include_tags | string[] | optional | Only include these HTML tags in extraction. |
| exclude_tags | string[] | optional | Exclude these HTML tags from extraction. |
| use_proxy | boolean | optional | Route request through a configured proxy for anti-bot bypass. |
| headers | object | optional | Custom HTTP headers to send (e.g. { "Cookie": "session=abc" }). |
| cookies | object | optional | Custom cookies to send as name/value pairs. |
| actions | ActionStep[] | optional | Browser actions to perform before extraction: click, wait, scroll, type, screenshot, hover, press, select, fill_form, evaluate. |
| extract | object | optional | LLM extraction config: { prompt: string, schema: JSONSchema }. |
| webhook_url | string | optional | Webhook URL for job completion notification. |
| webhook_secret | string | optional | HMAC secret for webhook signature verification. |
| capture_network | boolean | optional | Capture browser network requests/responses. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/scrape" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": [
"markdown",
"links"
],
"only_main_content": true
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"data": {
"markdown": "# Hacker News\n\n1. Show HN: I built an open-source web scraper with strategy caching\n2. Why Rust is eating the world...",
"links": [
"https://news.ycombinator.com/item?id=39912345",
"https://news.ycombinator.com/item?id=39912346",
"https://news.ycombinator.com/newest"
],
"metadata": {
"title": "Hacker News",
"description": null,
"language": "en",
"source_url": "https://news.ycombinator.com",
"status_code": 200,
"word_count": 1847,
"reading_time_seconds": 7,
"content_length": 12340
}
}
}Crawl
Start a crawl job that discovers and scrapes pages starting from a seed URL using BFS, DFS, or best-first strategy. Returns a job ID for tracking progress via SSE or polling.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| url | string | REQUIRED | The starting URL to crawl. |
| max_pages | number | optional | Maximum pages to crawl (1-1000). |
| max_depth | number | optional | Maximum link depth from the starting URL (1-10). |
| concurrency | number | optional | Number of parallel scrape workers (1-10). |
| crawl_strategy | string | optional | Strategy: "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first). |
| allow_external_links | boolean | optional | Follow links to external domains. |
| respect_robots_txt | boolean | optional | Obey the target site's robots.txt rules. |
| include_paths | string[] | optional | Only crawl URLs matching these glob path patterns. |
| exclude_paths | string[] | optional | Skip URLs matching these glob path patterns. |
| scrape_options | object | optional | Options passed to each page scrape: { formats, only_main_content, wait_for, timeout, include_tags, exclude_tags, mobile, extract }. |
| use_proxy | boolean | optional | Route all requests through a configured proxy. |
| filter_faceted_urls | boolean | optional | Deduplicate faceted/navigation URL variations. |
| webhook_url | string | optional | Webhook URL for crawl completion notification. |
| webhook_secret | string | optional | HMAC secret for webhook signature verification. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/crawl" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.python.org/3/",
"max_pages": 50,
"max_depth": 2,
"crawl_strategy": "bfs",
"include_paths": [
"/3/library/*"
]
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started",
"message": "Crawl job started"
}Search
Search the web using Google (via SearXNG), DuckDuckGo, or Brave, then scrape each result page and return structured content. Async job with polling.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | REQUIRED | The search query. |
| num_results | number | optional | Number of search results to scrape. |
| engine | string | optional | Search engine: "google" (SearXNG), "duckduckgo", or "brave". |
| formats | string[] | optional | Output formats for scraped results: "markdown", "html", "links", "screenshot", "structured_data", "headings", "images". |
| only_main_content | boolean | optional | Extract only the main content from each result. |
| use_proxy | boolean | optional | Route requests through a configured proxy. |
| mobile | boolean | optional | Emulate a mobile device viewport for scraping. |
| mobile_device | string | optional | Mobile device preset name. |
| headers | object | optional | Custom HTTP headers for scraping. |
| cookies | object | optional | Custom cookies for scraping. |
| extract | object | optional | LLM extraction config applied to each result: { prompt, schema }. |
| google_api_key | string | optional | Google Custom Search API key (alternative to SearXNG). |
| google_cx | string | optional | Google Custom Search Engine ID. |
| brave_api_key | string | optional | Brave Search API key. |
| webhook_url | string | optional | Webhook URL for search completion notification. |
| webhook_secret | string | optional | HMAC secret for webhook signature verification. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "best python web scraping libraries 2026",
"num_results": 5,
"engine": "google",
"formats": [
"markdown"
]
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"job_id": "550e8400-e29b-41d4-a716-446655440001",
"status": "started",
"message": "Search job started"
}Map
Discover all URLs on a website by combining sitemap.xml parsing, robots.txt discovery, and link crawling. Returns a flat list of URLs with metadata.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| url | string | REQUIRED | The website URL to map. |
| search | string | optional | Filter URLs matching this search string. |
| limit | number | optional | Maximum number of URLs to return. |
| include_subdomains | boolean | optional | Include URLs from subdomains. |
| use_sitemap | boolean | optional | Parse sitemap.xml for URL discovery. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/map" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.python.org",
"limit": 200,
"include_subdomains": false
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"total": 187,
"links": [
{
"url": "https://docs.python.org/3/",
"title": "Python 3 Documentation",
"lastmod": "2026-03-28"
},
{
"url": "https://docs.python.org/3/tutorial/index.html",
"title": "The Python Tutorial"
},
{
"url": "https://docs.python.org/3/library/index.html",
"title": "The Python Standard Library"
},
{
"url": "https://docs.python.org/3/reference/index.html",
"title": "The Python Language Reference"
}
]
}Extract
Extract structured data from web pages or raw content using LLM. Accepts URLs to scrape first, or raw markdown/HTML content directly. Returns typed JSON matching your schema.
Target Latency
1.2s - 4.5s
Credits
5 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| url | string | optional | Single URL to scrape then extract from. |
| urls | string[] | optional | Multiple URLs to scrape and extract from (async job). |
| content | string | optional | Raw markdown/text content to extract from (no scraping needed). |
| html | string | optional | Raw HTML to convert and extract from. |
| prompt | string | optional | Natural language extraction instruction (e.g. 'Extract all product names and prices'). |
| schema | object | optional | JSON Schema for structured output. The LLM will return data matching this schema. |
| provider | string | optional | LLM provider: "openai", "anthropic", "groq", etc. |
| only_main_content | boolean | optional | Extract only main content before LLM processing. |
| wait_for | number | optional | Wait ms after page load (for URLs). |
| timeout | number | optional | Scrape timeout in ms (for URLs). |
| use_proxy | boolean | optional | Use proxy for scraping. |
| headers | object | optional | Custom HTTP headers. |
| cookies | object | optional | Custom cookies. |
| webhook_url | string | optional | Webhook URL for extraction completion notification. |
| webhook_secret | string | optional | HMAC secret for webhook signature verification. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/extract" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://openai.com/pricing",
"prompt": "Extract all pricing tiers with name, price per million tokens, and context window",
"schema": {
"type": "object",
"properties": {
"tiers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"input_price": {
"type": "string"
},
"output_price": {
"type": "string"
},
"context_window": {
"type": "string"
}
}
}
}
}
}
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"data": {
"url": "https://openai.com/pricing",
"extract": {
"tiers": [
{
"name": "GPT-4o",
"input_price": "$2.50/1M",
"output_price": "$10.00/1M",
"context_window": "128K"
},
{
"name": "GPT-4o mini",
"input_price": "$0.15/1M",
"output_price": "$0.60/1M",
"context_window": "128K"
},
{
"name": "GPT-4.1",
"input_price": "$2.00/1M",
"output_price": "$8.00/1M",
"context_window": "1M"
}
]
},
"content_length": 48230
}
}Google Search
Search Google and get structured SERP data including organic results, featured snippets, People Also Ask, knowledge panels, AI overviews, videos, and related searches.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | REQUIRED | Search query (1-2048 characters). |
| num_results | number | optional | Number of organic results (1-100). |
| page | number | optional | Result page number (1-10). |
| language | string | optional | Language code (hl parameter). |
| country | string | optional | Country code for geo-targeting (gl parameter, e.g. us, uk, in). |
| safe_search | boolean | optional | Enable safe search filter. |
| time_range | string | optional | Time filter: "hour", "day", "week", "month", "year". |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/google/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning frameworks comparison 2026",
"num_results": 10,
"language": "en",
"country": "us"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"query": "machine learning frameworks comparison 2026",
"total_results": "About 142,000,000 results",
"time_taken": 1.23,
"organic_results": [
{
"position": 1,
"title": "Best Machine Learning Frameworks in 2026: Complete Comparison",
"url": "https://towardsdatascience.com/ml-frameworks-2026",
"displayed_url": "towardsdatascience.com",
"snippet": "A comprehensive comparison of PyTorch, JAX, TensorFlow, and MLX for production ML in 2026...",
"date": "Mar 15, 2026"
},
{
"position": 2,
"title": "PyTorch vs JAX vs TensorFlow - Benchmark Results",
"url": "https://paperswithcode.com/benchmarks/frameworks",
"displayed_url": "paperswithcode.com",
"snippet": "Side-by-side benchmark results across training speed, inference latency, and memory usage..."
}
],
"featured_snippet": {
"title": "Top ML Frameworks 2026",
"url": "https://towardsdatascience.com/ml-frameworks-2026",
"content": "The top machine learning frameworks in 2026 are: 1. PyTorch 2.5 2. JAX 0.5 3. TensorFlow 2.18 4. MLX 0.22",
"type": "list"
},
"people_also_ask": [
{
"question": "What is the best ML framework for beginners?",
"snippet": "PyTorch is widely recommended for beginners..."
},
{
"question": "Is TensorFlow still relevant in 2026?",
"snippet": "Yes, TensorFlow remains widely used in production..."
}
],
"related_searches": [
{
"query": "pytorch vs jax performance"
},
{
"query": "best ml framework for production"
}
],
"knowledge_panel": null,
"ai_overview": null,
"videos": [
{
"title": "PyTorch vs JAX in 2026 - Full Comparison",
"url": "https://youtube.com/watch?v=abc123",
"duration": "18:42",
"channel": "Yannic Kilcher"
}
]
}Google Maps
Search Google Maps for places, businesses, and points of interest. Supports search queries, coordinate-based nearby search, and single place detail lookups via place_id or CID. Returns full business data including reviews.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | optional | Search query (e.g. 'restaurants near Times Square'). Max 2048 characters. |
| coordinates | string | optional | GPS coordinates as 'lat,lng' (e.g. '40.7580,-73.9855'). |
| radius | number | optional | Search radius in meters (100-50000). Default 5000. |
| zoom | number | optional | Map zoom level (1-21). Auto-calculated from radius if not set. |
| type | string | optional | Place type filter: restaurant, hotel, gas_station, hospital, cafe, bar, gym, pharmacy, bank, supermarket, park, museum, airport. |
| keyword | string | optional | Additional keyword filter (max 500 chars). E.g. 'vegetarian', 'rooftop'. |
| min_rating | number | optional | Minimum star rating filter (1.0-5.0). |
| open_now | boolean | optional | Only show places that are currently open. |
| price_level | number | optional | Price range filter: 1=$, 2=$$, 3=$$$, 4=$$$$. |
| sort_by | string | optional | Sort order: "relevance", "distance", "rating", "reviews". |
| num_results | number | optional | Number of places to return (1-200). |
| place_id | string | optional | Google Place ID for detailed single place lookup (e.g. 'ChIJN1t_tDeuEmsRUsoyG'). |
| cid | string | optional | CID / Ludocid permanent business identifier. |
| data | string | optional | Google Maps data parameter (encoded place reference). |
| language | string | optional | Language code (hl parameter). |
| country | string | optional | Country code for geo-targeting (gl parameter). |
| include_reviews | boolean | optional | Include user reviews for each place. |
| reviews_limit | number | optional | Maximum reviews per place (1-20). Requires include_reviews=true. |
| reviews_sort | string | optional | Review sort: "most_relevant", "newest", "highest", "lowest". |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/google/maps" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "best ramen restaurants",
"coordinates": "40.7580,-73.9855",
"radius": 3000,
"num_results": 5,
"min_rating": 4,
"include_reviews": true,
"reviews_limit": 3
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"query": "best ramen restaurants",
"coordinates_used": "40.7580,-73.9855",
"search_type": "search",
"total_results": "5",
"time_taken": 3.42,
"filters_applied": {
"min_rating": 4,
"radius": 3000
},
"places": [
{
"position": 1,
"title": "Ichiran Ramen",
"place_id": "ChIJ4Y8RmkRYwokR5ntGn3BDXY8",
"address": "132 W 31st St, New York, NY 10001",
"gps_coordinates": {
"latitude": 40.7487,
"longitude": -73.9903
},
"url": "https://maps.google.com/?cid=10325091291252938854",
"website": "https://www.ichiranusa.com",
"phone": "+1 212-465-0701",
"rating": 4.5,
"reviews": 3847,
"price": "$$",
"price_level": 2,
"type": "Ramen restaurant",
"subtypes": [
"Ramen restaurant",
"Japanese restaurant",
"Noodle shop"
],
"open_state": "Open - Closes 2 AM",
"open_now": true,
"thumbnail": "https://lh5.googleusercontent.com/p/AF1QipN...",
"user_reviews": [
{
"author_name": "Sarah Chen",
"rating": 5,
"text": "Best ramen in NYC. The solo booth concept is genius. Rich tonkotsu broth...",
"relative_time": "2 weeks ago"
},
{
"author_name": "Mike Johnson",
"rating": 4,
"text": "Great flavors, a bit pricey for what you get but the experience is unique.",
"relative_time": "1 month ago"
}
]
}
],
"related_searches": [
{
"query": "ramen near me"
},
{
"query": "japanese restaurants midtown"
}
]
}Google News
Search Google News for articles. Supports time range filtering, language/country targeting, and relevance/date sorting. Returns article metadata including source, date, and thumbnail.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | REQUIRED | News search query (1-2048 characters). |
| num_results | number | optional | Number of articles to return (1-500). |
| language | string | optional | Language code (hl parameter). |
| country | string | optional | Country code for geo-targeting (gl parameter, e.g. us, uk, in). |
| time_range | string | optional | Time filter: "hour", "day", "week", "month", "year". |
| sort_by | string | optional | Sort order: "relevance", "date". |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/google/news" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "artificial intelligence regulation",
"num_results": 10,
"time_range": "week",
"language": "en",
"country": "us"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"query": "artificial intelligence regulation",
"total_results": "10",
"time_taken": 2.14,
"source_strategy": "searxng_google_news",
"articles": [
{
"position": 1,
"title": "EU AI Act Enforcement Begins: What Companies Need to Know",
"url": "https://www.reuters.com/technology/eu-ai-act-enforcement-2026-04-01",
"source": "Reuters",
"source_url": "reuters.com",
"date": "3 hours ago",
"published_date": "2026-04-03T09:00:00Z",
"snippet": "The European Union's AI Act enters its enforcement phase today, requiring all AI systems classified as high-risk to undergo compliance assessments...",
"thumbnail": "https://static.reuters.com/image/ai-regulation.jpg"
},
{
"position": 2,
"title": "US Senate Proposes Comprehensive AI Safety Bill",
"url": "https://www.washingtonpost.com/technology/2026/04/02/ai-safety-bill",
"source": "The Washington Post",
"source_url": "washingtonpost.com",
"date": "1 day ago",
"snippet": "Bipartisan legislation would create a federal AI licensing framework for frontier models..."
}
],
"related_searches": [
{
"query": "AI regulation news today"
},
{
"query": "EU AI Act requirements"
}
]
}Google Finance
Get financial market data from Google Finance. Omit query for a market overview (US, Europe, Asia, Crypto, Currencies, Futures + trends). Provide a ticker symbol (e.g. 'AAPL:NASDAQ') for a single stock quote with similar stocks and news.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | optional | Stock ticker (e.g. 'AAPL:NASDAQ', 'BTC-USD'). Omit for market overview. Max 100 characters. |
| language | string | optional | Language code (hl parameter). |
| country | string | optional | Country code for geo-targeting (gl parameter). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/google/finance" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "AAPL:NASDAQ"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 1.87,
"stock": "AAPL:NASDAQ",
"name": "Apple Inc",
"price": "$198.50",
"price_movement": {
"percentage": "+1.24%",
"value": "+$2.43",
"movement": "up"
},
"currency": "USD",
"previous_close": "$196.07",
"after_hours_price": "$198.75",
"after_hours_movement": {
"percentage": "+0.13%",
"value": "+$0.25",
"movement": "up"
},
"similar_stocks": [
{
"stock": "MSFT:NASDAQ",
"link": "https://www.google.com/finance/quote/MSFT:NASDAQ",
"name": "Microsoft Corporation",
"price": "$428.50",
"price_movement": {
"percentage": "+0.87%",
"value": "+$3.70",
"movement": "up"
}
},
{
"stock": "GOOGL:NASDAQ",
"link": "https://www.google.com/finance/quote/GOOGL:NASDAQ",
"name": "Alphabet Inc",
"price": "$178.20",
"price_movement": {
"percentage": "-0.34%",
"value": "-$0.61",
"movement": "down"
}
}
],
"news": [
{
"title": "Apple Reports Record Q2 Revenue Driven by Services Growth",
"url": "https://www.cnbc.com/2026/04/02/apple-q2-earnings.html",
"source": "CNBC",
"snippet": "Apple beat Wall Street expectations with $97.4B in revenue...",
"published_timestamp": 1743638400
}
]
}YouTube Channel
Get detailed channel metadata including subscriber count, video count, verification status, avatar, banner, country, and keywords. Accepts @handle or UC... channel ID.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| identifier | string | REQUIRED | Channel ID (UC...), @handle, or username. 1-100 characters. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/youtube/channel" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"identifier": "@mkbhd"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 2.31,
"channel": {
"id": "UCBJycsmduvYEL83R_U4JriQ",
"title": "MKBHD",
"description": "Quality Tech Videos | Updated weekly.",
"subscriber_count": 19800000,
"video_count": 1847,
"verified": true,
"avatar_url": "https://yt3.googleusercontent.com/lkBjk_ZQ...",
"banner_url": "https://yt3.googleusercontent.com/banner_ZQ...",
"channel_url": "https://www.youtube.com/@mkbhd",
"country": "US",
"keywords": "tech reviews gadgets smartphones MKBHD Marques Brownlee"
}
}YouTube Videos
Get a channel's recent videos with view counts, duration, publish dates, and thumbnails. Accepts @handle or UC... channel ID.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| identifier | string | REQUIRED | Channel ID (UC...), @handle, or username. 1-100 characters. |
| limit | number | optional | Number of videos to return (1-50). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/youtube/videos" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"identifier": "@mkbhd",
"limit": 5
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 3.15,
"channel": "MKBHD",
"channel_id": "UCBJycsmduvYEL83R_U4JriQ",
"count": 5,
"videos": [
{
"video_id": "dQw4w9WgXcQ",
"title": "The BEST Smartphones of 2026!",
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"views": 4200000,
"views_text": "4.2M views",
"published": "2 weeks ago",
"duration": "22:14",
"duration_seconds": 1334,
"thumbnail": "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg",
"description": "Ranking the top smartphones I've tested this year..."
},
{
"video_id": "abc123def456",
"title": "Galaxy S26 Ultra Review: The Camera King?",
"url": "https://www.youtube.com/watch?v=abc123def456",
"views": 2800000,
"views_text": "2.8M views",
"published": "3 weeks ago",
"duration": "18:07",
"duration_seconds": 1087,
"thumbnail": "https://i.ytimg.com/vi/abc123def456/maxresdefault.jpg"
}
]
}YouTube Search
Search YouTube for videos. Returns video metadata including channel info, view counts, duration, thumbnails, and badges.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | REQUIRED | Search keywords (1-500 characters). |
| limit | number | optional | Maximum video results (1-50). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/youtube/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "python web scraping tutorial 2026",
"limit": 5
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 2.87,
"query": "python web scraping tutorial 2026",
"count": 5,
"results": [
{
"video_id": "xyz789abc123",
"title": "Python Web Scraping Full Course 2026 - BeautifulSoup, Selenium, Playwright",
"url": "https://www.youtube.com/watch?v=xyz789abc123",
"channel": "Tech With Tim",
"channel_id": "UC4JX40jDee_tINbkjycV4Sg",
"views": 890000,
"views_text": "890K views",
"published": "1 month ago",
"duration": "3:42:15",
"duration_seconds": 13335,
"thumbnail": "https://i.ytimg.com/vi/xyz789abc123/maxresdefault.jpg",
"description": "Learn web scraping from scratch with Python...",
"badges": [
"New"
]
},
{
"video_id": "def456ghi789",
"title": "Scrape Any Website in 5 Minutes with Python",
"url": "https://www.youtube.com/watch?v=def456ghi789",
"channel": "Fireship",
"channel_id": "UCsBjURrPoezykLs9EqgamOA",
"views": 1200000,
"views_text": "1.2M views",
"published": "2 months ago",
"duration": "8:23",
"duration_seconds": 503,
"thumbnail": "https://i.ytimg.com/vi/def456ghi789/maxresdefault.jpg"
}
]
}YouTube Video Detail
Get detailed metadata for a single YouTube video including view count, likes, comment count, duration, keywords, live status, and full description.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| video_id | string | REQUIRED | YouTube video ID (e.g. 'dQw4w9WgXcQ'). 1-20 characters. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/youtube/video" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"video_id": "dQw4w9WgXcQ"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 1.94,
"video": {
"id": "dQw4w9WgXcQ",
"title": "Rick Astley - Never Gonna Give You Up (Official Music Video)",
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"description": "The official video for \"Never Gonna Give You Up\" by Rick Astley...",
"channel": "Rick Astley",
"channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
"views": 1500000000,
"likes": 16000000,
"comment_count": 3200000,
"duration_seconds": 212,
"published": "Oct 25, 2009",
"is_live": false,
"keywords": [
"rick astley",
"never gonna give you up",
"rickroll",
"music video",
"80s",
"pop"
],
"thumbnail": "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg"
}
}YouTube Comments
Get comments for a YouTube video including author info, verification status, likes, reply count, pinned status, and creator heart status.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| video_id | string | REQUIRED | YouTube video ID. 1-20 characters. |
| limit | number | optional | Maximum comments to return (1-100). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/youtube/comments" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"video_id": "dQw4w9WgXcQ",
"limit": 5
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 2.56,
"video_id": "dQw4w9WgXcQ",
"count": 5,
"comments": [
{
"author": "Rick Astley",
"author_channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
"author_avatar": "https://yt3.ggpht.com/ytc/AIdro_n...",
"is_verified": true,
"is_creator": true,
"text": "Thank you for 1.5 billion views! Never gonna let you down.",
"likes": 892000,
"reply_count": 47200,
"published": "3 months ago",
"pinned": true,
"hearted": false
},
{
"author": "Internet Historian",
"author_channel_id": "UCR1D15p_vY6mVRqY5eXSA",
"author_avatar": "https://yt3.ggpht.com/ytc/another...",
"is_verified": true,
"is_creator": false,
"text": "We've been rickrolling for almost 20 years and it still never gets old.",
"likes": 234000,
"reply_count": 1200,
"published": "6 months ago",
"pinned": false,
"hearted": true
}
]
}Twitter Profile
Get a Twitter/X user profile including follower/following counts, tweet count, verification status, bio, location, and profile images.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| username | string | REQUIRED | Twitter/X username without @ prefix. 1-50 characters. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/twitter/profile" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"username": "naval"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 1.82,
"user": {
"id": "745273",
"username": "naval",
"name": "Naval",
"description": "Angel investor, founder of AngelList. Building the future.",
"followers_count": 2100000,
"following_count": 1847,
"tweet_count": 18400,
"like_count": 42000,
"listed_count": 12500,
"verified": true,
"created_at": "2006-12-19T00:00:00Z",
"profile_image_url": "https://pbs.twimg.com/profile_images/naval_400x400.jpg",
"profile_banner_url": "https://pbs.twimg.com/profile_banners/745273/1600x600.jpg",
"location": "San Francisco, CA",
"url": "https://nav.al"
}
}Twitter Tweets
Get recent tweets from a Twitter/X user timeline including engagement metrics, media attachments, hashtags, and URLs.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| username | string | REQUIRED | Twitter/X username without @ prefix. 1-50 characters. |
| limit | number | optional | Number of tweets to return (1-100). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/twitter/tweets" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"username": "naval",
"limit": 5
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 3.47,
"username": "naval",
"count": 5,
"tweets": [
{
"id": "1907654321098765432",
"text": "Specific knowledge is found by pursuing your genuine curiosity rather than whatever is hot right now.",
"created_at": "2026-04-02T14:30:00Z",
"likes": 24000,
"retweets": 4800,
"replies": 320,
"quotes": 180,
"views": 1200000,
"lang": "en",
"url": "https://x.com/naval/status/1907654321098765432",
"author": {
"username": "naval",
"name": "Naval",
"verified": true,
"profile_image_url": "https://pbs.twimg.com/profile_images/naval_400x400.jpg"
},
"media": [],
"hashtags": [],
"urls": []
},
{
"id": "1907543210987654321",
"text": "The most important skill for getting rich is becoming a perpetual learner.",
"created_at": "2026-04-01T18:15:00Z",
"likes": 18500,
"retweets": 3200,
"replies": 210,
"quotes": 95,
"views": 890000,
"lang": "en",
"url": "https://x.com/naval/status/1907543210987654321",
"author": {
"username": "naval",
"name": "Naval",
"verified": true
},
"media": [],
"hashtags": [],
"urls": []
}
]
}Twitter Tweet Detail
Get detailed data for a single tweet by ID including full engagement metrics, media, author info, and embedded URLs.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| tweet_id | string | REQUIRED | Tweet ID (numeric string). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/twitter/tweet" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"tweet_id": "1907654321098765432"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 1.65,
"tweet": {
"id": "1907654321098765432",
"text": "Specific knowledge is found by pursuing your genuine curiosity rather than whatever is hot right now.",
"created_at": "2026-04-02T14:30:00Z",
"likes": 24000,
"retweets": 4800,
"replies": 320,
"quotes": 180,
"views": 1200000,
"lang": "en",
"url": "https://x.com/naval/status/1907654321098765432",
"author": {
"username": "naval",
"name": "Naval",
"verified": true,
"profile_image_url": "https://pbs.twimg.com/profile_images/naval_400x400.jpg"
},
"media": [],
"hashtags": [],
"urls": []
}
}Twitter Search
Search Twitter/X for tweets matching keywords, hashtags, or @mentions. Supports enrichment with full tweet data (metrics, media, author). Enrichment adds ~2-4s but provides complete data.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | REQUIRED | Search query: keywords, hashtags, @mentions, phrases. 1-500 characters. |
| limit | number | optional | Maximum tweet results (1-50). |
| enrich | boolean | optional | Fetch full tweet data (metrics, media, author). Adds ~2-4s latency. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/twitter/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "#webdev OR #python web scraping",
"limit": 10,
"enrich": true
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 4.12,
"query": "#webdev OR #python web scraping",
"count": 10,
"enriched": true,
"results": [
{
"id": "1907765432109876543",
"text": "Just built a web scraper using Python and Playwright that handles JavaScript-heavy sites beautifully. #python #webdev",
"created_at": "2026-04-03T10:00:00Z",
"likes": 450,
"retweets": 89,
"replies": 23,
"quotes": 12,
"views": 34000,
"lang": "en",
"url": "https://x.com/dev_sarah/status/1907765432109876543",
"author": {
"username": "dev_sarah",
"name": "Sarah Developer",
"verified": false
},
"media": [],
"hashtags": [
"python",
"webdev"
],
"urls": []
}
]
}Reddit Subreddit
Get subreddit metadata including subscriber count, active users, description, icon, banner, and community type.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| name | string | REQUIRED | Subreddit name (e.g. 'python' or 'r/python'). 1-100 characters. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/reddit/subreddit" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "python"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 1.45,
"subreddit": {
"name": "python",
"title": "Python",
"description": "News about the programming language Python.",
"long_description": "Welcome to r/Python! This is a community for all things Python - discussion, tutorials, projects, and news about the programming language.",
"subscribers": 1340000,
"active_users": 4200,
"created": "2008-01-25T00:00:00Z",
"icon": "https://styles.redditmedia.com/t5_2qh0y/python_icon.png",
"banner": "https://styles.redditmedia.com/t5_2qh0y/python_banner.png",
"over_18": false,
"type": "public",
"url": "https://www.reddit.com/r/python/"
}
}Reddit Posts
Get posts from a subreddit with sorting (hot, new, top, rising, controversial) and time filtering. Returns full post metadata including score, comments, media, flair, and awards.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| subreddit | string | REQUIRED | Subreddit name (e.g. 'python' or 'r/python'). 1-100 characters. |
| sort | string | optional | Sort order: "hot", "new", "top", "rising", "controversial". |
| limit | number | optional | Number of posts to return (1-100). |
| time_filter | string | optional | Time filter for top/controversial: "hour", "day", "week", "month", "year", "all". |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/reddit/posts" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"subreddit": "python",
"sort": "top",
"limit": 5,
"time_filter": "week"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 2.18,
"subreddit": "python",
"sort": "top",
"count": 5,
"posts": [
{
"id": "1b2c3d4",
"title": "I wrote a Python library that makes web scraping 10x easier",
"author": "scraperdev",
"subreddit": "python",
"score": 2847,
"upvote_ratio": 0.96,
"num_comments": 342,
"url": "https://github.com/scraperdev/easyscrape",
"permalink": "/r/python/comments/1b2c3d4/i_wrote_a_python_library/",
"selftext": null,
"created": "2026-03-30T15:30:00Z",
"thumbnail": "https://b.thumbs.redditmedia.com/abc123.jpg",
"is_video": false,
"is_self": false,
"over_18": false,
"stickied": false,
"flair": "Library",
"awards": 5
},
{
"id": "4e5f6g7",
"title": "Python 3.14 Released - What's New",
"author": "python_dev",
"subreddit": "python",
"score": 1923,
"upvote_ratio": 0.98,
"num_comments": 187,
"url": "https://www.reddit.com/r/python/comments/4e5f6g7/python_314_released/",
"permalink": "/r/python/comments/4e5f6g7/python_314_released/",
"selftext": "The latest Python release brings several exciting features including...",
"created": "2026-03-28T12:00:00Z",
"is_video": false,
"is_self": true,
"over_18": false,
"stickied": false,
"flair": "News",
"awards": 3
}
]
}Reddit Search
Search Reddit for posts matching keywords. Optionally restrict to a specific subreddit. Supports relevance, hot, top, new, and comments sorting.
Target Latency
1.2s - 4.5s
Credits
2 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| query | string | REQUIRED | Search keywords (1-500 characters). |
| limit | number | optional | Maximum results to return (1-100). |
| sort | string | optional | Sort order: "relevance", "hot", "top", "new", "comments". |
| subreddit | string | optional | Restrict search to a specific subreddit (1-100 characters). |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/reddit/search" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "web scraping python best practices",
"limit": 10,
"sort": "relevance",
"subreddit": "python"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 2.73,
"query": "web scraping python best practices",
"subreddit": "python",
"sort": "relevance",
"count": 10,
"results": [
{
"id": "8h9i0j1",
"title": "What are the best practices for web scraping in Python in 2026?",
"author": "curious_dev",
"subreddit": "python",
"score": 567,
"upvote_ratio": 0.93,
"num_comments": 89,
"url": "https://www.reddit.com/r/python/comments/8h9i0j1/best_practices_scraping/",
"permalink": "/r/python/comments/8h9i0j1/best_practices_scraping/",
"selftext": "I'm building a scraping pipeline and want to know the current best practices...",
"created": "2026-03-25T09:00:00Z",
"is_video": false,
"is_self": true,
"over_18": false,
"stickied": false,
"flair": "Discussion",
"awards": 1
}
]
}Reddit User
Get a Reddit user's profile including karma breakdown, account age, gold status, verification, and employee status.
Target Latency
1.2s - 4.5s
Credits
1 cr/req
Parameters
| Name | Type | Requirement | Description |
|---|---|---|---|
| username | string | REQUIRED | Reddit username (e.g. 'spez' or 'u/spez'). 1-100 characters. |
cURL Example
curl -X POST "https://api.datablue.dev/v1/data/reddit/user" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"username": "spez"
}'System Responses
200 OK
Request processed successfully.
401 UNAUTHORIZED
Missing or invalid API key.
429 RATE LIMIT
System capacity exceeded.
500 SYSTEM FAILURE
Internal core exception.
{
"success": true,
"time_taken": 1.32,
"user": {
"name": "spez",
"display_name": "spez",
"description": "CEO, Reddit",
"total_karma": 1250000,
"link_karma": 450000,
"comment_karma": 800000,
"created": "2005-06-06T00:00:00Z",
"icon": "https://styles.redditmedia.com/t5_6/spez_avatar.png",
"is_gold": true,
"verified": true,
"has_verified_email": true,
"is_employee": true,
"url": "https://www.reddit.com/user/spez/"
}
}Rate Limits
Rate limits are enforced per API key or JWT token. Limits vary by plan tier. When exceeded, the API returns 429 Too Many Requests.
| Plan | Rate Limit | Monthly Credits | Max Pages/Crawl | Concurrency |
|---|---|---|---|---|
| Free | 10 req/min | 500 | 100 | 2 |
| Starter | 60 req/min | 3,000 | 500 | 5 |
| Plus | 120 req/min | 15,000 | 1,000 | 10 |
| Pro | 200 req/min | 50,000 | 1,000 | 10 |
| Growth | 300 req/min | 200,000 | 1,000 | 10 |
| Scale | 500 req/min | 1,000,000 | 1,000 | 10 |
Rate Limit Headers
Every API response includes these headers:
| Header | Description |
|---|---|
X-RateLimit-Limit |
Maximum requests per minute for your plan |
X-RateLimit-Remaining |
Remaining requests in the current window |
X-RateLimit-Reset |
Unix timestamp when the rate limit window resets |
429 Response Example
{
"error": "Rate limit exceeded",
"message": "You have exceeded the rate limit of 10 requests per minute. Please wait and try again.",
"retry_after": 42
}
Error Codes
DataBlue uses standard HTTP status codes. All error responses include a JSON body with error and message fields.
| Code | Name | Description |
|---|---|---|
| 400 | Bad Request | Invalid request body, missing required fields, or validation error (e.g. URL format, parameter constraints). |
| 401 | Unauthorized | Missing, invalid, or expired authentication token. Re-authenticate or generate a new API key. |
| 403 | Forbidden | Authenticated but insufficient permissions. Your plan may not include access to this endpoint. |
| 404 | Not Found | The requested resource (job ID, monitor, schedule) does not exist. |
| 429 | Rate Limited | You have exceeded your plan's rate limit. Wait for the reset window and retry. See the retry_after field. |
| 500 | Internal Error | Server-side error. This may indicate a bug or infrastructure issue. Retry with exponential backoff. |
| 503 | Service Unavailable | Endpoint is temporarily disabled or under maintenance. Data APIs may return this if a scraper is being updated. |
Error Response Format
// 400 Bad Request
{
"success": false,
"error": "Bad Request",
"message": "Field 'url' is required"
}
// 401 Unauthorized
{
"success": false,
"error": "Unauthorized",
"message": "Invalid or expired authentication token"
}
// 429 Rate Limited
{
"success": false,
"error": "Rate limit exceeded",
"message": "You have exceeded the rate limit of 60 requests per minute",
"retry_after": 23
}
// 500 Internal Error
{
"success": false,
"error": "Internal Server Error",
"message": "An unexpected error occurred. Request ID: req_a1b2c3d4"
}
Scrape-specific error codes: Scrape responses may include an error_code field with values like BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, or NETWORK_ERROR for more granular error classification.
Webhooks
Webhooks allow you to receive real-time notifications when jobs complete instead of polling. Pass webhook_url and optionally webhook_secret when starting any async job (crawl, search, extract).
Webhook Events
| Event | Description |
|---|---|
scrape.completed | A scrape job has finished (success or failure) |
crawl.completed | A crawl job has finished all pages |
crawl.page | A single page within a crawl has been scraped |
search.completed | A search job has finished all results |
extract.completed | An extraction job has finished |
Payload Format
POST https://your-server.com/webhook
Headers:
Content-Type: application/json
X-Webhook-Signature: sha256=a1b2c3d4e5f6...
X-Webhook-Event: crawl.completed
Body:
{
"event": "crawl.completed",
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"timestamp": "2026-04-03T12:00:00Z",
"data": {
"total_pages": 47,
"completed_pages": 47,
"url": "https://docs.python.org"
}
}
Signature Verification
If you provide a webhook_secret, DataBlue signs each payload with HMAC-SHA256. The signature is sent in the X-Webhook-Signature header.
import hmac
import hashlib
def verify_webhook(payload: bytes, signature: str, secret: str) -> bool:
"""Verify the HMAC-SHA256 signature from DataBlue webhooks."""
expected = "sha256=" + hmac.new(
secret.encode(),
payload,
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
# Usage in a Flask/FastAPI handler:
@app.post("/webhook")
async def handle_webhook(request: Request):
body = await request.body()
signature = request.headers.get("X-Webhook-Signature", "")
if not verify_webhook(body, signature, "your_webhook_secret"):
raise HTTPException(status_code=401, detail="Invalid signature")
data = await request.json()
print(f"Event: {data['event']}, Job: {data['job_id']}")
Retry Policy
- 3 attempts total (1 initial + 2 retries)
- Exponential backoff: 10s, 60s, 300s between retries
- Retries triggered on: connection errors, 5xx responses, timeouts
- Successful delivery requires a 2xx response within 30 seconds
Plans & Credits
Every API call consumes credits. The cost varies by endpoint complexity. Credits reset monthly based on your plan.
Credit Costs by Endpoint
| Endpoint | Credits | Notes |
|---|---|---|
| Scrape | 1 | Per URL |
| Crawl | 1 | Per page crawled |
| Search | 1 | Per search query |
| Map | 1 | Per map request |
| Extract | 5 | Uses LLM processing |
| Google Data APIs | ||
| Google Search | 1 | SERP results |
| Google Maps | 1 | Places search or detail |
| Google News | 1 | News articles |
| Google Finance | 1 | Market data or quote |
| YouTube Data APIs | ||
| YouTube Channel | 1 | Channel metadata |
| YouTube Videos | 2 | Channel video listing |
| YouTube Search | 2 | Video search |
| YouTube Video | 1 | Single video detail |
| YouTube Comments | 2 | Video comments |
| Twitter Data APIs | ||
| Twitter Profile | 1 | User profile |
| Twitter Tweets | 2 | User timeline |
| Twitter Tweet | 1 | Single tweet detail |
| Twitter Search | 2 | Tweet search |
| Reddit Data APIs | ||
| Reddit Subreddit | 1 | Subreddit metadata |
| Reddit Posts | 2 | Subreddit post listing |
| Reddit Search | 2 | Post search |
| Reddit User | 1 | User profile |
Plan Comparison
| Feature | Free | Starter | Plus | Pro | Growth | Scale |
|---|---|---|---|---|---|---|
| Monthly Credits | 500 | 3,000 | 15,000 | 50,000 | 200,000 | 1,000,000 |
| Rate Limit | 10/min | 60/min | 120/min | 200/min | 300/min | 500/min |
| API Keys | 1 | 3 | 5 | 10 | 25 | Unlimited |
| Monitors | 0 | 2 | 10 | 25 | 100 | Unlimited |
| Schedules | 0 | 2 | 10 | 25 | 100 | Unlimited |