Introduction

DataBlue is a self-hosted web scraping and structured data API platform. It provides a Firecrawl-compatible REST API for scraping, crawling, searching, and extracting content from any website, plus 22+ Data APIs for Google, YouTube, Twitter, and Reddit.

Key Features

  • 5-tier parallel scraping engine — HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass run in a staggered race. First valid result wins. Strategy cache remembers winning strategy per domain.
  • Self-learning strategy cache — Domains that require stealth are automatically detected and upgraded to hard mode on subsequent requests. No manual configuration needed.
  • 22+ structured Data APIs — Google Search, Maps, News, Finance. YouTube channels, videos, comments. Twitter profiles, tweets. Reddit subreddits, posts, users. All return clean JSON.
  • 100% self-hosted — Deploy on your own infrastructure with Docker Compose. No third-party SaaS dependencies. Your data never leaves your servers.
  • Firecrawl-compatible API — Drop-in replacement for Firecrawl. Same endpoint paths, same request/response shapes. Migrate existing integrations with zero code changes.

Search

Search the web with Google, DuckDuckGo, or Brave and get scraped content from each result page. Returns markdown, HTML, links, screenshots, and structured data.

Scrape

Scrape any URL with automatic anti-bot bypass. Returns markdown, HTML, raw HTML, links, screenshots, headings, images, structured data, and LLM-extracted fields.

Extract

Extract structured data from any page using natural language prompts and JSON Schema. Powered by LLMs (OpenAI, Anthropic, Groq). Returns typed JSON matching your schema.

Data APIs

22+ structured data endpoints for Google (Search, Maps, News, Finance), YouTube (channels, videos, comments), Twitter (profiles, tweets), and Reddit (subreddits, posts, users).

Make Your First Request

curl -X POST "https://api.datablue.dev/v1/scrape" \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["markdown", "links"]
  }'

Authentication

DataBlue supports two authentication methods. Both are sent as Authorization: Bearer <token> in the request header.

Method Format TTL Source
JWT Token eyJ... 7 days POST /v1/auth/login
API Key wh_... Persistent Dashboard → API Keys

JWT Authentication

Obtain a JWT token by authenticating with your email and password:

curl -X POST "https://api.datablue.dev/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "password": "your_password"}'

# Response: {"access_token": "eyJ...", "token_type": "bearer"}

API Key Authentication (Recommended)

API keys are persistent and do not expire. Generate them from the Dashboard under the API Keys panel. All API keys use the wh_ prefix.

# Use your API key in every request
curl -X POST "https://api.datablue.dev/v1/scrape" \
  -H "Authorization: Bearer wh_abc123def456" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Security note: API keys and JWT tokens are interchangeable in the Authorization header. Sensitive data (LLM keys, proxy credentials) is encrypted at rest with Fernet (AES-256).

Quick Start

Step 1: Sign Up

Create an account at datablue.dev or register via the API:

curl -X POST "https://api.datablue.dev/v1/auth/register" \
  -H "Content-Type: application/json" \
  -d '{"email": "you@example.com", "password": "secure_password", "name": "Your Name"}'

Step 2: Generate an API Key

Log into the Dashboard and navigate to API Keys. Click Create New Key to generate a persistent API key with the wh_ prefix.

Step 3: Make Your First Request

cURL

curl -X POST "https://api.datablue.dev/v1/scrape" \
  -H "Authorization: Bearer wh_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["markdown"]
  }'

Python

import requests

response = requests.post(
    "https://api.datablue.dev/v1/scrape",
    headers={"Authorization": "Bearer wh_your_api_key"},
    json={
        "url": "https://news.ycombinator.com",
        "formats": ["markdown"]
    }
)
data = response.json()
print(data["data"]["markdown"][:200])

JavaScript

const response = await fetch("https://api.datablue.dev/v1/scrape", {
  method: "POST",
  headers: {
    "Authorization": "Bearer wh_your_api_key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://news.ycombinator.com",
    formats: ["markdown"]
  })
});
const data = await response.json();
console.log(data.data.markdown.slice(0, 200));
Example Response
{
  "success": true,
  "data": {
    "markdown": "# Hacker News\n\n1. Show HN: I built an open-source web scraper...",
    "metadata": {
      "title": "Hacker News",
      "language": "en",
      "source_url": "https://news.ycombinator.com",
      "status_code": 200,
      "word_count": 1847
    }
  }
}

Installation

The official datablue Python SDK provides both synchronous and asynchronous clients for the DataBlue API. Built on httpx and Pydantic v2, it offers full type safety, automatic retries with exponential backoff, and strongly-typed response models.

Requirements

Dependency Version
Python>= 3.10
httpx>= 0.27.0
pydantic>= 2.0.0

Install from PyPI

pip install datablue

Or with a specific version:

pip install datablue==2.0.0

Install with Poetry / uv

# Poetry
poetry add datablue

# uv
uv add datablue

Verify Installation

python -c "import datablue; print(datablue.__version__)"
# 2.0.0

Quick Start

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    result = client.scrape("https://example.com")
    print(result.data.markdown)

Async support: Every method available on DataBlue (sync) is also available on AsyncDataBlue (async) with the same signature. Use await and async with for the async variant.

Using with AI Assistants

The v2.0.0 SDK is designed to be AI-first. It ships with two machine-readable reference files that AI coding assistants (Claude Code, Cursor, GitHub Copilot, etc.) can read for accurate code generation:

  • CLAUDE.md — A structured quick-reference file at the SDK root (sdk/CLAUDE.md) containing all method signatures, response models, error types, and common patterns. AI assistants automatically read this file for context.
  • llms.txt — A standardized machine-readable documentation file following the llms.txt convention. Provides a condensed API surface for LLM consumption.

Tip: When using an AI coding assistant with the DataBlue SDK, point it at the sdk/ directory. The CLAUDE.md file contains every method signature, typed model, and error type with copy-pasteable examples. This eliminates hallucinated API calls and ensures the AI generates code that actually works.

SDK Features

  • Sync + Async clientsDataBlue for synchronous code, AsyncDataBlue for asyncio/FastAPI/Django
  • Pydantic v2 response models — every response is a typed dataclass with autocomplete and validation
  • Automatic retries — exponential backoff on 429 and 5xx errors, configurable max retries
  • Context manager supportwith / async with for clean resource management
  • Job polling built-in — crawl/search blocking methods poll automatically with configurable timeout
  • Batch scraping — concurrent scraping with semaphore-based concurrency control and streaming results
  • Typed error hierarchy — catch specific errors like RateLimitError, AuthenticationError, etc.
  • Environment variable config — zero-config setup with DataBlue.from_env()

Authentication

The SDK supports three authentication methods: API key (recommended), environment variables, and email/password login.

API Key (Recommended)

Pass your API key directly to the constructor. API keys use the wh_ prefix and never expire.

from datablue import DataBlue

# Sync client
client = DataBlue(api_key="wh_your_api_key")
result = client.scrape("https://example.com")
client.close()

# Or use context manager (recommended)
with DataBlue(api_key="wh_your_api_key") as client:
    result = client.scrape("https://example.com")
    print(result.data.markdown)
from datablue import AsyncDataBlue

# Async client
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    result = await client.scrape("https://example.com")
    print(result.data.markdown)

Environment Variables

Set DATABLUE_API_KEY and optionally DATABLUE_API_URL in your environment, then use from_env():

# Shell — set environment variables
export DATABLUE_API_KEY=wh_your_api_key
export DATABLUE_API_URL=https://api.datablue.dev    # optional, default: http://localhost:8000
export DATABLUE_TIMEOUT=120                          # optional, seconds (default: 60)
export DATABLUE_MAX_RETRIES=5                        # optional (default: 3)
from datablue import DataBlue

# Reads all DATABLUE_* environment variables automatically
with DataBlue.from_env() as client:
    result = client.scrape("https://example.com")
    print(result.data.markdown)
Variable Required Default Description
DATABLUE_API_KEYYesYour API key (wh_... prefix)
DATABLUE_API_URLNohttp://localhost:8000Base URL of the DataBlue API
DATABLUE_TIMEOUTNo60Request timeout in seconds
DATABLUE_MAX_RETRIESNo3Max retry attempts on transient errors

Email/Password Login

Use login() to authenticate with email and password. This obtains a JWT token (7-day TTL) and stores it internally on the client instance.

from datablue import DataBlue

with DataBlue(api_url="https://api.datablue.dev") as client:
    # Authenticate — JWT is stored automatically
    auth = client.login("you@example.com", "your_password")
    print(f"Token: {auth['access_token'][:20]}...")

    # All subsequent requests use the JWT
    result = client.scrape("https://example.com")
    print(result.data.markdown)
# Async variant
from datablue import AsyncDataBlue

async with AsyncDataBlue(api_url="https://api.datablue.dev") as client:
    await client.login("you@example.com", "your_password")
    result = await client.scrape("https://example.com")

Recommendation: Use API keys for production and CI/CD. Use email/password login only for interactive scripts or development. API keys are persistent and do not expire, while JWT tokens expire after 7 days.

Scrape

Scrape a single URL and return structured content. The scrape() method is synchronous and returns immediately with the result. The scraping engine uses a 5-tier parallel race: HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass — first valid result wins.

Basic Usage

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    result = client.scrape("https://example.com")
    print(result.success)          # True
    print(result.data.markdown)    # "# Example Domain\n\nThis domain is..."
    print(result.data.metadata)    # PageMetadata(title="Example Domain", ...)

With All Options

result = client.scrape(
    "https://news.ycombinator.com",
    formats=["markdown", "html", "links", "screenshot"],
    only_main_content=True,
    wait_for=2000,                          # wait 2s after page load
    timeout=45000,                          # 45s timeout
    include_tags=["article", "main"],       # only these HTML tags
    exclude_tags=["nav", "footer"],         # remove these tags
    headers={"Accept-Language": "en-US"},
    cookies={"session": "abc123"},
    mobile=True,
    mobile_device="iphone_14",
    css_selector="main.content",            # target specific element
    use_proxy=True,
    capture_network=True,
)

print(result.data.markdown)
print(result.data.html)
print(result.data.links)
print(result.data.screenshot)       # base64-encoded PNG
print(result.data.metadata.title)
print(result.data.metadata.word_count)

Saving Screenshots

Screenshots are returned as base64-encoded PNG strings. Save to a file, embed in HTML, or pass directly to an LLM:

import base64

result = client.scrape("https://example.com", formats=["screenshot"])

# Save to file
with open("screenshot.png", "wb") as f:
    f.write(base64.b64decode(result.data.screenshot))

# Embed in HTML
html_img = f'<img src="data:image/png;base64,{result.data.screenshot}" />'

# Store in database as a string field
db.save(url=result.data.metadata.source_url, screenshot=result.data.screenshot)

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    result = await client.scrape(
        "https://example.com",
        formats=["markdown", "links"],
        only_main_content=True,
    )
    print(result.data.markdown)

Browser Actions

Execute browser actions before content extraction. Useful for pages that require interaction (clicking buttons, filling forms, scrolling to load content).

result = client.scrape(
    "https://example.com/infinite-scroll",
    actions=[
        {"type": "wait", "milliseconds": 1000},
        {"type": "click", "selector": "button.load-more"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "scroll", "direction": "down", "amount": 3},
        {"type": "screenshot"},
    ],
)

Action Types Reference

Action Type Required Fields Optional Fields Description
clickselectorbutton, click_count, modifiersClick an element
waitmillisecondsWait for duration
scrolldirection (up/down), amountScroll the page
typeselector, textType text into input
screenshotCapture screenshot
hoverselectorHover over element
presskeymodifiersPress keyboard key
selectselector, valueSelect dropdown option
fill_formfieldsFill multiple form fields
evaluatescriptRun JavaScript
go_backNavigate back
go_forwardNavigate forward

LLM Extraction

Extract structured data from a page using an LLM. Pass a prompt and optional schema (JSON Schema) in the extract parameter.

result = client.scrape(
    "https://openai.com/pricing",
    extract={
        "prompt": "Extract all pricing tiers with name and price",
        "schema": {
            "type": "object",
            "properties": {
                "tiers": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                        },
                    },
                },
            },
        },
    },
)
print(result.data.extract)  # {"tiers": [{"name": "GPT-4o", "price": "$2.50/1M"}, ...]}

Parameters

Parameter Type Default Description
urlstrrequiredThe URL to scrape
formatslist[Literal["markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images"]]NoneOutput formats to return. Defaults to ["markdown"] if not specified.
only_main_contentboolTrueExtract only main content, removing nav/footer/sidebar
wait_forint0Milliseconds to wait after page load before extracting
timeoutint30000Request timeout in milliseconds
include_tagslist[str]NoneOnly include these HTML tags in extraction
exclude_tagslist[str]NoneExclude these HTML tags from extraction
headersdict[str, str]NoneCustom HTTP headers to send with the request
cookiesdict[str, str]NoneCustom cookies to send as name/value pairs
mobileboolFalseEmulate a mobile device viewport
mobile_devicestrNoneDevice preset: "iphone_14", "pixel_7", "ipad_pro"
css_selectorstrNoneOnly extract content matching this CSS selector
xpathstrNoneXPath expression for targeted extraction
selectorsdict[str, Any]NoneNamed CSS selectors for multi-element extraction
actionslist[dict] | list[ActionStep]NoneBrowser actions to execute before scraping. Accepts raw dicts or typed ActionStep model instances.
extractdict | ExtractConfigNoneLLM extraction config. Accepts a dict with "prompt" and "schema" keys, or a typed ExtractConfig model.
capture_networkboolFalseCapture browser network requests. Results appear in result.data.network_data as a dict with request/response details.
use_proxyboolFalseRoute through configured proxy for anti-bot bypass
webhook_urlstrNoneURL to receive webhook on completion
webhook_secretstrNoneHMAC secret for webhook signature verification

Response Model

Returns ScrapeResult — a Pydantic model with the following fields:

class ScrapeResult:
    success: bool                    # Whether the scrape request succeeded
    data: PageData | None            # Scraped page content including markdown, html, links, and metadata
    error: str | None                # Human-readable error message if the scrape failed
    error_code: str | None           # Machine-readable error code (BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, NETWORK_ERROR)
    job_id: str | None               # Unique job identifier for async scrape requests

class PageData:
    url: str | None                  # The URL of the scraped page
    markdown: str | None             # Clean Markdown conversion of the page content
    fit_markdown: str | None         # Markdown trimmed to fit LLM context windows
    html: str | None                 # Cleaned HTML content with boilerplate removed
    raw_html: str | None             # Original unmodified HTML source of the page
    links: list[str] | None          # List of URLs found on the page
    links_detail: list[dict] | None  # Detailed link info including anchor text and attributes
    screenshot: str | None           # Base64-encoded PNG screenshot of the page
    structured_data: dict | None     # JSON-LD and microdata extracted from the page
    headings: list[dict] | None      # List of headings with level and text (e.g. [{level: 1, text: "Title"}])
    images: list[dict] | None        # List of images with src, alt, and dimensions
    extract: dict | None             # LLM-extracted structured data matching the provided schema
    citations: list[dict] | None     # Source citations for extracted content
    markdown_with_citations: str | None  # Markdown content with inline citation references
    content_hash: str | None         # SHA-256 hash of the content for change detection
    metadata: PageMetadata | None    # Page metadata including title, status_code, word_count, and SEO tags

class PageMetadata:
    title: str | None                # Page title from <title> tag
    description: str | None          # Meta description from <meta name='description'>
    language: str | None             # Page language code (e.g. 'en', 'fr', 'de')
    source_url: str | None           # The URL that was actually scraped (after redirects)
    status_code: int | None          # HTTP response status code (200, 404, 500, etc.)
    word_count: int                  # Number of words in the main content
    reading_time_seconds: int        # Estimated reading time in seconds
    content_length: int              # Response body size in bytes
    og_image: str | None             # OpenGraph image URL for social sharing previews
    canonical_url: str | None        # Canonical URL from <link rel='canonical'>
    favicon: str | None              # Favicon URL
    robots: str | None               # Robots meta tag content (e.g. 'noindex, nofollow')
    response_headers: dict | None    # HTTP response headers as key-value pairs

Crawl

Crawl a website starting from a seed URL, discovering and scraping pages via BFS, DFS, or best-first strategy. The SDK provides two methods: crawl() blocks until all pages are scraped, while start_crawl() returns immediately for manual polling.

Blocking Crawl

The simplest approach. crawl() starts the job and polls until completion, then returns all results at once.

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    status = client.crawl(
        "https://docs.python.org/3/",
        max_pages=50,
        max_depth=2,
        include_paths=["/3/library/*"],
        poll_interval=2.0,     # check every 2 seconds
        timeout=300.0,         # give up after 5 minutes
    )

    print(f"Status: {status.status}")                # "completed"
    print(f"Pages: {status.completed_pages}/{status.total_pages}")
    print(f"Progress: {status.progress:.0%}")         # "100%"

    for page in status.data:
        print(f"  {page.url} — {page.metadata.word_count} words")

Non-blocking Crawl (Manual Polling)

Use start_crawl() to get a job ID, then poll with get_crawl_status() at your own pace.

import time
from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    # Start the crawl — returns immediately
    job = client.start_crawl(
        "https://docs.python.org/3/",
        max_pages=100,
        max_depth=3,
        concurrency=5,
        crawl_strategy="bfs",
    )
    print(f"Job ID: {job.job_id}")
    print(f"Status: {job.status}")    # "started"

    # Poll until done
    while True:
        status = client.get_crawl_status(job.job_id)
        print(f"  Progress: {status.completed_pages}/{status.total_pages}")

        if status.is_complete:
            break
        time.sleep(3)

    # Process results
    for page in status.data:
        if page.success:
            print(f"  {page.url}: {len(page.markdown or '')} chars")
        else:
            print(f"  {page.url}: FAILED — {page.error}")

Paginating Large Crawls

Results are paginated (20 per page by default). For crawls with more than 20 pages, you must paginate through the results to retrieve all scraped data:

# Get page 2 of results
status = client.get_crawl_status(job.job_id)  # page 1 (default)
# Access pagination info:
# status.total_results — total pages crawled
# status.page — current page number
# status.per_page — results per page (default 20)

Use the total_results, page, and per_page fields on the CrawlStatus response to determine how many pages of results exist and iterate through them. If your crawl returns fewer than 20 pages, all results will be in the first response.

Cancel a Crawl

result = client.cancel_crawl(job.job_id)
print(result)  # {"success": True, "message": "Crawl cancelled"}

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    # Blocking async crawl
    status = await client.crawl(
        "https://docs.python.org/3/",
        max_pages=50,
        max_depth=2,
    )
    for page in status.data:
        print(page.url)

    # Non-blocking: start + poll
    job = await client.start_crawl("https://example.com", max_pages=20)
    status = await client.get_crawl_status(job.job_id)
    await client.cancel_crawl(job.job_id)

Parameters

Parameter Type Default Description
urlstrrequiredStarting URL to crawl
max_pagesint100Maximum pages to crawl (1–1000)
max_depthint3Maximum link depth from starting URL (1–10)
concurrencyint3Parallel scrape workers (1–10)
include_pathslist[str]NoneOnly crawl URLs matching these glob patterns
exclude_pathslist[str]NoneSkip URLs matching these glob patterns
allow_external_linksboolFalseFollow links to external domains
respect_robots_txtboolTrueObey the site's robots.txt rules
filter_faceted_urlsboolTrueDeduplicate faceted/navigation URL variations
crawl_strategyLiteral["bfs", "dfs", "bff"]None"bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first frontier)
scrape_optionsdictNoneOptions passed to each page scrape (formats, only_main_content, wait_for, etc.)
use_proxyboolFalseRoute all requests through configured proxy
webhook_urlstrNoneURL to receive webhook on completion
webhook_secretstrNoneHMAC secret for webhook signature verification

The crawl() blocking method accepts two additional parameters:

Parameter Type Default Description
poll_intervalfloat2.0Seconds between status polls
timeoutfloat300.0Maximum seconds to wait before raising TimeoutError
on_progressCallable[[CrawlStatus], None]NoneOptional callback invoked after each poll with the latest CrawlStatus. Useful for progress bars or logging.

Streaming Crawl

Use crawl_stream() to receive pages in real time via NDJSON streaming — no polling required. Pages are yielded as they are discovered and scraped.

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    for page in client.crawl_stream("https://docs.example.com", max_pages=50):
        print(f"{page.url} — {len(page.markdown or '')} chars")

Streaming with Callbacks

Use crawl_stream_with_callback() for event-driven architectures. Provide callback functions instead of iterating.

from datablue import DataBlue

pages = []

with DataBlue(api_key="wh_your_api_key") as client:
    client.crawl_stream_with_callback(
        "https://example.com",
        max_pages=100,
        on_document=lambda page: pages.append(page),
        on_complete=lambda: print(f"Done! {len(pages)} pages"),
        on_error=lambda e: print(f"Error: {e}"),
    )
Method Returns Behavior
crawl_stream(url, **opts)Iterator[CrawlPageData]Yields pages via NDJSON stream as they are discovered
crawl_stream_with_callback(url, on_document=..., **opts)NoneCallback-based streaming: on_document, on_complete, on_error

Response Model

class CrawlJob:                         # from start_crawl()
    success: bool                        # Whether the crawl job was accepted
    job_id: str                          # Unique job identifier for polling crawl status
    status: str                          # Current job status (typically "started")
    message: str | None                  # Human-readable status message or error description

class CrawlStatus:                       # from crawl() or get_crawl_status()
    success: bool                        # Whether the status request succeeded
    job_id: str                          # Unique job identifier
    status: str                          # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
    total_pages: int                     # Total number of pages discovered for crawling
    completed_pages: int                 # Number of pages successfully scraped so far
    data: list[CrawlPageData]            # List of scraped page results (CrawlPageData objects)
    total_results: int                   # Total number of results available (for pagination)
    page: int                            # Current page number in paginated results (default: 1)
    per_page: int                        # Number of results per page (default: 20)
    error: str | None                    # Human-readable error message if the crawl failed

    # Properties
    is_complete: bool                    # True when status in {"completed", "failed", "cancelled"}
    progress: float                      # completed_pages / total_pages (0.0 to 1.0)

class CrawlPageData:
    id: str | None                       # Unique identifier for this page within the crawl job
    url: str | None                      # The URL of the crawled page
    markdown: str | None                 # Clean Markdown conversion of the page content
    fit_markdown: str | None             # Markdown trimmed to fit LLM context windows
    html: str | None                     # Cleaned HTML content with boilerplate removed
    raw_html: str | None                 # Original unmodified HTML source of the page
    links: list[str] | None              # List of URLs found on the page
    links_detail: dict | list | None     # Detailed link info including anchor text and attributes
    screenshot: str | None               # Base64-encoded PNG screenshot of the page
    structured_data: dict | None         # JSON-LD and microdata extracted from the page
    headings: list[dict] | None          # List of headings with level and text
    images: list[dict] | None            # List of images with src, alt, and dimensions
    extract: dict | None                 # LLM-extracted structured data matching the provided schema
    citations: list[dict] | None         # Source citations for extracted content
    markdown_with_citations: str | None  # Markdown content with inline citation references
    content_hash: str | None             # SHA-256 hash of the content for change detection
    metadata: PageMetadata | None        # Page metadata including title, status_code, word_count, and SEO tags
    error: str | None                    # Error message if scraping this page failed
    success: bool                        # Whether this individual page was scraped successfully

Map

Discover all URLs on a website by combining sitemap.xml parsing, robots.txt discovery, and link crawling. The map() method returns a flat list of discovered URLs with metadata. This is useful for understanding site structure before launching a targeted crawl.

Basic Usage

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    result = client.map("https://docs.python.org")

    print(f"Total URLs: {result.total}")
    for link in result.links:
        print(f"  {link.url}")
        if link.title:
            print(f"    Title: {link.title}")
        if link.lastmod:
            print(f"    Last modified: {link.lastmod}")

With Search Filter

# Only find URLs containing "tutorial"
result = client.map(
    "https://docs.python.org",
    search="tutorial",
    limit=50,
)

print(f"Found {result.total} tutorial URLs")
for link in result.links:
    print(f"  {link.url}")

URL Shorthand

Use the urls property to get a flat list of URL strings:

result = client.map("https://example.com", limit=200)

# Get just the URLs as a plain list
url_list = result.urls    # ["https://example.com/", "https://example.com/about", ...]
print(f"Found {len(url_list)} URLs")

# Feed into a crawl or batch scrape
crawl = client.crawl(url_list[0], max_pages=len(url_list))

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    result = await client.map(
        "https://example.com",
        limit=500,
        include_subdomains=True,
    )
    print(f"Found {result.total} URLs")
    for url in result.urls:
        print(url)

Parameters

Parameter Type Default Description
urlstrrequiredWebsite URL to map
searchstrNoneFilter URLs matching this search string
limitint100Maximum number of URLs to return
include_subdomainsboolTrueInclude URLs from subdomains
use_sitemapboolTrueParse sitemap.xml for URL discovery

Response Model

class MapResult:
    success: bool                        # Whether the map request succeeded
    total: int                           # Total number of links discovered on the site
    links: list[LinkResult]              # List of discovered links with URL and optional metadata
    error: str | None                    # Human-readable error message if the map failed
    job_id: str | None                   # Unique job identifier for async map requests

    # Properties
    urls: list[str]                      # Convenience: [link.url for link in links]

class LinkResult:
    url: str                             # The discovered URL
    title: str | None                    # Page title (from sitemap or page metadata)
    description: str | None              # Page description (from sitemap or meta tags)
    lastmod: str | None                  # Last modification date in ISO 8601 format (from sitemap)
    priority: float | None               # Sitemap priority value between 0.0 and 1.0

Batch Scrape

Scrape multiple URLs in a single call. The sync client runs sequentially, while the async client runs concurrently with configurable parallelism. For the async client, batch_scrape_iter() yields results as they complete for streaming processing.

Sync Batch Scrape

from datablue import DataBlue

urls = [
    "https://example.com",
    "https://news.ycombinator.com",
    "https://github.com",
    "https://stackoverflow.com",
]

with DataBlue(api_key="wh_your_api_key") as client:
    results = client.batch_scrape(urls, concurrency=5)

    for result in results:
        if result.success:
            print(f"{result.data.url}: {result.data.metadata.word_count} words")
        else:
            print(f"FAILED: {result.error}")

Sync with Scrape Options

results = client.batch_scrape(
    urls,
    concurrency=3,
    scrape_options={
        "formats": ["markdown", "links"],
        "only_main_content": True,
        "timeout": 45000,
    },
)

Async Batch (Collect All)

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    results = await client.batch_scrape(
        urls,
        concurrency=10,
        scrape_options={"formats": ["markdown"]},
    )
    print(f"Scraped {len(results)} pages")
    for r in results:
        print(f"  {r.data.url}: {r.success}")

Async Streaming (Recommended for Large Batches)

Use batch_scrape_iter() to process results as they arrive — no need to wait for all pages to finish before starting processing.

from datablue import AsyncDataBlue

urls = ["https://example.com/page/" + str(i) for i in range(100)]

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    completed = 0
    async for result in client.batch_scrape_iter(urls, concurrency=10):
        completed += 1
        if result.success:
            print(f"[{completed}/{len(urls)}] {result.data.url} — {result.data.metadata.word_count} words")
        else:
            print(f"[{completed}/{len(urls)}] FAILED: {result.error}")

Parameters

Parameter Type Default Description
urlslist[str]requiredList of URLs to scrape
concurrencyint5Maximum concurrent requests (async only; sync runs sequentially)
scrape_optionsdictNoneOptions passed to each scrape: formats, only_main_content, timeout, etc.

Methods Summary

Method Client Returns Behavior
batch_scrape()DataBluelist[ScrapeResult]Sequential, blocks until all done
batch_scrape()AsyncDataBluelist[ScrapeResult]Concurrent, blocks until all done
batch_scrape_iter()AsyncDataBlueAsyncIterator[ScrapeResult]Concurrent, yields as completed

Note: The sync client runs batch scrape sequentially regardless of the concurrency parameter. For parallel execution, use the async client's batch_scrape() or batch_scrape_iter().

Error resilience: Batch methods never raise on individual page failures. Failed pages return a ScrapeResult with success=False and the error message in the error field. Always check result.success before accessing result.data.

Error Handling

The SDK raises typed exceptions for all API errors. Every exception inherits from DataBlueError, making it easy to catch all errors or handle specific types. The HTTP client automatically retries transient errors (429 and 5xx) with exponential backoff before raising.

Exception Hierarchy

DataBlueError                         # Base exception for all SDK errors
    AuthenticationError               # 401 — bad or missing API key / JWT
    NotFoundError                     # 404 — resource does not exist
    RateLimitError                    # 429 — rate limit exceeded (retryable)
    ServerError                       # 5xx — server error (retryable)
    JobFailedError                    # Polled job completed with "failed" status
    TimeoutError                      # Polling timeout exceeded

Basic Error Handling

from datablue import (
    DataBlue,
    DataBlueError,
    AuthenticationError,
    RateLimitError,
    NotFoundError,
    ServerError,
    JobFailedError,
    TimeoutError,
)

with DataBlue(api_key="wh_your_api_key") as client:
    try:
        result = client.scrape("https://example.com")
    except AuthenticationError as e:
        print(f"Auth failed: {e.message}")
        print(f"Status: {e.status_code}")        # 401
        print(f"Docs: {e.docs_url}")              # https://docs.datablue.dev/errors/authentication
    except RateLimitError as e:
        print(f"Rate limited: {e.message}")
        print(f"Retry after: {e.retry_after}s")   # seconds to wait
        print(f"Retryable: {e.is_retryable}")     # True
    except NotFoundError as e:
        print(f"Not found: {e.message}")           # 404
    except ServerError as e:
        print(f"Server error ({e.status_code}): {e.message}")
        print(f"Retryable: {e.is_retryable}")     # True
    except DataBlueError as e:
        print(f"API error: {e.message}")
        print(f"Status: {e.status_code}")
        print(f"Body: {e.response_body}")

Job Errors (Crawl / Search)

from datablue import DataBlue, JobFailedError, TimeoutError

with DataBlue(api_key="wh_your_api_key") as client:
    try:
        status = client.crawl(
            "https://example.com",
            max_pages=100,
            timeout=60.0,       # fail if not done in 60s
        )
    except TimeoutError as e:
        print(f"Timed out after {e.elapsed:.1f}s")
        print(f"Job ID: {e.job_id}")
        # Optionally cancel the still-running job
        client.cancel_crawl(e.job_id)
    except JobFailedError as e:
        print(f"Job failed: {e.message}")
        print(f"Job ID: {e.job_id}")
        print(f"Response: {e.response_body}")

Exception Attributes

Attribute Type Available On Description
messagestrAllHuman-readable error description
status_codeint | NoneAllHTTP status code (if from API response)
response_bodydict | NoneAllRaw API response body
is_retryableboolAllWhether the request can be safely retried
retry_afterfloat | NoneRateLimitErrorSeconds to wait before retrying
docs_urlstr | NoneAllLink to documentation for this error type
job_idstr | NoneJobFailedError, TimeoutErrorJob ID that failed or timed out
elapsedfloat | NoneTimeoutErrorSeconds elapsed before timeout

AI-Friendly Error Messages

v2.0.0 errors include fix suggestions directly in the message, making them useful for both humans and AI coding assistants:

# AuthenticationError message includes fix instructions:
# "Authentication failed. Set DATABLUE_API_KEY environment variable
#  or pass api_key to DataBlue(api_key='wh_...')"

# RateLimitError includes wait time:
# "Rate limit exceeded. Wait 42s before retrying,
#  or reduce request frequency."

# TimeoutError includes fix suggestion:
# "Job crawl-abc123 did not complete within 300s.
#  Try increasing the timeout parameter."

# ServerError indicates auto-retry:
# "Server error (502). This request will be automatically retried."

Automatic Retries

The SDK automatically retries on transient errors before raising an exception:

  • 429 (Rate Limit) — waits for the Retry-After header, or uses exponential backoff (max 30s)
  • 5xx (Server Error) — exponential backoff: 0.5s, 1s, 2s (max 10s per wait)
  • Connection errors — same exponential backoff as 5xx
  • Max retries: 3 by default, configurable via max_retries parameter

Configuration

The SDK uses an immutable ClientConfig dataclass for all configuration. You can pass parameters directly to the constructor, use environment variables, or build a config object manually.

Constructor Parameters

from datablue import DataBlue

client = DataBlue(
    api_url="https://api.datablue.dev",   # Base URL (default: http://localhost:8000)
    api_key="wh_your_api_key",             # API key (wh_ prefix)
    timeout=120.0,                          # Request timeout in seconds (default: 60)
    max_retries=5,                          # Max retry attempts (default: 3)
)

ClientConfig Object

For advanced control, build a ClientConfig and pass it to the constructor:

from datablue import DataBlue, ClientConfig

config = ClientConfig(
    api_url="https://api.datablue.dev",
    api_key="wh_your_api_key",
    timeout=120.0,
    max_retries=5,
    backoff_factor=1.0,                  # Multiplier for exponential backoff (default: 0.5)
)

client = DataBlue(config=config)

Config from Environment

from datablue import DataBlue, ClientConfig

# Build config from DATABLUE_* env vars
config = ClientConfig.from_env()

# Use with either client type
sync_client = DataBlue(config=config)
from datablue import AsyncDataBlue, ClientConfig

config = ClientConfig.from_env()
async_client = AsyncDataBlue(config=config)

Cloning Configs

Configs are immutable (frozen dataclass). Use clone() to create modified copies for different environments:

from datablue import DataBlue, ClientConfig

# Base config
prod = ClientConfig(
    api_url="https://api.datablue.dev",
    api_key="wh_prod_key",
    timeout=60.0,
    max_retries=3,
)

# Derive staging config (inherits everything except overrides)
staging = prod.clone(
    api_url="https://staging.datablue.dev",
    api_key="wh_staging_key",
)

# Derive a fast config for time-sensitive operations
fast = prod.clone(timeout=10.0, max_retries=1)

# Use each
with DataBlue(config=prod) as client:
    result = client.scrape("https://example.com")

ClientConfig Fields

Field Type Default Description
api_urlstrhttp://localhost:8000Base URL of the DataBlue API (trailing slash auto-stripped)
api_keystr | NoneNoneAPI key with wh_ prefix
timeoutfloat60.0HTTP request timeout in seconds
max_retriesint3Maximum retry attempts on transient errors (429, 5xx, connection errors)
backoff_factorfloat0.5Multiplier for exponential backoff: delay = factor * 2^attempt

Self-Hosted Setup

Point the SDK at your self-hosted DataBlue instance by setting the api_url:

# Direct constructor
with DataBlue(
    api_url="https://scraper.internal.company.com",
    api_key="wh_internal_key",
) as client:
    result = client.scrape("https://example.com")
# Or via environment variables
export DATABLUE_API_URL=https://scraper.internal.company.com
export DATABLUE_API_KEY=wh_internal_key
from datablue import DataBlue

with DataBlue.from_env() as client:
    result = client.scrape("https://example.com")
    print(result.data.markdown)

Default URL: The SDK defaults to http://localhost:8000 which works out of the box with the Docker Compose development setup. For production deployments, always set the URL explicitly.

Complete API Reference (v2.0.0)

Method Description
scrape(url, **opts)Scrape a single URL, returns ScrapeResult
crawl(url, **opts)Crawl a site (blocking with polling), returns CrawlStatus
start_crawl(url, **opts)Start crawl (non-blocking), returns CrawlJob
get_crawl_status(job_id)Poll crawl status, returns CrawlStatus
cancel_crawl(job_id)Cancel an in-progress crawl
crawl_stream(url, **opts)Stream crawl pages via NDJSON, returns Iterator[CrawlPageData]
crawl_stream_with_callback(url, on_document=..., **opts)Callback-based crawl streaming (on_document, on_complete, on_error)
search(query, **opts)Search the web (blocking with polling), returns SearchStatus
start_search(query, **opts)Start search (non-blocking), returns SearchJob
get_search_status(job_id)Poll search status, returns SearchStatus
map(url, **opts)Discover URLs on a site, returns MapResult
batch_scrape(urls, **opts)Scrape multiple URLs, returns list[ScrapeResult]
batch_scrape_iter(urls, **opts)Async-only: stream batch results as they complete, returns AsyncIterator[ScrapeResult]
login(email, password)Authenticate with email/password, stores JWT internally
close()Close the HTTP connection pool
from_env()Class method: create client from DATABLUE_* env vars

AI-First Documentation Files

v2.0.0 ships with machine-readable reference files for AI coding assistants:

File Location Purpose
CLAUDE.mdsdk/CLAUDE.mdComplete SDK quick-reference: all method signatures, response models, error types, and patterns. Read automatically by Claude Code and other AI assistants.
llms.txtsdk/llms.txtStandardized machine-readable documentation following the llms.txt convention. Condensed API surface for LLM consumption.

Why AI-first? AI coding assistants hallucinate API calls when they lack accurate documentation. The CLAUDE.md file ensures AI assistants generate code using real method signatures, real parameter names, and real response types — no guessing.

POST/v1/scrape

Scrape

Scrape a single URL and return the content in your desired format. Uses a 5-tier parallel scraping engine with automatic strategy selection and domain-level strategy caching.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
urlstringREQUIREDThe URL to scrape. Protocol is auto-prepended if missing.
formatsstring[]optionalOutput formats: "markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images".
only_main_contentbooleanoptionalExtract only the main content, removing navs, footers, sidebars.
mobilebooleanoptionalEmulate a mobile device viewport.
mobile_devicestringoptionalDevice preset name: "iphone_14", "pixel_7", "ipad_pro".
timeoutnumberoptionalRequest timeout in milliseconds.
wait_fornumberoptionalWait this many ms after page load before extracting.
css_selectorstringoptionalOnly extract content matching this CSS selector.
xpathstringoptionalXPath expression for targeted extraction.
include_tagsstring[]optionalOnly include these HTML tags in extraction.
exclude_tagsstring[]optionalExclude these HTML tags from extraction.
use_proxybooleanoptionalRoute request through a configured proxy for anti-bot bypass.
headersobjectoptionalCustom HTTP headers to send (e.g. { "Cookie": "session=abc" }).
cookiesobjectoptionalCustom cookies to send as name/value pairs.
actionsActionStep[]optionalBrowser actions to perform before extraction: click, wait, scroll, type, screenshot, hover, press, select, fill_form, evaluate.
extractobjectoptionalLLM extraction config: { prompt: string, schema: JSONSchema }.
webhook_urlstringoptionalWebhook URL for job completion notification.
webhook_secretstringoptionalHMAC secret for webhook signature verification.
capture_networkbooleanoptionalCapture browser network requests/responses.

cURL Example

curl -X POST "https://api.datablue.dev/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://news.ycombinator.com",
  "formats": [
    "markdown",
    "links"
  ],
  "only_main_content": true
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "data": {
    "markdown": "# Hacker News\n\n1. Show HN: I built an open-source web scraper with strategy caching\n2. Why Rust is eating the world...",
    "links": [
      "https://news.ycombinator.com/item?id=39912345",
      "https://news.ycombinator.com/item?id=39912346",
      "https://news.ycombinator.com/newest"
    ],
    "metadata": {
      "title": "Hacker News",
      "description": null,
      "language": "en",
      "source_url": "https://news.ycombinator.com",
      "status_code": 200,
      "word_count": 1847,
      "reading_time_seconds": 7,
      "content_length": 12340
    }
  }
}
POST/v1/crawl

Crawl

Start a crawl job that discovers and scrapes pages starting from a seed URL using BFS, DFS, or best-first strategy. Returns a job ID for tracking progress via SSE or polling.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
urlstringREQUIREDThe starting URL to crawl.
max_pagesnumberoptionalMaximum pages to crawl (1-1000).
max_depthnumberoptionalMaximum link depth from the starting URL (1-10).
concurrencynumberoptionalNumber of parallel scrape workers (1-10).
crawl_strategystringoptionalStrategy: "bfs" (breadth-first), "dfs" (depth-first), or "bff" (best-first).
allow_external_linksbooleanoptionalFollow links to external domains.
respect_robots_txtbooleanoptionalObey the target site's robots.txt rules.
include_pathsstring[]optionalOnly crawl URLs matching these glob path patterns.
exclude_pathsstring[]optionalSkip URLs matching these glob path patterns.
scrape_optionsobjectoptionalOptions passed to each page scrape: { formats, only_main_content, wait_for, timeout, include_tags, exclude_tags, mobile, extract }.
use_proxybooleanoptionalRoute all requests through a configured proxy.
filter_faceted_urlsbooleanoptionalDeduplicate faceted/navigation URL variations.
webhook_urlstringoptionalWebhook URL for crawl completion notification.
webhook_secretstringoptionalHMAC secret for webhook signature verification.

cURL Example

curl -X POST "https://api.datablue.dev/v1/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://docs.python.org/3/",
  "max_pages": 50,
  "max_depth": 2,
  "crawl_strategy": "bfs",
  "include_paths": [
    "/3/library/*"
  ]
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "message": "Crawl job started"
}
POST/v1/map

Map

Discover all URLs on a website by combining sitemap.xml parsing, robots.txt discovery, and link crawling. Returns a flat list of URLs with metadata.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
urlstringREQUIREDThe website URL to map.
searchstringoptionalFilter URLs matching this search string.
limitnumberoptionalMaximum number of URLs to return.
include_subdomainsbooleanoptionalInclude URLs from subdomains.
use_sitemapbooleanoptionalParse sitemap.xml for URL discovery.

cURL Example

curl -X POST "https://api.datablue.dev/v1/map" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://docs.python.org",
  "limit": 200,
  "include_subdomains": false
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "total": 187,
  "links": [
    {
      "url": "https://docs.python.org/3/",
      "title": "Python 3 Documentation",
      "lastmod": "2026-03-28"
    },
    {
      "url": "https://docs.python.org/3/tutorial/index.html",
      "title": "The Python Tutorial"
    },
    {
      "url": "https://docs.python.org/3/library/index.html",
      "title": "The Python Standard Library"
    },
    {
      "url": "https://docs.python.org/3/reference/index.html",
      "title": "The Python Language Reference"
    }
  ]
}
POST/v1/extract

Extract

Extract structured data from web pages or raw content using LLM. Accepts URLs to scrape first, or raw markdown/HTML content directly. Returns typed JSON matching your schema.

Target Latency

1.2s - 4.5s

Credits

5 cr/req

Parameters

NameTypeRequirementDescription
urlstringoptionalSingle URL to scrape then extract from.
urlsstring[]optionalMultiple URLs to scrape and extract from (async job).
contentstringoptionalRaw markdown/text content to extract from (no scraping needed).
htmlstringoptionalRaw HTML to convert and extract from.
promptstringoptionalNatural language extraction instruction (e.g. 'Extract all product names and prices').
schemaobjectoptionalJSON Schema for structured output. The LLM will return data matching this schema.
providerstringoptionalLLM provider: "openai", "anthropic", "groq", etc.
only_main_contentbooleanoptionalExtract only main content before LLM processing.
wait_fornumberoptionalWait ms after page load (for URLs).
timeoutnumberoptionalScrape timeout in ms (for URLs).
use_proxybooleanoptionalUse proxy for scraping.
headersobjectoptionalCustom HTTP headers.
cookiesobjectoptionalCustom cookies.
webhook_urlstringoptionalWebhook URL for extraction completion notification.
webhook_secretstringoptionalHMAC secret for webhook signature verification.

cURL Example

curl -X POST "https://api.datablue.dev/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "url": "https://openai.com/pricing",
  "prompt": "Extract all pricing tiers with name, price per million tokens, and context window",
  "schema": {
    "type": "object",
    "properties": {
      "tiers": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {
              "type": "string"
            },
            "input_price": {
              "type": "string"
            },
            "output_price": {
              "type": "string"
            },
            "context_window": {
              "type": "string"
            }
          }
        }
      }
    }
  }
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "data": {
    "url": "https://openai.com/pricing",
    "extract": {
      "tiers": [
        {
          "name": "GPT-4o",
          "input_price": "$2.50/1M",
          "output_price": "$10.00/1M",
          "context_window": "128K"
        },
        {
          "name": "GPT-4o mini",
          "input_price": "$0.15/1M",
          "output_price": "$0.60/1M",
          "context_window": "128K"
        },
        {
          "name": "GPT-4.1",
          "input_price": "$2.00/1M",
          "output_price": "$8.00/1M",
          "context_window": "1M"
        }
      ]
    },
    "content_length": 48230
  }
}
POST/v1/data/google/maps

Google Maps

Search Google Maps for places, businesses, and points of interest. Supports search queries, coordinate-based nearby search, and single place detail lookups via place_id or CID. Returns full business data including reviews.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
querystringoptionalSearch query (e.g. 'restaurants near Times Square'). Max 2048 characters.
coordinatesstringoptionalGPS coordinates as 'lat,lng' (e.g. '40.7580,-73.9855').
radiusnumberoptionalSearch radius in meters (100-50000). Default 5000.
zoomnumberoptionalMap zoom level (1-21). Auto-calculated from radius if not set.
typestringoptionalPlace type filter: restaurant, hotel, gas_station, hospital, cafe, bar, gym, pharmacy, bank, supermarket, park, museum, airport.
keywordstringoptionalAdditional keyword filter (max 500 chars). E.g. 'vegetarian', 'rooftop'.
min_ratingnumberoptionalMinimum star rating filter (1.0-5.0).
open_nowbooleanoptionalOnly show places that are currently open.
price_levelnumberoptionalPrice range filter: 1=$, 2=$$, 3=$$$, 4=$$$$.
sort_bystringoptionalSort order: "relevance", "distance", "rating", "reviews".
num_resultsnumberoptionalNumber of places to return (1-200).
place_idstringoptionalGoogle Place ID for detailed single place lookup (e.g. 'ChIJN1t_tDeuEmsRUsoyG').
cidstringoptionalCID / Ludocid permanent business identifier.
datastringoptionalGoogle Maps data parameter (encoded place reference).
languagestringoptionalLanguage code (hl parameter).
countrystringoptionalCountry code for geo-targeting (gl parameter).
include_reviewsbooleanoptionalInclude user reviews for each place.
reviews_limitnumberoptionalMaximum reviews per place (1-20). Requires include_reviews=true.
reviews_sortstringoptionalReview sort: "most_relevant", "newest", "highest", "lowest".

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/google/maps" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "query": "best ramen restaurants",
  "coordinates": "40.7580,-73.9855",
  "radius": 3000,
  "num_results": 5,
  "min_rating": 4,
  "include_reviews": true,
  "reviews_limit": 3
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "query": "best ramen restaurants",
  "coordinates_used": "40.7580,-73.9855",
  "search_type": "search",
  "total_results": "5",
  "time_taken": 3.42,
  "filters_applied": {
    "min_rating": 4,
    "radius": 3000
  },
  "places": [
    {
      "position": 1,
      "title": "Ichiran Ramen",
      "place_id": "ChIJ4Y8RmkRYwokR5ntGn3BDXY8",
      "address": "132 W 31st St, New York, NY 10001",
      "gps_coordinates": {
        "latitude": 40.7487,
        "longitude": -73.9903
      },
      "url": "https://maps.google.com/?cid=10325091291252938854",
      "website": "https://www.ichiranusa.com",
      "phone": "+1 212-465-0701",
      "rating": 4.5,
      "reviews": 3847,
      "price": "$$",
      "price_level": 2,
      "type": "Ramen restaurant",
      "subtypes": [
        "Ramen restaurant",
        "Japanese restaurant",
        "Noodle shop"
      ],
      "open_state": "Open - Closes 2 AM",
      "open_now": true,
      "thumbnail": "https://lh5.googleusercontent.com/p/AF1QipN...",
      "user_reviews": [
        {
          "author_name": "Sarah Chen",
          "rating": 5,
          "text": "Best ramen in NYC. The solo booth concept is genius. Rich tonkotsu broth...",
          "relative_time": "2 weeks ago"
        },
        {
          "author_name": "Mike Johnson",
          "rating": 4,
          "text": "Great flavors, a bit pricey for what you get but the experience is unique.",
          "relative_time": "1 month ago"
        }
      ]
    }
  ],
  "related_searches": [
    {
      "query": "ramen near me"
    },
    {
      "query": "japanese restaurants midtown"
    }
  ]
}
POST/v1/data/google/news

Google News

Search Google News for articles. Supports time range filtering, language/country targeting, and relevance/date sorting. Returns article metadata including source, date, and thumbnail.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
querystringREQUIREDNews search query (1-2048 characters).
num_resultsnumberoptionalNumber of articles to return (1-500).
languagestringoptionalLanguage code (hl parameter).
countrystringoptionalCountry code for geo-targeting (gl parameter, e.g. us, uk, in).
time_rangestringoptionalTime filter: "hour", "day", "week", "month", "year".
sort_bystringoptionalSort order: "relevance", "date".

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/google/news" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "query": "artificial intelligence regulation",
  "num_results": 10,
  "time_range": "week",
  "language": "en",
  "country": "us"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "query": "artificial intelligence regulation",
  "total_results": "10",
  "time_taken": 2.14,
  "source_strategy": "searxng_google_news",
  "articles": [
    {
      "position": 1,
      "title": "EU AI Act Enforcement Begins: What Companies Need to Know",
      "url": "https://www.reuters.com/technology/eu-ai-act-enforcement-2026-04-01",
      "source": "Reuters",
      "source_url": "reuters.com",
      "date": "3 hours ago",
      "published_date": "2026-04-03T09:00:00Z",
      "snippet": "The European Union's AI Act enters its enforcement phase today, requiring all AI systems classified as high-risk to undergo compliance assessments...",
      "thumbnail": "https://static.reuters.com/image/ai-regulation.jpg"
    },
    {
      "position": 2,
      "title": "US Senate Proposes Comprehensive AI Safety Bill",
      "url": "https://www.washingtonpost.com/technology/2026/04/02/ai-safety-bill",
      "source": "The Washington Post",
      "source_url": "washingtonpost.com",
      "date": "1 day ago",
      "snippet": "Bipartisan legislation would create a federal AI licensing framework for frontier models..."
    }
  ],
  "related_searches": [
    {
      "query": "AI regulation news today"
    },
    {
      "query": "EU AI Act requirements"
    }
  ]
}
POST/v1/data/google/finance

Google Finance

Get financial market data from Google Finance. Omit query for a market overview (US, Europe, Asia, Crypto, Currencies, Futures + trends). Provide a ticker symbol (e.g. 'AAPL:NASDAQ') for a single stock quote with similar stocks and news.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
querystringoptionalStock ticker (e.g. 'AAPL:NASDAQ', 'BTC-USD'). Omit for market overview. Max 100 characters.
languagestringoptionalLanguage code (hl parameter).
countrystringoptionalCountry code for geo-targeting (gl parameter).

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/google/finance" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "query": "AAPL:NASDAQ"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 1.87,
  "stock": "AAPL:NASDAQ",
  "name": "Apple Inc",
  "price": "$198.50",
  "price_movement": {
    "percentage": "+1.24%",
    "value": "+$2.43",
    "movement": "up"
  },
  "currency": "USD",
  "previous_close": "$196.07",
  "after_hours_price": "$198.75",
  "after_hours_movement": {
    "percentage": "+0.13%",
    "value": "+$0.25",
    "movement": "up"
  },
  "similar_stocks": [
    {
      "stock": "MSFT:NASDAQ",
      "link": "https://www.google.com/finance/quote/MSFT:NASDAQ",
      "name": "Microsoft Corporation",
      "price": "$428.50",
      "price_movement": {
        "percentage": "+0.87%",
        "value": "+$3.70",
        "movement": "up"
      }
    },
    {
      "stock": "GOOGL:NASDAQ",
      "link": "https://www.google.com/finance/quote/GOOGL:NASDAQ",
      "name": "Alphabet Inc",
      "price": "$178.20",
      "price_movement": {
        "percentage": "-0.34%",
        "value": "-$0.61",
        "movement": "down"
      }
    }
  ],
  "news": [
    {
      "title": "Apple Reports Record Q2 Revenue Driven by Services Growth",
      "url": "https://www.cnbc.com/2026/04/02/apple-q2-earnings.html",
      "source": "CNBC",
      "snippet": "Apple beat Wall Street expectations with $97.4B in revenue...",
      "published_timestamp": 1743638400
    }
  ]
}
POST/v1/data/youtube/channel

YouTube Channel

Get detailed channel metadata including subscriber count, video count, verification status, avatar, banner, country, and keywords. Accepts @handle or UC... channel ID.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
identifierstringREQUIREDChannel ID (UC...), @handle, or username. 1-100 characters.

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/youtube/channel" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "identifier": "@mkbhd"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 2.31,
  "channel": {
    "id": "UCBJycsmduvYEL83R_U4JriQ",
    "title": "MKBHD",
    "description": "Quality Tech Videos | Updated weekly.",
    "subscriber_count": 19800000,
    "video_count": 1847,
    "verified": true,
    "avatar_url": "https://yt3.googleusercontent.com/lkBjk_ZQ...",
    "banner_url": "https://yt3.googleusercontent.com/banner_ZQ...",
    "channel_url": "https://www.youtube.com/@mkbhd",
    "country": "US",
    "keywords": "tech reviews gadgets smartphones MKBHD Marques Brownlee"
  }
}
POST/v1/data/youtube/videos

YouTube Videos

Get a channel's recent videos with view counts, duration, publish dates, and thumbnails. Accepts @handle or UC... channel ID.

Target Latency

1.2s - 4.5s

Credits

2 cr/req

Parameters

NameTypeRequirementDescription
identifierstringREQUIREDChannel ID (UC...), @handle, or username. 1-100 characters.
limitnumberoptionalNumber of videos to return (1-50).

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/youtube/videos" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "identifier": "@mkbhd",
  "limit": 5
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 3.15,
  "channel": "MKBHD",
  "channel_id": "UCBJycsmduvYEL83R_U4JriQ",
  "count": 5,
  "videos": [
    {
      "video_id": "dQw4w9WgXcQ",
      "title": "The BEST Smartphones of 2026!",
      "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
      "views": 4200000,
      "views_text": "4.2M views",
      "published": "2 weeks ago",
      "duration": "22:14",
      "duration_seconds": 1334,
      "thumbnail": "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg",
      "description": "Ranking the top smartphones I've tested this year..."
    },
    {
      "video_id": "abc123def456",
      "title": "Galaxy S26 Ultra Review: The Camera King?",
      "url": "https://www.youtube.com/watch?v=abc123def456",
      "views": 2800000,
      "views_text": "2.8M views",
      "published": "3 weeks ago",
      "duration": "18:07",
      "duration_seconds": 1087,
      "thumbnail": "https://i.ytimg.com/vi/abc123def456/maxresdefault.jpg"
    }
  ]
}
POST/v1/data/youtube/video

YouTube Video Detail

Get detailed metadata for a single YouTube video including view count, likes, comment count, duration, keywords, live status, and full description.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
video_idstringREQUIREDYouTube video ID (e.g. 'dQw4w9WgXcQ'). 1-20 characters.

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/youtube/video" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "video_id": "dQw4w9WgXcQ"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 1.94,
  "video": {
    "id": "dQw4w9WgXcQ",
    "title": "Rick Astley - Never Gonna Give You Up (Official Music Video)",
    "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "description": "The official video for \"Never Gonna Give You Up\" by Rick Astley...",
    "channel": "Rick Astley",
    "channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
    "views": 1500000000,
    "likes": 16000000,
    "comment_count": 3200000,
    "duration_seconds": 212,
    "published": "Oct 25, 2009",
    "is_live": false,
    "keywords": [
      "rick astley",
      "never gonna give you up",
      "rickroll",
      "music video",
      "80s",
      "pop"
    ],
    "thumbnail": "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg"
  }
}
POST/v1/data/youtube/comments

YouTube Comments

Get comments for a YouTube video including author info, verification status, likes, reply count, pinned status, and creator heart status.

Target Latency

1.2s - 4.5s

Credits

2 cr/req

Parameters

NameTypeRequirementDescription
video_idstringREQUIREDYouTube video ID. 1-20 characters.
limitnumberoptionalMaximum comments to return (1-100).

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/youtube/comments" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "video_id": "dQw4w9WgXcQ",
  "limit": 5
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 2.56,
  "video_id": "dQw4w9WgXcQ",
  "count": 5,
  "comments": [
    {
      "author": "Rick Astley",
      "author_channel_id": "UCuAXFkgsw1L7xaCfnd5JJOw",
      "author_avatar": "https://yt3.ggpht.com/ytc/AIdro_n...",
      "is_verified": true,
      "is_creator": true,
      "text": "Thank you for 1.5 billion views! Never gonna let you down.",
      "likes": 892000,
      "reply_count": 47200,
      "published": "3 months ago",
      "pinned": true,
      "hearted": false
    },
    {
      "author": "Internet Historian",
      "author_channel_id": "UCR1D15p_vY6mVRqY5eXSA",
      "author_avatar": "https://yt3.ggpht.com/ytc/another...",
      "is_verified": true,
      "is_creator": false,
      "text": "We've been rickrolling for almost 20 years and it still never gets old.",
      "likes": 234000,
      "reply_count": 1200,
      "published": "6 months ago",
      "pinned": false,
      "hearted": true
    }
  ]
}
POST/v1/data/twitter/profile

Twitter Profile

Get a Twitter/X user profile including follower/following counts, tweet count, verification status, bio, location, and profile images.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
usernamestringREQUIREDTwitter/X username without @ prefix. 1-50 characters.

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/twitter/profile" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "username": "naval"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 1.82,
  "user": {
    "id": "745273",
    "username": "naval",
    "name": "Naval",
    "description": "Angel investor, founder of AngelList. Building the future.",
    "followers_count": 2100000,
    "following_count": 1847,
    "tweet_count": 18400,
    "like_count": 42000,
    "listed_count": 12500,
    "verified": true,
    "created_at": "2006-12-19T00:00:00Z",
    "profile_image_url": "https://pbs.twimg.com/profile_images/naval_400x400.jpg",
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/745273/1600x600.jpg",
    "location": "San Francisco, CA",
    "url": "https://nav.al"
  }
}
POST/v1/data/twitter/tweets

Twitter Tweets

Get recent tweets from a Twitter/X user timeline including engagement metrics, media attachments, hashtags, and URLs.

Target Latency

1.2s - 4.5s

Credits

2 cr/req

Parameters

NameTypeRequirementDescription
usernamestringREQUIREDTwitter/X username without @ prefix. 1-50 characters.
limitnumberoptionalNumber of tweets to return (1-100).

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/twitter/tweets" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "username": "naval",
  "limit": 5
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 3.47,
  "username": "naval",
  "count": 5,
  "tweets": [
    {
      "id": "1907654321098765432",
      "text": "Specific knowledge is found by pursuing your genuine curiosity rather than whatever is hot right now.",
      "created_at": "2026-04-02T14:30:00Z",
      "likes": 24000,
      "retweets": 4800,
      "replies": 320,
      "quotes": 180,
      "views": 1200000,
      "lang": "en",
      "url": "https://x.com/naval/status/1907654321098765432",
      "author": {
        "username": "naval",
        "name": "Naval",
        "verified": true,
        "profile_image_url": "https://pbs.twimg.com/profile_images/naval_400x400.jpg"
      },
      "media": [],
      "hashtags": [],
      "urls": []
    },
    {
      "id": "1907543210987654321",
      "text": "The most important skill for getting rich is becoming a perpetual learner.",
      "created_at": "2026-04-01T18:15:00Z",
      "likes": 18500,
      "retweets": 3200,
      "replies": 210,
      "quotes": 95,
      "views": 890000,
      "lang": "en",
      "url": "https://x.com/naval/status/1907543210987654321",
      "author": {
        "username": "naval",
        "name": "Naval",
        "verified": true
      },
      "media": [],
      "hashtags": [],
      "urls": []
    }
  ]
}
POST/v1/data/twitter/tweet

Twitter Tweet Detail

Get detailed data for a single tweet by ID including full engagement metrics, media, author info, and embedded URLs.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
tweet_idstringREQUIREDTweet ID (numeric string).

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/twitter/tweet" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "tweet_id": "1907654321098765432"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 1.65,
  "tweet": {
    "id": "1907654321098765432",
    "text": "Specific knowledge is found by pursuing your genuine curiosity rather than whatever is hot right now.",
    "created_at": "2026-04-02T14:30:00Z",
    "likes": 24000,
    "retweets": 4800,
    "replies": 320,
    "quotes": 180,
    "views": 1200000,
    "lang": "en",
    "url": "https://x.com/naval/status/1907654321098765432",
    "author": {
      "username": "naval",
      "name": "Naval",
      "verified": true,
      "profile_image_url": "https://pbs.twimg.com/profile_images/naval_400x400.jpg"
    },
    "media": [],
    "hashtags": [],
    "urls": []
  }
}
POST/v1/data/reddit/subreddit

Reddit Subreddit

Get subreddit metadata including subscriber count, active users, description, icon, banner, and community type.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
namestringREQUIREDSubreddit name (e.g. 'python' or 'r/python'). 1-100 characters.

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/reddit/subreddit" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "name": "python"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 1.45,
  "subreddit": {
    "name": "python",
    "title": "Python",
    "description": "News about the programming language Python.",
    "long_description": "Welcome to r/Python! This is a community for all things Python - discussion, tutorials, projects, and news about the programming language.",
    "subscribers": 1340000,
    "active_users": 4200,
    "created": "2008-01-25T00:00:00Z",
    "icon": "https://styles.redditmedia.com/t5_2qh0y/python_icon.png",
    "banner": "https://styles.redditmedia.com/t5_2qh0y/python_banner.png",
    "over_18": false,
    "type": "public",
    "url": "https://www.reddit.com/r/python/"
  }
}
POST/v1/data/reddit/posts

Reddit Posts

Get posts from a subreddit with sorting (hot, new, top, rising, controversial) and time filtering. Returns full post metadata including score, comments, media, flair, and awards.

Target Latency

1.2s - 4.5s

Credits

2 cr/req

Parameters

NameTypeRequirementDescription
subredditstringREQUIREDSubreddit name (e.g. 'python' or 'r/python'). 1-100 characters.
sortstringoptionalSort order: "hot", "new", "top", "rising", "controversial".
limitnumberoptionalNumber of posts to return (1-100).
time_filterstringoptionalTime filter for top/controversial: "hour", "day", "week", "month", "year", "all".

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/reddit/posts" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "subreddit": "python",
  "sort": "top",
  "limit": 5,
  "time_filter": "week"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 2.18,
  "subreddit": "python",
  "sort": "top",
  "count": 5,
  "posts": [
    {
      "id": "1b2c3d4",
      "title": "I wrote a Python library that makes web scraping 10x easier",
      "author": "scraperdev",
      "subreddit": "python",
      "score": 2847,
      "upvote_ratio": 0.96,
      "num_comments": 342,
      "url": "https://github.com/scraperdev/easyscrape",
      "permalink": "/r/python/comments/1b2c3d4/i_wrote_a_python_library/",
      "selftext": null,
      "created": "2026-03-30T15:30:00Z",
      "thumbnail": "https://b.thumbs.redditmedia.com/abc123.jpg",
      "is_video": false,
      "is_self": false,
      "over_18": false,
      "stickied": false,
      "flair": "Library",
      "awards": 5
    },
    {
      "id": "4e5f6g7",
      "title": "Python 3.14 Released - What's New",
      "author": "python_dev",
      "subreddit": "python",
      "score": 1923,
      "upvote_ratio": 0.98,
      "num_comments": 187,
      "url": "https://www.reddit.com/r/python/comments/4e5f6g7/python_314_released/",
      "permalink": "/r/python/comments/4e5f6g7/python_314_released/",
      "selftext": "The latest Python release brings several exciting features including...",
      "created": "2026-03-28T12:00:00Z",
      "is_video": false,
      "is_self": true,
      "over_18": false,
      "stickied": false,
      "flair": "News",
      "awards": 3
    }
  ]
}
POST/v1/data/reddit/user

Reddit User

Get a Reddit user's profile including karma breakdown, account age, gold status, verification, and employee status.

Target Latency

1.2s - 4.5s

Credits

1 cr/req

Parameters

NameTypeRequirementDescription
usernamestringREQUIREDReddit username (e.g. 'spez' or 'u/spez'). 1-100 characters.

cURL Example

curl -X POST "https://api.datablue.dev/v1/data/reddit/user" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "username": "spez"
}'

System Responses

200 OK

Request processed successfully.

401 UNAUTHORIZED

Missing or invalid API key.

429 RATE LIMIT

System capacity exceeded.

500 SYSTEM FAILURE

Internal core exception.

EXAMPLE RESPONSE
{
  "success": true,
  "time_taken": 1.32,
  "user": {
    "name": "spez",
    "display_name": "spez",
    "description": "CEO, Reddit",
    "total_karma": 1250000,
    "link_karma": 450000,
    "comment_karma": 800000,
    "created": "2005-06-06T00:00:00Z",
    "icon": "https://styles.redditmedia.com/t5_6/spez_avatar.png",
    "is_gold": true,
    "verified": true,
    "has_verified_email": true,
    "is_employee": true,
    "url": "https://www.reddit.com/user/spez/"
  }
}

Rate Limits

Rate limits are enforced per API key or JWT token. Limits vary by plan tier. When exceeded, the API returns 429 Too Many Requests.

Plan Rate Limit Monthly Credits Max Pages/Crawl Concurrency
Free 10 req/min 500 100 2
Starter 60 req/min 3,000 500 5
Plus 120 req/min 15,000 1,000 10
Pro 200 req/min 50,000 1,000 10
Growth 300 req/min 200,000 1,000 10
Scale 500 req/min 1,000,000 1,000 10

Rate Limit Headers

Every API response includes these headers:

Header Description
X-RateLimit-Limit Maximum requests per minute for your plan
X-RateLimit-Remaining Remaining requests in the current window
X-RateLimit-Reset Unix timestamp when the rate limit window resets

429 Response Example

{
  "error": "Rate limit exceeded",
  "message": "You have exceeded the rate limit of 10 requests per minute. Please wait and try again.",
  "retry_after": 42
}

Error Codes

DataBlue uses standard HTTP status codes. All error responses include a JSON body with error and message fields.

Code Name Description
400 Bad Request Invalid request body, missing required fields, or validation error (e.g. URL format, parameter constraints).
401 Unauthorized Missing, invalid, or expired authentication token. Re-authenticate or generate a new API key.
403 Forbidden Authenticated but insufficient permissions. Your plan may not include access to this endpoint.
404 Not Found The requested resource (job ID, monitor, schedule) does not exist.
429 Rate Limited You have exceeded your plan's rate limit. Wait for the reset window and retry. See the retry_after field.
500 Internal Error Server-side error. This may indicate a bug or infrastructure issue. Retry with exponential backoff.
503 Service Unavailable Endpoint is temporarily disabled or under maintenance. Data APIs may return this if a scraper is being updated.

Error Response Format

// 400 Bad Request
{
  "success": false,
  "error": "Bad Request",
  "message": "Field 'url' is required"
}

// 401 Unauthorized
{
  "success": false,
  "error": "Unauthorized",
  "message": "Invalid or expired authentication token"
}

// 429 Rate Limited
{
  "success": false,
  "error": "Rate limit exceeded",
  "message": "You have exceeded the rate limit of 60 requests per minute",
  "retry_after": 23
}

// 500 Internal Error
{
  "success": false,
  "error": "Internal Server Error",
  "message": "An unexpected error occurred. Request ID: req_a1b2c3d4"
}

Scrape-specific error codes: Scrape responses may include an error_code field with values like BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, or NETWORK_ERROR for more granular error classification.

Webhooks

Webhooks allow you to receive real-time notifications when jobs complete instead of polling. Pass webhook_url and optionally webhook_secret when starting any async job (crawl, search, extract).

Webhook Events

Event Description
scrape.completedA scrape job has finished (success or failure)
crawl.completedA crawl job has finished all pages
crawl.pageA single page within a crawl has been scraped
search.completedA search job has finished all results
extract.completedAn extraction job has finished

Payload Format

POST https://your-server.com/webhook

Headers:
  Content-Type: application/json
  X-Webhook-Signature: sha256=a1b2c3d4e5f6...
  X-Webhook-Event: crawl.completed

Body:
{
  "event": "crawl.completed",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "timestamp": "2026-04-03T12:00:00Z",
  "data": {
    "total_pages": 47,
    "completed_pages": 47,
    "url": "https://docs.python.org"
  }
}

Signature Verification

If you provide a webhook_secret, DataBlue signs each payload with HMAC-SHA256. The signature is sent in the X-Webhook-Signature header.

import hmac
import hashlib

def verify_webhook(payload: bytes, signature: str, secret: str) -> bool:
    """Verify the HMAC-SHA256 signature from DataBlue webhooks."""
    expected = "sha256=" + hmac.new(
        secret.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, signature)

# Usage in a Flask/FastAPI handler:
@app.post("/webhook")
async def handle_webhook(request: Request):
    body = await request.body()
    signature = request.headers.get("X-Webhook-Signature", "")
    if not verify_webhook(body, signature, "your_webhook_secret"):
        raise HTTPException(status_code=401, detail="Invalid signature")
    data = await request.json()
    print(f"Event: {data['event']}, Job: {data['job_id']}")

Retry Policy

  • 3 attempts total (1 initial + 2 retries)
  • Exponential backoff: 10s, 60s, 300s between retries
  • Retries triggered on: connection errors, 5xx responses, timeouts
  • Successful delivery requires a 2xx response within 30 seconds

Plans & Credits

Every API call consumes credits. The cost varies by endpoint complexity. Credits reset monthly based on your plan.

Credit Costs by Endpoint

Endpoint Credits Notes
Scrape1Per URL
Crawl1Per page crawled
Search1Per search query
Map1Per map request
Extract5Uses LLM processing
Google Data APIs
Google Search1SERP results
Google Maps1Places search or detail
Google News1News articles
Google Finance1Market data or quote
YouTube Data APIs
YouTube Channel1Channel metadata
YouTube Videos2Channel video listing
YouTube Search2Video search
YouTube Video1Single video detail
YouTube Comments2Video comments
Twitter Data APIs
Twitter Profile1User profile
Twitter Tweets2User timeline
Twitter Tweet1Single tweet detail
Twitter Search2Tweet search
Reddit Data APIs
Reddit Subreddit1Subreddit metadata
Reddit Posts2Subreddit post listing
Reddit Search2Post search
Reddit User1User profile

Plan Comparison

Feature Free Starter Plus Pro Growth Scale
Monthly Credits5003,00015,00050,000200,0001,000,000
Rate Limit10/min60/min120/min200/min300/min500/min
API Keys1351025Unlimited
Monitors021025100Unlimited
Schedules021025100Unlimited