Scrape

Scrape a single URL and return structured content. The scrape() method is synchronous and returns immediately with the result. The scraping engine uses a 5-tier parallel race: HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass — first valid result wins.

Basic Usage

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    result = client.scrape("https://example.com")
    print(result.success)          # True
    print(result.data.markdown)    # "# Example Domain\n\nThis domain is..."
    print(result.data.metadata)    # PageMetadata(title="Example Domain", ...)

With All Options

result = client.scrape(
    "https://news.ycombinator.com",
    formats=["markdown", "html", "links", "screenshot"],
    only_main_content=True,
    wait_for=2000,                          # wait 2s after page load
    timeout=45000,                          # 45s timeout
    include_tags=["article", "main"],       # only these HTML tags
    exclude_tags=["nav", "footer"],         # remove these tags
    headers={"Accept-Language": "en-US"},
    cookies={"session": "abc123"},
    mobile=True,
    mobile_device="iphone_14",
    css_selector="main.content",            # target specific element
    use_proxy=True,
    capture_network=True,
)

print(result.data.markdown)
print(result.data.html)
print(result.data.links)
print(result.data.screenshot)       # base64-encoded PNG
print(result.data.metadata.title)
print(result.data.metadata.word_count)

Saving Screenshots

Screenshots are returned as base64-encoded PNG strings. Save to a file, embed in HTML, or pass directly to an LLM:

import base64

result = client.scrape("https://example.com", formats=["screenshot"])

# Save to file
with open("screenshot.png", "wb") as f:
    f.write(base64.b64decode(result.data.screenshot))

# Embed in HTML
html_img = f'<img src="data:image/png;base64,{result.data.screenshot}" />'

# Store in database as a string field
db.save(url=result.data.metadata.source_url, screenshot=result.data.screenshot)

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    result = await client.scrape(
        "https://example.com",
        formats=["markdown", "links"],
        only_main_content=True,
    )
    print(result.data.markdown)

Browser Actions

Execute browser actions before content extraction. Useful for pages that require interaction (clicking buttons, filling forms, scrolling to load content).

result = client.scrape(
    "https://example.com/infinite-scroll",
    actions=[
        {"type": "wait", "milliseconds": 1000},
        {"type": "click", "selector": "button.load-more"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "scroll", "direction": "down", "amount": 3},
        {"type": "screenshot"},
    ],
)

Action Types Reference

Action Type	Required Fields	Optional Fields	Description
`click`	selector	button, click_count, modifiers	Click an element
`wait`	—	milliseconds	Wait for duration
`scroll`	—	direction (up/down), amount	Scroll the page
`type`	selector, text	—	Type text into input
`screenshot`	—	—	Capture screenshot
`hover`	selector	—	Hover over element
`press`	key	modifiers	Press keyboard key
`select`	selector, value	—	Select dropdown option
`fill_form`	fields	—	Fill multiple form fields
`evaluate`	script	—	Run JavaScript
`go_back`	—	—	Navigate back
`go_forward`	—	—	Navigate forward

LLM Extraction

Extract structured data from a page using an LLM. Pass a prompt and optional schema (JSON Schema) in the extract parameter.

result = client.scrape(
    "https://openai.com/pricing",
    extract={
        "prompt": "Extract all pricing tiers with name and price",
        "schema": {
            "type": "object",
            "properties": {
                "tiers": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                        },
                    },
                },
            },
        },
    },
)
print(result.data.extract)  # {"tiers": [{"name": "GPT-4o", "price": "$2.50/1M"}, ...]}

Parameters

Parameter	Type	Default	Description
`url`	str	required	The URL to scrape
`formats`	list[Literal["markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images"]]	None	Output formats to return. Defaults to ["markdown"] if not specified.
`only_main_content`	bool	True	Extract only main content, removing nav/footer/sidebar
`wait_for`	int	0	Milliseconds to wait after page load before extracting
`timeout`	int	30000	Request timeout in milliseconds
`include_tags`	list[str]	None	Only include these HTML tags in extraction
`exclude_tags`	list[str]	None	Exclude these HTML tags from extraction
`headers`	dict[str, str]	None	Custom HTTP headers to send with the request
`cookies`	dict[str, str]	None	Custom cookies to send as name/value pairs
`mobile`	bool	False	Emulate a mobile device viewport
`mobile_device`	str	None	Device preset: "iphone_14", "pixel_7", "ipad_pro"
`css_selector`	str	None	Only extract content matching this CSS selector
`xpath`	str	None	XPath expression for targeted extraction
`selectors`	dict[str, Any]	None	Named CSS selectors for multi-element extraction
`actions`	list[dict] \| list[ActionStep]	None	Browser actions to execute before scraping. Accepts raw dicts or typed `ActionStep` model instances.
`extract`	dict \| ExtractConfig	None	LLM extraction config. Accepts a dict with "prompt" and "schema" keys, or a typed `ExtractConfig` model.
`capture_network`	bool	False	Capture browser network requests. Results appear in `result.data.network_data` as a dict with request/response details.
`use_proxy`	bool	False	Route through configured proxy for anti-bot bypass
`webhook_url`	str	None	URL to receive webhook on completion
`webhook_secret`	str	None	HMAC secret for webhook signature verification

Response Model

Returns ScrapeResult — a Pydantic model with the following fields:

class ScrapeResult:
    success: bool                    # Whether the scrape request succeeded
    data: PageData | None            # Scraped page content including markdown, html, links, and metadata
    error: str | None                # Human-readable error message if the scrape failed
    error_code: str | None           # Machine-readable error code (BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, NETWORK_ERROR)
    job_id: str | None               # Unique job identifier for async scrape requests

class PageData:
    url: str | None                  # The URL of the scraped page
    markdown: str | None             # Clean Markdown conversion of the page content
    fit_markdown: str | None         # Markdown trimmed to fit LLM context windows
    html: str | None                 # Cleaned HTML content with boilerplate removed
    raw_html: str | None             # Original unmodified HTML source of the page
    links: list[str] | None          # List of URLs found on the page
    links_detail: list[dict] | None  # Detailed link info including anchor text and attributes
    screenshot: str | None           # Base64-encoded PNG screenshot of the page
    structured_data: dict | None     # JSON-LD and microdata extracted from the page
    headings: list[dict] | None      # List of headings with level and text (e.g. [{level: 1, text: "Title"}])
    images: list[dict] | None        # List of images with src, alt, and dimensions
    extract: dict | None             # LLM-extracted structured data matching the provided schema
    citations: list[dict] | None     # Source citations for extracted content
    markdown_with_citations: str | None  # Markdown content with inline citation references
    content_hash: str | None         # SHA-256 hash of the content for change detection
    metadata: PageMetadata | None    # Page metadata including title, status_code, word_count, and SEO tags

class PageMetadata:
    title: str | None                # Page title from <title> tag
    description: str | None          # Meta description from <meta name='description'>
    language: str | None             # Page language code (e.g. 'en', 'fr', 'de')
    source_url: str | None           # The URL that was actually scraped (after redirects)
    status_code: int | None          # HTTP response status code (200, 404, 500, etc.)
    word_count: int                  # Number of words in the main content
    reading_time_seconds: int        # Estimated reading time in seconds
    content_length: int              # Response body size in bytes
    og_image: str | None             # OpenGraph image URL for social sharing previews
    canonical_url: str | None        # Canonical URL from <link rel='canonical'>
    favicon: str | None              # Favicon URL
    robots: str | None               # Robots meta tag content (e.g. 'noindex, nofollow')
    response_headers: dict | None    # HTTP response headers as key-value pairs