Scrape

Scrape a single URL and return structured content. The scrape() method is synchronous and returns immediately with the result. The scraping engine uses a 5-tier parallel race: HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass — first valid result wins.

Basic Usage

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    result = client.scrape("https://example.com")
    print(result.success)          # True
    print(result.data.markdown)    # "# Example Domain\n\nThis domain is..."
    print(result.data.metadata)    # PageMetadata(title="Example Domain", ...)

With All Options

result = client.scrape(
    "https://news.ycombinator.com",
    formats=["markdown", "html", "links", "screenshot"],
    only_main_content=True,
    wait_for=2000,                          # wait 2s after page load
    timeout=45000,                          # 45s timeout
    include_tags=["article", "main"],       # only these HTML tags
    exclude_tags=["nav", "footer"],         # remove these tags
    headers={"Accept-Language": "en-US"},
    cookies={"session": "abc123"},
    mobile=True,
    mobile_device="iphone_14",
    css_selector="main.content",            # target specific element
    use_proxy=True,
    capture_network=True,
)

print(result.data.markdown)
print(result.data.html)
print(result.data.links)
print(result.data.screenshot)       # base64-encoded PNG
print(result.data.metadata.title)
print(result.data.metadata.word_count)

Saving Screenshots

Screenshots are returned as base64-encoded PNG strings. Save to a file, embed in HTML, or pass directly to an LLM:

import base64

result = client.scrape("https://example.com", formats=["screenshot"])

# Save to file
with open("screenshot.png", "wb") as f:
    f.write(base64.b64decode(result.data.screenshot))

# Embed in HTML
html_img = f'<img src="data:image/png;base64,{result.data.screenshot}" />'

# Store in database as a string field
db.save(url=result.data.metadata.source_url, screenshot=result.data.screenshot)

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    result = await client.scrape(
        "https://example.com",
        formats=["markdown", "links"],
        only_main_content=True,
    )
    print(result.data.markdown)

Browser Actions

Execute browser actions before content extraction. Useful for pages that require interaction (clicking buttons, filling forms, scrolling to load content).

result = client.scrape(
    "https://example.com/infinite-scroll",
    actions=[
        {"type": "wait", "milliseconds": 1000},
        {"type": "click", "selector": "button.load-more"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "scroll", "direction": "down", "amount": 3},
        {"type": "screenshot"},
    ],
)

Action Types Reference

Action Type Required Fields Optional Fields Description
clickselectorbutton, click_count, modifiersClick an element
waitmillisecondsWait for duration
scrolldirection (up/down), amountScroll the page
typeselector, textType text into input
screenshotCapture screenshot
hoverselectorHover over element
presskeymodifiersPress keyboard key
selectselector, valueSelect dropdown option
fill_formfieldsFill multiple form fields
evaluatescriptRun JavaScript
go_backNavigate back
go_forwardNavigate forward

LLM Extraction

Extract structured data from a page using an LLM. Pass a prompt and optional schema (JSON Schema) in the extract parameter.

result = client.scrape(
    "https://openai.com/pricing",
    extract={
        "prompt": "Extract all pricing tiers with name and price",
        "schema": {
            "type": "object",
            "properties": {
                "tiers": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                        },
                    },
                },
            },
        },
    },
)
print(result.data.extract)  # {"tiers": [{"name": "GPT-4o", "price": "$2.50/1M"}, ...]}

Parameters

Parameter Type Default Description
urlstrrequiredThe URL to scrape
formatslist[Literal["markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images"]]NoneOutput formats to return. Defaults to ["markdown"] if not specified.
only_main_contentboolTrueExtract only main content, removing nav/footer/sidebar
wait_forint0Milliseconds to wait after page load before extracting
timeoutint30000Request timeout in milliseconds
include_tagslist[str]NoneOnly include these HTML tags in extraction
exclude_tagslist[str]NoneExclude these HTML tags from extraction
headersdict[str, str]NoneCustom HTTP headers to send with the request
cookiesdict[str, str]NoneCustom cookies to send as name/value pairs
mobileboolFalseEmulate a mobile device viewport
mobile_devicestrNoneDevice preset: "iphone_14", "pixel_7", "ipad_pro"
css_selectorstrNoneOnly extract content matching this CSS selector
xpathstrNoneXPath expression for targeted extraction
selectorsdict[str, Any]NoneNamed CSS selectors for multi-element extraction
actionslist[dict] | list[ActionStep]NoneBrowser actions to execute before scraping. Accepts raw dicts or typed ActionStep model instances.
extractdict | ExtractConfigNoneLLM extraction config. Accepts a dict with "prompt" and "schema" keys, or a typed ExtractConfig model.
capture_networkboolFalseCapture browser network requests. Results appear in result.data.network_data as a dict with request/response details.
use_proxyboolFalseRoute through configured proxy for anti-bot bypass
webhook_urlstrNoneURL to receive webhook on completion
webhook_secretstrNoneHMAC secret for webhook signature verification

Response Model

Returns ScrapeResult — a Pydantic model with the following fields:

class ScrapeResult:
    success: bool                    # Whether the scrape request succeeded
    data: PageData | None            # Scraped page content including markdown, html, links, and metadata
    error: str | None                # Human-readable error message if the scrape failed
    error_code: str | None           # Machine-readable error code (BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, NETWORK_ERROR)
    job_id: str | None               # Unique job identifier for async scrape requests

class PageData:
    url: str | None                  # The URL of the scraped page
    markdown: str | None             # Clean Markdown conversion of the page content
    fit_markdown: str | None         # Markdown trimmed to fit LLM context windows
    html: str | None                 # Cleaned HTML content with boilerplate removed
    raw_html: str | None             # Original unmodified HTML source of the page
    links: list[str] | None          # List of URLs found on the page
    links_detail: list[dict] | None  # Detailed link info including anchor text and attributes
    screenshot: str | None           # Base64-encoded PNG screenshot of the page
    structured_data: dict | None     # JSON-LD and microdata extracted from the page
    headings: list[dict] | None      # List of headings with level and text (e.g. [{level: 1, text: "Title"}])
    images: list[dict] | None        # List of images with src, alt, and dimensions
    extract: dict | None             # LLM-extracted structured data matching the provided schema
    citations: list[dict] | None     # Source citations for extracted content
    markdown_with_citations: str | None  # Markdown content with inline citation references
    content_hash: str | None         # SHA-256 hash of the content for change detection
    metadata: PageMetadata | None    # Page metadata including title, status_code, word_count, and SEO tags

class PageMetadata:
    title: str | None                # Page title from <title> tag
    description: str | None          # Meta description from <meta name='description'>
    language: str | None             # Page language code (e.g. 'en', 'fr', 'de')
    source_url: str | None           # The URL that was actually scraped (after redirects)
    status_code: int | None          # HTTP response status code (200, 404, 500, etc.)
    word_count: int                  # Number of words in the main content
    reading_time_seconds: int        # Estimated reading time in seconds
    content_length: int              # Response body size in bytes
    og_image: str | None             # OpenGraph image URL for social sharing previews
    canonical_url: str | None        # Canonical URL from <link rel='canonical'>
    favicon: str | None              # Favicon URL
    robots: str | None               # Robots meta tag content (e.g. 'noindex, nofollow')
    response_headers: dict | None    # HTTP response headers as key-value pairs