Scrape
Scrape a single URL and return structured content. The scrape() method is synchronous and returns immediately with the result. The scraping engine uses a 5-tier parallel race: HTTP, stealth browser, Playwright, headless Chrome, and anti-bot bypass — first valid result wins.
Basic Usage
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
result = client.scrape("https://example.com")
print(result.success) # True
print(result.data.markdown) # "# Example Domain\n\nThis domain is..."
print(result.data.metadata) # PageMetadata(title="Example Domain", ...)
With All Options
result = client.scrape(
"https://news.ycombinator.com",
formats=["markdown", "html", "links", "screenshot"],
only_main_content=True,
wait_for=2000, # wait 2s after page load
timeout=45000, # 45s timeout
include_tags=["article", "main"], # only these HTML tags
exclude_tags=["nav", "footer"], # remove these tags
headers={"Accept-Language": "en-US"},
cookies={"session": "abc123"},
mobile=True,
mobile_device="iphone_14",
css_selector="main.content", # target specific element
use_proxy=True,
capture_network=True,
)
print(result.data.markdown)
print(result.data.html)
print(result.data.links)
print(result.data.screenshot) # base64-encoded PNG
print(result.data.metadata.title)
print(result.data.metadata.word_count)
Saving Screenshots
Screenshots are returned as base64-encoded PNG strings. Save to a file, embed in HTML, or pass directly to an LLM:
import base64
result = client.scrape("https://example.com", formats=["screenshot"])
# Save to file
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.data.screenshot))
# Embed in HTML
html_img = f'<img src="data:image/png;base64,{result.data.screenshot}" />'
# Store in database as a string field
db.save(url=result.data.metadata.source_url, screenshot=result.data.screenshot)
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
result = await client.scrape(
"https://example.com",
formats=["markdown", "links"],
only_main_content=True,
)
print(result.data.markdown)
Browser Actions
Execute browser actions before content extraction. Useful for pages that require interaction (clicking buttons, filling forms, scrolling to load content).
result = client.scrape(
"https://example.com/infinite-scroll",
actions=[
{"type": "wait", "milliseconds": 1000},
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down", "amount": 3},
{"type": "screenshot"},
],
)
Action Types Reference
| Action Type | Required Fields | Optional Fields | Description |
|---|---|---|---|
click | selector | button, click_count, modifiers | Click an element |
wait | — | milliseconds | Wait for duration |
scroll | — | direction (up/down), amount | Scroll the page |
type | selector, text | — | Type text into input |
screenshot | — | — | Capture screenshot |
hover | selector | — | Hover over element |
press | key | modifiers | Press keyboard key |
select | selector, value | — | Select dropdown option |
fill_form | fields | — | Fill multiple form fields |
evaluate | script | — | Run JavaScript |
go_back | — | — | Navigate back |
go_forward | — | — | Navigate forward |
LLM Extraction
Extract structured data from a page using an LLM. Pass a prompt and optional schema (JSON Schema) in the extract parameter.
result = client.scrape(
"https://openai.com/pricing",
extract={
"prompt": "Extract all pricing tiers with name and price",
"schema": {
"type": "object",
"properties": {
"tiers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
},
},
},
},
},
},
)
print(result.data.extract) # {"tiers": [{"name": "GPT-4o", "price": "$2.50/1M"}, ...]}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | str | required | The URL to scrape |
formats | list[Literal["markdown", "html", "raw_html", "links", "screenshot", "structured_data", "headings", "images"]] | None | Output formats to return. Defaults to ["markdown"] if not specified. |
only_main_content | bool | True | Extract only main content, removing nav/footer/sidebar |
wait_for | int | 0 | Milliseconds to wait after page load before extracting |
timeout | int | 30000 | Request timeout in milliseconds |
include_tags | list[str] | None | Only include these HTML tags in extraction |
exclude_tags | list[str] | None | Exclude these HTML tags from extraction |
headers | dict[str, str] | None | Custom HTTP headers to send with the request |
cookies | dict[str, str] | None | Custom cookies to send as name/value pairs |
mobile | bool | False | Emulate a mobile device viewport |
mobile_device | str | None | Device preset: "iphone_14", "pixel_7", "ipad_pro" |
css_selector | str | None | Only extract content matching this CSS selector |
xpath | str | None | XPath expression for targeted extraction |
selectors | dict[str, Any] | None | Named CSS selectors for multi-element extraction |
actions | list[dict] | list[ActionStep] | None | Browser actions to execute before scraping. Accepts raw dicts or typed ActionStep model instances. |
extract | dict | ExtractConfig | None | LLM extraction config. Accepts a dict with "prompt" and "schema" keys, or a typed ExtractConfig model. |
capture_network | bool | False | Capture browser network requests. Results appear in result.data.network_data as a dict with request/response details. |
use_proxy | bool | False | Route through configured proxy for anti-bot bypass |
webhook_url | str | None | URL to receive webhook on completion |
webhook_secret | str | None | HMAC secret for webhook signature verification |
Response Model
Returns ScrapeResult — a Pydantic model with the following fields:
class ScrapeResult:
success: bool # Whether the scrape request succeeded
data: PageData | None # Scraped page content including markdown, html, links, and metadata
error: str | None # Human-readable error message if the scrape failed
error_code: str | None # Machine-readable error code (BLOCKED_BY_WAF, CAPTCHA_REQUIRED, TIMEOUT, JS_REQUIRED, NETWORK_ERROR)
job_id: str | None # Unique job identifier for async scrape requests
class PageData:
url: str | None # The URL of the scraped page
markdown: str | None # Clean Markdown conversion of the page content
fit_markdown: str | None # Markdown trimmed to fit LLM context windows
html: str | None # Cleaned HTML content with boilerplate removed
raw_html: str | None # Original unmodified HTML source of the page
links: list[str] | None # List of URLs found on the page
links_detail: list[dict] | None # Detailed link info including anchor text and attributes
screenshot: str | None # Base64-encoded PNG screenshot of the page
structured_data: dict | None # JSON-LD and microdata extracted from the page
headings: list[dict] | None # List of headings with level and text (e.g. [{level: 1, text: "Title"}])
images: list[dict] | None # List of images with src, alt, and dimensions
extract: dict | None # LLM-extracted structured data matching the provided schema
citations: list[dict] | None # Source citations for extracted content
markdown_with_citations: str | None # Markdown content with inline citation references
content_hash: str | None # SHA-256 hash of the content for change detection
metadata: PageMetadata | None # Page metadata including title, status_code, word_count, and SEO tags
class PageMetadata:
title: str | None # Page title from <title> tag
description: str | None # Meta description from <meta name='description'>
language: str | None # Page language code (e.g. 'en', 'fr', 'de')
source_url: str | None # The URL that was actually scraped (after redirects)
status_code: int | None # HTTP response status code (200, 404, 500, etc.)
word_count: int # Number of words in the main content
reading_time_seconds: int # Estimated reading time in seconds
content_length: int # Response body size in bytes
og_image: str | None # OpenGraph image URL for social sharing previews
canonical_url: str | None # Canonical URL from <link rel='canonical'>
favicon: str | None # Favicon URL
robots: str | None # Robots meta tag content (e.g. 'noindex, nofollow')
response_headers: dict | None # HTTP response headers as key-value pairs