Search

Search the web using Google (via SearXNG), DuckDuckGo, or Brave, then scrape each result page and return structured content. Like crawl, search offers a blocking search() method and a non-blocking start_search() method.

Blocking Search

from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    status = client.search(
        "best python web scraping libraries 2026",
        num_results=5,
        engine="google",
        formats=["markdown"],
        only_main_content=True,
    )

    print(f"Query: {status.query}")
    print(f"Results: {status.completed_results}/{status.total_results}")

    for item in status.data:
        print(f"\n--- {item.title} ---")
        print(f"URL: {item.url}")
        print(f"Snippet: {item.snippet}")
        if item.markdown:
            print(f"Content: {item.markdown[:200]}...")

Non-blocking Search

import time
from datablue import DataBlue

with DataBlue(api_key="wh_your_api_key") as client:
    job = client.start_search(
        "machine learning frameworks comparison",
        num_results=10,
        engine="google",
    )
    print(f"Job ID: {job.job_id}")

    while True:
        status = client.get_search_status(job.job_id)
        print(f"  Progress: {status.completed_results}/{status.total_results}")
        if status.is_complete:
            break
        time.sleep(2)

    for item in status.data:
        print(f"  {item.title}: {item.url}")

Async

from datablue import AsyncDataBlue

async with AsyncDataBlue(api_key="wh_your_api_key") as client:
    status = await client.search(
        "latest AI research papers",
        num_results=5,
        engine="google",
    )
    for item in status.data:
        print(item.title, item.url)

With LLM Extraction

Apply LLM extraction to each search result for structured output:

status = client.search(
    "python web frameworks comparison",
    num_results=3,
    extract={
        "prompt": "Extract the framework name, pros, and cons",
        "schema": {
            "type": "object",
            "properties": {
                "framework": {"type": "string"},
                "pros": {"type": "array", "items": {"type": "string"}},
                "cons": {"type": "array", "items": {"type": "string"}},
            },
        },
    },
)

for item in status.data:
    if item.extract:
        print(f"{item.extract['framework']}: {item.extract['pros']}")

Parameters

Parameter	Type	Default	Description
`query`	str	required	The search query string
`num_results`	int	5	Number of search results to scrape
`engine`	Literal["google", "duckduckgo", "brave"]	"google"	Search engine: "google" (via SearXNG), "duckduckgo", or "brave"
`formats`	list[str]	None	Output formats for scraped results: "markdown", "html", "links", etc.
`only_main_content`	bool	False	Extract only main content from each result page
`headers`	dict[str, str]	None	Custom HTTP headers for scraping result pages
`cookies`	dict[str, str]	None	Custom cookies for scraping result pages
`mobile`	bool	False	Emulate mobile device viewport for scraping
`mobile_device`	str	None	Mobile device preset name
`extract`	dict	None	LLM extraction config applied to each result: { "prompt", "schema" }
`use_proxy`	bool	False	Route requests through configured proxy
`google_api_key`	str	None	Google Custom Search API key (alternative to SearXNG)
`google_cx`	str	None	Google Custom Search Engine ID
`brave_api_key`	str	None	Brave Search API key
`webhook_url`	str	None	URL to receive webhook on completion
`webhook_secret`	str	None	HMAC secret for webhook signature verification

The search() blocking method accepts two additional parameters:

Parameter	Type	Default	Description
`poll_interval`	float	2.0	Seconds between status polls
`timeout`	float	300.0	Maximum seconds to wait before raising TimeoutError
`on_progress`	Callable[[SearchStatus], None]	None	Optional callback invoked after each poll with the latest SearchStatus. Useful for progress bars or logging.

Response Model

class SearchJob:                         # from start_search()
    success: bool                        # Whether the search job was accepted
    job_id: str                          # Unique job identifier for polling search status
    status: str                          # Current job status (typically "started")
    message: str | None                  # Human-readable status message or error description

class SearchStatus:                      # from search() or get_search_status()
    job_id: str                          # Unique job identifier
    status: str                          # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
    query: str | None                    # The search query that was executed
    total_results: int                   # Total number of search results found
    completed_results: int               # Number of results successfully scraped so far
    data: list[SearchResultItem]         # List of search result items with optional scraped content
    error: str | None                    # Human-readable error message if the search failed

    # Properties
    is_complete: bool                    # True when status in {"completed", "failed", "cancelled"}
    progress: float                      # completed_results / total_results (0.0 to 1.0)

class SearchResultItem:
    id: str | None                       # Unique identifier for this search result
    url: str                             # URL of the search result
    title: str | None                    # Title of the search result from the search engine
    snippet: str | None                  # Search result snippet/description text
    success: bool                        # Whether this result was successfully scraped
    markdown: str | None                 # Clean Markdown conversion of the scraped page content
    html: str | None                     # Cleaned HTML content of the scraped page
    links: list[str] | None              # List of URLs found on the scraped page
    links_detail: list[dict] | None      # Detailed link info including anchor text and attributes
    screenshot: str | None               # Base64-encoded PNG screenshot of the page
    structured_data: dict | None         # JSON-LD and microdata extracted from the page
    headings: list[dict] | None          # List of headings with level and text
    images: list[dict] | None            # List of images with src, alt, and dimensions
    extract: dict | None                 # LLM-extracted structured data matching the provided schema
    metadata: PageMetadata | None        # Page metadata including title, status_code, word_count, and SEO tags
    error: str | None                    # Error message if scraping this result failed