Search
Search the web using Google (via SearXNG), DuckDuckGo, or Brave, then scrape each result page and return structured content.
Like crawl, search offers a blocking search() method and a non-blocking start_search() method.
Blocking Search
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
status = client.search(
"best python web scraping libraries 2026",
num_results=5,
engine="google",
formats=["markdown"],
only_main_content=True,
)
print(f"Query: {status.query}")
print(f"Results: {status.completed_results}/{status.total_results}")
for item in status.data:
print(f"\n--- {item.title} ---")
print(f"URL: {item.url}")
print(f"Snippet: {item.snippet}")
if item.markdown:
print(f"Content: {item.markdown[:200]}...")
Non-blocking Search
import time
from datablue import DataBlue
with DataBlue(api_key="wh_your_api_key") as client:
job = client.start_search(
"machine learning frameworks comparison",
num_results=10,
engine="google",
)
print(f"Job ID: {job.job_id}")
while True:
status = client.get_search_status(job.job_id)
print(f" Progress: {status.completed_results}/{status.total_results}")
if status.is_complete:
break
time.sleep(2)
for item in status.data:
print(f" {item.title}: {item.url}")
Async
from datablue import AsyncDataBlue
async with AsyncDataBlue(api_key="wh_your_api_key") as client:
status = await client.search(
"latest AI research papers",
num_results=5,
engine="google",
)
for item in status.data:
print(item.title, item.url)
With LLM Extraction
Apply LLM extraction to each search result for structured output:
status = client.search(
"python web frameworks comparison",
num_results=3,
extract={
"prompt": "Extract the framework name, pros, and cons",
"schema": {
"type": "object",
"properties": {
"framework": {"type": "string"},
"pros": {"type": "array", "items": {"type": "string"}},
"cons": {"type": "array", "items": {"type": "string"}},
},
},
},
)
for item in status.data:
if item.extract:
print(f"{item.extract['framework']}: {item.extract['pros']}")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | required | The search query string |
num_results | int | 5 | Number of search results to scrape |
engine | Literal["google", "duckduckgo", "brave"] | "google" | Search engine: "google" (via SearXNG), "duckduckgo", or "brave" |
formats | list[str] | None | Output formats for scraped results: "markdown", "html", "links", etc. |
only_main_content | bool | False | Extract only main content from each result page |
headers | dict[str, str] | None | Custom HTTP headers for scraping result pages |
cookies | dict[str, str] | None | Custom cookies for scraping result pages |
mobile | bool | False | Emulate mobile device viewport for scraping |
mobile_device | str | None | Mobile device preset name |
extract | dict | None | LLM extraction config applied to each result: { "prompt", "schema" } |
use_proxy | bool | False | Route requests through configured proxy |
google_api_key | str | None | Google Custom Search API key (alternative to SearXNG) |
google_cx | str | None | Google Custom Search Engine ID |
brave_api_key | str | None | Brave Search API key |
webhook_url | str | None | URL to receive webhook on completion |
webhook_secret | str | None | HMAC secret for webhook signature verification |
The search() blocking method accepts two additional parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
poll_interval | float | 2.0 | Seconds between status polls |
timeout | float | 300.0 | Maximum seconds to wait before raising TimeoutError |
on_progress | Callable[[SearchStatus], None] | None | Optional callback invoked after each poll with the latest SearchStatus. Useful for progress bars or logging. |
Response Model
class SearchJob: # from start_search()
success: bool # Whether the search job was accepted
job_id: str # Unique job identifier for polling search status
status: str # Current job status (typically "started")
message: str | None # Human-readable status message or error description
class SearchStatus: # from search() or get_search_status()
job_id: str # Unique job identifier
status: str # "pending" | "running" | "completed" | "failed" | "cancelled" | "started"
query: str | None # The search query that was executed
total_results: int # Total number of search results found
completed_results: int # Number of results successfully scraped so far
data: list[SearchResultItem] # List of search result items with optional scraped content
error: str | None # Human-readable error message if the search failed
# Properties
is_complete: bool # True when status in {"completed", "failed", "cancelled"}
progress: float # completed_results / total_results (0.0 to 1.0)
class SearchResultItem:
id: str | None # Unique identifier for this search result
url: str # URL of the search result
title: str | None # Title of the search result from the search engine
snippet: str | None # Search result snippet/description text
success: bool # Whether this result was successfully scraped
markdown: str | None # Clean Markdown conversion of the scraped page content
html: str | None # Cleaned HTML content of the scraped page
links: list[str] | None # List of URLs found on the scraped page
links_detail: list[dict] | None # Detailed link info including anchor text and attributes
screenshot: str | None # Base64-encoded PNG screenshot of the page
structured_data: dict | None # JSON-LD and microdata extracted from the page
headings: list[dict] | None # List of headings with level and text
images: list[dict] | None # List of images with src, alt, and dimensions
extract: dict | None # LLM-extracted structured data matching the provided schema
metadata: PageMetadata | None # Page metadata including title, status_code, word_count, and SEO tags
error: str | None # Error message if scraping this result failed