Extraction
Reliable structured extraction is achieved through Css, Xpath, Regex, & Llms.
Css, Xpath, & Regex extraction is communicated though the extract_scheme parameter which is a extraction schema where each object defines a data/area “bucket.” These objects act as either a Rule (to pull data) or a Container (to scope data).
Defining an extract_scheme
1 - Select data/container with CSS/xpath/regex locators
2 - Nest to target extraction into subsections if needed
3 - Configure extract field types (content/attribute/exists/count)
Extraction Parameters
| Parameter | Type | Description |
|---|---|---|
label |
String | The key name for the field in the resulting JSON. |
type |
Enum | content (text), attribute (links/src), exists (bool), count (int), or nest. |
selector |
CSS | A standard CSS selector to target the element. |
xpath |
XPath | A standard XPath expression for complex navigation. |
regex |
Pattern | A Regular Expression to extract text from the raw HTML/text. |
attribute |
String | Required if type=“attribute”. Specify the target (e.g., href, src). |
fields |
Array | Required if type=“nest”. A recursive list of extraction objects. |
Below is a standard configuration for extracting a list of blog posts from a page:
{
"extract_scheme": [
{
"label": "blog_posts",
"type": "nest",
"selector": "article.post-card",
"fields": [
{
"label": "title",
"type": "content",
"selector": "h2"
},
{
"label": "link",
"type": "attribute",
"selector": "a.main-link",
"attribute": "href"
},
{
"label": "author_via_xpath",
"type": "content",
"xpath": ".//span[contains(@class, 'author-name')]"
}
]
}
]
}⚠️
Set
delivery=json to receive results, View “Output Formats” for more!- Hybrid Targeting: You can mix and match locator types. For example, you can use a
selectorto find a parent container and then useregexorxpathto find specific data inside it. - Use
regexfor raw strings: Best when data isn’t wrapped in its own tag (e.g., pulling a price out of a string like “Price: $45.00 only today”). - Use
type: "nest"for lists: Use this for repeated elements like product grids to keep your JSON structured. - Use
type: "exists"for validation: Perfect for checking if a “Sold Out” badge is present without needing the actual text.
AI Extraction
| Parameter | Type | Default | Required When | Description |
|---|---|---|---|---|
ai_enhance |
boolean | false |
- | Enable AI processing |
ai_source |
string | - | ai_enhance=true | Source: markdown or screenshot |
ai_prompt |
string | - | No | Custom prompt for AI processing |
ai_force_json |
boolean | true |
No | Force JSON output format |
AI Model: Google Gemini 2.0 Flash
Cost: +10 credits (additive)
Example:
{
"ai_enhance": true,
"ai_source": "markdown",
"ai_prompt": "Extract product name, price, and availability as JSON"
}See AI Enhancement for detailed examples.
Request Mode with Filtering then Extracting (2 credits)
{
"url": "https://spa-website.com",
"mode": "request",
"excluded_tags": ["aside", "form", "nav"],
"excluded_selectors": [".ads"],
"proxy_type": "residential",
"proxy_country": "US",
"extract_scheme": [
{
"label": "main_content",
"type": "nest",
"selector": "div",
"fields": [
{
"label": "headline",
"type": "content",
"xpath": "//h1"
},
{
"label": "body_text",
"type": "content",
"selector": "p"
},
{
"label": "external_link",
"type": "attribute",
"selector": "a",
"attribute": "href"
}
]
}
]
}