Extraction

Reliable structured extraction is achieved through Css, Xpath, Regex, & Llms.

Css, Xpath, & Regex extraction is communicated though the extract_scheme parameter which is a extraction schema where each object defines a data/area “bucket.” These objects act as either a Rule (to pull data) or a Container (to scope data).

Defining an extract_scheme

1 - Select data/container with CSS/xpath/regex locators

2 - Nest to target extraction into subsections if needed

3 - Configure extract field types (content/attribute/exists/count)

Extraction Parameters

Parameter Type Description
label String The key name for the field in the resulting JSON.
type Enum content (text), attribute (links/src), exists (bool), count (int), or nest.
selector CSS A standard CSS selector to target the element.
xpath XPath A standard XPath expression for complex navigation.
regex Pattern A Regular Expression to extract text from the raw HTML/text.
attribute String Required if type=“attribute”. Specify the target (e.g., href, src).
fields Array Required if type=“nest”. A recursive list of extraction objects.

Below is a standard configuration for extracting a list of blog posts from a page:

{
  "extract_scheme": [
    {
      "label": "blog_posts",
      "type": "nest",
      "selector": "article.post-card",
      "fields": [
        {
          "label": "title",
          "type": "content",
          "selector": "h2"
        },
        {
          "label": "link",
          "type": "attribute",
          "selector": "a.main-link",
          "attribute": "href"
        },
        {
          "label": "author_via_xpath",
          "type": "content",
          "xpath": ".//span[contains(@class, 'author-name')]" 
        }
      ]
    }
  ]
}
⚠️
Set delivery=json to receive results, View “Output Formats” for more!
  • Hybrid Targeting: You can mix and match locator types. For example, you can use a selector to find a parent container and then use regex or xpath to find specific data inside it.
  • Use regex for raw strings: Best when data isn’t wrapped in its own tag (e.g., pulling a price out of a string like “Price: $45.00 only today”).
  • Use type: "nest" for lists: Use this for repeated elements like product grids to keep your JSON structured.
  • Use type: "exists" for validation: Perfect for checking if a “Sold Out” badge is present without needing the actual text.

AI Extraction

Parameter Type Default Required When Description
ai_enhance boolean false - Enable AI processing
ai_source string - ai_enhance=true Source: markdown or screenshot
ai_prompt string - No Custom prompt for AI processing
ai_force_json boolean true No Force JSON output format

AI Model: Google Gemini 2.0 Flash
Cost: +10 credits (additive)

Example:

{
  "ai_enhance": true,
  "ai_source": "markdown",
  "ai_prompt": "Extract product name, price, and availability as JSON"
}

See AI Enhancement for detailed examples.

Request Mode with Filtering then Extracting (2 credits)

{
  "url": "https://spa-website.com",
  "mode": "request",
  "excluded_tags": ["aside", "form", "nav"],
  "excluded_selectors": [".ads"],
  "proxy_type": "residential",
  "proxy_country": "US",
  "extract_scheme": [
    {
      "label": "main_content",
      "type": "nest",
      "selector": "div",
      "fields": [
        {
          "label": "headline",
          "type": "content",
          "xpath": "//h1"
        },
        {
          "label": "body_text",
          "type": "content",
          "selector": "p"
        },
        {
          "label": "external_link",
          "type": "attribute",
          "selector": "a",
          "attribute": "href"
        }
      ]
    }
  ]
}