Extraction

Scraper API

Parameters

Extraction

Reliable structured extraction is achieved through Css, Xpath, Regex, & Llms.

Css, Xpath, & Regex extraction is communicated though the extract_scheme or scheme_id parameter which is a extraction scheme where each object defines a data/area “bucket.” These objects act as either a Rule (to pull data) or a Container (to scope data).

⚠️

Note: extract_scheme can be defined with each scraper request or using a presaved configuration using scheme_id. View “Scheme” Tab.

Defining an `extract_scheme`

1 - Select data/container with CSS/xpath/regex locators

2 - Nest to target extraction into subsections if needed

3 - Configure extract field types (content/attribute/exists/count)

Extraction Parameters

Parameter	Type	Description
`label`	String	The key name for the field in the resulting JSON.
`type`	Enum	`content` (text), `attribute` (links/src), `exists` (bool), `count` (int), or `nest`.
`selector`	CSS	A standard CSS selector to target the element.
`xpath`	XPath	A standard XPath expression for complex navigation.
`regex`	Pattern	A Regular Expression to extract text from the raw HTML/text.
`attribute`	String	Required if type=“attribute”. Specify the target (e.g., `href`, `src`).
`fields`	Array	Required if type=“nest”. A recursive list of extraction objects.

Below is a standard configuration for extracting a list of blog posts from a page:

{
  "extract_scheme": [
    {
      "label": "blog_posts",
      "type": "nest",
      "selector": "article.post-card",
      "fields": [
        {
          "label": "title",
          "type": "content",
          "selector": "h2"
        },
        {
          "label": "link",
          "type": "attribute",
          "selector": "a.main-link",
          "attribute": "href"
        },
        {
          "label": "author_via_xpath",
          "type": "content",
          "xpath": ".//span[contains(@class, 'author-name')]" 
        }
      ]
    }
  ]
}

⚠️

Set delivery=json to receive results, View “Output Formats” Tab.

Hybrid Targeting: You can mix and match locator types. For example, you can use a selector to find a parent container and then use regex or xpath to find specific data inside it.
Use regex for raw strings: Best when data isn’t wrapped in its own tag (e.g., pulling a price out of a string like “Price: $45.00 only today”).
Use type: "nest" for lists: Use this for repeated elements like product grids to keep your JSON structured.
Use type: "exists" for validation: Perfect for checking if a “Sold Out” badge is present without needing the actual text.

Request Mode with Filtering then Extracting (2 credits)

{
  "url": "https://spa-website.com",
  "mode": "request",
  "excluded_tags": ["aside", "form", "nav"],
  "excluded_selectors": [".ads"],
  "proxy_type": "residential",
  "proxy_country": "US",
  "extract_scheme": [
    {
      "label": "main_content",
      "type": "nest",
      "selector": "div",
      "fields": [
        {
          "label": "headline",
          "type": "content",
          "xpath": "//h1"
        },
        {
          "label": "body_text",
          "type": "content",
          "selector": "p"
        },
        {
          "label": "external_link",
          "type": "attribute",
          "selector": "a",
          "attribute": "href"
        }
      ]
    }
  ]
}

AI Extraction

Parameter	Type	Default	Required When	Description
`ai_enhance`	boolean	`false`	-	Enable AI processing
`ai_source`	string	-	ai_enhance=true	Source: `markdown` or `screenshot`
`ai_prompt`	string	-	No	Custom prompt for AI processing
`ai_force_json`	boolean	`true`	No	Force JSON output format

AI Model: Google Gemini 2.0 Flash
Cost: +30 credits (additive)

Example:

{
  "ai_enhance": true,
  "ai_source": "markdown",
  "ai_prompt": "Extract product name, price, and availability as JSON"
}

See AI Enhancement for detailed examples.

Extraction

Defining an extract_scheme

Extraction Parameters

Request Mode with Filtering then Extracting (2 credits)

AI Extraction

Defining an `extract_scheme`