Domain Crawling

Our Domain Crawling API discovers URLs from any domain. Find and filter thousands of URLs in seconds, optionally, validate and scrape them all in one request—perfect for content audits, competitor analysis, and bulk data collection.

Why Domain Crawling?

Automated URL Discovery No need to manually browse websites or maintain URL lists. Our API discovers URLs and gives coverage of any domain.

Smart Validation Optionally validate that discovered URLs are still accessible with parallel HEAD checks. Filter out broken links before you scrape, saving time and credits.

Integrated Scraping Combine discovery with scraping in a single request. Discovered URLs are automatically queued for scraping with your configured settings, respecting your concurrency limits.

Flexible Filtering Use regex patterns to target specific URL structures—find only blog posts, product pages, or any URL pattern you need.

Parameters Overview

Parameter	Type	Required	Description
`domain`	string	Yes	Domain to crawl (e.g., “example.com”)
`sources`	array	No	Discovery sources: `["sitemap"]`, `["commoncrawl"]`, or both. Default: both
`max_urls`	integer	Yes	Maximum URLs to return (1-10,000)
`check_if_live`	boolean	No	Validate URLs with HEAD checks. Default: true
`url_pattern`	string	No	Regex pattern to filter URLs during discovery
`scraper_config`	object	No	Scrape config on discovered URLs
`async`	boolean	No	Process in background and return task ID. Default: false

Response Formats

{
  "success": true,
  "domain": "example.com",
  "discovered_count": 150,
  "results": [
    {
      "url": "https://example.com/page1",
      "source": "sitemap"
    },
    {
      "url": "https://example.com/page2",
      "source": "commoncrawl"
    }
  ],
  "credits_used": 77.0,
  "credits_remaining": 923.0
}

With Scraper Integration:

{
  "success": true,
  "domain": "example.com",
  "discovered_count": 50,
  "scraper_tasks_submitted": 50,
  "results": [
    {
      "url": "https://example.com/page1",
      "source": "sitemap",
      "scrape_task_id": "abc-123",
      "scrape_check_url": "/api/v1/scraper/tasks/abc-123"
    }
  ],
  "credits_used": 27.0,
  "credits_remaining": 973.0
}

Each discovered URL with scraper integration includes a scrape_task_id for tracking individual scrape jobs.

Error Handling

The API uses standard HTTP status codes:

Status	Meaning	Action
200	Success	Results are ready
202	Accepted	Async task is processing
400	Bad request	Check your parameters
401	Unauthorized	Verify your API key
402	Insufficient credits	Add credits to your account
404	Task not found	Invalid task ID
429	Rate limit exceeded	Wait and retry

Sync vs Async Modes

Synchronous Mode (default) Returns results immediately. Best for small to medium domains (up to ~1000 URLs).

curl -X POST "https://scrape.evomi.com/api/v1/scraper/crawl" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "example.com",
    "max_urls": 500
  }'

Asynchronous Mode Returns a task ID immediately. Poll the status endpoint to retrieve results when ready. Best for large domains (1000+ URLs) or when integrating scraping.

curl -X POST "https://scrape.evomi.com/api/v1/scraper/crawl" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "example.com",
    "max_urls": 5000,
    "async": true
  }'

Response (202 Accepted):

{
  "success": true,
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "processing",
  "message": "Crawl task submitted for background processing",
  "check_url": "/api/v1/scraper/crawl/tasks/550e8400-e29b-41d4-a716-446655440000",
  "credits_reserved": 2502.5
}

Check Status:

curl "https://scrape.evomi.com/api/v1/scraper/crawl/tasks/550e8400-e29b-41d4-a716-446655440000" \
  -H "x-api-key: YOUR_API_KEY"

Transparent, Pricing

Operation	Cost
Sitemap discovery	2 credits
Common Crawl discovery	2 credits
URL validation (per URL)	0.5 credits

Examples:

Discover 100 URLs from sitemap + validate = 2 + (100 × 0.5) = 52 credits
Discover 500 URLs from both sources + validate = 4 + (500 × 0.5) = 254 credits
Discover 1000 URLs, no validation = 2-4 credits = 2-4 credits

⚠️

Scraping Integration: When using scraper_config to automatically scrape discovered URLs, each scrape is charged separately according to Scraper API pricing. Discovery costs and scraping costs are billed independently.

ℹ️

Smart Refunds: If fewer URLs are discovered than your max_urls limit, you’re only charged for actual validation performed. For example, if you request 1000 URLs but only 300 are found, you pay for 300 validations, not 1000.

Quick Start

Discover all URLs from a domain:

ℹ️

Your API key is available in the Evomi Dashboard. You need an active Scraper API subscription or credits.

curl -X POST "https://scrape.evomi.com/api/v1/scraper/crawl" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "example.com",
    "sources": ["sitemap"],
    "max_urls": 100,
    "check_if_live": true
  }'

Response:

{
  "success": true,
  "domain": "example.com",
  "discovered_count": 87,
  "results": [
    {
      "url": "https://example.com/",
      "source": "sitemap"
    },
    {
      "url": "https://example.com/about",
      "source": "sitemap"
    }
  ],
  "credits_used": 45.5,
  "credits_remaining": 954.5
}

Base URL

https://scrape.evomi.com

All API requests use this base URL with the /api/v1/scraper/crawl endpoint.

Common Use Cases

Content Audits Discover all pages on your site to ensure completeness, check for broken links, or verify sitemap accuracy.

Competitor Analysis Map out competitor websites—find all product pages, blog posts, or landing pages for market research.

SEO Monitoring Track which URLs are indexed, validate they’re accessible, and monitor for 404 errors or redirects.

Bulk Data Collection Discover thousands of URLs and automatically scrape them with a single API call—perfect for large-scale data gathering.

Archive Research Access historical URLs from Common Crawl that may no longer appear in current sitemaps.

Discovery Sources

Sitemaps

Fast and structured. We automatically discover sitemaps from:

Common locations (/sitemap.xml, /sitemap_index.xml)
Robots.txt declarations
Sitemap indexes (recursive parsing)

Best for: Active websites with maintained sitemaps, complete coverage of current content.

Common Crawl

Historical web archive with billions of indexed URLs. Searches Common Crawl’s database for URLs matching your domain.

Best for: Finding historical URLs, discovering pages not in sitemaps, comprehensive competitor research.

ℹ️

You can use both sources in a single request for maximum coverage. The API automatically deduplicates URLs found in multiple sources.

Next Steps

Dive deeper into the Domain Crawling API - Usage Examples -

⚠️

Start with small max_urls values to understand costs before scaling. Use check_if_live: false for initial discovery, then validate only the URLs you need.