Scrapy

This guide provides instructions on how to integrate Evomi’s proxies with Scrapy, a powerful Python framework for web crawling and scraping.

Prerequisites

Before you begin, ensure you have the following:

  1. Python installed on your system
  2. Scrapy installed in your project environment
  3. Your Evomi proxy credentials (username, password, host, and port)

Installation

If you haven’t already installed Scrapy, you can do so using pip. It’s recommended to install it within a virtual environment for your project:

pip install scrapy

For SOCKS proxy support, an additional package will be needed (see “Using SOCKS5 Proxies” below).

Configuration

Integrating Evomi proxies into your Scrapy project involves configuring your project’s settings.py file and creating a custom downloader middleware.

Step 1: Configure Proxy Settings in settings.py

Open your Scrapy project’s settings.py file (usually located at your_project_name/your_project_name/settings.py) and add your Evomi proxy details:

# your_project_name/settings.py

# Evomi Proxy Configuration
PROXY_HOST = "rp.evomi.com"  # e.g., Residential Proxy Host
PROXY_PORT = "1000"  # e.g., Residential Proxy Port (ensure it's a string if your port is a string)
PROXY_USER = "your_username"
PROXY_PASS = "your_password_session-anychars_mode-speed"

# Choose the scheme for your proxy: 'http' for HTTP/HTTPS proxies, or 'socks5h' for SOCKS5
# If using SOCKS5, you'll also need to configure scrapy-socks (see below).
PROXY_SCHEME = "http"

# Construct the full proxy URL
if PROXY_USER and PROXY_PASS:
    PROXY_URL = f"{PROXY_SCHEME}://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
else:  # For proxies that might not require authentication
    PROXY_URL = f"{PROXY_SCHEME}://{PROXY_HOST}:{PROXY_PORT}"

# User-Agent (optional, but recommended)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

Replace your_username, your_password_session-anychars_mode-speed, PROXY_HOST, PROXY_PORT, and PROXY_SCHEME with your specific Evomi proxy details and desired protocol.

Step 2: Create a Custom Proxy Middleware

Create a new file named middlewares.py inside your Scrapy project’s main application directory (e.g., your_project_name/your_project_name/middlewares.py), or add to it if it already exists.

Add the following EvomiProxyMiddleware class:

# your_project_name/middlewares.py

from scrapy.exceptions import NotConfigured
import logging

logger = logging.getLogger(__name__)

class EvomiProxyMiddleware:
    def __init__(self, proxy_url):
        self.proxy_url = proxy_url
        logger.info(f"EvomiProxyMiddleware initialized. Proxy URL: {self.proxy_url}")

    @classmethod
    def from_crawler(cls, crawler):
        proxy_url = crawler.settings.get('PROXY_URL')
        if not proxy_url:
            raise NotConfigured("PROXY_URL not set in settings.py")
        return cls(proxy_url)

    def process_request(self, request, spider):
        # Don't overwrite proxy if it's already set (e.g. by RetryMiddleware or per-request)
        if 'proxy' not in request.meta:
            request.meta['proxy'] = self.proxy_url
            # You can log this for debugging if needed:
            # spider.logger.debug(f'Using proxy {self.proxy_url} for request {request.url}')
        return None # Continue processing this request

Step 3: Enable the Custom Middleware in settings.py

Now, go back to your settings.py file and enable this EvomiProxyMiddleware by adding it to the DOWNLOADER_MIDDLEWARES setting. Ensure it’s ordered before Scrapy’s default HttpProxyMiddleware (which is at order 750) if you want to ensure your proxy is set by your middleware.

# your_project_name/settings.py (continued)

DOWNLOADER_MIDDLEWARES = {
    # Replace 'your_project_name' with the actual name of your Scrapy project module
    'your_project_name.middlewares.EvomiProxyMiddleware': 350, # Adjust order as needed
    # Scrapy's built-in HttpProxyMiddleware is usually at 750 and will use request.meta['proxy']
}

Make sure to replace your_project_name with the actual Python module name of your Scrapy project (the one containing settings.py and middlewares.py).

Example Spider

Here’s a simple Scrapy spider that you can use to test your proxy configuration. Create it in your project’s spiders directory (e.g., your_project_name/spiders/ip_checker.py):

# your_project_name/spiders/ip_checker.py
import scrapy

class IPCheckSpider(scrapy.Spider):
    name = 'ipcheck'
    start_urls = ['https://ip.evomi.com/s'] # This site returns the request's IP address

    def parse(self, response):
        self.logger.info(f"Visited {response.url} using proxy, got IP: {response.text.strip()}")
        yield {
            'ip': response.text.strip(),
            'url': response.url
        }

Explanation

  1. settings.py Configuration:

    • PROXY_HOST, PROXY_PORT, PROXY_USER, PROXY_PASS: These store your fundamental Evomi proxy credentials.
    • PROXY_SCHEME: This setting (e.g., 'http' or 'socks5h') determines the protocol part of the proxy URL.
    • PROXY_URL: This is constructed in settings.py using the above details, forming a complete URL like http://your_username:[email protected]:1000. This is the URL your custom middleware will use.
    • USER_AGENT: It’s good practice to set a common browser User-Agent, as some websites may block default Scrapy User-Agents.
    • ROBOTSTXT_OBEY: Tells Scrapy to respect the robots.txt rules of websites.
  2. EvomiProxyMiddleware (in middlewares.py):

    • from_crawler(cls, crawler): This class method is called when Scrapy initializes the middleware. It reads the PROXY_URL string from your project settings.
    • __init__(self, proxy_url): The constructor stores the proxy_url.
    • process_request(self, request, spider): This method is called for each request Scrapy makes. It sets request.meta['proxy'] = self.proxy_url. Scrapy’s built-in HttpProxyMiddleware (which is enabled by default) will then detect and use the proxy specified in request.meta['proxy'].
  3. Enabling Middleware (DOWNLOADER_MIDDLEWARES):

    • By adding 'your_project_name.middlewares.EvomiProxyMiddleware': 350 to DOWNLOADER_MIDDLEWARES in settings.py, you tell Scrapy to use your custom middleware. The number 350 defines its processing order (lower numbers are processed earlier).
  4. IPCheckSpider: This spider makes a request to https://ip.evomi.com/s. The response from this site is simply the IP address from which the request was received. If the proxy is working correctly, this IP should be that of your Evomi proxy server.

Evomi Proxy Endpoints & SOCKS5 Configuration

Adjust PROXY_HOST and PROXY_PORT in settings.py based on the Evomi product you are using.

Common Endpoints:

  • Residential Proxies: PROXY_HOST = "rp.evomi.com", PROXY_PORT = "1000"
  • Mobile Proxies: PROXY_HOST = "mp.evomi.com", PROXY_PORT = "3000"
  • Datacenter Proxies: PROXY_HOST = "dcp.evomi.com", PROXY_PORT = "2000"

(Always refer to your Evomi dashboard for the most current endpoint details.)

Using SOCKS5 Proxies:

If your Evomi product uses a SOCKS5 proxy (e.g., rp.evomi.com:1002), you need to:

  1. Install scrapy-socks:

    pip install scrapy-socks
  2. Update settings.py:

    • Set the PROXY_SCHEME: PROXY_SCHEME = "socks5h" (or socks5 if DNS resolution should be local).
    • The PROXY_URL will then be constructed like socks5h://your_username:[email protected]:1002.
    • Add DOWNLOAD_HANDLERS to tell Scrapy to use scrapy-socks for handling requests:
      # settings.py (for SOCKS5 proxies)
      # ... (PROXY_HOST, PROXY_PORT, PROXY_USER, PROXY_PASS, PROXY_SCHEME='socks5h', PROXY_URL)
      
      DOWNLOAD_HANDLERS = {
          "http": "scrapy_socks.handlers.http.SOCKSDownloadHandler",
          "https": "scrapy_socks.handlers.http.SOCKSDownloadHandler",
      }

    Your EvomiProxyMiddleware will still set request.meta['proxy'] with the SOCKS URL, and the SOCKSDownloadHandler will process it.

Running the Spider

Navigate to your Scrapy project’s root directory in the terminal (the one containing scrapy.cfg) and run:

scrapy crawl ipcheck

If configured correctly, the output will include log messages from the spider showing the IP address of your Evomi proxy.

Tips and Troubleshooting

  • Credentials: Double-check PROXY_USER, PROXY_PASS, PROXY_HOST, PROXY_PORT, and PROXY_SCHEME in settings.py. Incorrect details are the most common source of errors.
  • Password Format: The proxy password (e.g., your_password_session-anychars_mode-speed) might include session parameters. Ensure you replace your_password correctly while keeping any additional required parts of the string intact.
  • Project Name: Ensure your_project_name in DOWNLOADER_MIDDLEWARES matches your project’s actual Python module name.
  • SSL/TLS Issues: Scrapy generally handles HTTPS connections well. If you face SSL errors with specific target websites (not the proxy itself), you might need to customize the SSL/TLS context factory using the DOWNLOADER_CLIENTCONTEXTFACTORY setting in settings.py. This is an advanced topic; consult the Scrapy documentation for details.
  • User-Agent: Always set a realistic USER_AGENT in settings.py (as shown in the example) or per-request to avoid being blocked by websites.
  • ROBOTSTXT_OBEY: While set to True by default in new Scrapy projects and in this guide for good practice, ensure this aligns with your scraping policy and the terms of service of the websites you target.
  • Dynamic Content: Scrapy fetches raw HTML. If a site relies heavily on JavaScript to load content, Scrapy alone might not be sufficient. For such cases, consider tools like Selenium or Playwright, potentially integrated with Scrapy (e.g., using scrapy-playwright or scrapy-selenium).
  • Debug Logging: For more detailed output, you can set LOG_LEVEL = 'DEBUG' in your settings.py.

By following this guide, you should be able to successfully integrate Evomi’s proxies with your Scrapy projects for efficient and reliable web scraping.