Scrapy
This guide provides instructions on how to integrate Evomi’s proxies with Scrapy, a powerful Python framework for web crawling and scraping.
Prerequisites
Before you begin, ensure you have the following:
- Python installed on your system
- Scrapy installed in your project environment
- Your Evomi proxy credentials (username, password, host, and port)
Installation
If you haven’t already installed Scrapy, you can do so using pip. It’s recommended to install it within a virtual environment for your project:
pip install scrapy
For SOCKS proxy support, an additional package will be needed (see “Using SOCKS5 Proxies” below).
Configuration
Integrating Evomi proxies into your Scrapy project involves configuring your project’s settings.py
file and creating a custom downloader middleware.
Step 1: Configure Proxy Settings in settings.py
Open your Scrapy project’s settings.py
file (usually located at your_project_name/your_project_name/settings.py
) and add your Evomi proxy details:
# your_project_name/settings.py
# Evomi Proxy Configuration
PROXY_HOST = "rp.evomi.com" # e.g., Residential Proxy Host
PROXY_PORT = "1000" # e.g., Residential Proxy Port (ensure it's a string if your port is a string)
PROXY_USER = "your_username"
PROXY_PASS = "your_password_session-anychars_mode-speed"
# Choose the scheme for your proxy: 'http' for HTTP/HTTPS proxies, or 'socks5h' for SOCKS5
# If using SOCKS5, you'll also need to configure scrapy-socks (see below).
PROXY_SCHEME = "http"
# Construct the full proxy URL
if PROXY_USER and PROXY_PASS:
PROXY_URL = f"{PROXY_SCHEME}://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
else: # For proxies that might not require authentication
PROXY_URL = f"{PROXY_SCHEME}://{PROXY_HOST}:{PROXY_PORT}"
# User-Agent (optional, but recommended)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
Replace your_username
, your_password_session-anychars_mode-speed
, PROXY_HOST
, PROXY_PORT
, and PROXY_SCHEME
with your specific Evomi proxy details and desired protocol.
Step 2: Create a Custom Proxy Middleware
Create a new file named middlewares.py
inside your Scrapy project’s main application directory (e.g., your_project_name/your_project_name/middlewares.py
), or add to it if it already exists.
Add the following EvomiProxyMiddleware
class:
# your_project_name/middlewares.py
from scrapy.exceptions import NotConfigured
import logging
logger = logging.getLogger(__name__)
class EvomiProxyMiddleware:
def __init__(self, proxy_url):
self.proxy_url = proxy_url
logger.info(f"EvomiProxyMiddleware initialized. Proxy URL: {self.proxy_url}")
@classmethod
def from_crawler(cls, crawler):
proxy_url = crawler.settings.get('PROXY_URL')
if not proxy_url:
raise NotConfigured("PROXY_URL not set in settings.py")
return cls(proxy_url)
def process_request(self, request, spider):
# Don't overwrite proxy if it's already set (e.g. by RetryMiddleware or per-request)
if 'proxy' not in request.meta:
request.meta['proxy'] = self.proxy_url
# You can log this for debugging if needed:
# spider.logger.debug(f'Using proxy {self.proxy_url} for request {request.url}')
return None # Continue processing this request
Step 3: Enable the Custom Middleware in settings.py
Now, go back to your settings.py
file and enable this EvomiProxyMiddleware
by adding it to the DOWNLOADER_MIDDLEWARES
setting. Ensure it’s ordered before Scrapy’s default HttpProxyMiddleware
(which is at order 750) if you want to ensure your proxy is set by your middleware.
# your_project_name/settings.py (continued)
DOWNLOADER_MIDDLEWARES = {
# Replace 'your_project_name' with the actual name of your Scrapy project module
'your_project_name.middlewares.EvomiProxyMiddleware': 350, # Adjust order as needed
# Scrapy's built-in HttpProxyMiddleware is usually at 750 and will use request.meta['proxy']
}
Make sure to replace your_project_name
with the actual Python module name of your Scrapy project (the one containing settings.py
and middlewares.py
).
Example Spider
Here’s a simple Scrapy spider that you can use to test your proxy configuration. Create it in your project’s spiders
directory (e.g., your_project_name/spiders/ip_checker.py
):
# your_project_name/spiders/ip_checker.py
import scrapy
class IPCheckSpider(scrapy.Spider):
name = 'ipcheck'
start_urls = ['https://ip.evomi.com/s'] # This site returns the request's IP address
def parse(self, response):
self.logger.info(f"Visited {response.url} using proxy, got IP: {response.text.strip()}")
yield {
'ip': response.text.strip(),
'url': response.url
}
Explanation
-
settings.py
Configuration:PROXY_HOST
,PROXY_PORT
,PROXY_USER
,PROXY_PASS
: These store your fundamental Evomi proxy credentials.PROXY_SCHEME
: This setting (e.g.,'http'
or'socks5h'
) determines the protocol part of the proxy URL.PROXY_URL
: This is constructed insettings.py
using the above details, forming a complete URL likehttp://your_username:[email protected]:1000
. This is the URL your custom middleware will use.USER_AGENT
: It’s good practice to set a common browser User-Agent, as some websites may block default Scrapy User-Agents.ROBOTSTXT_OBEY
: Tells Scrapy to respect therobots.txt
rules of websites.
-
EvomiProxyMiddleware
(inmiddlewares.py
):from_crawler(cls, crawler)
: This class method is called when Scrapy initializes the middleware. It reads thePROXY_URL
string from your project settings.__init__(self, proxy_url)
: The constructor stores theproxy_url
.process_request(self, request, spider)
: This method is called for each request Scrapy makes. It setsrequest.meta['proxy'] = self.proxy_url
. Scrapy’s built-inHttpProxyMiddleware
(which is enabled by default) will then detect and use the proxy specified inrequest.meta['proxy']
.
-
Enabling Middleware (
DOWNLOADER_MIDDLEWARES
):- By adding
'your_project_name.middlewares.EvomiProxyMiddleware': 350
toDOWNLOADER_MIDDLEWARES
insettings.py
, you tell Scrapy to use your custom middleware. The number350
defines its processing order (lower numbers are processed earlier).
- By adding
-
IPCheckSpider
: This spider makes a request tohttps://ip.evomi.com/s
. The response from this site is simply the IP address from which the request was received. If the proxy is working correctly, this IP should be that of your Evomi proxy server.
Evomi Proxy Endpoints & SOCKS5 Configuration
Adjust PROXY_HOST
and PROXY_PORT
in settings.py
based on the Evomi product you are using.
Common Endpoints:
- Residential Proxies:
PROXY_HOST = "rp.evomi.com"
,PROXY_PORT = "1000"
- Mobile Proxies:
PROXY_HOST = "mp.evomi.com"
,PROXY_PORT = "3000"
- Datacenter Proxies:
PROXY_HOST = "dcp.evomi.com"
,PROXY_PORT = "2000"
(Always refer to your Evomi dashboard for the most current endpoint details.)
Using SOCKS5 Proxies:
If your Evomi product uses a SOCKS5 proxy (e.g., rp.evomi.com:1002
), you need to:
-
Install
scrapy-socks
:pip install scrapy-socks
-
Update
settings.py
:- Set the
PROXY_SCHEME
:PROXY_SCHEME = "socks5h"
(orsocks5
if DNS resolution should be local). - The
PROXY_URL
will then be constructed likesocks5h://your_username:[email protected]:1002
. - Add
DOWNLOAD_HANDLERS
to tell Scrapy to usescrapy-socks
for handling requests:# settings.py (for SOCKS5 proxies) # ... (PROXY_HOST, PROXY_PORT, PROXY_USER, PROXY_PASS, PROXY_SCHEME='socks5h', PROXY_URL) DOWNLOAD_HANDLERS = { "http": "scrapy_socks.handlers.http.SOCKSDownloadHandler", "https": "scrapy_socks.handlers.http.SOCKSDownloadHandler", }
Your
EvomiProxyMiddleware
will still setrequest.meta['proxy']
with the SOCKS URL, and theSOCKSDownloadHandler
will process it. - Set the
Running the Spider
Navigate to your Scrapy project’s root directory in the terminal (the one containing scrapy.cfg
) and run:
scrapy crawl ipcheck
If configured correctly, the output will include log messages from the spider showing the IP address of your Evomi proxy.
Tips and Troubleshooting
- Credentials: Double-check
PROXY_USER
,PROXY_PASS
,PROXY_HOST
,PROXY_PORT
, andPROXY_SCHEME
insettings.py
. Incorrect details are the most common source of errors. - Password Format: The proxy password (e.g.,
your_password_session-anychars_mode-speed
) might include session parameters. Ensure you replaceyour_password
correctly while keeping any additional required parts of the string intact. - Project Name: Ensure
your_project_name
inDOWNLOADER_MIDDLEWARES
matches your project’s actual Python module name. - SSL/TLS Issues: Scrapy generally handles HTTPS connections well. If you face SSL errors with specific target websites (not the proxy itself), you might need to customize the SSL/TLS context factory using the
DOWNLOADER_CLIENTCONTEXTFACTORY
setting insettings.py
. This is an advanced topic; consult the Scrapy documentation for details. - User-Agent: Always set a realistic
USER_AGENT
insettings.py
(as shown in the example) or per-request to avoid being blocked by websites. ROBOTSTXT_OBEY
: While set toTrue
by default in new Scrapy projects and in this guide for good practice, ensure this aligns with your scraping policy and the terms of service of the websites you target.- Dynamic Content: Scrapy fetches raw HTML. If a site relies heavily on JavaScript to load content, Scrapy alone might not be sufficient. For such cases, consider tools like Selenium or Playwright, potentially integrated with Scrapy (e.g., using
scrapy-playwright
orscrapy-selenium
). - Debug Logging: For more detailed output, you can set
LOG_LEVEL = 'DEBUG'
in yoursettings.py
.
By following this guide, you should be able to successfully integrate Evomi’s proxies with your Scrapy projects for efficient and reliable web scraping.