Beautiful Soup
This guide provides instructions on how to use Evomi’s proxies with Beautiful Soup, a popular Python library for web scraping and parsing HTML and XML documents, in conjunction with the requests
library for network communication.
Prerequisites
Before you begin, ensure you have the following:
- Python installed on your system.
- Beautiful Soup (
beautifulsoup4
) andrequests
libraries installed. - Your Evomi proxy credentials (username and password).
Installation
If you haven’t already installed Beautiful Soup and requests
, you can do so using pip:
pip install beautifulsoup4 requests
For SOCKS proxy support, you’ll need an extra component (see “Tips and Troubleshooting”).
Configuration
To use Evomi proxies with Beautiful Soup, you’ll configure the requests
library to route HTTP/HTTPS requests through your Evomi proxy.
Here’s a basic setup:
import requests
from bs4 import BeautifulSoup
# Evomi proxy configuration
proxy_host = "rp.evomi.com" # Example: Residential Proxy
proxy_port = "1000"
proxy_username = "your_username"
proxy_password = "your_password_session-anychars_mode-speed"
# Construct the proxy URL with authentication
# For HTTP/HTTPS proxies:
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
# For SOCKS5 proxies, the scheme would be 'socks5h' (or 'socks5' if DNS resolution is local)
# proxy_url = f"socks5h://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
# Proxy dictionary for the requests library
proxies = {
"http": proxy_url,
"https": proxy_url # Use the same proxy URL for both HTTP and HTTPS
}
# Target URL to fetch
url_to_scrape = "https://ip.evomi.com/s" # This site shows your current IP
try:
# Make a request through the proxy
response = requests.get(url_to_scrape, proxies=proxies, timeout=10) # Added timeout
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Print the IP address (which should be the proxy's IP)
print(f"Response from {url_to_scrape} using proxy:")
print(soup.get_text().strip())
except requests.exceptions.ProxyError as e:
print(f"Proxy Error: {e}")
print("Please check your proxy_host, proxy_port, proxy_username, and proxy_password.")
except requests.exceptions.HTTPError as e:
print(f"HTTP Error: {e}")
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
Replace your_username
with your actual Evomi proxy username and your_password
with your actual password.
Explanation
Let’s break down the key parts of this script:
- Import Libraries:
requests
: For making HTTP/HTTPS requests.BeautifulSoup
(frombs4
): For parsing HTML/XML content.
- Proxy Configuration:
proxy_host
,proxy_port
,proxy_username
,proxy_password
: These variables store your specific Evomi proxy credentials and server details.
- Proxy URL Construction:
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
: This line creates the full proxy URL string, embedding your username and password for authentication. This formathttp://user:pass@host:port
is standard for HTTP/HTTPS proxies.- For SOCKS proxies, the scheme in the URL would change (e.g.,
socks5h://...
).
proxies
Dictionary:proxies = {"http": proxy_url, "https": proxy_url}
: Therequests
library uses this dictionary to determine which proxy to use for HTTP and HTTPS traffic. Both are routed through the sameproxy_url
.
- Making the Request:
response = requests.get(url_to_scrape, proxies=proxies, timeout=10)
: This sends a GET request tourl_to_scrape
.proxies=proxies
: This argument tellsrequests
to use the configured proxy.timeout=10
: It’s good practice to add a timeout to prevent your script from hanging indefinitely.
response.raise_for_status()
: This will check if the request was successful (status code 2xx). If not (e.g., 403 Forbidden, 401 Unauthorized, 500 Internal Server Error), it will raise anHTTPError
.
- Parsing Content:
soup = BeautifulSoup(response.content, 'html.parser')
: The raw byte content (response.content
) of the successful response is parsed byBeautifulSoup
using Python’s built-in HTML parser.
- Extracting Information:
print(soup.get_text().strip())
: This extracts all the text from the parsed HTML and removes leading/trailing whitespace. Forhttps://ip.evomi.com/s
, this will be the IP address seen by the server, which should be your proxy’s IP.
- Error Handling:
- The
try...except
block catches potential errors like proxy connection issues (requests.exceptions.ProxyError
), HTTP errors (requests.exceptions.HTTPError
), or other request-related problems (requests.exceptions.RequestException
).
- The
Evomi Proxy Endpoints
Depending on the Evomi product and protocol you’re using, adjust the proxy_host
, proxy_port
, and the protocol scheme in your proxy_url
(http://
, https://
, socks5h://
).
Residential Proxies
- HTTP/HTTPS Proxy:
rp.evomi.com:1000
(Usehttp://user:[email protected]:1000
inproxy_url
) - SOCKS5 Proxy:
rp.evomi.com:1002
(Usesocks5h://user:[email protected]:1002
inproxy_url
)
Mobile Proxies
- HTTP/HTTPS Proxy:
mp.evomi.com:3000
- SOCKS5 Proxy:
mp.evomi.com:3002
Datacenter Proxies
- HTTP/HTTPS Proxy:
dcp.evomi.com:2000
- SOCKS5 Proxy:
dcp.evomi.com:2002
Note on HTTPS Proxies vs. HTTPS URLs:
The proxies
dictionary routes all HTTPS traffic (e.g., https://example.com
) through the specified proxy_url
. If your proxy_url
itself uses http://user:pass@host:port
, this is standard. The proxy server is responsible for handling the onward SSL/TLS connection to the final HTTPS destination. Some Evomi endpoints might offer a specific port for proxies that primarily handle HTTPS outgoing traffic (e.g., rp.evomi.com:1001
), but typically the standard HTTP proxy port (:1000
) handles both.
Advanced Usage: Scraping Specific Elements
Here’s an example of scraping titles from a news site (fictional example):
import requests
from bs4 import BeautifulSoup
# Evomi proxy configuration (ensure these are correctly set)
proxy_username = "your_username"
proxy_password = "your_password_session-anychars_mode-speed"
proxy_host = "rp.evomi.com"
proxy_port = "1000"
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
proxies = {"http": proxy_url, "https": proxy_url}
target_site_url = "https://www.example-news.com" # Replace with a real site you have permission to scrape
def scrape_titles(url, proxies_config):
try:
headers = { # It's good practice to set a User-Agent
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, proxies=proxies_config, headers=headers, timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
titles = []
# This is an example selector; you'll need to inspect the target site's HTML
# to find the correct tags and classes for the elements you want.
for P_tag in soup.find_all('P', class_='article-title'): # Adjust selector as needed
titles.append(P_tag.get_text(strip=True))
return titles
except requests.exceptions.RequestException as e:
print(f"Error during scraping {url}: {e}")
return []
# --- Main part of the script ---
if __name__ == "__main__":
# Replace your_username and your_password
if "your_username" in proxy_url or "your_password" in proxy_url:
print("Please replace 'your_username' and 'your_password' in the script.")
else:
article_titles = scrape_titles(target_site_url, proxies)
if article_titles:
print(f"Found titles on {target_site_url}:")
for i, title in enumerate(article_titles[:5]): # Print first 5 titles
print(f"{i+1}. {title}")
else:
print(f"No titles found or error occurred for {target_site_url}.")
This script defines a function scrape_titles
that takes a URL and proxy configuration, fetches the page, and then uses Beautiful Soup’s find_all
method to locate HTML elements (e.g., <h2>
tags with a class article-title
) and extract their text. Remember to inspect the HTML structure of your target website to determine the correct selectors.
Tips and Troubleshooting
- Credentials & Endpoint: Always double-check
proxy_username
,proxy_password
,proxy_host
,proxy_port
, and the protocol scheme inproxy_url
. Incorrect details are the most common cause of issues. - Password Format: The proxy password often includes session parameters (e.g.,
_session-anychars_mode-speed
). Ensure you replaceyour_password
but keep these additional parameters intact if they are part of your assigned password string. - SOCKS Proxies:
- To use SOCKS5 proxies with
requests
, you need to install thePySocks
library:pip install requests[socks]
- Then, update your
proxy_url
to use thesocks5h://
(orsocks5://
) scheme:proxy_url = f"socks5h://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}" # 'socks5h://' means DNS resolution happens on the proxy server side. # 'socks5://' means DNS resolution happens on the client side. 'socks5h' is often preferred.
- To use SOCKS5 proxies with
- SSL Verification (
verify=False
):- If you encounter SSL errors specifically from the target website (not the proxy connection itself), you might be tempted to use
verify=False
inrequests.get()
:# response = requests.get(url, proxies=proxies, verify=False)
- Use with extreme caution, as this disables SSL certificate verification and makes your connection insecure. It’s better to resolve underlying SSL issues (e.g., by ensuring your system’s root CAs are up to date).
- If you encounter SSL errors specifically from the target website (not the proxy connection itself), you might be tempted to use
- User-Agent: Some websites block requests that don’t have a common browser User-Agent. Set one via the
headers
parameter inrequests.get()
:headers = {'User-Agent': 'Mozilla/5.0 ...'} response = requests.get(url, proxies=proxies, headers=headers)
- Error Handling: The provided examples include basic error handling. For production scrapers, implement more robust error checking, retries with backoff, and logging.
- Website Terms of Service: Always respect the
robots.txt
file and the terms of service of any website you scrape. Implement rate limiting and be considerate to avoid overloading servers or getting your proxy IP blocked. - Dynamic Content (JavaScript): Beautiful Soup and
requests
only fetch the initial HTML content. If a website loads content dynamically using JavaScript, these tools won’t execute the JavaScript. For such sites, consider tools like Selenium (see our Selenium integration guide) or Playwright.
By following this guide, you should be able to successfully integrate Evomi’s proxies with Beautiful Soup and requests
for your web scraping tasks.