Beautiful Soup

Beautiful Soup

This guide provides instructions on how to use Evomi’s proxies with Beautiful Soup, a popular Python library for web scraping and parsing HTML and XML documents, in conjunction with the requests library for network communication.

Prerequisites

Before you begin, ensure you have the following:

  1. Python installed on your system.
  2. Beautiful Soup (beautifulsoup4) and requests libraries installed.
  3. Your Evomi proxy credentials (username and password).

Installation

If you haven’t already installed Beautiful Soup and requests, you can do so using pip:

pip install beautifulsoup4 requests

For SOCKS proxy support, you’ll need an extra component (see “Tips and Troubleshooting”).

Configuration

To use Evomi proxies with Beautiful Soup, you’ll configure the requests library to route HTTP/HTTPS requests through your Evomi proxy.

Here’s a basic setup:

import requests
from bs4 import BeautifulSoup

# Evomi proxy configuration
proxy_host = "rp.evomi.com"  # Example: Residential Proxy
proxy_port = "1000"
proxy_username = "your_username"
proxy_password = "your_password_session-anychars_mode-speed"

# Construct the proxy URL with authentication
# For HTTP/HTTPS proxies:
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
# For SOCKS5 proxies, the scheme would be 'socks5h' (or 'socks5' if DNS resolution is local)
# proxy_url = f"socks5h://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"


# Proxy dictionary for the requests library
proxies = {
    "http": proxy_url,
    "https": proxy_url  # Use the same proxy URL for both HTTP and HTTPS
}

# Target URL to fetch
url_to_scrape = "https://ip.evomi.com/s" # This site shows your current IP

try:
    # Make a request through the proxy
    response = requests.get(url_to_scrape, proxies=proxies, timeout=10) # Added timeout
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

    # Parse the HTML content using Beautiful Soup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Print the IP address (which should be the proxy's IP)
    print(f"Response from {url_to_scrape} using proxy:")
    print(soup.get_text().strip())

except requests.exceptions.ProxyError as e:
    print(f"Proxy Error: {e}")
    print("Please check your proxy_host, proxy_port, proxy_username, and proxy_password.")
except requests.exceptions.HTTPError as e:
    print(f"HTTP Error: {e}")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

Replace your_username with your actual Evomi proxy username and your_password with your actual password.

Explanation

Let’s break down the key parts of this script:

  1. Import Libraries:
    • requests: For making HTTP/HTTPS requests.
    • BeautifulSoup (from bs4): For parsing HTML/XML content.
  2. Proxy Configuration:
    • proxy_host, proxy_port, proxy_username, proxy_password: These variables store your specific Evomi proxy credentials and server details.
  3. Proxy URL Construction:
    • proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}": This line creates the full proxy URL string, embedding your username and password for authentication. This format http://user:pass@host:port is standard for HTTP/HTTPS proxies.
    • For SOCKS proxies, the scheme in the URL would change (e.g., socks5h://...).
  4. proxies Dictionary:
    • proxies = {"http": proxy_url, "https": proxy_url}: The requests library uses this dictionary to determine which proxy to use for HTTP and HTTPS traffic. Both are routed through the same proxy_url.
  5. Making the Request:
    • response = requests.get(url_to_scrape, proxies=proxies, timeout=10): This sends a GET request to url_to_scrape.
      • proxies=proxies: This argument tells requests to use the configured proxy.
      • timeout=10: It’s good practice to add a timeout to prevent your script from hanging indefinitely.
    • response.raise_for_status(): This will check if the request was successful (status code 2xx). If not (e.g., 403 Forbidden, 401 Unauthorized, 500 Internal Server Error), it will raise an HTTPError.
  6. Parsing Content:
    • soup = BeautifulSoup(response.content, 'html.parser'): The raw byte content (response.content) of the successful response is parsed by BeautifulSoup using Python’s built-in HTML parser.
  7. Extracting Information:
    • print(soup.get_text().strip()): This extracts all the text from the parsed HTML and removes leading/trailing whitespace. For https://ip.evomi.com/s, this will be the IP address seen by the server, which should be your proxy’s IP.
  8. Error Handling:
    • The try...except block catches potential errors like proxy connection issues (requests.exceptions.ProxyError), HTTP errors (requests.exceptions.HTTPError), or other request-related problems (requests.exceptions.RequestException).

Evomi Proxy Endpoints

Depending on the Evomi product and protocol you’re using, adjust the proxy_host, proxy_port, and the protocol scheme in your proxy_url (http://, https://, socks5h://).

Residential Proxies

  • HTTP/HTTPS Proxy: rp.evomi.com:1000 (Use http://user:[email protected]:1000 in proxy_url)
  • SOCKS5 Proxy: rp.evomi.com:1002 (Use socks5h://user:[email protected]:1002 in proxy_url)

Mobile Proxies

  • HTTP/HTTPS Proxy: mp.evomi.com:3000
  • SOCKS5 Proxy: mp.evomi.com:3002

Datacenter Proxies

  • HTTP/HTTPS Proxy: dcp.evomi.com:2000
  • SOCKS5 Proxy: dcp.evomi.com:2002

Note on HTTPS Proxies vs. HTTPS URLs: The proxies dictionary routes all HTTPS traffic (e.g., https://example.com) through the specified proxy_url. If your proxy_url itself uses http://user:pass@host:port, this is standard. The proxy server is responsible for handling the onward SSL/TLS connection to the final HTTPS destination. Some Evomi endpoints might offer a specific port for proxies that primarily handle HTTPS outgoing traffic (e.g., rp.evomi.com:1001), but typically the standard HTTP proxy port (:1000) handles both.

Advanced Usage: Scraping Specific Elements

Here’s an example of scraping titles from a news site (fictional example):

import requests
from bs4 import BeautifulSoup

# Evomi proxy configuration (ensure these are correctly set)
proxy_username = "your_username"
proxy_password = "your_password_session-anychars_mode-speed"
proxy_host = "rp.evomi.com"
proxy_port = "1000"
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
proxies = {"http": proxy_url, "https": proxy_url}

target_site_url = "https://www.example-news.com" # Replace with a real site you have permission to scrape

def scrape_titles(url, proxies_config):
    try:
        headers = { # It's good practice to set a User-Agent
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, proxies=proxies_config, headers=headers, timeout=15)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        titles = []
        # This is an example selector; you'll need to inspect the target site's HTML
        # to find the correct tags and classes for the elements you want.
        for P_tag in soup.find_all('P', class_='article-title'): # Adjust selector as needed
            titles.append(P_tag.get_text(strip=True))
        return titles

    except requests.exceptions.RequestException as e:
        print(f"Error during scraping {url}: {e}")
        return []

# --- Main part of the script ---
if __name__ == "__main__":
    # Replace your_username and your_password
    if "your_username" in proxy_url or "your_password" in proxy_url:
        print("Please replace 'your_username' and 'your_password' in the script.")
    else:
        article_titles = scrape_titles(target_site_url, proxies)
        if article_titles:
            print(f"Found titles on {target_site_url}:")
            for i, title in enumerate(article_titles[:5]): # Print first 5 titles
                print(f"{i+1}. {title}")
        else:
            print(f"No titles found or error occurred for {target_site_url}.")

This script defines a function scrape_titles that takes a URL and proxy configuration, fetches the page, and then uses Beautiful Soup’s find_all method to locate HTML elements (e.g., <h2> tags with a class article-title) and extract their text. Remember to inspect the HTML structure of your target website to determine the correct selectors.

Tips and Troubleshooting

  • Credentials & Endpoint: Always double-check proxy_username, proxy_password, proxy_host, proxy_port, and the protocol scheme in proxy_url. Incorrect details are the most common cause of issues.
  • Password Format: The proxy password often includes session parameters (e.g., _session-anychars_mode-speed). Ensure you replace your_password but keep these additional parameters intact if they are part of your assigned password string.
  • SOCKS Proxies:
    • To use SOCKS5 proxies with requests, you need to install the PySocks library:
      pip install requests[socks]
    • Then, update your proxy_url to use the socks5h:// (or socks5://) scheme:
      proxy_url = f"socks5h://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
      # 'socks5h://' means DNS resolution happens on the proxy server side.
      # 'socks5://' means DNS resolution happens on the client side. 'socks5h' is often preferred.
  • SSL Verification (verify=False):
    • If you encounter SSL errors specifically from the target website (not the proxy connection itself), you might be tempted to use verify=False in requests.get():
      # response = requests.get(url, proxies=proxies, verify=False)
    • Use with extreme caution, as this disables SSL certificate verification and makes your connection insecure. It’s better to resolve underlying SSL issues (e.g., by ensuring your system’s root CAs are up to date).
  • User-Agent: Some websites block requests that don’t have a common browser User-Agent. Set one via the headers parameter in requests.get():
    headers = {'User-Agent': 'Mozilla/5.0 ...'}
    response = requests.get(url, proxies=proxies, headers=headers)
  • Error Handling: The provided examples include basic error handling. For production scrapers, implement more robust error checking, retries with backoff, and logging.
  • Website Terms of Service: Always respect the robots.txt file and the terms of service of any website you scrape. Implement rate limiting and be considerate to avoid overloading servers or getting your proxy IP blocked.
  • Dynamic Content (JavaScript): Beautiful Soup and requests only fetch the initial HTML content. If a website loads content dynamically using JavaScript, these tools won’t execute the JavaScript. For such sites, consider tools like Selenium (see our Selenium integration guide) or Playwright.

By following this guide, you should be able to successfully integrate Evomi’s proxies with Beautiful Soup and requests for your web scraping tasks.