python爬虫如何设置异常处理

Python爬虫设置异常处理的关键步骤：捕获异常、记录日志、重试机制、设置超时、使用代理。异常处理对于确保爬虫的稳定性和可靠性至关重要。捕获异常是处理异常的第一步，通过捕获异常可以防止程序崩溃，并给予程序一定的恢复机会。例如，在捕获HTTP错误时，可以通过调整请求参数或重试机制来重新请求。

捕获异常的详细描述：

在Python爬虫中，常见的异常包括网络错误（如ConnectionError）、HTTP错误（如HTTPError）和解析错误（如ValueError）。通过使用try-except语句来捕获这些异常，可以避免程序因异常而中断。以下是一个捕获异常的基本示例：

import requests
url = "http://example.com"
try:
    response = requests.get(url)
    response.rAIse_for_status()  # 如果响应状态不是200，抛出HTTPError
    data = response.text
except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
except requests.exceptions.ConnectionError as conn_err:
    print(f"Connection error occurred: {conn_err}")
except Exception as err:
    print(f"An error occurred: {err}")

通过捕获这些异常，爬虫可以在遇到错误时继续运行而不是崩溃，并根据具体的异常类型采取不同的处理措施。

一、捕获异常

捕获异常是处理爬虫异常的首要步骤。通过捕获异常，程序能够识别并处理不同类型的错误，从而避免程序崩溃。常见的异常类型包括网络错误、HTTP错误和解析错误。

1、网络错误

网络错误通常是由于网络连接问题引起的，例如服务器未响应或网络不稳定。在Python爬虫中，可以使用requests库捕获网络错误。以下是一个示例：

import requests
url = "http://example.com"
try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.ConnectionError as conn_err:
    print(f"Connection error occurred: {conn_err}")

2、HTTP错误

HTTP错误指的是HTTP请求返回的状态码不是200，例如404（页面未找到）或500（服务器错误）。可以通过捕获requests.exceptions.HTTPError来处理这些错误：

import requests
url = "http://example.com"
try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")

3、解析错误

解析错误通常发生在解析HTML或JSON数据时。例如，在使用BeautifulSoup解析HTML时，如果HTML格式不正确，可能会引发解析错误。可以通过捕获相应的异常来处理这些错误：

from bs4 import BeautifulSoup
html = "<html><body><h1>Example</h1></body></html>"
try:
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.find('title').text
except AttributeError as attr_err:
    print(f"Parsing error occurred: {attr_err}")

二、记录日志

记录日志是监控和调试爬虫的重要手段。通过记录日志，可以跟踪爬虫的运行状态、错误信息和处理过程，从而更好地分析和解决问题。

1、使用logging库

Python的logging库提供了强大的日志记录功能，可以将日志信息输出到控制台、文件或其他输出渠道。以下是一个简单的日志记录示例：

import logging
配置日志
logging.basicConfig(level=logging.INFO, filename='crawler.log', filemode='w',
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
url = "http://example.com"
try:
    response = requests.get(url)
    response.raise_for_status()
    logger.info(f"Successfully fetched {url}")
except requests.exceptions.HTTPError as http_err:
    logger.error(f"HTTP error occurred: {http_err}")
except requests.exceptions.ConnectionError as conn_err:
    logger.error(f"Connection error occurred: {conn_err}")
except Exception as err:
    logger.error(f"An error occurred: {err}")

2、日志级别

logging库提供了多种日志级别，用于表示日志信息的严重程度。常见的日志级别包括：

DEBUG：详细的调试信息，通常用于开发和调试阶段。
INFO：一般信息，表示程序的正常运行状态。
WARNING：警告信息，表示程序可能出现潜在问题。
ERROR：错误信息，表示程序出现问题，但仍能继续运行。
CRITICAL：严重错误信息，表示程序出现重大问题，可能导致程序中断。

通过合理设置日志级别，可以更好地控制日志信息的输出，便于分析和调试。

三、重试机制

在爬虫运行过程中，网络不稳定和服务器响应超时等问题可能会导致请求失败。通过设置重试机制，可以在请求失败时重新尝试请求，提高爬虫的稳定性和成功率。

1、使用retrying库

retrying库提供了简单的重试机制，可以方便地在请求失败时进行重试。以下是一个使用retrying库实现重试机制的示例：

from retrying import retry
import requests
@retry(stop_max_attempt_number=3, wait_fixed=2000)
def fetch_url(url):
    response = requests.get(url)
    response.raise_for_status()
    return response.text
url = "http://example.com"
try:
    data = fetch_url(url)
    print(f"Successfully fetched {url}")
except Exception as err:
    print(f"Failed to fetch {url}: {err}")

在这个示例中，@retry装饰器用于指定重试的条件，其中stop_max_attempt_number表示最大重试次数，wait_fixed表示每次重试之间的等待时间（毫秒）。

2、自定义重试机制

除了使用retrying库外，还可以自定义重试机制。以下是一个自定义重试机制的示例：

import time
import requests
def fetch_url(url, max_retries=3, wait_time=2):
    for i in range(max_retries):
        try:
            response = requests.get(url)
            response.raise_for_status()
            return response.text
        except requests.exceptions.RequestException as req_err:
            print(f"Attempt {i+1} failed: {req_err}")
            time.sleep(wait_time)
    raise Exception(f"Failed to fetch {url} after {max_retries} attempts")
url = "http://example.com"
try:
    data = fetch_url(url)
    print(f"Successfully fetched {url}")
except Exception as err:
    print(f"Failed to fetch {url}: {err}")

在这个示例中，fetch_url函数通过循环进行多次尝试请求，并在每次请求失败后等待一段时间（wait_time），直到达到最大重试次数（max_retries）。

四、设置超时

在网络请求中，设置超时是防止请求长时间挂起的重要手段。通过设置超时，可以在请求超过指定时间时自动中断，从而避免程序陷入无限等待的状态。

1、请求超时

在requests库中，可以通过timeout参数设置请求超时。以下是一个设置请求超时的示例：

import requests
url = "http://example.com"
try:
    response = requests.get(url, timeout=5)  # 设置超时时间为5秒
    response.raise_for_status()
    data = response.text
    print(f"Successfully fetched {url}")
except requests.exceptions.Timeout as timeout_err:
    print(f"Request timed out: {timeout_err}")
except requests.exceptions.RequestException as req_err:
    print(f"Request error occurred: {req_err}")

通过设置超时时间，可以在请求超过指定时间（例如5秒）时抛出Timeout异常，从而避免长时间等待。

2、连接超时和读取超时

requests库还支持分别设置连接超时和读取超时。连接超时用于限制建立连接的时间，读取超时用于限制读取响应数据的时间。以下是一个设置连接超时和读取超时的示例：

import requests
url = "http://example.com"
try:
    response = requests.get(url, timeout=(3, 5))  # 连接超时3秒，读取超时5秒
    response.raise_for_status()
    data = response.text
    print(f"Successfully fetched {url}")
except requests.exceptions.Timeout as timeout_err:
    print(f"Request timed out: {timeout_err}")
except requests.exceptions.RequestException as req_err:
    print(f"Request error occurred: {req_err}")

通过分别设置连接超时和读取超时，可以更精细地控制请求的超时时间，避免因连接或读取过程中的长时间等待而影响程序运行。

五、使用代理

在爬虫过程中，使用代理是应对IP封禁和限制访问的重要手段。通过代理服务器，可以隐藏真实IP地址，提高爬虫的隐匿性和成功率。

1、设置代理

在requests库中，可以通过proxies参数设置代理。以下是一个使用代理的示例：

import requests
url = "http://example.com"
proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "http://10.10.1.10:1080"
}
try:
    response = requests.get(url, proxies=proxies)
    response.raise_for_status()
    data = response.text
    print(f"Successfully fetched {url} using proxy")
except requests.exceptions.RequestException as req_err:
    print(f"Request error occurred: {req_err}")

通过设置代理服务器，可以通过代理IP地址进行请求，从而隐藏真实IP地址。

2、使用代理池

为了进一步提高爬虫的隐匿性，可以使用代理池进行代理IP的轮换。以下是一个使用代理池的示例：

import requests
import random
url = "http://example.com"
proxy_pool = [
    "http://10.10.1.10:3128",
    "http://10.10.1.11:3128",
    "http://10.10.1.12:3128"
]
def get_random_proxy():
    return random.choice(proxy_pool)
try:
    proxy = get_random_proxy()
    proxies = {"http": proxy, "https": proxy}
    response = requests.get(url, proxies=proxies)
    response.raise_for_status()
    data = response.text
    print(f"Successfully fetched {url} using proxy {proxy}")
except requests.exceptions.RequestException as req_err:
    print(f"Request error occurred: {req_err}")

通过使用代理池，可以在每次请求时随机选择代理IP，从而避免因使用单一IP地址而被封禁。

六、其他异常处理策略

除了以上提到的捕获异常、记录日志、重试机制、设置超时和使用代理外，还有一些其他的异常处理策略可以提高爬虫的稳定性和可靠性。

1、限制请求频率

为了避免因频繁请求而被封禁，可以限制请求频率。例如，通过在每次请求之间添加随机延时，模拟人类访问的行为：

import time
import random
import requests
url = "http://example.com"
def fetch_url(url):
    response = requests.get(url)
    response.raise_for_status()
    return response.text
try:
    data = fetch_url(url)
    print(f"Successfully fetched {url}")
    time.sleep(random.uniform(1, 3))  # 随机延时1到3秒
except requests.exceptions.RequestException as req_err:
    print(f"Request error occurred: {req_err}")

2、处理异常数据

在爬取数据时，可能会遇到异常数据（例如缺失字段或格式错误）。可以通过捕获解析异常并进行相应处理来提高数据的完整性和准确性：

from bs4 import BeautifulSoup
html = "<html><body><h1>Example</h1></body></html>"
try:
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.find('title').text
except AttributeError as attr_err:
    print(f"Parsing error occurred: {attr_err}")
    title = "Unknown"  # 使用默认值
finally:
    print(f"Title: {title}")