如何用python爬虫样例

使用Python进行网络爬虫需要掌握以下几个关键步骤：确定目标网站、发送HTTP请求、解析HTML内容、处理数据存储。 在这些步骤中，解析HTML内容是关键，因为这直接影响到我们能否正确获取到所需的数据。通过详细的示例和代码，可以帮助你更好地理解这些步骤。

一、确定目标网站

在开始爬取数据之前，我们需要明确我们要爬取哪个网站以及我们需要的数据所在的位置。假设我们要爬取一个新闻网站的标题和内容。

二、发送HTTP请求

我们需要使用Python的requests库来发送HTTP请求并获取网页内容。

import requests
url = 'https://example.com/news'
response = requests.get(url)
if response.status_code == 200:
    page_content = response.text
else:
    print(f'Failed to retrieve content: {response.status_code}')

三、解析HTML内容

使用BeautifulSoup库来解析HTML内容，从中提取我们需要的数据。

from bs4 import BeautifulSoup
soup = BeautifulSoup(page_content, 'html.parser')
titles = soup.find_all('h2', class_='news-title')
contents = soup.find_all('div', class_='news-content')
for title, content in zip(titles, contents):
    print(f'Title: {title.get_text()}')
    print(f'Content: {content.get_text()}')

四、处理数据存储

我们可以将获取到的数据保存到文件或数据库中，这里以保存到CSV文件为例。

import csv
with open('news.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Content'])
    for title, content in zip(titles, contents):
        writer.writerow([title.get_text(), content.get_text()])

五、对爬虫进行优化和异常处理

在实际应用中，爬虫需要处理多种异常情况和对爬取过程进行优化。

1、处理异常

在网络爬虫过程中，可能会遇到各种异常情况，如网络连接失败、目标网站拒绝访问等。我们可以使用try-except来捕捉这些异常。

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.RequestException as e:
    print(f'Error fetching {url}: {e}')

2、设置请求头

有些网站会检查请求头来判断请求是否来自真实的浏览器。我们可以模拟浏览器的请求头来避免被网站拒绝访问。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)

六、使用代理和延迟

为了避免爬虫被网站封禁，可以使用代理和延迟来降低爬虫的访问频率。

1、使用代理

通过代理服务器来发送请求，可以隐藏爬虫的真实IP。

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'https://10.10.1.10:1080',
}
response = requests.get(url, proxies=proxies)

2、设置延迟

在每次请求之间设置一个随机的延迟时间，以模拟人类访问的行为。

import time
import random
time.sleep(random.uniform(1, 3))

七、处理JavaScript渲染的页面

有些网站的内容是通过JavaScript动态渲染的，传统的requests和BeautifulSoup方法无法直接获取这些内容。我们可以使用Selenium库来处理这些动态内容。

1、安装Selenium

pip install selenium

2、使用Selenium获取动态内容

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)  # 等待页面加载完成
page_content = driver.page_source
driver.quit()
soup = BeautifulSoup(page_content, 'html.parser')
titles = soup.find_all('h2', class_='news-title')
contents = soup.find_all('div', class_='news-content')
for title, content in zip(titles, contents):
    print(f'Title: {title.get_text()}')
    print(f'Content: {content.get_text()}')

八、深入解析HTML内容

有些HTML内容较为复杂，可能需要使用正则表达式或Xpath来解析。

1、使用正则表达式

import re
pattern = re.compile(r'<h2 class="news-title">(.*?)</h2>', re.S)
titles = pattern.findall(page_content)
for title in titles:
    print(f'Title: {title}')

2、使用Xpath

from lxml import etree
tree = etree.HTML(page_content)
titles = tree.xpath('//h2[@class="news-title"]/text()')
for title in titles:
    print(f'Title: {title}')

九、并发爬虫

为了提高爬取效率，我们可以使用多线程或多进程来并发执行爬虫任务。

1、使用多线程

import threading
def fetch_url(url):
    response = requests.get(url)
    if response.status_code == 200:
        print(f'Successfully fetched {url}')
    else:
        print(f'Failed to fetch {url}')
urls = ['https://example.com/news1', 'https://example.com/news2', 'https://example.com/news3']
threads = []
for url in urls:
    thread = threading.Thread(target=fetch_url, args=(url,))
    threads.append(thread)
    thread.start()
for thread in threads:
    thread.join()

2、使用多进程

from multiprocessing import Pool
def fetch_url(url):
    response = requests.get(url)
    if response.status_code == 200:
        print(f'Successfully fetched {url}')
    else:
        print(f'Failed to fetch {url}')
urls = ['https://example.com/news1', 'https://example.com/news2', 'https://example.com/news3']
with Pool(4) as p:
    p.map(fetch_url, urls)

十、常见的反爬虫机制及应对策略

1、验证码

有些网站会使用验证码来阻止爬虫。可以通过OCR技术或第三方打码平台来自动识别验证码，但这种方法效果不一定理想。

2、IP封禁

通过使用代理池来轮换IP，避免被封禁。

3、动态内容加载

使用Selenium等工具来模拟浏览器行为，获取动态加载的内容。

4、频率限制

通过设置随机延迟来降低访问频率，避免触发频率限制。

十一、数据清洗和处理

爬取到的数据可能需要进一步清洗和处理才能使用。

1、数据清洗

移除多余的空白字符、HTML标签等。

import re
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # 去除多余的空白字符
    text = re.sub(r'<.*?>', '', text)  # 移除HTML标签
    return text.strip()
cleaned_titles = [clean_text(title.get_text()) for title in titles]
cleaned_contents = [clean_text(content.get_text()) for content in contents]

2、数据存储

将清洗后的数据存储到数据库中。

import sqlite3
conn = sqlite3.connect('news.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS news (title TEXT, content TEXT)''')
for title, content in zip(cleaned_titles, cleaned_contents):
    c.execute("INSERT INTO news (title, content) VALUES (?, ?)", (title, content))
conn.commit()
conn.close()