如何用python爬取语料库

使用Python爬取语料库的步骤包括：选择合适的爬虫工具、配置请求头信息、解析网页内容、提取并保存数据。 其中，选择合适的爬虫工具是最重要的，因为不同的工具适合不同的需求，例如Scrapy适合大规模爬取，BeautifulSoup适合小规模数据提取。接下来我将详细介绍如何使用Scrapy和BeautifulSoup来爬取语料库。

一、选择合适的爬虫工具

选择爬虫工具是爬取语料库的第一步。Python中常用的爬虫工具主要有Scrapy和BeautifulSoup。Scrapy是一个功能强大的爬虫框架，适合大规模数据爬取，而BeautifulSoup则适合于小规模数据爬取和HTML解析。

1、Scrapy

安装Scrapy

要使用Scrapy，首先需要进行安装。可以使用pip进行安装：

pip install Scrapy

创建Scrapy项目

安装完成后，可以创建一个Scrapy项目：

scrapy startproject mycorpus cd mycorpus

创建Spider

在Scrapy项目中，需要创建一个Spider来定义爬取逻辑：

scrapy genspider example example.com

编写Spider代码

在生成的Spider文件中，编写爬取逻辑：

import scrapy
class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行爬虫

完成Spider代码编写后，可以运行爬虫：

scrapy crawl example

2、BeautifulSoup

安装BeautifulSoup

BeautifulSoup是一个用于解析HTML和XML的库，可以与requests库配合使用：

pip install beautifulsoup4 requests

编写爬取代码

使用BeautifulSoup编写爬取代码：

import requests
from bs4 import BeautifulSoup
URL = 'http://example.com'
response = requests.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')
for quote in soup.find_all('div', class_='quote'):
    text = quote.find('span', class_='text').get_text()
    author = quote.find('span').find('small').get_text()
    tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
    print(f'Text: {text}, Author: {author}, Tags: {tags}')

二、配置请求头信息

为了避免被服务器识别为爬虫程序，需要配置请求头信息，使得爬虫行为更像一个真实用户的访问。可以在requests中添加headers参数：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(URL, headers=headers)

在Scrapy中，可以在settings.py文件中添加USER_AGENT设置：

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

三、解析网页内容

解析网页内容是从HTML中提取数据的关键步骤。可以使用BeautifulSoup或Scrapy提供的选择器来提取所需信息。

1、BeautifulSoup解析

使用BeautifulSoup解析HTML内容：

soup = BeautifulSoup(response.content, 'html.parser')
for quote in soup.find_all('div', class_='quote'):
    text = quote.find('span', class_='text').get_text()
    author = quote.find('span').find('small').get_text()
    tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
    print(f'Text: {text}, Author: {author}, Tags: {tags}')

2、Scrapy解析

在Scrapy中，可以使用CSS选择器或XPath来提取数据：

def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

四、提取并保存数据

提取数据后，需要将数据保存到文件或数据库中，以便后续处理。

1、保存到文件

可以将数据保存到CSV或JSON文件中：

import csv
with open('corpus.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Text', 'Author', 'Tags'])
    for quote in quotes:
        writer.writerow([quote['text'], quote['author'], ','.join(quote['tags'])])

2、保存到数据库

可以将数据保存到SQLite数据库中：

import sqlite3
conn = sqlite3.connect('corpus.db')
c = conn.cursor()
c.execute('''CREATE TABLE quotes (text TEXT, author TEXT, tags TEXT)''')
for quote in quotes:
    c.execute("INSERT INTO quotes (text, author, tags) VALUES (?, ?, ?)",
              (quote['text'], quote['author'], ','.join(quote['tags'])))
conn.commit()
conn.close()

五、处理反爬虫措施

在爬取过程中，可能会遇到反爬虫措施，例如验证码、IP封禁等。可以采取以下措施来处理这些问题：

1、使用代理

使用代理可以隐藏真实IP，避免被封禁：

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
response = requests.get(URL, proxies=proxies)

在Scrapy中，可以在settings.py文件中添加代理设置：

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'myproject.middlewares.ProxyMiddleware': 100, }

2、使用延迟和随机UA

设置爬取延迟和随机User-Agent可以减少被识别为爬虫的风险：

import random
import time
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/53.0.2 Safari/537.3',
    # 添加更多User-Agent
]
headers = {
    'User-Agent': random.choice(user_agents)
}
time.sleep(random.uniform(1, 3))
response = requests.get(URL, headers=headers)

在Scrapy中，可以在settings.py文件中添加延迟和随机User-Agent设置：

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

六、处理动态网页

有些网页内容是通过JavaScript动态加载的，这种情况下可以使用Selenium或Splash来处理。

1、使用Selenium

Selenium是一个自动化测试工具，可以模拟浏览器行为，适合处理动态加载的网页内容。

安装Selenium

首先需要安装Selenium：

pip install selenium

安装浏览器驱动

根据使用的浏览器，安装相应的驱动程序，例如ChromeDriver：

# 下载并解压到系统路径

编写代码

使用Selenium模拟浏览器行为：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
service = Service('path/to/chromedriver')
driver = webdriver.Chrome(service=service)
driver.get(URL)
try:
    quotes = WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, 'quote'))
    )
    for quote in quotes:
        text = quote.find_element(By.CLASS_NAME, 'text').text
        author = quote.find_element(By.TAG_NAME, 'small').text
        tags = [tag.text for tag in quote.find_elements(By.CLASS_NAME, 'tag')]
        print(f'Text: {text}, Author: {author}, Tags: {tags}')
finally:
    driver.quit()

2、使用Splash

Splash是一个JavaScript渲染服务，适合处理动态加载的网页内容。

安装Splash

首先需要安装Docker，然后运行Splash服务：

docker run -p 8050:8050 scrapinghub/splash

编写代码

使用Scrapy-Splash结合Scrapy来处理动态加载的网页内容：

pip install scrapy-splash

在Scrapy项目中，添加Splash设置：

# settings.py SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

编写Spider代码：

import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 2})
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield SplashRequest(response.urljoin(next_page), self.parse, args={'wait': 2})