Python爬虫知乎如何爬取多页

爬取多页知乎内容的方法包括：使用requests库获取网页内容、使用BeautifulSoup解析HTML、模拟登录获取知乎的cookie、使用循环或递归处理分页、处理反爬虫机制。其中，处理反爬虫机制是关键，因为知乎会对频繁的请求进行限制。我们可以通过使用代理、请求头伪装、模拟用户行为等方法来绕过反爬虫机制。

一、使用requests库获取网页内容

requests库是Python中最常用的HTTP库之一，可以用来发送HTTP请求并获取响应内容。首先，我们需要安装requests库：

pip install requests

然后，通过requests库发送GET请求来获取知乎的网页内容：

import requests
url = 'https://www.zhihu.com/question/123456789'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.text)

二、使用BeautifulSoup解析HTML

BeautifulSoup是一个用于解析HTML和XML的Python库，可以用来提取网页中的数据。我们需要先安装BeautifulSoup：

pip install beautifulsoup4

然后，使用BeautifulSoup解析获取到的HTML内容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
questions = soup.find_all('div', class_='QuestionItem-title')
for question in questions:
    print(question.get_text())

三、模拟登录获取知乎的cookie

知乎对未登录用户的访问有一定限制，模拟登录可以获取更多内容。为了模拟登录，我们需要使用requests库发送POST请求，并携带登录所需的参数和headers。

login_url = 'https://www.zhihu.com/login/phone_num'
login_data = {
    'phone_num': 'your_phone_number',
    'password': 'your_password'
}
session = requests.Session()
session.headers.update(headers)
login_response = session.post(login_url, data=login_data)
print(login_response.json())

四、使用循环或递归处理分页

知乎的内容通常分页展示，我们需要使用循环或递归来处理分页。可以通过分析网页的下一页链接，逐页抓取内容。

base_url = 'https://www.zhihu.com/question/123456789?page='
page = 1
while True:
    response = session.get(base_url + str(page))
    soup = BeautifulSoup(response.text, 'html.parser')
    questions = soup.find_all('div', class_='QuestionItem-title')
    if not questions:
        break
    for question in questions:
        print(question.get_text())
    page += 1

五、处理反爬虫机制

知乎有多种反爬虫机制，如IP封禁、验证码等。为了绕过这些机制，可以使用以下方法：

使用代理：代理可以隐藏爬虫的真实IP，避免被封禁。

proxies = {
    'http': 'http://your_proxy:port',
    'https': 'https://your_proxy:port'
}
response = session.get(url, proxies=proxies)

请求头伪装：通过设置合理的请求头，如User-Agent、Referer等，可以伪装成浏览器访问。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer': 'https://www.zhihu.com'
}

模拟用户行为：通过延时、随机点击等方式模拟用户行为，降低爬虫被识别的可能性。

import time
import random
time.sleep(random.uniform(1, 3))

六、示例代码

以下是一个完整的示例代码，演示如何爬取知乎多页内容：

import requests
from bs4 import BeautifulSoup
import time
import random
def get_html(url, session, proxies=None):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = session.get(url, headers=headers, proxies=proxies)
    return response.text
def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    questions = soup.find_all('div', class_='QuestionItem-title')
    return [q.get_text() for q in questions]
def main():
    login_url = 'https://www.zhihu.com/login/phone_num'
    base_url = 'https://www.zhihu.com/question/123456789?page='
    login_data = {
        'phone_num': 'your_phone_number',
        'password': 'your_password'
    }
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    })
    session.post(login_url, data=login_data)
    page = 1
    while True:
        html = get_html(base_url + str(page), session)
        questions = parse_html(html)
        if not questions:
            break
        for question in questions:
            print(question)
        page += 1
        time.sleep(random.uniform(1, 3))
if __name__ == '__main__':
    main()