Python如何爬js网页

Python爬JS网页的方法有：使用Selenium、使用Pyppeteer、使用requests-html、使用Scrapy-Splash。 其中，使用Selenium 是较为常见且功能全面的方法。

使用Selenium可以模拟用户操作浏览器，通过实际渲染页面的方式获取数据，适用于需要进行复杂交互的网页。Selenium支持多种浏览器，并且有丰富的API接口，可以精准地定位并操作网页元素。

一、使用Selenium

1. 安装Selenium

首先，需要安装Selenium库以及浏览器驱动。以Chrome浏览器为例：

pip install selenium

下载ChromeDriver对应你Chrome浏览器版本的驱动，并将其放置在系统环境变量路径中。

2. 初始化浏览器

使用Selenium打开一个浏览器实例：

from selenium import webdriver
初始化Chrome浏览器
driver = webdriver.Chrome()
打开目标网页
driver.get('https://example.com')
获取页面内容
content = driver.page_source
print(content)
关闭浏览器
driver.quit()

3. 等待页面加载

有些网页内容是动态加载的，需要等待一定时间才能获取到完整数据。Selenium提供了隐式等待和显式等待两种方法：

隐式等待：设置一个全局等待时间，WebDriver会等待指定的时间，直到元素出现。

driver.implicitly_wAIt(10)  # 全局等待10秒

显式等待：指定某个条件成立时才继续执行。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
等待某个元素加载完成
element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, 'element_id'))
)

4. 操作网页元素

Selenium可以通过多种方式定位网页元素，并进行点击、输入等操作：

# 通过ID定位元素并点击
element = driver.find_element(By.ID, 'element_id')
element.click()
通过CSS选择器定位元素并输入内容
input_element = driver.find_element(By.CSS_SELECTOR, 'input[name="q"]')
input_element.send_keys('Python爬虫')
input_element.submit()

二、使用Pyppeteer

Pyppeteer是Puppeteer的Python版本，功能强大，可以控制无头浏览器进行网页爬取和测试。

1. 安装Pyppeteer

pip install pyppeteer

2. 使用Pyppeteer

import asyncio
from pyppeteer import launch
async def main():
    # 启动浏览器
    browser = await launch()
    page = await browser.newPage()
    # 打开目标网页
    await page.goto('https://example.com')
    # 等待页面加载完成
    await page.waitForSelector('#element_id')
    # 获取页面内容
    content = await page.content()
    print(content)
    # 关闭浏览器
    await browser.close()
运行异步函数
asyncio.get_event_loop().run_until_complete(main())

三、使用requests-html

requests-html是一个集成了requests库和JavaScript渲染功能的库，简单易用。

1. 安装requests-html

pip install requests-html

2. 使用requests-html

from requests_html import HTMLSession
创建会话
session = HTMLSession()
打开目标网页
response = session.get('https://example.com')
渲染JavaScript
response.html.render()
获取页面内容
content = response.html.html
print(content)

四、使用Scrapy-Splash

Scrapy-Splash是Scrapy的扩展，可以通过Splash渲染JavaScript网页。

1. 安装Scrapy-Splash

pip install scrapy-splash

2. 配置Scrapy-Splash

在Scrapy项目的settings.py文件中添加以下配置：

SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

3. 使用Scrapy-Splash

在Scrapy爬虫中使用SplashRequest：

import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 2})
    def parse(self, response):
        # 解析页面内容
        content = response.body
        print(content)