如何用python保存网页

要用Python保存网页，主要有以下几种方法：使用requests库下载网页内容、使用BeautifulSoup解析并保存、使用selenium模拟浏览器行为、使用pyppeteer进行无头浏览器操作。其中，使用requests库下载网页内容是最常用和简单的方法。它通过发送HTTP请求获取网页的HTML代码，然后将其保存到本地文件中。下面将详细介绍如何使用requests库来保存网页。

一、使用REQUESTS库下载网页内容

1. 安装和导入REQUESTS库

首先，确保你的Python环境中安装了requests库。如果没有，可以通过以下命令安装：

pip install requests

安装完成后，在Python脚本中导入requests库：

import requests

2. 发送HTTP请求获取网页内容

使用requests库的get方法发送HTTP请求获取网页的HTML内容：

url = 'http://example.com'
response = requests.get(url)
html_content = response.text

在这里，我们将要保存的网页URL赋值给url变量，然后使用requests.get(url)发送请求，获取响应对象response，最后通过response.text获取网页的HTML内容。

3. 将HTML内容保存到本地文件

获取到网页的HTML内容后，可以将其写入本地文件中：

with open('example.html', 'w', encoding='utf-8') as file:
    file.write(html_content)

在这段代码中，使用Python的内置open函数以写模式打开一个文件，文件名为example.html，并指定编码为utf-8。然后使用file.write(html_content)将HTML内容写入文件。

二、使用BEAUTIFULSOUP解析并保存

1. 安装和导入BEAUTIFULSOUP库

BeautifulSoup是一个用于解析HTML和XML文档的库。首先，需要安装BeautifulSoup和lxml：

pip install beautifulsoup4 lxml

安装完成后，在Python脚本中导入：

from bs4 import BeautifulSoup
import requests

2. 获取并解析网页内容

通过requests获取网页内容后，使用BeautifulSoup进行解析：

url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

在这里，soup对象即为经过解析的HTML文档。

3. 保存解析后的内容

可以将解析后的网页内容以结构化的方式保存到本地：

with open('structured_example.html', 'w', encoding='utf-8') as file:
    file.write(soup.prettify())

prettify方法可以使HTML文档格式化，更加美观和易读。

三、使用SELENIUM模拟浏览器行为

1. 安装和配置SELENIUM

Selenium是一个用于自动化网页操作的工具。首先，安装Selenium：

pip install selenium

还需要下载对应的浏览器驱动，例如ChromeDriver，并将其路径加入系统环境变量。

2. 使用SELENIUM获取网页

在Python脚本中导入并使用Selenium：

from selenium import webdriver
driver = webdriver.Chrome()  # 或者webdriver.Firefox()等
driver.get('http://example.com')
html_content = driver.page_source
driver.quit()

在这里，page_source属性获取当前页面的HTML内容。

3. 保存获取的网页内容

与之前的方法类似，将获取到的HTML内容保存到文件：

with open('selenium_example.html', 'w', encoding='utf-8') as file:
    file.write(html_content)

四、使用PYPPETEER进行无头浏览器操作

1. 安装和配置PYPPETEER

Pyppeteer是一个Python版本的Puppeteer，用于无头浏览器操作。首先，安装Pyppeteer：

pip install pyppeteer

2. 使用PYPPETEER获取网页

在Python脚本中使用Pyppeteer：

import asyncio
from pyppeteer import launch
async def mAIn():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://example.com')
    html_content = await page.content()
    await browser.close()
    with open('pyppeteer_example.html', 'w', encoding='utf-8') as file:
        file.write(html_content)
asyncio.get_event_loop().run_until_complete(main())