python如何循环爬取多页图片

Python循环爬取多页图片的方法包括使用请求库、解析库和循环结构来访问多个网页，解析网页内容并下载图片。 下面详细介绍一种常见的方法，使用requests库进行HTTP请求，BeautifulSoup库解析HTML内容以及os库保存图片。

一、准备工作

在开始之前，你需要确保已经安装了必要的Python库：

pip install requests pip install beautifulsoup4

二、基础步骤概述

发送HTTP请求：使用requests库发送HTTP请求，获取网页的HTML内容。
解析HTML内容：使用BeautifulSoup库解析HTML内容，找到图片的URL。
保存图片：使用os库保存图片到本地。
循环访问多页：通过循环结构访问多个网页。

三、详细步骤讲解

1、发送HTTP请求

首先，我们需要发送HTTP请求，获取网页的HTML内容。以下是一个示例代码：

import requests
url = "http://example.com/page1"
response = requests.get(url)
html_content = response.text

2、解析HTML内容

使用BeautifulSoup库解析HTML内容，找到图片的URL：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
image_tags = soup.find_all('img')
for img_tag in image_tags:
    img_url = img_tag['src']
    print(img_url)

3、保存图片

将图片保存到本地，使用os库创建目录：

import os
def save_image(img_url, folder_path):
    response = requests.get(img_url)
    if response.status_code == 200:
        with open(os.path.join(folder_path, img_url.split('/')[-1]), 'wb') as file:
            file.write(response.content)
创建文件夹
folder_path = 'images'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
保存图片
for img_tag in image_tags:
    img_url = img_tag['src']
    save_image(img_url, folder_path)

4、循环访问多页

通过循环结构访问多个网页，并重复上述步骤：

base_url = "http://example.com/page"
num_pages = 10
for i in range(1, num_pages + 1):
    url = f"{base_url}{i}"
    response = requests.get(url)
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    image_tags = soup.find_all('img')
    for img_tag in image_tags:
        img_url = img_tag['src']
        save_image(img_url, folder_path)

四、完整示例代码

结合上述步骤，以下是一个完整的示例代码：

import requests
from bs4 import BeautifulSoup
import os
def save_image(img_url, folder_path):
    response = requests.get(img_url)
    if response.status_code == 200:
        with open(os.path.join(folder_path, img_url.split('/')[-1]), 'wb') as file:
            file.write(response.content)
def scrape_images(base_url, num_pages, folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    for i in range(1, num_pages + 1):
        url = f"{base_url}{i}"
        response = requests.get(url)
        html_content = response.text
        soup = BeautifulSoup(html_content, 'html.parser')
        image_tags = soup.find_all('img')
        for img_tag in image_tags:
            img_url = img_tag['src']
            save_image(img_url, folder_path)
使用示例
base_url = "http://example.com/page"
num_pages = 10
folder_path = 'images'
scrape_images(base_url, num_pages, folder_path)

五、注意事项

反爬策略：一些网站有反爬策略，可能会封禁频繁的请求。可以使用请求头（headers）模拟浏览器访问。
URL格式：确保URL格式正确，特别是分页部分。
异常处理：添加异常处理，避免程序因网络或其他问题中断。

import requests
from bs4 import BeautifulSoup
import os
import time
def save_image(img_url, folder_path):
    response = requests.get(img_url)
    if response.status_code == 200:
        with open(os.path.join(folder_path, img_url.split('/')[-1]), 'wb') as file:
            file.write(response.content)
def scrape_images(base_url, num_pages, folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    for i in range(1, num_pages + 1):
        url = f"{base_url}{i}"
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            html_content = response.text
            soup = BeautifulSoup(html_content, 'html.parser')
            image_tags = soup.find_all('img')
            for img_tag in image_tags:
                img_url = img_tag['src']
                save_image(img_url, folder_path)
            # 延迟一段时间，防止被封禁
            time.sleep(2)
        except requests.exceptions.RequestException as e:
            print(f"Error: {e}")
            continue
使用示例
base_url = "http://example.com/page"
num_pages = 10
folder_path = 'images'
scrape_images(base_url, num_pages, folder_path)