python如何用爬虫爬图片

Python爬虫爬取图片的方法有很多，常用的有使用requests库、BeautifulSoup库、Selenium库等。 其中，使用requests库和BeautifulSoup库较为常见。通过requests库来发送HTTP请求，获取网页内容，再使用BeautifulSoup库解析HTML，从中提取图片链接，最后下载图片。详细步骤包括：发送HTTP请求、解析HTML内容、提取图片链接、下载并保存图片。下面将详细介绍如何用这些库来实现图片爬取。

一、发送HTTP请求

在爬取图片的过程中，首先需要发送HTTP请求，以获取网页的HTML内容。Python的requests库非常适合这个任务。

import requests
url = 'https://example.com'
response = requests.get(url)
if response.status_code == 200:
    html_content = response.content
else:
    print(f"FAIled to retrieve the webpage. Status code: {response.status_code}")

在上述代码中，我们使用requests.get()方法发送HTTP GET请求，并将响应内容存储在response对象中。如果请求成功（状态码为200），则将HTML内容保存在html_content变量中。

二、解析HTML内容

获取到网页的HTML内容后，接下来需要解析HTML，以提取其中的图片链接。BeautifulSoup库非常适合解析HTML内容。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

使用BeautifulSoup将HTML内容解析为一个soup对象，方便我们进行进一步的操作。

三、提取图片链接

通过解析后的soup对象，可以很方便地提取图片链接。通常，图片链接包含在<img>标签的src属性中。

image_tags = soup.find_all('img')
image_urls = [img['src'] for img in image_tags if 'src' in img.attrs]

在上述代码中，我们使用soup.find_all('img')方法找到所有的<img>标签，并通过列表推导式提取每个<img>标签的src属性，最终得到一个包含所有图片链接的列表image_urls。

四、下载并保存图片

最后一步是下载并保存图片。可以使用requests库的get()方法来下载图片，并将其保存到本地。

import os
image_folder = 'downloaded_images'
if not os.path.exists(image_folder):
    os.makedirs(image_folder)
for i, image_url in enumerate(image_urls):
    try:
        image_response = requests.get(image_url)
        if image_response.status_code == 200:
            image_path = os.path.join(image_folder, f'image_{i+1}.jpg')
            with open(image_path, 'wb') as file:
                file.write(image_response.content)
            print(f"Downloaded {image_url}")
        else:
            print(f"Failed to download {image_url}. Status code: {image_response.status_code}")
    except Exception as e:
        print(f"An error occurred while downloading {image_url}: {e}")

在上述代码中，我们首先检查并创建一个用于保存图片的文件夹downloaded_images。然后遍历所有图片链接，使用requests.get()方法下载图片，并将其保存到本地文件夹中。每张图片的文件名为'image_1.jpg', 'image_2.jpg'等。

五、处理动态加载的图片

有些网站的图片是通过JavaScript动态加载的，使用requests和BeautifulSoup可能无法直接获取这些图片链接。这时，可以使用Selenium库来处理动态加载的内容。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
初始化Selenium WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
url = 'https://example.com'
driver.get(url)
等待页面加载完成（可以根据具体情况调整等待时间或条件）
driver.implicitly_wait(10)
获取页面源代码
html_content = driver.page_source
关闭浏览器
driver.quit()
使用BeautifulSoup解析HTML内容
soup = BeautifulSoup(html_content, 'html.parser')
提取图片链接
image_tags = soup.find_all('img')
image_urls = [img['src'] for img in image_tags if 'src' in img.attrs]
下载并保存图片（同前面的步骤）

在上述代码中，我们使用Selenium库启动一个浏览器，并打开目标网页。通过driver.page_source获取页面源代码，再使用BeautifulSoup解析HTML内容，提取图片链接并下载图片。

六、处理相对路径的图片链接

在实际应用中，很多图片链接是相对路径，需要转换为绝对路径才能正确下载。可以使用urllib库的urljoin()方法来处理相对路径。

from urllib.parse import urljoin
base_url = 'https://example.com'
image_urls = [urljoin(base_url, img['src']) for img in image_tags if 'src' in img.attrs]

在上述代码中，我们使用urljoin()方法将每个相对路径的图片链接转换为绝对路径。

七、处理反爬虫机制

一些网站可能会有反爬虫机制，如检测请求头、限制访问频率等。可以通过修改请求头或添加延时来绕过这些限制。

import time
import random
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i, image_url in enumerate(image_urls):
    try:
        image_response = requests.get(image_url, headers=headers)
        if image_response.status_code == 200:
            image_path = os.path.join(image_folder, f'image_{i+1}.jpg')
            with open(image_path, 'wb') as file:
                file.write(image_response.content)
            print(f"Downloaded {image_url}")
            # 随机延时，避免触发反爬虫机制
            time.sleep(random.uniform(1, 3))
        else:
            print(f"Failed to download {image_url}. Status code: {image_response.status_code}")
    except Exception as e:
        print(f"An error occurred while downloading {image_url}: {e}")

在上述代码中，我们通过设置User-Agent请求头来模拟浏览器请求，并在每次下载图片后随机延时1到3秒，以避免触发反爬虫机制。

八、总结

通过以上步骤，我们可以使用Python爬虫来爬取图片。主要步骤包括：发送HTTP请求、解析HTML内容、提取图片链接、下载并保存图片。在处理动态加载的图片时，可以使用Selenium库。在处理相对路径的图片链接时，可以使用urljoin()方法。在应对反爬虫机制时，可以通过修改请求头或添加延时来绕过限制。

九、常见问题及解决方案

图片链接失效或无法访问：有些图片链接可能会失效或无法访问，可以通过检查响应状态码并跳过无效链接来解决。
图片下载速度慢：可以通过并发下载来提高下载速度。例如，使用多线程或多进程进行并发下载。
网页内容加载缓慢：可以通过增加等待时间或使用显式等待来确保网页内容加载完成。
IP被封禁：如果频繁访问某个网站导致IP被封禁，可以使用代理IP来绕过封禁。
图片格式不一致：有些网站的图片格式可能不一致，可以通过检查图片的Content-Type头部信息来确定图片格式，并使用相应的文件扩展名保存。

十、实战示例

下面是一个完整的实战示例，展示如何使用Python爬虫爬取图片。

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
import random
def download_images(url, folder='downloaded_images'):
    # 发送HTTP请求
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return
    # 解析HTML内容
    html_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')
    # 提取图片链接
    image_tags = soup.find_all('img')
    image_urls = [urljoin(url, img['src']) for img in image_tags if 'src' in img.attrs]
    # 创建文件夹
    if not os.path.exists(folder):
        os.makedirs(folder)
    # 下载并保存图片
    for i, image_url in enumerate(image_urls):
        try:
            image_response = requests.get(image_url, headers=headers)
            if image_response.status_code == 200:
                content_type = image_response.headers['Content-Type']
                if 'image' in content_type:
                    extension = content_type.split('/')[-1]
                    image_path = os.path.join(folder, f'image_{i+1}.{extension}')
                    with open(image_path, 'wb') as file:
                        file.write(image_response.content)
                    print(f"Downloaded {image_url}")
                    time.sleep(random.uniform(1, 3))
            else:
                print(f"Failed to download {image_url}. Status code: {image_response.status_code}")
        except Exception as e:
            print(f"An error occurred while downloading {image_url}: {e}")
if __name__ == "__main__":
    target_url = 'https://example.com'
    download_images(target_url)

在这个示例中，我们定义了一个download_images()函数，接收目标网页的URL和保存图片的文件夹路径。首先发送HTTP请求获取网页内容，然后使用BeautifulSoup解析HTML内容，提取图片链接并将其转换为绝对路径。接着，检查并创建保存图片的文件夹，遍历所有图片链接，下载并保存图片。最后，通过检查图片的Content-Type头部信息来确定图片格式，并使用相应的文件扩展名保存图片。

十一、进阶技巧

并发下载：使用多线程或多进程进行并发下载，可以显著提高下载速度。可以使用concurrent.futures模块来实现并发下载。

import concurrent.futures
def download_image(image_url, folder, headers, index):
    try:
        image_response = requests.get(image_url, headers=headers)
        if image_response.status_code == 200:
            content_type = image_response.headers['Content-Type']
            if 'image' in content_type:
                extension = content_type.split('/')[-1]
                image_path = os.path.join(folder, f'image_{index+1}.{extension}')
                with open(image_path, 'wb') as file:
                    file.write(image_response.content)
                print(f"Downloaded {image_url}")
        else:
            print(f"Failed to download {image_url}. Status code: {image_response.status_code}")
    except Exception as e:
        print(f"An error occurred while downloading {image_url}: {e}")
def download_images_concurrently(url, folder='downloaded_images'):
    # 发送HTTP请求
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return
    # 解析HTML内容
    html_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')
    # 提取图片链接
    image_tags = soup.find_all('img')
    image_urls = [urljoin(url, img['src']) for img in image_tags if 'src' in img.attrs]
    # 创建文件夹
    if not os.path.exists(folder):
        os.makedirs(folder)
    # 并发下载图片
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures = [executor.submit(download_image, image_url, folder, headers, i) for i, image_url in enumerate(image_urls)]
        concurrent.futures.wait(futures)
if __name__ == "__main__":
    target_url = 'https://example.com'
    download_images_concurrently(target_url)

在这个示例中，我们定义了一个download_image()函数，用于下载单个图片，并将其放在一个线程池中进行并发下载。在download_images_concurrently()函数中，我们使用concurrent.futures.ThreadPoolExecutor创建一个线程池，并将下载任务提交给线程池执行。

处理JavaScript动态加载的图片：如果图片是通过JavaScript动态加载的，可以使用Selenium库来模拟浏览器行为，等待页面加载完成后再提取图片链接。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
def download_images_with_selenium(url, folder='downloaded_images'):
    # 初始化Selenium WebDriver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service)
    # 打开目标网页
    driver.get(url)
    # 等待页面加载完成（可以根据具体情况调整等待时间或条件）
    driver.implicitly_wait(10)
    # 获取页面源代码
    html_content = driver.page_source
    # 关闭浏览器
    driver.quit()
    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(html_content, 'html.parser')
    # 提取图片链接
    image_tags = soup.find_all('img')
    image_urls = [urljoin(url, img['src']) for img in image_tags if 'src' in img.attrs]
    # 创建文件夹
    if not os.path.exists(folder):
        os.makedirs(folder)
    # 下载并保存图片（同前面的步骤）
    for i, image_url in enumerate(image_urls):
        try:
            image_response = requests.get(image_url, headers=headers)
            if image_response.status_code == 200:
                content_type = image_response.headers['Content-Type']
                if 'image' in content_type:
                    extension = content_type.split('/')[-1]
                    image_path = os.path.join(folder, f'image_{i+1}.{extension}')
                    with open(image_path, 'wb') as file:
                        file.write(image_response.content)
                    print(f"Downloaded {image_url}")
                    time.sleep(random.uniform(1, 3))
            else:
                print(f"Failed to download {image_url}. Status code: {image_response.status_code}")
        except Exception as e:
            print(f"An error occurred while downloading {image_url}: {e}")
if __name__ == "__main__":
    target_url = 'https://example.com'
    download_images_with_selenium(target_url)

在这个示例中，我们使用Selenium库启动一个浏览器，并打开目标网页。通过driver.page_source获取页面源代码，再使用BeautifulSoup解析HTML内容，提取图片链接并下载图片。

十二、总结

通过上述方法和技巧，可以有效地使用Python爬虫爬取图片。主要步骤包括：发送HTTP请求、解析HTML内容、提取图片链接、下载并保存图片。在处理动态加载的图片时，可以使用Selenium库。在处理相对路径的图片链接时，可以使用urljoin()方法。在应对反爬虫机制时，可以通过修改请求头或添加延时来绕过限制。通过并发下载可以提高下载速度，处理动态加载的图片可以使用Selenium库来模拟浏览器行为。希望本文能为您提供有效的参考。