如何利用python爬虫搜索百度图片

如何利用Python爬虫搜索百度图片

利用Python爬虫搜索百度图片主要有以下几个步骤：构建请求头、发送请求、解析网页、下载图片、保存图片。其中，构建请求头至关重要，它能够模拟浏览器行为，从而避免爬虫被封禁。以下将详细展开构建请求头的过程。

构建请求头：在爬取百度图片时，设置请求头模拟浏览器的行为，能有效避开反爬机制。关键参数包括User-Agent、Referer等。User-Agent是标识请求来源浏览器类型的重要字段，而Referer则表明请求来源页面。

一、构建请求头

在爬虫过程中，构建合理的请求头是非常重要的。请求头中的关键字段包括User-Agent和Referer。User-Agent用来标识请求来源的浏览器类型，可以避免一些基本的反爬机制。Referer则用来表明请求来源，进一步模拟真实用户操作。

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    "Referer": "https://image.baidu.com/"
}

二、发送请求

构建好请求头之后，下一步就是发送请求获取网页内容。可以使用requests库来完成这一步。

import requests
url = "https://image.baidu.com/search/index?tn=baiduimage&word=cat"
response = requests.get(url, headers=headers)
html_content = response.text

三、解析网页

获取到网页内容之后，需要解析网页提取图片的链接。我们可以使用BeautifulSoup库来解析HTML内容。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
img_tags = soup.find_all("img")

四、下载图片

解析出图片链接后，接下来需要下载图片。可以遍历所有的img标签，获取其src属性，然后下载图片。

import os
if not os.path.exists("images"):
    os.makedirs("images")
for img_tag in img_tags:
    img_url = img_tag.get("src")
    if img_url:
        img_response = requests.get(img_url, headers=headers)
        with open(os.path.join("images", os.path.basename(img_url)), "wb") as img_file:
            img_file.write(img_response.content)

五、保存图片

在下载图片的过程中，需要将其保存到本地。可以使用os库创建一个目录，然后将下载的图片保存到该目录中。

if not os.path.exists("images"):
    os.makedirs("images")
for img_tag in img_tags:
    img_url = img_tag.get("src")
    if img_url:
        img_response = requests.get(img_url, headers=headers)
        with open(os.path.join("images", os.path.basename(img_url)), "wb") as img_file:
            img_file.write(img_response.content)

六、处理翻页

在百度图片搜索页面中，搜索结果通常是分页显示的。为了获取所有搜索结果中的图片，我们需要处理翻页。在网页中，翻页的链接通常可以在下一页的按钮中找到，我们可以通过解析该链接来实现翻页。

next_page = soup.find("a", class_="n")
if next_page:
    next_url = next_page.get("href")
    response = requests.get(next_url, headers=headers)
    html_content = response.text

七、异常处理

在网络爬虫过程中，网络问题、反爬虫机制等各种因素都可能导致请求失败。因此，在代码中添加异常处理是非常重要的。

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()
except requests.RequestException as e:
    print(f"Request failed: {e}")

八、总结

通过上述步骤，我们可以实现利用Python爬虫搜索百度图片。具体步骤包括构建请求头、发送请求、解析网页、下载图片、保存图片、处理翻页和异常处理。每一步都有其关键点，尤其是构建请求头和处理翻页，这是确保爬虫能够稳定高效运行的关键。

下面是完整的代码示例：

import os
import requests
from bs4 import BeautifulSoup
def download_images_from_baidu(query, num_images=50):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Referer": "https://image.baidu.com/"
    }
    url = f"https://image.baidu.com/search/index?tn=baiduimage&word={query}"
    image_count = 0
    if not os.path.exists("images"):
        os.makedirs("images")
    while url and image_count < num_images:
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            html_content = response.text
        except requests.RequestException as e:
            print(f"Request failed: {e}")
            break
        soup = BeautifulSoup(html_content, "html.parser")
        img_tags = soup.find_all("img")
        for img_tag in img_tags:
            img_url = img_tag.get("src")
            if img_url and image_count < num_images:
                try:
                    img_response = requests.get(img_url, headers=headers)
                    img_response.raise_for_status()
                    with open(os.path.join("images", os.path.basename(img_url)), "wb") as img_file:
                        img_file.write(img_response.content)
                    image_count += 1
                except requests.RequestException as e:
                    print(f"Failed to download image {img_url}: {e}")
        next_page = soup.find("a", class_="n")
        url = next_page.get("href") if next_page else None
if __name__ == "__main__":
    download_images_from_baidu("cat", num_images=50)