如何用python爬取优酷视频

用Python爬取优酷视频的核心步骤包括：获取视频页面源代码、解析页面获取视频链接、下载视频文件、处理视频分段。

在本文中，我们将详细讨论如何实现这些步骤，并提供一些具体的代码示例来帮助你更好地理解和实现这一过程。

一、获取视频页面源代码

在爬取优酷视频之前，我们首先需要获取视频页面的源代码。可以使用Python的requests库来完成这一任务。以下是一个简单的示例代码：

import requests
url = "https://v.youku.com/v_show/id_XMjY3OTE5MzY0OA==.html"
response = requests.get(url)
html = response.text
print(html)

在这个示例中，我们使用requests.get()函数发送一个HTTP GET请求，并将响应的内容保存在变量html中。你可以打印这个变量来查看页面的源代码。

二、解析页面获取视频链接

获取页面源代码后，我们需要解析HTML以提取视频链接。可以使用BeautifulSoup库来完成这一任务。以下是一个示例代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
video_url = soup.find('meta', {'property': 'og:video:url'})['content']
print(video_url)

在这个示例中，我们使用BeautifulSoup解析HTML，并查找包含视频链接的meta标签。然后，我们提取该标签的content属性值，这是视频的实际链接。

三、下载视频文件

接下来，我们需要下载视频文件。可以使用requests库来完成这一任务。以下是一个示例代码：

import requests
video_url = "https://example.com/video.mp4"
response = requests.get(video_url, stream=True)
with open('video.mp4', 'wb') as file:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            file.write(chunk)

在这个示例中，我们使用requests.get()函数发送一个HTTP GET请求，并将stream参数设置为True，以便逐块下载视频文件。然后，我们将每个块写入一个文件中。

四、处理视频分段

优酷上的视频通常分为多个段，需要将这些段合并成一个完整的视频文件。我们可以使用ffmpeg库来完成这一任务。首先，你需要安装ffmpeg：

sudo apt-get install ffmpeg

然后，使用以下代码来合并视频段：

import subprocess
segment_list = ['segment1.mp4', 'segment2.mp4', 'segment3.mp4']
with open('segments.txt', 'w') as file:
    for segment in segment_list:
        file.write(f"file '{segment}'\n")
subprocess.run(['ffmpeg', '-f', 'concat', '-SAFe', '0', '-i', 'segments.txt', '-c', 'copy', 'output.mp4'])

在这个示例中，我们首先创建一个包含所有视频段文件名的文本文件。然后，我们使用ffmpeg命令将这些段合并成一个完整的视频文件。

五、处理反爬虫机制

优酷等视频网站通常会使用反爬虫机制来防止自动化下载。你可能需要处理验证码、JavaScript加密、动态加载等问题。可以使用Selenium库来模拟浏览器行为，以绕过这些机制。以下是一个示例代码：

from selenium import webdriver
url = "https://v.youku.com/v_show/id_XMjY3OTE5MzY0OA==.html"
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
print(html)
driver.quit()

在这个示例中，我们使用Selenium启动一个Chrome浏览器实例，并加载视频页面。然后，我们获取页面的源代码，并将其保存在变量html中。

六、解析M3U8文件

优酷的视频链接通常是一个M3U8文件，其中包含多个TS段链接。我们需要解析M3U8文件，并下载所有TS段。可以使用m3u8库来完成这一任务。以下是一个示例代码：

import m3u8
import requests
m3u8_url = "https://example.com/video.m3u8"
m3u8_obj = m3u8.load(m3u8_url)
segment_urls = [segment.uri for segment in m3u8_obj.segments]
for segment_url in segment_urls:
    response = requests.get(segment_url, stream=True)
    with open(segment_url.split('/')[-1], 'wb') as file:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                file.write(chunk)

在这个示例中，我们使用m3u8.load()函数加载M3U8文件，并提取所有TS段链接。然后，我们逐个下载这些段，并将其保存为文件。

七、合并TS段

下载所有TS段后，我们需要将它们合并成一个完整的视频文件。可以使用ffmpeg库来完成这一任务。以下是一个示例代码：

import subprocess
ts_files = ['segment1.ts', 'segment2.ts', 'segment3.ts']
with open('segments.txt', 'w') as file:
    for ts_file in ts_files:
        file.write(f"file '{ts_file}'\n")
subprocess.run(['ffmpeg', '-f', 'concat', '-safe', '0', '-i', 'segments.txt', '-c', 'copy', 'output.mp4'])

在这个示例中，我们创建一个包含所有TS段文件名的文本文件。然后，我们使用ffmpeg命令将这些段合并成一个完整的视频文件。

八、处理视频加密

有些优酷视频可能会进行加密，下载的TS段需要解密才能播放。可以使用cryptography库来解密这些段。以下是一个示例代码：

from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
import requests
key_url = "https://example.com/key"
key = requests.get(key_url).content
iv = b'\x00' * 16  # Initialization vector
for ts_file in ts_files:
    with open(ts_file, 'rb') as file:
        encrypted_data = file.read()
    cipher = Cipher(algorithms.AES(key), modes.CBC(iv))
    decryptor = cipher.decryptor()
    decrypted_data = decryptor.update(encrypted_data) + decryptor.finalize()
    with open(ts_file, 'wb') as file:
        file.write(decrypted_data)

在这个示例中，我们首先下载加密密钥。然后，我们使用AES算法和CBC模式解密每个TS段，并将解密后的数据写回文件。

九、处理视频质量和格式

优酷的视频可能有不同的质量和格式选项。你可以通过解析页面源代码或M3U8文件来获取这些选项，并选择合适的链接进行下载。以下是一个示例代码：

import m3u8
import requests
m3u8_url = "https://example.com/video.m3u8"
m3u8_obj = m3u8.load(m3u8_url)
variant_playlists = m3u8_obj.playlists
for variant in variant_playlists:
    print(f"Resolution: {variant.stream_info.resolution}, URL: {variant.uri}")
Select a variant based on resolution or other criteria
selected_variant_url = variant_playlists[0].uri
m3u8_obj = m3u8.load(selected_variant_url)
segment_urls = [segment.uri for segment in m3u8_obj.segments]
for segment_url in segment_urls:
    response = requests.get(segment_url, stream=True)
    with open(segment_url.split('/')[-1], 'wb') as file:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                file.write(chunk)