python如何提取word图片

Python提取Word图片的方法包括使用Python库来读取Word文件、访问嵌入的图片、保存图片到本地。其中最常用的库是python-docx和zipfile。使用这些库可以实现自动化文档处理和图像提取。我们将详细解释使用python-docx库的具体步骤。

一、安装和导入必要的库

在开始之前，确保你已经安装了必要的Python库。python-docx是一个处理Word文档的强大库。你可以通过pip安装：

pip install python-docx

另外，我们还需要zipfile来处理Word文件，因为Word文件实际上是一个压缩包，包含了文档内容和图片等资源。

pip install zipfile36

安装完成后，我们可以在Python脚本中导入这些库：

from docx import Document
import zipfile
import os

二、读取Word文件并提取图片

1、读取Word文件

首先，我们需要读取Word文件。python-docx库可以轻松地打开和读取Word文档。

doc = Document('example.docx')

2、解压Word文件

Word文件实际上是一个压缩包，包含了文档内容和嵌入的图片。我们可以使用zipfile库来解压这个压缩包。

with zipfile.ZipFile('example.docx', 'r') as docx:
    docx.extractall('extracted_content')

3、定位图片文件

解压后的文件夹中，图片通常存储在word/media目录下。我们可以遍历这个目录来查找所有的图片文件。

media_path = 'extracted_content/word/media'
if os.path.exists(media_path):
    for file_name in os.listdir(media_path):
        if file_name.endswith(('.png', '.jpeg', '.jpg', '.bmp')):
            file_path = os.path.join(media_path, file_name)
            print(f'Found image: {file_path}')
else:
    print('No media directory found.')

4、保存图片文件

找到图片文件后，我们可以将它们保存到指定的目录。例如，我们可以将图片保存到当前工作目录下的images文件夹中。

output_dir = 'images'
os.makedirs(output_dir, exist_ok=True)
for file_name in os.listdir(media_path):
    if file_name.endswith(('.png', '.jpeg', '.jpg', '.bmp')):
        src_path = os.path.join(media_path, file_name)
        dst_path = os.path.join(output_dir, file_name)
        with open(src_path, 'rb') as src_file, open(dst_path, 'wb') as dst_file:
            dst_file.write(src_file.read())
        print(f'Saved image: {dst_path}')

三、处理提取到的图片

1、图像格式转换

有时候，我们可能需要将提取到的图片转换成其他格式。我们可以使用Pillow库来实现图像格式的转换。

pip install Pillow

from PIL import Image
for file_name in os.listdir(output_dir):
    if file_name.endswith(('.png', '.jpeg', '.jpg', '.bmp')):
        img_path = os.path.join(output_dir, file_name)
        img = Image.open(img_path)
        img.convert('RGB').save(f'{img_path}.pdf')
        print(f'Converted image to PDF: {img_path}.pdf')

2、图像压缩

如果图片文件太大，我们可以使用Pillow库来压缩图像。

for file_name in os.listdir(output_dir):
    if file_name.endswith(('.png', '.jpeg', '.jpg', '.bmp')):
        img_path = os.path.join(output_dir, file_name)
        img = Image.open(img_path)
        img.save(img_path, optimize=True, quality=85)
        print(f'Compressed image: {img_path}')

四、自动化图片提取的完整示例代码

下面是一个完整的示例代码，它展示了如何从Word文件中提取图片并保存到本地：

from docx import Document
import zipfile
import os
from PIL import Image
def extract_images_from_docx(docx_path, output_dir):
    # Read the Word document
    doc = Document(docx_path)
    # Unzip the Word file
    with zipfile.ZipFile(docx_path, 'r') as docx:
        docx.extractall('extracted_content')
    media_path = 'extracted_content/word/media'
    if os.path.exists(media_path):
        os.makedirs(output_dir, exist_ok=True)
        for file_name in os.listdir(media_path):
            if file_name.endswith(('.png', '.jpeg', '.jpg', '.bmp')):
                src_path = os.path.join(media_path, file_name)
                dst_path = os.path.join(output_dir, file_name)
                with open(src_path, 'rb') as src_file, open(dst_path, 'wb') as dst_file:
                    dst_file.write(src_file.read())
                print(f'Saved image: {dst_path}')
                # Optionally convert image to PDF
                img = Image.open(dst_path)
                img.convert('RGB').save(f'{dst_path}.pdf')
                print(f'Converted image to PDF: {dst_path}.pdf')
                # Optionally compress image
                img.save(dst_path, optimize=True, quality=85)
                print(f'Compressed image: {dst_path}')
    else:
        print('No media directory found.')
Example usage
extract_images_from_docx('example.docx', 'images')

通过以上步骤和代码示例，你可以轻松地从Word文档中提取图片并保存到本地。这种自动化处理方法不仅节省时间，还能提高工作效率。