如何批量从图片中提取文字python

要批量从图片中提取文字，Python提供了多种工具和库，主要的解决方案包括使用光学字符识别（OCR）技术。常用的库有Tesseract OCR、Pytesseract、OpenCV等。首先，安装所需的库，如Pillow（用于图像处理）、Pytesseract（Tesseract的Python封装）和OpenCV（计算机视觉库）。然后，编写脚本来读取图像文件、处理图像、应用OCR提取文字。以下是详细步骤及实现示例。

一、安装必备库

在开始编写代码之前，你需要确保安装了必要的库。这些库包括Pillow、Pytesseract、OpenCV。你可以使用以下命令来安装这些库：

pip install pillow pytesseract opencv-python

此外，你需要安装Tesseract OCR引擎。你可以从tesseract-ocr的GitHub页面找到安装指南。安装完成后，确保将Tesseract加入系统的PATH中。

二、导入库并配置环境

在Python脚本中导入必要的库并配置Tesseract路径。

import pytesseract
from PIL import Image
import cv2
import os
配置Tesseract的路径
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

三、读取图像文件

编写函数来读取图像文件。可以使用os库遍历指定文件夹中的所有图像文件：

def get_image_files(directory):
    supported_formats = ('.png', '.jpg', '.jpeg', '.bmp', '.tiff')
    return [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith(supported_formats)]

四、图像预处理

图像预处理对于提升OCR的准确性至关重要。可以使用OpenCV来进行一些常见的预处理操作，如灰度化、二值化和去噪等：

def preprocess_image(image_path):
    image = cv2.imread(image_path)
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    # 二值化处理
    _, binary_image = cv2.threshold(gray_image, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    return binary_image

五、文字提取

使用Pytesseract从预处理过的图像中提取文字。

def extract_text_from_image(image):
    return pytesseract.image_to_string(image, lang='eng')

六、批量处理

将上述步骤整合到一个批量处理的函数中，遍历文件夹中的所有图像文件并提取文字。

def batch_extract_text_from_images(directory):
    image_files = get_image_files(directory)
    extracted_texts = {}
    for image_file in image_files:
        preprocessed_image = preprocess_image(image_file)
        text = extract_text_from_image(preprocessed_image)
        extracted_texts[image_file] = text
    return extracted_texts

七、保存结果

将提取到的文字保存到文件中。

def save_extracted_texts(texts, output_file):
    with open(output_file, 'w', encoding='utf-8') as f:
        for image_file, text in texts.items():
            f.write(f"File: {image_file}\n")
            f.write(text)
            f.write("\n\n")
使用示例
directory = 'path_to_your_image_directory'
output_file = 'extracted_texts.txt'
extracted_texts = batch_extract_text_from_images(directory)
save_extracted_texts(extracted_texts, output_file)

八、优化和错误处理

为了提高脚本的鲁棒性和性能，可以加入错误处理和一些优化。例如，处理不同分辨率和质量的图像、适应各种语言、以及处理大批量图像时的性能优化。

def preprocess_image(image_path):
    try:
        image = cv2.imread(image_path)
        if image is None:
            raise ValueError("Image not found or unable to read")
        gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        _, binary_image = cv2.threshold(gray_image, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
        return binary_image
    except Exception as e:
        print(f"Error processing image {image_path}: {e}")
        return None
def batch_extract_text_from_images(directory):
    image_files = get_image_files(directory)
    extracted_texts = {}
    for image_file in image_files:
        preprocessed_image = preprocess_image(image_file)
        if preprocessed_image is not None:
            text = extract_text_from_image(preprocessed_image)
            extracted_texts[image_file] = text
        else:
            extracted_texts[image_file] = "Error processing image"
    return extracted_texts

九、并行处理

对于大量图像，可以使用并行处理来提高处理速度。可以使用Python的多线程或多进程库，如concurrent.futures：

from concurrent.futures import ThreadPoolExecutor
def batch_extract_text_from_images(directory):
    image_files = get_image_files(directory)
    extracted_texts = {}
    def process_image(image_file):
        preprocessed_image = preprocess_image(image_file)
        if preprocessed_image is not None:
            return image_file, extract_text_from_image(preprocessed_image)
        else:
            return image_file, "Error processing image"
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = executor.map(process_image, image_files)
    for image_file, text in results:
        extracted_texts[image_file] = text
    return extracted_texts