如何批量从图片中提取文字python

使用Python批量从图片中提取文字的方法包括：Tesseract OCR、Pytesseract库、OpenCV处理、批量处理脚本。

其中，Tesseract OCR 是一个开源的光学字符识别引擎，它非常适合从图片中提取文字。通过结合Python的Pytesseract库，我们可以方便地调用Tesseract的功能。为了提高识别效果，我们还可以使用OpenCV对图片进行预处理。下面将详细介绍这些方法。

一、TESSERACT OCR 与 PYTESSERACT

1、安装与配置

Tesseract OCR 是一个强大的开源光学字符识别工具。首先，需要安装Tesseract。可以通过以下命令安装：

sudo apt-get install tesseract-ocr

安装完成后，Python中可以使用Pytesseract库调用Tesseract引擎。Pytesseract是一个Python封装库，可以通过以下命令安装：

pip install pytesseract

2、基本使用

安装完成后，可以通过以下代码从图片中提取文字：

import pytesseract
from PIL import Image
设置tesseract可执行文件的路径
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
打开图片
image = Image.open('sample_image.png')
识别图片中的文字
text = pytesseract.image_to_string(image)
print(text)

3、批量处理图片

为了批量处理图片，我们可以编写一个脚本来处理文件夹中的所有图片文件：

import os
from PIL import Image
import pytesseract
def extract_text_from_images(folder_path):
    # 设置tesseract可执行文件的路径
    pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
    # 获取文件夹中的所有文件
    files = os.listdir(folder_path)
    for file in files:
        if file.endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(folder_path, file)
            image = Image.open(image_path)
            # 识别图片中的文字
            text = pytesseract.image_to_string(image)
            print(f'Extracted text from {file}:')
            print(text)
            print('-' * 30)
调用函数，处理指定文件夹中的所有图片
extract_text_from_images('path_to_your_folder')

二、图像预处理与OpenCV

1、安装OpenCV

为了提高OCR的准确率，我们可以对图像进行预处理。OpenCV 是一个强大的图像处理库，可以通过以下命令安装：

pip install opencv-python

2、图像预处理

通过OpenCV对图像进行预处理（如灰度化、二值化），可以提高Tesseract的识别准确率：

import cv2
import numpy as np
def preprocess_image(image_path):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    # 二值化
    _, binary_image = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)
    # 去噪
    denoised_image = cv2.fastNlMeansDenoising(binary_image, None, 30, 7, 21)
    return denoised_image
def extract_text_from_image(image):
    # 预处理图像
    preprocessed_image = preprocess_image(image)
    # 转换为PIL图像
    pil_image = Image.fromarray(preprocessed_image)
    # 识别文字
    text = pytesseract.image_to_string(pil_image)
    return text
示例使用
image_path = 'sample_image.png'
extracted_text = extract_text_from_image(image_path)
print(extracted_text)

三、批量处理脚本改进

结合OpenCV的预处理功能，可以改进批量处理脚本：

import os
from PIL import Image
import pytesseract
import cv2
def preprocess_image(image_path):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    # 二值化
    _, binary_image = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)
    # 去噪
    denoised_image = cv2.fastNlMeansDenoising(binary_image, None, 30, 7, 21)
    return denoised_image
def extract_text_from_image(image_path):
    # 预处理图像
    preprocessed_image = preprocess_image(image_path)
    # 转换为PIL图像
    pil_image = Image.fromarray(preprocessed_image)
    # 识别文字
    text = pytesseract.image_to_string(pil_image)
    return text
def extract_text_from_images(folder_path):
    # 设置tesseract可执行文件的路径
    pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
    # 获取文件夹中的所有文件
    files = os.listdir(folder_path)
    for file in files:
        if file.endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(folder_path, file)
            text = extract_text_from_image(image_path)
            print(f'Extracted text from {file}:')
            print(text)
            print('-' * 30)
调用函数，处理指定文件夹中的所有图片
extract_text_from_images('path_to_your_folder')

四、批量处理结果保存

为了更好地管理提取结果，可以将提取的文字保存到文本文件中：

def extract_text_from_images(folder_path, output_file):
    # 设置tesseract可执行文件的路径
    pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
    # 获取文件夹中的所有文件
    files = os.listdir(folder_path)
    with open(output_file, 'w') as f:
        for file in files:
            if file.endswith(('.png', '.jpg', '.jpeg')):
                image_path = os.path.join(folder_path, file)
                text = extract_text_from_image(image_path)
                f.write(f'Extracted text from {file}:n')
                f.write(text)
                f.write('n' + '-' * 30 + 'n')
调用函数，处理指定文件夹中的所有图片并保存结果
extract_text_from_images('path_to_your_folder', 'output_text.txt')

五、提高OCR准确度的其他技巧

1、调整DPI

OCR的准确性可能会受到图像分辨率的影响。通过调整图像的DPI（每英寸点数）可以提高识别效果：

def adjust_dpi(image_path, dpi=300):
    image = Image.open(image_path)
    image.save(image_path, dpi=(dpi, dpi))
def extract_text_from_images_with_dpi_adjustment(folder_path, output_file, dpi=300):
    pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
    files = os.listdir(folder_path)
    with open(output_file, 'w') as f:
        for file in files:
            if file.endswith(('.png', '.jpg', '.jpeg')):
                image_path = os.path.join(folder_path, file)
                # 调整DPI
                adjust_dpi(image_path, dpi)
                text = extract_text_from_image(image_path)
                f.write(f'Extracted text from {file}:n')
                f.write(text)
                f.write('n' + '-' * 30 + 'n')
调用函数，处理指定文件夹中的所有图片并保存结果
extract_text_from_images_with_dpi_adjustment('path_to_your_folder', 'output_text.txt')

2、使用不同的OCR语言包

Tesseract支持多种语言包，可以通过指定语言包来提高识别效果：

def extract_text_from_image_with_language(image_path, language='eng'):
    preprocessed_image = preprocess_image(image_path)
    pil_image = Image.fromarray(preprocessed_image)
    text = pytesseract.image_to_string(pil_image, lang=language)
    return text
def extract_text_from_images_with_language(folder_path, output_file, language='eng'):
    pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
    files = os.listdir(folder_path)
    with open(output_file, 'w') as f:
        for file in files:
            if file.endswith(('.png', '.jpg', '.jpeg')):
                image_path = os.path.join(folder_path, file)
                text = extract_text_from_image_with_language(image_path, language)
                f.write(f'Extracted text from {file}:n')
                f.write(text)
                f.write('n' + '-' * 30 + 'n')
调用函数，处理指定文件夹中的所有图片并保存结果
extract_text_from_images_with_language('path_to_your_folder', 'output_text.txt', language='eng')

六、综合项目管理

在处理大量图像时，使用项目管理系统来跟踪和管理OCR任务是很有必要的。推荐使用研发项目管理系统PingCode和通用项目管理软件Worktile。

1、PingCode

PingCode是一款专业的研发项目管理工具，可以帮助团队高效地管理OCR项目。在PingCode中，可以创建任务、分配责任人、设置截止日期，并实时跟踪项目进度。

2、Worktile

Worktile是一款通用的项目管理软件，适用于各类项目管理需求。通过Worktile，可以建立OCR项目的看板视图，跟踪每个任务的状态，确保项目按时完成。

通过以上方法，我们可以高效地批量从图片中提取文字，并使用项目管理工具进行项目管理，确保任务的顺利完成。

如何批量从图片中提取文字python

一、TESSERACT OCR 与 PYTESSERACT

1、安装与配置

2、基本使用

设置tesseract可执行文件的路径

打开图片

识别图片中的文字

3、批量处理图片

调用函数，处理指定文件夹中的所有图片

二、图像预处理与OpenCV

1、安装OpenCV

2、图像预处理

示例使用

三、批量处理脚本改进

调用函数，处理指定文件夹中的所有图片

四、批量处理结果保存

调用函数，处理指定文件夹中的所有图片并保存结果

五、提高OCR准确度的其他技巧

1、调整DPI

调用函数，处理指定文件夹中的所有图片并保存结果

2、使用不同的OCR语言包

调用函数，处理指定文件夹中的所有图片并保存结果

六、综合项目管理

1、PingCode

2、Worktile

相关问答FAQs：