python如何将chm文件转为txt

Python将chm文件转为txt的几种方法：使用第三方库、将CHM文件解压、使用HTML解析库、编写定制脚本。最简单的方法是使用现有的第三方库，如pychm，它提供了直接读取CHM文件内容的功能。下面详细描述如何使用pychm库读取CHM文件并将其内容转换为TXT格式。

一、安装并使用pychm库

pychm库是一个用于处理CHM文件的Python库。通过这个库，我们可以方便地读取CHM文件的内容并将其转换为TXT格式。首先，需要安装pychm库：

pip install pychm

安装完毕后，我们可以编写一个简单的Python脚本来读取CHM文件并将其内容写入TXT文件中。

import pychm
def extract_chm_to_txt(chm_file_path, output_txt_file_path):
    chm_file = pychm.CHMFile(chm_file_path)
    with open(output_txt_file_path, 'w', encoding='utf-8') as output_file:
        for topic in chm_file.get_topics():
            if topic['path'].endswith('.html'):
                html_content = chm_file.get_topic_data(topic['path'])
                text_content = html_content.decode('utf-8', errors='ignore')
                output_file.write(text_content)
                output_file.write('\n\n')
if __name__ == "__main__":
    extract_chm_to_txt('example.chm', 'output.txt')

以上脚本中，extract_chm_to_txt函数接收两个参数：CHM文件的路径和输出TXT文件的路径。该函数读取CHM文件中的每个HTML页面，将其内容写入TXT文件。

二、将CHM文件解压为HTML文件

CHM文件实际上是一个压缩的HTML文件集合。我们可以使用chmlib或者其它工具将CHM文件解压为HTML文件，然后使用Python脚本读取这些HTML文件并提取文本内容。

1. 使用chmlib解压CHM文件

首先需要安装chmlib：

sudo apt-get install libchm-bin

然后使用extract_chmLib工具解压CHM文件：

extract_chmLib example.chm output_directory

2. 使用Python读取解压后的HTML文件

解压完成后，我们可以使用Python的BeautifulSoup库读取HTML文件并提取文本内容。首先安装BeautifulSoup：

pip install beautifulsoup4

然后编写Python脚本读取HTML文件并提取文本内容：

import os
from bs4 import BeautifulSoup
def extract_html_to_txt(html_directory, output_txt_file_path):
    with open(output_txt_file_path, 'w', encoding='utf-8') as output_file:
        for root, dirs, files in os.walk(html_directory):
            for file in files:
                if file.endswith('.html'):
                    file_path = os.path.join(root, file)
                    with open(file_path, 'r', encoding='utf-8', errors='ignore') as html_file:
                        soup = BeautifulSoup(html_file, 'html.parser')
                        text_content = soup.get_text()
                        output_file.write(text_content)
                        output_file.write('\n\n')
if __name__ == "__main__":
    extract_html_to_txt('output_directory', 'output.txt')

这个脚本将遍历解压后的HTML文件目录，读取每个HTML文件并提取文本内容，最终将所有文本内容写入输出的TXT文件。

三、使用HTML解析库处理CHM文件内容

除了使用pychm库，我们还可以使用其他HTML解析库来处理CHM文件内容。例如，html2text库可以将HTML内容转换为纯文本格式。首先安装html2text：

pip install html2text

然后修改之前的脚本，使用html2text库将HTML内容转换为纯文本：

import pychm
import html2text
def extract_chm_to_txt(chm_file_path, output_txt_file_path):
    chm_file = pychm.CHMFile(chm_file_path)
    converter = html2text.HTML2Text()
    converter.ignore_links = True
    with open(output_txt_file_path, 'w', encoding='utf-8') as output_file:
        for topic in chm_file.get_topics():
            if topic['path'].endswith('.html'):
                html_content = chm_file.get_topic_data(topic['path'])
                text_content = converter.handle(html_content.decode('utf-8', errors='ignore'))
                output_file.write(text_content)
                output_file.write('\n\n')
if __name__ == "__main__":
    extract_chm_to_txt('example.chm', 'output.txt')

以上脚本中，html2text库用于将HTML内容转换为纯文本。这样可以更好地处理HTML标签和格式，使输出的TXT文件更加干净和易读。

四、编写定制脚本进行更复杂的处理

在某些情况下，可能需要对CHM文件中的内容进行更复杂的处理，例如提取特定的文本部分或处理特定的HTML标签。我们可以编写定制的Python脚本来实现这些需求。

1. 提取特定的文本部分

假设我们只想提取CHM文件中的特定部分，例如标题和段落。可以修改之前的脚本，使用BeautifulSoup库来选择特定的HTML标签：

import pychm
from bs4 import BeautifulSoup
def extract_chm_to_txt(chm_file_path, output_txt_file_path):
    chm_file = pychm.CHMFile(chm_file_path)
    with open(output_txt_file_path, 'w', encoding='utf-8') as output_file:
        for topic in chm_file.get_topics():
            if topic['path'].endswith('.html'):
                html_content = chm_file.get_topic_data(topic['path'])
                soup = BeautifulSoup(html_content, 'html.parser')
                for tag in soup.find_all(['h1', 'h2', 'h3', 'p']):
                    text_content = tag.get_text()
                    output_file.write(text_content)
                    output_file.write('\n\n')
if __name__ == "__main__":
    extract_chm_to_txt('example.chm', 'output.txt')

在这个脚本中，我们使用BeautifulSoup库选择HTML文件中的标题（h1, h2, h3）和段落（p）标签，并将其文本内容提取出来写入TXT文件。

2. 处理特定的HTML标签

如果需要处理特定的HTML标签，例如表格（table）或列表（ul, ol），可以编写相应的处理逻辑。例如，处理HTML表格并将其转换为文本表格：

import pychm
from bs4 import BeautifulSoup
def extract_chm_to_txt(chm_file_path, output_txt_file_path):
    chm_file = pychm.CHMFile(chm_file_path)
    with open(output_txt_file_path, 'w', encoding='utf-8') as output_file:
        for topic in chm_file.get_topics():
            if topic['path'].endswith('.html'):
                html_content = chm_file.get_topic_data(topic['path'])
                soup = BeautifulSoup(html_content, 'html.parser')
                for tag in soup.find_all(['h1', 'h2', 'h3', 'p', 'table']):
                    if tag.name == 'table':
                        table_text = extract_table_text(tag)
                        output_file.write(table_text)
                    else:
                        text_content = tag.get_text()
                        output_file.write(text_content)
                    output_file.write('\n\n')
def extract_table_text(table_tag):
    table_text = ""
    for row in table_tag.find_all('tr'):
        row_text = "\t".join([cell.get_text() for cell in row.find_all(['th', 'td'])])
        table_text += row_text + '\n'
    return table_text
if __name__ == "__main__":
    extract_chm_to_txt('example.chm', 'output.txt')