如何用python做简单的英文词频统计

如何用Python做简单的英文词频统计

用Python做简单的英文词频统计，可以通过文本读取、数据清洗、词频计算这些步骤来实现。首先，我们需要从文本文件中读取数据，然后清洗数据以去除标点符号和转换大小写，最后通过计数器或字典来统计每个单词的出现频率。下面我们将详细介绍如何实现这些步骤。

一、文本读取

在进行词频统计之前，首先需要将文本数据读取到程序中。Python提供了多种读取文本文件的方法，最常用的方式是使用open()函数。

def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

二、数据清洗

数据清洗是指对读取的文本进行处理，以便后续的词频统计。主要包括去除标点符号、转换大小写、去除无意义的停用词等。

import string
def clean_text(text):
    # 转换为小写
    text = text.lower()
    # 去除标点符号
    translator = str.maketrans("", "", string.punctuation)
    text = text.translate(translator)
    return text

三、词频计算

词频计算是指统计每个单词在文本中出现的次数。我们可以使用Python的collections模块中的Counter类来简化这个过程。

from collections import Counter
def calculate_word_frequencies(text):
    words = text.split()
    word_frequencies = Counter(words)
    return word_frequencies

四、综合示例

将以上步骤综合在一起，我们可以编写一个完整的词频统计程序。

import string
from collections import Counter
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text
def clean_text(text):
    text = text.lower()
    translator = str.maketrans("", "", string.punctuation)
    text = text.translate(translator)
    return text
def calculate_word_frequencies(text):
    words = text.split()
    word_frequencies = Counter(words)
    return word_frequencies
def main(file_path):
    text = read_file(file_path)
    clean_text = clean_text(text)
    word_frequencies = calculate_word_frequencies(clean_text)
    return word_frequencies
if __name__ == "__main__":
    file_path = 'sample.txt'
    word_frequencies = main(file_path)
    for word, freq in word_frequencies.items():
        print(f"{word}: {freq}")

五、进一步优化和扩展

1. 去除停用词

停用词是指一些在文本中频繁出现但对文本内容没有实际意义的词，如“the”、“is”、“in”等。我们可以通过去除停用词来提高词频统计的准确性。

def remove_stop_words(text):
    stop_words = set(["the", "is", "in", "and", "to", "of", "a", "with"])
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

2. 可视化词频统计结果

为了更直观地展示词频统计结果，我们可以使用Python的matplotlib库对数据进行可视化。

import matplotlib.pyplot as plt
def plot_word_frequencies(word_frequencies):
    words = list(word_frequencies.keys())
    frequencies = list(word_frequencies.values())
    plt.figure(figsize=(10, 5))
    plt.bar(words[:10], frequencies[:10])  # 只显示前10个单词
    plt.xlabel('Words')
    plt.ylabel('Frequencies')
    plt.title('Top 10 Word Frequencies')
    plt.show()

六、综合优化后的示例

import string
from collections import Counter
import matplotlib.pyplot as plt
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text
def clean_text(text):
    text = text.lower()
    translator = str.maketrans("", "", string.punctuation)
    text = text.translate(translator)
    return text
def remove_stop_words(text):
    stop_words = set(["the", "is", "in", "and", "to", "of", "a", "with"])
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)
def calculate_word_frequencies(text):
    words = text.split()
    word_frequencies = Counter(words)
    return word_frequencies
def plot_word_frequencies(word_frequencies):
    words = list(word_frequencies.keys())
    frequencies = list(word_frequencies.values())
    plt.figure(figsize=(10, 5))
    plt.bar(words[:10], frequencies[:10])  # 只显示前10个单词
    plt.xlabel('Words')
    plt.ylabel('Frequencies')
    plt.title('Top 10 Word Frequencies')
    plt.show()
def main(file_path):
    text = read_file(file_path)
    cleaned_text = clean_text(text)
    filtered_text = remove_stop_words(cleaned_text)
    word_frequencies = calculate_word_frequencies(filtered_text)
    plot_word_frequencies(word_frequencies)
if __name__ == "__main__":
    file_path = 'sample.txt'
    main(file_path)