如何删除重复单词python

在Python中删除重复单词可以通过多种方法实现，例如使用集合(set)数据结构、字典(dict)或者正则表达式(regex)。集合是最直接的方法，因为集合本身不能包含重复元素；字典则可以通过键值对的唯一性来去重；正则表达式可用于更复杂的字符串处理。下面将详细介绍使用集合的方法。

一、使用集合去除重复单词

集合是一种无序的数据结构，其最大的特点就是不允许重复元素。利用这一特性，我们可以很方便地去除字符串中的重复单词。

1.1 将字符串转换为集合

首先，将字符串按照空格分割成单词列表，然后将列表转换为集合。这样，重复的单词会自动被去除。

def remove_duplicates_with_set(input_string):
    words = input_string.split()
    unique_words = set(words)
    return " ".join(unique_words)
示例
input_string = "Python is great and Python is dynamic"
output_string = remove_duplicates_with_set(input_string)
print(output_string)

1.2 保持原有顺序

由于集合是无序的，如果需要保持单词的原有顺序，可以使用字典来实现。Python 3.7之后，字典也保持插入顺序。

def remove_duplicates_keep_order(input_string):
    words = input_string.split()
    seen = {}
    for word in words:
        if word not in seen:
            seen[word] = True
    return " ".join(seen.keys())
示例
input_string = "Python is great and Python is dynamic"
output_string = remove_duplicates_keep_order(input_string)
print(output_string)

二、使用字典去除重复单词

字典在Python中的应用非常广泛，它可以用来去除重复单词并保持顺序。

2.1 字典的唯一性

字典的键具有唯一性，这可以用来去除重复。

def remove_duplicates_with_dict(input_string):
    words = input_string.split()
    unique_words = dict.fromkeys(words)
    return " ".join(unique_words)
示例
input_string = "Python is great and Python is dynamic"
output_string = remove_duplicates_with_dict(input_string)
print(output_string)

三、使用正则表达式去除重复单词

正则表达式是一种强大的字符串处理工具，适用于复杂的文本处理场景。

3.1 使用正则表达式匹配单词

通过正则表达式可以检测并去除重复的连续单词。

import re
def remove_consecutive_duplicates(input_string):
    return re.sub(r'\b(\w+)( \1\b)+', r'\1', input_string)
示例
input_string = "Python is is great and and Python is dynamic"
output_string = remove_consecutive_duplicates(input_string)
print(output_string)

四、优化与扩展

4.1 考虑大小写

在某些情况下，"Python"和"python"应该被视为相同的单词。可以将字符串转换为统一大小写来处理。

def remove_duplicates_case_insensitive(input_string):
    words = input_string.lower().split()
    unique_words = dict.fromkeys(words)
    return " ".join(unique_words)
示例
input_string = "Python is great and python is dynamic"
output_string = remove_duplicates_case_insensitive(input_string)
print(output_string)

4.2 考虑标点符号

标点符号可能会干扰单词的识别，需要在处理前去除。

import string
def remove_duplicates_ignore_punctuation(input_string):
    translator = str.maketrans('', '', string.punctuation)
    stripped_input = input_string.translate(translator)
    words = stripped_input.split()
    unique_words = dict.fromkeys(words)
    return " ".join(unique_words)
示例
input_string = "Python, is great! And Python is dynamic."
output_string = remove_duplicates_ignore_punctuation(input_string)
print(output_string)