python如何取出文本中的数字

在Python中取出文本中的数字，可以使用多种方法，例如正则表达式（regex）、字符串方法、列表解析等。常用的方法包括使用正则表达式提取所有数字、使用字符串方法拆分和过滤、利用列表解析进行筛选。在这篇文章中，我们将详细探讨这些方法，并通过示例代码展示如何在不同情况下应用这些技巧。

一、使用正则表达式提取数字

正则表达式是一种强大的文本处理工具，特别适合处理复杂的字符串匹配和提取操作。Python中的re模块提供了对正则表达式的支持。

导入re模块

首先，导入re模块：

import re

查找所有数字

使用re.findall()函数可以找到文本中的所有匹配项，并返回一个列表：

text = "Python 3.8 was released in October 2019. Python 3.9 followed in October 2020."
numbers = re.findall(r'\d+', text)
print(numbers)  # 输出: ['3', '8', '2019', '3', '9', '2020']

在这个示例中，正则表达式r'\d+'用于匹配一个或多个连续的数字字符。re.findall()函数会返回一个包含所有匹配项的列表。

提取浮点数

如果文本中包含浮点数，可以使用稍微复杂的正则表达式：

text = "The price is 45.67 dollars and the discount is 12.5%."
numbers = re.findall(r'\d+\.\d+', text)
print(numbers)  # 输出: ['45.67', '12.5']

在这个示例中，正则表达式r'\d+\.\d+'用于匹配一个或多个数字字符、一个小数点和一个或多个数字字符。

二、使用字符串方法拆分和过滤

有时，正则表达式可能显得过于复杂，特别是对于简单的文本处理任务。此时，可以使用Python的字符串方法。

拆分字符串

首先，可以使用str.split()方法将字符串拆分为单词列表：

text = "Python 3.8 was released in October 2019. Python 3.9 followed in October 2020."
words = text.split()
print(words)

过滤数字

然后，使用列表解析从单词列表中过滤出数字：

numbers = [word for word in words if word.isdigit()]
print(numbers)  # 输出: ['3', '2019', '3', '2020']

在这个示例中，str.isdigit()方法用于检查字符串是否只包含数字字符。

三、使用列表解析进行筛选

列表解析是一种简洁而高效的方法，可以用于从字符串中提取数字。

提取整数

可以使用列表解析结合str.isdigit()方法：

text = "Python 3.8 was released in October 2019. Python 3.9 followed in October 2020."
numbers = [int(word) for word in text.split() if word.isdigit()]
print(numbers)  # 输出: [3, 2019, 3, 2020]

提取浮点数

如果需要提取浮点数，可以使用try-except块：

text = "The price is 45.67 dollars and the discount is 12.5%."
numbers = []
for word in text.split():
    try:
        numbers.append(float(word))
    except ValueError:
        pass
print(numbers)  # 输出: [45.67]

在这个示例中，尝试将每个单词转换为浮点数，如果转换失败，则跳过该单词。

四、综合实例

为了更好地理解这些方法的应用，让我们看一个综合实例，其中我们需要从文本中提取所有整数和浮点数，并分别存储在两个列表中。

示例代码

import re
def extract_numbers(text):
    integers = re.findall(r'\b\d+\b', text)
    floats = re.findall(r'\b\d+\.\d+\b', text)
    return [int(num) for num in integers], [float(num) for num in floats]
text = "The price of item A is 45.67 dollars, item B is 12.5 dollars, and item C is 100 dollars."
integers, floats = extract_numbers(text)
print("Integers:", integers)  # 输出: Integers: [100]
print("Floats:", floats)  # 输出: Floats: [45.67, 12.5]

在这个示例中，正则表达式r'\b\d+\b'用于匹配整数，r'\b\d+\.\d+\b'用于匹配浮点数。re.findall()函数返回的列表中的字符串被分别转换为整数和浮点数。

五、处理复杂文本

在处理更复杂的文本时，可能需要结合多种方法。例如，文本中可能包含混合的数字格式（整数、浮点数、百分比等），或者需要忽略特定的字符。

示例代码

import re
def extract_numbers(text):
    # 匹配所有整数和浮点数，包括负数和百分比
    pattern = r'-?\b\d+(?:\.\d+)?%?\b'
    matches = re.findall(pattern, text)
    integers = []
    floats = []
    for match in matches:
        # 去掉百分号
        match = match.replace('%', '')
        try:
            if '.' in match:
                floats.append(float(match))
            else:
                integers.append(int(match))
        except ValueError:
            pass
    return integers, floats
text = "The temperature dropped to -5.4% yesterday. The discount is -10% and the price is 123.45 dollars."
integers, floats = extract_numbers(text)
print("Integers:", integers)  # 输出: Integers: [-10, 123]
print("Floats:", floats)  # 输出: Floats: [-5.4]

在这个示例中，正则表达式r'-?\b\d+(?:\.\d+)?%?\b'用于匹配整数、浮点数和百分比。然后，通过处理匹配到的字符串来提取实际的数字值。

六、处理大文本文件

有时，需要从大文本文件中提取数字。此时，可以逐行读取文件，并使用上述方法处理每一行。

示例代码

import re
def extract_numbers_from_file(file_path):
    integers = []
    floats = []
    pattern = r'-?\b\d+(?:\.\d+)?%?\b'
    with open(file_path, 'r') as file:
        for line in file:
            matches = re.findall(pattern, line)
            for match in matches:
                match = match.replace('%', '')
                try:
                    if '.' in match:
                        floats.append(float(match))
                    else:
                        integers.append(int(match))
                except ValueError:
                    pass
    return integers, floats
file_path = 'example.txt'
integers, floats = extract_numbers_from_file(file_path)
print("Integers:", integers)
print("Floats:", floats)