如何用python写词法分析器

如何用Python写词法分析器

用Python写词法分析器的核心步骤有：定义词法规则、编写词法分析器类、实现状态机逻辑、处理输入代码。 其中，定义词法规则是最重要的一步，因为它决定了词法分析器如何识别和处理不同的标记（tokens）。

下面我们将详细描述如何用Python编写一个基本的词法分析器，并逐步讲解每个步骤。

一、定义词法规则

词法规则定义了程序将识别的标记及其对应的正则表达式。常见的标记类型包括关键字、标识符、操作符、数字和分隔符等。我们可以使用Python的re模块来定义这些规则。

import re
定义词法规则，每个规则包含一个名称和一个正则表达式
rules = [
    ('KEYWORD', r'\b(if|else|while|return|function)\b'),
    ('IDENTIFIER', r'\b[A-Za-z_][A-Za-z0-9_]*\b'),
    ('NUMBER', r'\b\d+(\.\d+)?\b'),
    ('OPERATOR', r'[+\-*/=<>!&|]'),
    ('SEPARATOR', r'[(),;{}]'),
    ('WHITESPACE', r'\s+'),
    ('UNKNOWN', r'.'),  # 未知字符
]
合并所有规则的正则表达式
master_pattern = re.compile('|'.join(f'(?P<{name}>{pattern})' for name, pattern in rules))

二、编写词法分析器类

词法分析器类负责读取输入代码，并根据定义的规则将其分解为标记。我们可以创建一个类来实现这一功能。

class Lexer:
    def __init__(self, rules):
        self.rules = rules
        self.master_pattern = re.compile('|'.join(f'(?P<{name}>{pattern})' for name, pattern in rules))
    def tokenize(self, code):
        tokens = []
        pos = 0
        while pos < len(code):
            match = self.master_pattern.match(code, pos)
            if match:
                token_type = match.lastgroup
                token_value = match.group(token_type)
                if token_type != 'WHITESPACE':  # 跳过空白字符
                    tokens.append((token_type, token_value))
                pos = match.end()
            else:
                raise SyntaxError(f'Illegal character at position {pos}')
        return tokens

三、实现状态机逻辑

为了处理不同的标记，我们需要实现一个状态机。状态机在每个状态下根据输入字符的类型决定下一步的动作。我们可以在tokenize方法中添加对状态机的处理逻辑。

class Lexer:
    def __init__(self, rules):
        self.rules = rules
        self.master_pattern = re.compile('|'.join(f'(?P<{name}>{pattern})' for name, pattern in rules))
    def tokenize(self, code):
        tokens = []
        pos = 0
        while pos < len(code):
            match = self.master_pattern.match(code, pos)
            if match:
                token_type = match.lastgroup
                token_value = match.group(token_type)
                if token_type != 'WHITESPACE':  # 跳过空白字符
                    tokens.append((token_type, token_value))
                pos = match.end()
            else:
                raise SyntaxError(f'Illegal character at position {pos}')
        return tokens

四、处理输入代码

最后，我们需要处理输入代码并使用词法分析器将其分解为标记。可以编写一个简单的函数来读取输入代码，并调用词法分析器的tokenize方法。

def main():
    code = """
    function add(a, b) {
        return a + b;
    }
    """
    lexer = Lexer(rules)
    tokens = lexer.tokenize(code)
    for token in tokens:
        print(token)
if __name__ == '__main__':
    main()

五、进阶功能

对于更复杂的词法分析器，我们可以添加更多功能，如处理注释、字符串、错误报告等。下面是一些进阶功能的实现示例。

1、处理注释

我们可以扩展词法规则，添加对注释的处理。通常，注释包括单行注释和多行注释。

rules = [
    # 添加注释规则
    ('COMMENT', r'//.*?$|/\*.*?\*/'),
    # 其他规则
    ('KEYWORD', r'\b(if|else|while|return|function)\b'),
    ('IDENTIFIER', r'\b[A-Za-z_][A-Za-z0-9_]*\b'),
    ('NUMBER', r'\b\d+(\.\d+)?\b'),
    ('OPERATOR', r'[+\-*/=<>!&|]'),
    ('SEPARATOR', r'[(),;{}]'),
    ('WHITESPACE', r'\s+'),
    ('UNKNOWN', r'.'),
]
修改tokenize方法，跳过注释
def tokenize(self, code):
    tokens = []
    pos = 0
    while pos < len(code):
        match = self.master_pattern.match(code, pos)
        if match:
            token_type = match.lastgroup
            token_value = match.group(token_type)
            if token_type not in ('WHITESPACE', 'COMMENT'):  # 跳过空白字符和注释
                tokens.append((token_type, token_value))
            pos = match.end()
        else:
            raise SyntaxError(f'Illegal character at position {pos}')
    return tokens

2、处理字符串

字符串通常用引号括起来，可以是单引号或双引号。我们可以扩展词法规则，添加对字符串的处理。

rules = [
    # 添加字符串规则
    ('STRING', r'\'[^\']*\'|\"[^\"]*\"'),
    # 其他规则
    ('KEYWORD', r'\b(if|else|while|return|function)\b'),
    ('IDENTIFIER', r'\b[A-Za-z_][A-Za-z0-9_]*\b'),
    ('NUMBER', r'\b\d+(\.\d+)?\b'),
    ('OPERATOR', r'[+\-*/=<>!&|]'),
    ('SEPARATOR', r'[(),;{}]'),
    ('WHITESPACE', r'\s+'),
    ('UNKNOWN', r'.'),
]
修改tokenize方法，处理字符串
def tokenize(self, code):
    tokens = []
    pos = 0
    while pos < len(code):
        match = self.master_pattern.match(code, pos)
        if match:
            token_type = match.lastgroup
            token_value = match.group(token_type)
            if token_type not in ('WHITESPACE', 'COMMENT'):  # 跳过空白字符和注释
                tokens.append((token_type, token_value))
            pos = match.end()
        else:
            raise SyntaxError(f'Illegal character at position {pos}')
    return tokens

3、错误报告

当词法分析器遇到非法字符时，需要报告错误位置。我们可以在tokenize方法中添加错误报告功能。

def tokenize(self, code):
    tokens = []
    pos = 0
    while pos < len(code):
        match = self.master_pattern.match(code, pos)
        if match:
            token_type = match.lastgroup
            token_value = match.group(token_type)
            if token_type not in ('WHITESPACE', 'COMMENT'):  # 跳过空白字符和注释
                tokens.append((token_type, token_value))
            pos = match.end()
        else:
            raise SyntaxError(f'Illegal character at position {pos}: {code[pos]}')
    return tokens