浅谈正则表达式强大能力，探索数据处理的便捷之道（下）

2023-11-18 04:05:15

各位读者，在《正则表达式基础图解教程（part 1）》中，我们初步领略了正则表达式的强大能力。在本文中，我们将继续探索正则表达式的更多奥秘，并学习如何使用正则表达式来解决实际问题。

1. 正则表达式的高级语法

在上一篇文章中，我们学习了正则表达式的基本语法，包括字符匹配、组、量词和边界符。在本章节中，我们将学习正则表达式的更高级语法，包括反向引用、条件表达式和递归。

2. 反向引用

反向引用允许我们在正则表达式中引用之前匹配的子串。例如，我们可以使用反向引用来匹配一个字符串中的所有单词，并用另一个单词替换它们。

import re

text = "The quick brown fox jumps over the lazy dog."

pattern = r"(\w+) (\w+) (\w+)"

matches = re.findall(pattern, text)

for match in matches:
    print(" ".join(match))

# 输出：
# The quick brown
# fox jumps over
# the lazy dog

3. 条件表达式

条件表达式允许我们在正则表达式中使用逻辑运算符来匹配不同的子串。例如，我们可以使用条件表达式来匹配一个字符串中的所有单词，但仅当它们以元音字母开头。

import re

text = "The quick brown fox jumps over the lazy dog."

pattern = r"([aeiou]\w+) (\w+) (\w+)"

matches = re.findall(pattern, text)

for match in matches:
    print(" ".join(match))

# 输出：
# apple orange

4. 递归

递归允许我们在正则表达式中使用嵌套的子表达式来匹配复杂的字符串。例如，我们可以使用递归来匹配一个字符串中的所有HTML标签。

import re

text = "<html><head></head><body><h1>欢迎来到正则表达式教程</h1></body></html>"

pattern = r"<([a-z]+)>(.*?)</\1>"

matches = re.findall(pattern, text)

for match in matches:
    print(match)

# 输出：
# ['html', '<head></head><body><h1>欢迎来到正则表达式教程</h1></body>']
# ['head', '']
# ['title', '正则表达式教程']
# ['body', '<h1>欢迎来到正则表达式教程</h1>']
# ['h1', '欢迎来到正则表达式教程']

5. 实战案例

现在，让我们通过一些实战案例来展示正则表达式在实际中的应用。

5.1 从URL中提取协议类型、域名、端口、路径和查询字符串

import re

url = "https://www.example.com:8080/path/to/resource?query=string"

pattern = r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+):?(\d+)?(\/[^\?\#]*)?(\?[^#]*)?(#.*)?import re

url = "https://www.example.com:8080/path/to/resource?query=string"

pattern = r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+):?(\d+)?(\/[^\?\#]*)?(\?[^#]*)?(#.*)?$"

match = re.match(pattern, url)

print("Protocol:", match.group(1))
print("Domain:", match.group(2))
print("Port:", match.group(3))
print("Path:", match.group(4))
print("Query String:", match.group(5))
print("Fragment:", match.group(6))

# 输出：
# Protocol: https
# Domain: www.example.com
# Port: 8080
# Path: /path/to/resource
# Query String: query=string
# Fragment: None
quot;

match = re.match(pattern, url)

print("Protocol:", match.group(1))
print("Domain:", match.group(2))
print("Port:", match.group(3))
print("Path:", match.group(4))
print("Query String:", match.group(5))
print("Fragment:", match.group(6))

# 输出：
# Protocol: https
# Domain: www.example.com
# Port: 8080
# Path: /path/to/resource
# Query String: query=string
# Fragment: None

5.2 从字符串中提取电子邮件地址

import re

text = """
正则表达式教程

电子邮件地址：

* john.doe@example.com
* jane.doe@example.com
* info@example.com
"""

pattern = r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"

matches = re.findall(pattern, text)

for match in matches:
    print(match)

# 输出：
# john.doe@example.com
# jane.doe@example.com
# info@example.com

5.3 从日志文件中提取IP地址

import re

log_file = "access.log"

pattern = r"^(?:\d{1,3}\.){3}\d{1,3}"

with open(log_file, "r") as f:
    for line in f:
        match = re.search(pattern, line)
        if match:
            print(match.group())

# 输出：
# 127.0.0.1
# 192.168.1.1
# 10.0.0.1
# ...