从 FASTA 序列列表中提取模式的实用指南

2024-03-05 11:53:31

利用 SeqIO 和 re.findall 提取 FASTA 序列中的模式

简介

生物信息学研究中，处理大量基因序列数据是不可或缺的任务。其中，FASTA 格式是存储序列数据的广泛使用文本格式。本文将探讨如何使用 Python 的 Biopython 库和正则表达式 (re) 模块，从 FASTA 序列列表中迭代提取特定模式。

提取模式的步骤

1. 导入必要库

from Bio import SeqIO
import re

2. 解析 FASTA 文件

records = list(SeqIO.parse('prot_sequences.fasta', 'fasta'))

3. 定义正则表达式模式

定义一个正则表达式模式，用于匹配序列中的特定模式。例如，要查找包含特定氨基酸模式的序列：

pattern = r'W.P'

4. 迭代序列列表

遍历 FASTA 序列列表中的每个序列记录：

for record in records:

5. 查找模式

使用 re.findall() 函数在每个序列中查找定义的模式：

matches = re.findall(pattern, str(record.seq), re.I)

6. 保存匹配项

如果找到匹配项，将匹配项和序列标识符保存到输出文件中：

if matches:
    with open(outfile, 'a') as f:
        result = f"{record.id}\t{matches}"
        f.write(result + '\n')

代码示例

from Bio import SeqIO
import re

outfile = 'sekvenser.txt'

records = list(SeqIO.parse('prot_sequences.fasta', 'fasta'))

pattern = r'W.P'

for record in records:
    matches = re.findall(pattern, str(record.seq), re.I)
    if matches:
        with open(outfile, 'a') as f:
            result = f"{record.id}\t{matches}"
            f.write(result + '\n')