Issue
i'm a beginner to learning python, and need some help in trying to get the following details from a multi-fasta file:
- list out all ORFs in all three reading frames for each sequence in a multi fasta format
- list out the no. of ORFs and longest ORF for each sequence along with its position
- find out the identifier/sequence description starting with a '>' for the longest ORfs.
I cannot seem to get even the lsiting of ORFs right :'). I'd really appreciate all the help i can get.
I tried listing out the ORFs from each sequence and counting them by first making a dictionary whose keys are the sequence descriptions/identifiers starting with a '>' and values are the sequences themselves.
my code:
def findcodon(file,frame):
f=open(file)
dictionary=makedictionary(file)
mlist = list(dictionary.values())
orfcount=0
startpos=0
endpos=0
stopcodon=['tga','tag','taa','TGA','TAG','TGA']
startcodon=['ATG','atg']
for i in mlist:
for j in range(frame,len(i),3):
codon=i[j:j+3]
if codon in startcodon:
startpos=int(j)
elif codon in stopcodon:
endpos=int(j)
orf=i[j][startpos:endpos]
orfcount+=1
print('orf',orfcount,':',orf)
print('total no.of orfs:',orfcount)
findcodon('myseq.txt',1)
My output:
orf 1 :
total no.of orfs: 1
orf 2 :
total no.of orfs: 2
orf 3 :
total no.of orfs: 3
orf 4 :
total no.of orfs: 4
orf 5 :
total no.of orfs: 5
Solution
In the original code, the start and end positions for each ORF were being overwritten in each loop iteration, leading to incorrect ORF extraction. Additionally, the logic for extracting ORFs using list slicing wasn't accurate, and the total number of ORFs was printed inside the loop, resulting in a count of 1 for each ORF due to repetition.
def find_codon(file, frame):
# Open the file and read sequences into a dictionary
dictionary = {}
with open(file, 'r') as f:
identifier = ''
sequence = ''
for line in f:
line = line.strip()
if line.startswith('>'):
if identifier:
dictionary[identifier] = sequence
identifier = line
sequence = ''
else:
sequence += line
# Add the last sequence
if identifier:
dictionary[identifier] = sequence
# Define start and stop codons
stop_codons = {'tga', 'tag', 'taa'}
start_codon = 'atg'
# Iterate through each sequence
for identifier, sequence in dictionary.items():
orf_count = 0
longest_orf = ('', '', 0) # (ORF, position, length)
# Loop through each reading frame
for i in range(frame):
start_pos = None
for j in range(i, len(sequence), 3):
codon = sequence[j:j+3].lower()
if codon == start_codon:
start_pos = j
elif codon in stop_codons and start_pos is not None:
orf_count += 1
orf = sequence[start_pos:j+3]
orf_length = j - start_pos + 3
if orf_length > longest_orf[2]:
longest_orf = (orf, start_pos, orf_length)
print(f'ORF {orf_count}: {orf} at position {start_pos}')
start_pos = None
print(f'Total number of ORFs: {orf_count}')
if longest_orf[0]:
print(f'Longest ORF: {longest_orf[0]} at position {longest_orf[1]}')
# Example usage
find_codon('example_fasta.fasta', 3)
Answered By - Umar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.