Wednesday, February 7, 2024

[FIXED] How to code for a python program (without using biopython) to read & list out all open reading frames (ORFs) of each sequence from a multi-fasta file

February 07, 2024 bioinformatics, biopython, python-3.x No comments

Issue

i'm a beginner to learning python, and need some help in trying to get the following details from a multi-fasta file:

list out all ORFs in all three reading frames for each sequence in a multi fasta format
list out the no. of ORFs and longest ORF for each sequence along with its position
find out the identifier/sequence description starting with a '>' for the longest ORfs.

I cannot seem to get even the lsiting of ORFs right :'). I'd really appreciate all the help i can get.

I tried listing out the ORFs from each sequence and counting them by first making a dictionary whose keys are the sequence descriptions/identifiers starting with a '>' and values are the sequences themselves.

my code:

def findcodon(file,frame):
    f=open(file)
    dictionary=makedictionary(file)
    mlist = list(dictionary.values())
    orfcount=0
    startpos=0
    endpos=0
    stopcodon=['tga','tag','taa','TGA','TAG','TGA']
    startcodon=['ATG','atg']
    for i in mlist:
        for j in range(frame,len(i),3):
            codon=i[j:j+3]
            if codon in startcodon:
                startpos=int(j)
            elif codon in stopcodon:
                endpos=int(j)
        orf=i[j][startpos:endpos]
        orfcount+=1
        print('orf',orfcount,':',orf)
        print('total no.of orfs:',orfcount)


findcodon('myseq.txt',1)

My output:

orf 1 : 
total no.of orfs: 1
orf 2 : 
total no.of orfs: 2
orf 3 : 
total no.of orfs: 3
orf 4 : 
total no.of orfs: 4
orf 5 : 
total no.of orfs: 5

Solution

In the original code, the start and end positions for each ORF were being overwritten in each loop iteration, leading to incorrect ORF extraction. Additionally, the logic for extracting ORFs using list slicing wasn't accurate, and the total number of ORFs was printed inside the loop, resulting in a count of 1 for each ORF due to repetition.

def find_codon(file, frame):
    # Open the file and read sequences into a dictionary
    dictionary = {}
    with open(file, 'r') as f:
        identifier = ''
        sequence = ''
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                if identifier:
                    dictionary[identifier] = sequence
                identifier = line
                sequence = ''
            else:
                sequence += line

        # Add the last sequence
        if identifier:
            dictionary[identifier] = sequence

    # Define start and stop codons
    stop_codons = {'tga', 'tag', 'taa'}
    start_codon = 'atg'

    # Iterate through each sequence
    for identifier, sequence in dictionary.items():
        orf_count = 0
        longest_orf = ('', '', 0)  # (ORF, position, length)

        # Loop through each reading frame
        for i in range(frame):
            start_pos = None
            for j in range(i, len(sequence), 3):
                codon = sequence[j:j+3].lower()
                if codon == start_codon:
                    start_pos = j
                elif codon in stop_codons and start_pos is not None:
                    orf_count += 1
                    orf = sequence[start_pos:j+3]
                    orf_length = j - start_pos + 3
                    if orf_length > longest_orf[2]:
                        longest_orf = (orf, start_pos, orf_length)
                    print(f'ORF {orf_count}: {orf} at position {start_pos}')
                    start_pos = None

        print(f'Total number of ORFs: {orf_count}')
        if longest_orf[0]:
            print(f'Longest ORF: {longest_orf[0]} at position {longest_orf[1]}')

# Example usage
find_codon('example_fasta.fasta', 3)

Answered By - Umar

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 7, 2024

[FIXED] How to code for a python program (without using biopython) to read & list out all open reading frames (ORFs) of each sequence from a multi-fasta file

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels