Saturday, December 9, 2023

[FIXED] Parsing a .pdb file in Python

December 09, 2023 python No comments

Issue

I'm trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb

If you scroll down far enough you will get to a line that looks like this: ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N

The first element "atom" means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, "MET" is the name of the residue, "A" relates to the the type of chain, 1 (the second "1") is the atom count and then the next 3 numbers are the x-y-z positions in space.

What I need output is something like this (where the "1" below corresponds to the atom count, not the general count): MET A 1 63.645 97.355 31.526

To make matters more complicated, sometimes the atom count (the second "1" in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:

ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C

While they both refer to residue 21, the biochem isn't precise enough to get an exact position, so they give two options. Ideally, I would specify "1", "2" or whatever, but if I just take the first option that would be ok. Finally, for the type of atom ("N") in my original example, I only want to get those lines with a "CA".

I'm a newbie to python, and my training is in biostats so I was wondering what's the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?

I realize it's a bit to ask, but some guidance would be a ton of help! I've programmed all the statistics bits in using R, but now I just need to get my files in the right format!

Thanks!

Solution

That's a long description. I am not sure I got all that correctly :-) If the fields (for the lines start with ATOM) are fixed, you could use a split and some comaprisons. I've used a hash to see if the entry is already seen to eliminate the duplicate as you wanted. Hope this would give you a start,

visited = {}
for line in open('3GFT.pdb'):
    list = line.split()
    id = list[0]
    if id == 'ATOM':
        type = list[2]
        if type == 'CA':
            residue = list[3]
            type_of_chain = list[4]
            atom_count = int(list[5])
            position = list[6:8]
            if atom_count >= 0:
                if type_of_chain not in visited:
                    visited[type_of_chain] = 1
                    print residue,type_of_chain,atom_count,' '.join(position)

Will output,

MET A 1 62.935 97.579
GLY B 0 39.524 105.916
GLY C 0 67.295 110.376
MET D 1 59.311 124.106
GLY E 0 44.038 96.819
GLY F 0 44.187 123.590

Answered By - dpp

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 9, 2023

[FIXED] Parsing a .pdb file in Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels