Issue
I'm trying to write a quick parser for .pdb files (they show protein structure). An example of a protein I am looking at is KRAS (common in cancer) and is here: http://www.rcsb.org/pdb/files/3GFT.pdb
If you scroll down far enough you will get to a line that looks like this: ATOM 1 N MET A 1 63.645 97.355 31.526 1.00 33.80 N
The first element "atom" means this relates to an actual atom in the protein. The 1 relates to a general count, N relates to the type of atom, "MET" is the name of the residue, "A" relates to the the type of chain, 1 (the second "1") is the atom count and then the next 3 numbers are the x-y-z positions in space.
What I need output is something like this (where the "1" below corresponds to the atom count, not the general count): MET A 1 63.645 97.355 31.526
To make matters more complicated, sometimes the atom count (the second "1" in this case) is negative. In those cases I want to skip that line an continue until I hit a positive entry as those elements relate to the biochemistry needed to find the positions and not the actual protein. To make matters even worse, sometimes you get a line as such:
ATOM 139 CA AILE A 21 63.260 111.496 12.203 0.50 12.87 C
ATOM 140 CA BILE A 21 63.275 111.495 12.201 0.50 12.17 C
While they both refer to residue 21, the biochem isn't precise enough to get an exact position, so they give two options. Ideally, I would specify "1", "2" or whatever, but if I just take the first option that would be ok. Finally, for the type of atom ("N") in my original example, I only want to get those lines with a "CA".
I'm a newbie to python, and my training is in biostats so I was wondering what's the best way to do this? Do I parse this line by line with a for loop? Is there a way to do it faster in python? How do I handle the double entries for some atoms?
I realize it's a bit to ask, but some guidance would be a ton of help! I've programmed all the statistics bits in using R, but now I just need to get my files in the right format!
Thanks!
Solution
That's a long description. I am not sure I got all that correctly :-) If the fields (for the lines start with ATOM) are fixed, you could use a split and some comaprisons. I've used a hash to see if the entry is already seen to eliminate the duplicate as you wanted. Hope this would give you a start,
visited = {}
for line in open('3GFT.pdb'):
list = line.split()
id = list[0]
if id == 'ATOM':
type = list[2]
if type == 'CA':
residue = list[3]
type_of_chain = list[4]
atom_count = int(list[5])
position = list[6:8]
if atom_count >= 0:
if type_of_chain not in visited:
visited[type_of_chain] = 1
print residue,type_of_chain,atom_count,' '.join(position)
Will output,
MET A 1 62.935 97.579
GLY B 0 39.524 105.916
GLY C 0 67.295 110.376
MET D 1 59.311 124.106
GLY E 0 44.038 96.819
GLY F 0 44.187 123.590
Answered By - dpp
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.