Issue
I asked this question recently and received very helpful advice. I have been building on the advice but have again encountered a problem I can't solve. I would be very grateful for any advice.
In my previous situation, I read data from a well-organized external file into a dataframe. The data were already in columns and all I had to do to get the data was:
import pandas as pd
df = pd.read_table("organizedValues.txt", delimiter="\t")
In the current problem, the data I need are not in a well-organized file. They're posted at a website. I want to put the data into a Pandas dataframe.. Here are details.
There are many different molecules. There's a separate webpage for each molecule and it always has the same structure. Here's an example of how one molecule's webpage appears.
CH4 Methane
13.00000 Things in unit
4.00000 L, units in unit defintion
262.22300 mMass (lb/mol)
300.00000 K_0 (C)
100.45200 V_0 (m^3/mol)
719.08310 Eta_0 (-)
0.00000 Some parameter
0.00000 Another parameter
0.00000 Strain (-)
200.00000 K_1 (C)
I followed advice here to read data from a URL and save it in a file. Now I have a file called "ch4.txt." The problem is that it's a mess. I can't use "read_table" as I did in my previous post. Here's a very small snippet as an example:
{"name”:”test2”,”path”:”test2”,”contentType":"file"},{"name”:”test5”,”path”:”test”5,”contentType":"file"}],"totalCount”:500}},”fileTreeProcessingTime”:25,”foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id”:yyy,”defaultBranch":"main","name”:”code”name,”ownerLogin”:”username”,”currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2022-04-31T01:30:52.000Z",v=4","public":true,"private":false,"isOrgOwned":false},"symbolsExpanded":false,"treeExpanded":true,"refInfo":{"name":"main","listCacheKey”:”XXX”,”canEdit":false,"refType":"branch","currentOid”:”XXX”},”path”:”or”,”currentUser":null,"blob":{"rawLines":[" CH4 Methane "," 13.00000 Strain (-) "," 0.00000 K_0 (C)
"," 300.00000 Things in unit "," 13.00000
Eta_0 (-) "," 756
I need to search the file for keywords like "Strain (-)
" and then extract the values after the keywords (in this case, 0.00000
). For this particular example, I would want to end up with a Pandas dataframe like:
Name Strain (-) K_0 (C) Things in unit Eta_0 (-)
1 CH4 0.00000 300.00000 13.00000 756
A cleaner way to do it (so I won't be storing many files) would be to get the data directly from the website and read it into a dataframe in an organized way. However, I have spent a few hours on trying to find information about how to do this and have not been successful. If anyone has encountered this kind of thing before, I would love to know what worked. Thank you.
Solution
Assuming a substring starting with "blob"
can be extracted from the input (the sample is an incomplete fragment), here's how to convert that to a dataframe as expected
import json
import pandas as pd
from io import StringIO
# fragment may need to be surrounded with {}
rawstr = '''
{
"blob": {
"rawLines": ["CH4 Methane ",
" 13.00000 Strain (-) ",
" 0.00000 K_0 (C) ",
" 300.00000 Things in unit ",
" 13.00000 Eta_0 (-) ", "756 ."
]
}
}
'''
# parsing the fragment as json
data = json.load(StringIO(rawstr))
cdata = []
for v in data["blob"]["rawLines"]:
vals = [x for x in v.strip().split(' ') if x != '']
cdata.append([vals[0], " ".join(vals[1:])])
datadict = {}
datadict['Name'] = [cdata[0][0]]
for s in range(1,len(cdata) - 1):
# value must be taken from next item in the list that's why it's s+1
datadict[cdata[s][1]] = cdata[s+1][0]
df = pd.DataFrame(data=datadict, dtype=object)
print(df)
Result
Name Strain (-) K_0 (C) Things in unit Eta_0 (-)
0 CH4 0.00000 300.00000 13.00000 756
Answered By - LMC
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.