Thursday, December 14, 2023

[FIXED] Python Pandas: how to get data from a website into a dataframe by searching for values following key words?

December 14, 2023 dataframe, file-io, pandas, python, python-3.x No comments

Issue

I asked this question recently and received very helpful advice. I have been building on the advice but have again encountered a problem I can't solve. I would be very grateful for any advice.

In my previous situation, I read data from a well-organized external file into a dataframe. The data were already in columns and all I had to do to get the data was:

import pandas as pd

df = pd.read_table("organizedValues.txt", delimiter="\t")

In the current problem, the data I need are not in a well-organized file. They're posted at a website. I want to put the data into a Pandas dataframe.. Here are details.

There are many different molecules. There's a separate webpage for each molecule and it always has the same structure. Here's an example of how one molecule's webpage appears.

CH4   Methane                  
    13.00000        Things in unit                             
     4.00000        L, units in unit defintion                     
   262.22300        mMass (lb/mol)                              
   300.00000        K_0 (C)                                           
   100.45200        V_0 (m^3/mol)                                    
   719.08310        Eta_0 (-)                                       
     0.00000        Some parameter                           
     0.00000        Another parameter                           
     0.00000        Strain (-)                            
   200.00000        K_1 (C)

I followed advice here to read data from a URL and save it in a file. Now I have a file called "ch4.txt." The problem is that it's a mess. I can't use "read_table" as I did in my previous post. Here's a very small snippet as an example:

{"name”:”test2”,”path”:”test2”,”contentType":"file"},{"name”:”test5”,”path”:”test”5,”contentType":"file"}],"totalCount”:500}},”fileTreeProcessingTime”:25,”foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id”:yyy,”defaultBranch":"main","name”:”code”name,”ownerLogin”:”username”,”currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2022-04-31T01:30:52.000Z",v=4","public":true,"private":false,"isOrgOwned":false},"symbolsExpanded":false,"treeExpanded":true,"refInfo":{"name":"main","listCacheKey”:”XXX”,”canEdit":false,"refType":"branch","currentOid”:”XXX”},”path”:”or”,”currentUser":null,"blob":{"rawLines":[" CH4 Methane "," 13.00000 Strain (-) "," 0.00000 K_0 (C)
"," 300.00000 Things in unit "," 13.00000
Eta_0 (-) "," 756

I need to search the file for keywords like "Strain (-)" and then extract the values after the keywords (in this case, 0.00000). For this particular example, I would want to end up with a Pandas dataframe like:

    Name   Strain (-)   K_0 (C)    Things in unit  Eta_0 (-)
1   CH4    0.00000      300.00000  13.00000        756

A cleaner way to do it (so I won't be storing many files) would be to get the data directly from the website and read it into a dataframe in an organized way. However, I have spent a few hours on trying to find information about how to do this and have not been successful. If anyone has encountered this kind of thing before, I would love to know what worked. Thank you.

Solution

Assuming a substring starting with "blob" can be extracted from the input (the sample is an incomplete fragment), here's how to convert that to a dataframe as expected

import json
import pandas as pd
from io import StringIO

# fragment may need to be surrounded with {}
rawstr = '''
{
  "blob": {
    "rawLines": ["CH4   Methane                  ",
    "            13.00000        Strain (-)                             ",
    "             0.00000   K_0 (C)     ",
    "         300.00000     Things in unit   ",
    "             13.00000  Eta_0 (-)   ", "756    ."
    ]
  }
}
'''

# parsing the fragment as json
data = json.load(StringIO(rawstr))

cdata = []
for v in data["blob"]["rawLines"]:
    vals = [x for x in v.strip().split(' ') if x != '']
    cdata.append([vals[0], " ".join(vals[1:])])

datadict = {}
datadict['Name'] = [cdata[0][0]]

for s in range(1,len(cdata) - 1):
    # value must be taken from next item in the list that's why it's s+1
    datadict[cdata[s][1]] = cdata[s+1][0]

df = pd.DataFrame(data=datadict, dtype=object)

print(df)

Result

  Name Strain (-)    K_0 (C) Things in unit Eta_0 (-)
0  CH4    0.00000  300.00000       13.00000       756

Answered By - LMC

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 14, 2023

[FIXED] Python Pandas: how to get data from a website into a dataframe by searching for values following key words?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels