Saturday, October 16, 2021

[FIXED] Trouble Looping through JSON elements pulled using API

October 16, 2021 json, loops, pandas, python, web-scraping No comments

Issue

I am trying to pull search results data from an API on a website and put it into a pandas dataframe. I've been able to successfully pull the info from the API into a JSON format.

The next step I'm stuck on is how to loop through the search results on a particular page and then again for each page of results.

Here is what I've tried so far:

#Step 1: Connect to an API
import requests
import json
response_API = requests.get('https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page=1')
#200

#Step 2: Get the data from API
data = response_API.text

#Step 3: Parse the data into JSON format
parse_json = json.loads(data)

#Step 4: Extract data
title = parse_json['results'][0]['title']
pub_date = parse_json['results'][0]['publication_date']
agency = parse_json['results'][0]['agencies'][0]['name']

Here is where I've tried to put this all into a loop:

import numpy as np
import pandas as pd
df=[]
for page in np.arange(0,7):
    url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
    response_API = requests.get(url)
    print(response_API.status_code)
    data = response_API.text
    parse_json = json.loads(data)

    for i in parse_json:
        title = parse_json['results'][i]['title']
        pub_date = parse_json['results'][i]['publication_date']
        agency = parse_json['results'][i]['agencies'][0]['name']    
        df.append([title,pub_date,agency])


cols = ["Title", "Date","Agency"]

df = pd.DataFrame(df,columns=cols)

I feel like I'm close to the correct answer, but I'm not sure how to move forward from here. I need to iterate through the results where I placed the i's when parsing through the json data, but I get an error that reads, "Type Error: list indices must be integers or slices, not str". I understand I can't put the i's in those spots, but how else am I supposed to iterate through the results?

Any help would be appreciated! Thank you!

Solution

I think you are very close!

import numpy as np
import pandas as pd
import requests

BASE_URL = "'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}"

results = []
for page in range(0, 7):
    response = requests.get(BASE_URL.format(page=page))
    if response.ok:
        resp_json = response.json()
        for res in resp_json["results"]:
            results.append(
                [
                    res["title"],
                    res["publication_date"],
                    [agency["name"] for agency in res["agencies"]]
                ]
            )

df = pd.DataFrame(results, columns=["Title", "Date", "Agencies"])

In this block of code, I used the requests library's built-in .json() method, which can automatically convert a response's text to a JSON dict (if it's in the proper format).

The if response.ok is a little less-verbose way provided by requests to check if the status code is < 400, and can prevent errors that might occur when attempting to parse the response if there was a problem with the HTTP call.

Finally, I'm not sure what data you need exactly for your DataFrame, but each object in the "results" list from the pages pulled from that website has "agencies" as a list of agencies... wasn't sure if you wanted to drop all that data, so I kept the names as a list.

*Edit:

In case the response objects don't contain the proper keys, we can use the .get() method of Python dictionaries.

# ...snip
        for res in resp_json["results"]:
            results.append(
                [
                    res.get("title"),  # This will return `None` as a default, instead of causing a KeyError
                    res.get("publication_date"),
                    [
                        # Here, get the 'raw_name' or None, in case 'name' key doesn't exist
                        agency.get("name", agency.get("raw_name"))
                        for agency in res.get("agencies", [])
                    ]
                ]
            )

Answered By - Sparrow1029

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 16, 2021

[FIXED] Trouble Looping through JSON elements pulled using API

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels