Friday, January 19, 2024

[FIXED] Getting an error while parsing the data in Jupyter notebook

January 19, 2024 beautifulsoup, json, jupyter, jupyter-notebook, python No comments

Issue

I am try to get internet penetration data from the world bank and while parsing it for further processing, getting this error. Here is the code:


import pandas as pd
import requests
import csv
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json

url = 'https://api.worldbank.org/v2/country/all/indicator/IT.NET.USER.ZS'
params = {
    'format': 'json',
    'date': '1990:2022',
    'per_page': '10000'  # Maximum number of results per page
}

r = requests.get(url, params=params)
data = r.json()[1]  # Index 1 contains the actual data
data_json = json.dumps(data)

# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')

# Extract relevant data from the parsed response
parsed_data = []
for entry in soup.find_all('record'):
    country_iso = entry.find('field', {'name': 'countryiso3code'}).get_text()
    country_name = entry.find('field', {'name': 'country'}).get_text()
    value = entry.find('field', {'name': 'value'}).get_text()

    for date_entry in entry.find_all('data'):
        date = date_entry.get('date')

        parsed_data.append({
            'countryiso3code': country_iso,
            'country': country_name,
            'date': date,
            'value': value
        })

# Create a DataFrame from the parsed data
df = pd.DataFrame(parsed_data)

df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)

df = df[df['date'].astype(int) >= 1990]

Error:

KeyError                                  Traceback (most recent call last)
Cell In[15], line 47
     44 df = pd.DataFrame(parsed_data)
     46 # Add the 'date' column to the DataFrame
---> 47 df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
     49 # Filter data for the past 21 years as it's the first available data input to the World Bank
     50 df = df[df['date'].astype(int) >= 1990]

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\range.py:349, in RangeIndex.get_loc(self, key)
    347         raise KeyError(key) from err
    348 if isinstance(key, Hashable):
--> 349     raise KeyError(key)
    350 self._check_indexing_error(key)
    351 raise KeyError(key)

KeyError: 'date'

I am a beginner with this whole web scrapping stuff. Can someone help me out?

Tried changing some parsing code for the date but no luck.

Solution

data_json = json.dumps(data)

# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')

This step doesn't make sense - you have the data in JSON format, then convert it into a string, then parse it as HTML. But JSON is not HTML, so beautiful soup can't parse this in a meaningful way.

When the code soup.find_all('record') runs, it finds no records, and therefore the loop runs 0 times.

Instead, I would suggest something like this:

r = requests.get(url, params=params)
data = r.json()[1]  # Index 1 contains the actual data
df = pd.json_normalize(data)
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)

This converts JSON to a dataframe.

df = df[df['date'].astype(int) >= 1990]

This step isn't doing what you expect. It is converting the date to an int, which is the number of nanoseconds since 1970. This code is checking if the date is later than 0.002 ms after Jan 1, 1970.

You probably want to check the year, instead:

df = df[df['date'].dt.year >= 1990]

Answered By - Nick ODell

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 19, 2024

[FIXED] Getting an error while parsing the data in Jupyter notebook

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels