Issue
I am try to get internet penetration data from the world bank and while parsing it for further processing, getting this error. Here is the code:
import pandas as pd
import requests
import csv
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json
url = 'https://api.worldbank.org/v2/country/all/indicator/IT.NET.USER.ZS'
params = {
'format': 'json',
'date': '1990:2022',
'per_page': '10000' # Maximum number of results per page
}
r = requests.get(url, params=params)
data = r.json()[1] # Index 1 contains the actual data
data_json = json.dumps(data)
# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')
# Extract relevant data from the parsed response
parsed_data = []
for entry in soup.find_all('record'):
country_iso = entry.find('field', {'name': 'countryiso3code'}).get_text()
country_name = entry.find('field', {'name': 'country'}).get_text()
value = entry.find('field', {'name': 'value'}).get_text()
for date_entry in entry.find_all('data'):
date = date_entry.get('date')
parsed_data.append({
'countryiso3code': country_iso,
'country': country_name,
'date': date,
'value': value
})
# Create a DataFrame from the parsed data
df = pd.DataFrame(parsed_data)
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
df = df[df['date'].astype(int) >= 1990]
Error:
KeyError Traceback (most recent call last)
Cell In[15], line 47
44 df = pd.DataFrame(parsed_data)
46 # Add the 'date' column to the DataFrame
---> 47 df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
49 # Filter data for the past 21 years as it's the first available data input to the World Bank
50 df = df[df['date'].astype(int) >= 1990]
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\range.py:349, in RangeIndex.get_loc(self, key)
347 raise KeyError(key) from err
348 if isinstance(key, Hashable):
--> 349 raise KeyError(key)
350 self._check_indexing_error(key)
351 raise KeyError(key)
KeyError: 'date'
I am a beginner with this whole web scrapping stuff. Can someone help me out?
Tried changing some parsing code for the date but no luck.
Solution
data_json = json.dumps(data)
# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')
This step doesn't make sense - you have the data in JSON format, then convert it into a string, then parse it as HTML. But JSON is not HTML, so beautiful soup can't parse this in a meaningful way.
When the code soup.find_all('record')
runs, it finds no records, and therefore the loop runs 0 times.
Instead, I would suggest something like this:
r = requests.get(url, params=params)
data = r.json()[1] # Index 1 contains the actual data
df = pd.json_normalize(data)
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
This converts JSON to a dataframe.
df = df[df['date'].astype(int) >= 1990]
This step isn't doing what you expect. It is converting the date to an int, which is the number of nanoseconds since 1970. This code is checking if the date is later than 0.002 ms after Jan 1, 1970.
You probably want to check the year, instead:
df = df[df['date'].dt.year >= 1990]
Answered By - Nick ODell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.