Friday, January 5, 2024

[FIXED] Unable to access some tags on a site for Web-Scraping

January 05, 2024 beautifulsoup, python, python-requests, web-scraping No comments

Issue

I'm trying to web scrape Citibike tripdata. Since there are multiple files I want to download, I thought better to automate this with python by accessing desired links and then using other methods to download them

here is my code:

url = 'https://s3.amazonaws.com/tripdata/index.html'
html_source = requests.get(url).text 
soup = BeautifulSoup(html_source, "html.parser")
soup.prettify() 

# I'm successful until I add '.find_all('tr')' at the end
citibikedata = soup.find('tbody', id = "tbody-content").find_all('tr') 
print(citibikedata)

HTML Page Screenshot Attached

When I try to print, I get an empty list. If I do length (len), I get 0. However, if I remove the find_all(), I get results for just the tbody-content.

I suspect that for some reason I'm not able to access the tr tag. Meanwhile, there's another layer of tag, 'td', that I have to access in order for me to fetch the data I'm actually looking for, which are the href and text in the a tag.

Please what am I missing, I will appreciate your help. Many thanks in advance

I couldn't find online resources where tags without classes were accessed, which is my suspicion for the issue. Perhaps, it not.

Solution

Main issue here, data is as mentioned loaded dynamically, so you could fetch it from the source it comes from.

https://s3.amazonaws.com/tripdata will provide a XML file that contains your data, simply iterate its items and pick your urls.

Example

import requests
from bs4 import BeautifulSoup

base_url = 'https://s3.amazonaws.com/tripdata/'
links = [
         base_url+key.text
         for key in BeautifulSoup(requests.get(base_url).text).select('Key')
        ]
links

Output

['https://s3.amazonaws.com/tripdata/201306-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201307-201402-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201308-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201309-citibike-tripdata.zip',
 'https://s3.amazonaws.com/tripdata/201310-citibike-tripdata.zip',...]

Based on your comment you could use dict or a list of dicts:

import requests
from bs4 import BeautifulSoup

base_url = 'https://s3.amazonaws.com/tripdata/'

data = []
         
for e in BeautifulSoup(requests.get(base_url).text).select('Contents'):
        data.append({
                'url':base_url+e.key.text,
                'date': e.lastmodified.text
        })
data

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 5, 2024

[FIXED] Unable to access some tags on a site for Web-Scraping

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels