Issue
I'm trying to web scrape Citibike tripdata. Since there are multiple files I want to download, I thought better to automate this with python by accessing desired links and then using other methods to download them
here is my code:
url = 'https://s3.amazonaws.com/tripdata/index.html'
html_source = requests.get(url).text
soup = BeautifulSoup(html_source, "html.parser")
soup.prettify()
# I'm successful until I add '.find_all('tr')' at the end
citibikedata = soup.find('tbody', id = "tbody-content").find_all('tr')
print(citibikedata)
When I try to print, I get an empty list
. If I do length (len
), I get 0.
However, if I remove the find_all()
, I get results for just the tbody-content
.
I suspect that for some reason I'm not able to access the tr
tag. Meanwhile, there's another layer of tag, 'td', that I have to access in order for me to fetch the data I'm actually looking for, which are the href
and text in the a
tag.
Please what am I missing, I will appreciate your help. Many thanks in advance
I couldn't find online resources where tags without classes were accessed, which is my suspicion for the issue. Perhaps, it not.
Solution
Main issue here, data is as mentioned loaded dynamically, so you could fetch it from the source it comes from.
https://s3.amazonaws.com/tripdata will provide a XML file that contains your data, simply iterate its items and pick your urls.
Example
import requests
from bs4 import BeautifulSoup
base_url = 'https://s3.amazonaws.com/tripdata/'
links = [
base_url+key.text
for key in BeautifulSoup(requests.get(base_url).text).select('Key')
]
links
Output
['https://s3.amazonaws.com/tripdata/201306-citibike-tripdata.zip',
'https://s3.amazonaws.com/tripdata/201307-201402-citibike-tripdata.zip',
'https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
'https://s3.amazonaws.com/tripdata/201308-citibike-tripdata.zip',
'https://s3.amazonaws.com/tripdata/201309-citibike-tripdata.zip',
'https://s3.amazonaws.com/tripdata/201310-citibike-tripdata.zip',...]
Based on your comment you could use dict
or a list
of dicts
:
import requests
from bs4 import BeautifulSoup
base_url = 'https://s3.amazonaws.com/tripdata/'
data = []
for e in BeautifulSoup(requests.get(base_url).text).select('Contents'):
data.append({
'url':base_url+e.key.text,
'date': e.lastmodified.text
})
data
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.