Issue
This URL has some CSV data files I'd like to use for analysis under the folder 'usage-stats'. However, I'm unable to scrape data from this page. When I try using the below code, I can only see the terms and conditions page instead of the links to the CSV data files:
import requests
from bs4 import BeautifulSoup
def fetch_file_list(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
links = [link.get("href") for link in soup.find_all("a")]
return links
fetch_file_list(<urlname>)
This code gives the below output:
['https://tfl.gov.uk/corporate/terms-and-conditions/transport-data-service']
One additional thing I tried was using the URL including folder name. Still.. no use. I'm new to scraping.
When we're returned the T&C link are we supposed to assume scraping data from this site is impossible? If not, how else should I do it?
I need data from usage-stats folder for the time period from 2021 to 2023 which is a lot of files. I'd like to find a way to read them automatically and simply if that is possible.
Solution
Data is loaded dynamically via Javascript, so you have to lookup the XHR / API requests via developer tools of your browser.
One will give you an XML with a list you could parse and extract the URLs of the files from.
Example
import requests
from bs4 import BeautifulSoup
soup=BeautifulSoup(requests.get('https://s3-eu-west-1.amazonaws.com/cycling.data.tfl.gov.uk/?list-type=2&max-keys=1500').text)
for c in soup.select('contents key'):
if c.text.startswith('usage-stats') and c.text.endswith('.csv'):
print('https://cycling.data.tfl.gov.uk/'+c.text)
https://cycling.data.tfl.gov.uk/usage-stats/01aJourneyDataExtract10Jan16-23Jan16.csv
https://cycling.data.tfl.gov.uk/usage-stats/01b Journey Data Extract 24Jan16-06Feb16.csv
https://cycling.data.tfl.gov.uk/usage-stats/01bJourneyDataExtract24Jan16-06Feb16.csv
https://cycling.data.tfl.gov.uk/usage-stats/02aJourneyDataExtract07Fe16-20Feb2016.csv
https://cycling.data.tfl.gov.uk/usage-stats/02bJourneyDataExtract21Feb16-05Mar2016.csv
https://cycling.data.tfl.gov.uk/usage-stats/03JourneyDataExtract06Mar2016-31Mar2016.csv
...
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.