Friday, December 8, 2023

[FIXED] Web Scraping returns T&C link

December 08, 2023 beautifulsoup, python, python-requests, web-scraping No comments

Issue

This URL has some CSV data files I'd like to use for analysis under the folder 'usage-stats'. However, I'm unable to scrape data from this page. When I try using the below code, I can only see the terms and conditions page instead of the links to the CSV data files:

import requests
from bs4 import BeautifulSoup

def fetch_file_list(url):

  response = requests.get(url)

  if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    links = [link.get("href") for link in soup.find_all("a")] 
  return links

fetch_file_list(<urlname>)

This code gives the below output:

['https://tfl.gov.uk/corporate/terms-and-conditions/transport-data-service']

One additional thing I tried was using the URL including folder name. Still.. no use. I'm new to scraping.

When we're returned the T&C link are we supposed to assume scraping data from this site is impossible? If not, how else should I do it?

I need data from usage-stats folder for the time period from 2021 to 2023 which is a lot of files. I'd like to find a way to read them automatically and simply if that is possible.

Solution

Data is loaded dynamically via Javascript, so you have to lookup the XHR / API requests via developer tools of your browser.

One will give you an XML with a list you could parse and extract the URLs of the files from.

Example

import requests
from bs4 import BeautifulSoup

soup=BeautifulSoup(requests.get('https://s3-eu-west-1.amazonaws.com/cycling.data.tfl.gov.uk/?list-type=2&max-keys=1500').text)
for c in soup.select('contents key'):
    if c.text.startswith('usage-stats') and c.text.endswith('.csv'):
        print('https://cycling.data.tfl.gov.uk/'+c.text)

https://cycling.data.tfl.gov.uk/usage-stats/01aJourneyDataExtract10Jan16-23Jan16.csv
https://cycling.data.tfl.gov.uk/usage-stats/01b Journey Data Extract 24Jan16-06Feb16.csv
https://cycling.data.tfl.gov.uk/usage-stats/01bJourneyDataExtract24Jan16-06Feb16.csv
https://cycling.data.tfl.gov.uk/usage-stats/02aJourneyDataExtract07Fe16-20Feb2016.csv
https://cycling.data.tfl.gov.uk/usage-stats/02bJourneyDataExtract21Feb16-05Mar2016.csv
https://cycling.data.tfl.gov.uk/usage-stats/03JourneyDataExtract06Mar2016-31Mar2016.csv
...

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 8, 2023

[FIXED] Web Scraping returns T&C link

Issue

Solution

Example

0 comments:

Post a Comment

Popular Posts

Labels