Issue
I'm new to web scraping but I have designed a scraper that gets some information from the sports site TimeForm. I have tested it against html that I downloaded from page source and saved as an HTML file and then tested the scraper on that (if that makes sense - didn't want to keep hammering their site during testing the code). I have tried to go "live" on the site and I get no outputs at all. So I thought I would just try fetching the HTML using this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.timeform.com/horse-racing/racecards")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)
Which finishes but has no output. I am able to get the HTML for https://www.timeform.com/ and many other pages on TimeForm's site.
I have tried to check TimeForm's robots file to see if it restricts scrapers and all it says is:
User-agent: *
Sitemap: https://www.timeform.com/sitemap.xml
# Don't index the reroute or error url's
User-agent: *
Disallow: /something-has-gone-wrong
Disallow: /horse-racing/something-has-gone-wrong
Disallow: /horse-racing/account/sign-in
Disallow: /free-bets/reroute/
# Block access to whole site for badly behaved crawlers
User-agent: Yandex
User-agent: Baiduspider
User-agent: SemrushBot
User-agent: PetalBot
User-agent: MJ12bot
User-agent: BLEXBot
User-agent: dotbot
User-agent: AhrefsBot
Disallow: /
My initial thoughts were - I don't know if its a scraper issue or the site blocking me? Is it possible its just blocking my IP / the scraper for the "/horse-racing/" sub pages alone? If so how do I find out if that's the cause and can I get around this using IP rotation?
However I now suspect its because the page is dynamically loading (as the racecards will change daily) and therefore isn't static HTML like my saved source page HTML is.
My question now is - if this is what the issue is, can I use Selenium or something else to essentially take what would be dynamic pages and essentially convert them to static HTML for me to scrape? And if so is there an easy way of converting my code to allow for it or am I looking at a complete re-write? I've tried researching the answer but end up going down some long "rabbit holes" with no real outcome. I was wondering if I could use something like Selenium Webdriver to load the page and then my current BeautifulSoup code to scrape it if that makes sense?
Thanks
Solution
Try to set User-Agent
HTTP header when doing the request:
import requests
from bs4 import BeautifulSoup
url = 'https://www.timeform.com/horse-racing/racecards'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for r in soup.select('.w-racecard-grid-race-result'):
title = r.find_previous('h2')
if not title:
continue
print(title.text, r.get_text(strip=True, separator=' '))
Prints:
Aintree 13:45 Chase 3m 210y Grade 1
Aintree 14:20 Hurdle 2m 4f Hcap
Aintree 14:55 Hurdle 2m 103y Grade 1
Aintree 15:30 Chase 2m 3f 200y Grade 1
Aintree 16:05 Chase 2m 5f 19y (Nat) Hcap
Aintree 16:40 Hurdle 3m 149y Grade 1
Aintree 17:15 Hurdle 2m 103y Appr Hcap
Leicester 13:30 Flat 1m 3f 179y Stks
Leicester 14:00 Flat 5f Stks
Leicester 14:35 Flat 7f Stks
Leicester 15:10 Flat 7f Hcap
Leicester 15:45 Flat 6f Clm Stks
...
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.