Issue
I am trying to scrape content from a website using Scrapy CrawlSpider Class but I am blocked by the below response. I guess the above error has got to do with the User-Agent of my Crawler. So I had to add a custom Middleware user Agent, but the response still persist. Please I need your help, suggestions on how to resolve this.
I didn't consider using splash because the content and links to be scraped don't have a javascript extension.
My Scrapy spider class:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from datetime import datetime
import arrow
import re
import pandas as pd
class GreyhoundSpider(CrawlSpider):
name = 'greyhound'
allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']
start_urls = ['https://thegreyhoundrecorder.com.au/form-guides//']
base_url = 'https://thegreyhoundrecorder.com.au'
rules = (
Rule(LinkExtractor(restrict_xpaths="//tbody/tr/td[2]/a"), callback='parse_item', follow=True), #//tbody/tr/td[2]/a
)
def clean_date(dm):
year = pd.to_datetime('now').year # Get current year
race_date = pd.to_datetime(dm + ' ' + str(year)).strftime('%d/%m/%Y')
return race_date
def parse_item(self, response):
#Field = response.xpath ("//ul/li[1][@class='nav-item']/a/text()").extract_first() #all fileds
for race in response.xpath("//div[@class= 'fieldsSingleRace']"):
title = ''.join(race.xpath(".//div/h1[@class='title']/text()").extract_first())
Track = title.split('-')[0].strip()
date = title.split('-')[1].strip()
final_date = self.clean_date(date)
race_number = ''.join(race.xpath(".//tr[@id = 'tableHeader']/td[1]/text()").extract())
num = list(race_number)
final_race_number = "".join(num[::len(num)-1] )
Distance = race.xpath("//tr[@id = 'tableHeader']/td[3]/text()").extract()
TGR_Grade = race.xpath("//tr[@id = 'tableHeader']/td[4]/text()").extract()
TGR1 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[1]/text()").extract()
TGR2 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[2]/text()").extract()
TGR3 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[3]/text()").extract()
TGR4 = response.xpath("//tbody/tr[1][@class='fieldsTableRow raceTipsRow']//div/span[4]/text()").extract()
yield {
'Track': Track,
'Date': final_date,
'#': final_race_number,
'Distance': Distance,
'TGR_Grade': TGR_Grade,
'TGR1': TGR1,
'TGR2': TGR2,
'TGR3': TGR3,
'TGR4': TGR4,
'user-agent': response.request.headers.get('User-Agent').decode('utf-8')
}
My custom Middleware Class:
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
user_agents_list = [
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit /537.36 KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)AppleWebKit /603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)AppleWebKit /601.3.9 /601.3.9 (KHTML, like Gecko)',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit /537.36 (KHTML, like Gecko) Chrome/51.0.2704.79Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
try:
self.user_agent = random.choice(self.user_agents_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")
I have also changed the DOWNLOADER_MIDDLEWARES to my custom middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'greyhound_recorder_website.middlewares.UserAgentRotatorMiddleware': 400,
}
Set the AUTOTHROTTLE
AUTOTHROTTLE_ENABLED = True
Here is the robots.txt of the website.
User-agent: bingbot
Crawl-delay: 10
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: Yandex
Disallow: /
User-agent: *
Disallow: /wp-admin/
Spider Response on terminal:
2021-09-24 11:52:06 [scrapy.core.engine] INFO: Spider opened
2021-09-24 11:52:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-09-24 11:52:06 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache storage in C:\Users\8470p\Desktop\web-scraping\greyhound_recorder_website\.scrapy\httpcache
2021-09-24 11:52:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-09-24 11:52:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://thegreyhoundrecorder.com.au/robots.txt> (referer: None) ['cached']
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 3 without any user agent to enforce it on.
2021-09-24 11:52:06 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
Solution
The major hindrance is allowed_domains
. You must have to take care on it, otherwise Crawlspider fails to produce desired output and another reason may arise to for //
at the end of start_urls so you should use /
and instead of allowed_domains = ['thegreyhoundrecorder.com.au/form-guides/']
You have to only domain name like as follows:
allowed_domains = ['thegreyhoundrecorder.com.au']
Lastly, you can add your real user agent in settings.py file and it is always better practice to set robots.txt = False
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.