Issue
I am trying to fetch the links from the scorecard column on this page...
I am using a crawlspider, and trying to access the links with this xpath expression....
"//tbody//tr[@class='data1']//td[last()]//a[@class='data-link']"
This expression works inside the scrapy shell, and fetches all 48 links. When I use the spider it scrapes nothing.
I have tried 20 different xpath expressions to no avail. I have also tried using 'allow' and css selectors as well. I believe I shouldn't include @href as the crawlspider takes care of this.
I am flummoxed as I have a very similar crawlspider that works with no issues.
Here is the full code
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class IntlistmakerSpider(CrawlSpider):
name = 'intlistmaker'
allowed_domains = ['www.espncricinfo.com']
start_urls = 'https://stats.espncricinfo.com/uae/engine/records/team/match_results.html?class=3;id=1965;type=ground'
rules = (
Rule(LinkExtractor(restrict_xpaths="//tbody//tr[@class='data1']//td[last()]//a[@class='data-link']"), callback='parse_item', follow=False),
)
def parse_item(self, response):
raw_url = response.url
yield {
'url': raw_url,
}
The output
2021-05-25 18:03:07 [scrapy.core.engine] INFO: Spider opened
2021-05-25 18:03:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-25 18:03:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2021-05-25 18:03:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stats.espncricinfo.com/robots.txt> (referer: None)
2021-05-25 18:03:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stats.espncricinfo.com/uae/engine/records/team/match_results.html?class=3;id=1965;type=ground> (referer: None)
2021-05-25 18:03:08 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stats.espncricinfo.com': <GET https://stats.espncricinfo.com/uae/engine/match/439500.html>
2021-05-25 18:03:08 [scrapy.core.engine] INFO: Closing spider (finished)
2021-05-25 18:03:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 524,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 18522,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.528078,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 5, 25, 17, 3, 8, 872122),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'offsite/domains': 1,
'offsite/filtered': 48,
'request_depth_max': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 5, 25, 17, 3, 7, 344044)}
2021-05-25 18:03:08 [scrapy.core.engine] INFO: Spider closed (finished)
Here is the spider that works:
class ListmakerSpider(CrawlSpider):
name = 'listmaker'
allowed_domains = ['www.espncricinfo.com']
start_urls = [psl21]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@data-hover='Scorecard']"), callback='parse_item', follow=True),
)
This spider successfully extracts the scorecard links from this page....
https://www.espncricinfo.com/series/ipl-2021-1249214/match-results
Please can anyone suggest how I can alter the xpath expression in the first example, so I can isolate and retrieve the scorecard urls.
Thanks in advance.
Solution
The key line in the log is this one
2021-05-25 18:03:08 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'stats.espncricinfo.com': <GET https://stats.espncricinfo.com/uae/engine/match/439500.html>
You have set allowed_domains
to "www.espncricinfo.com"
which doesn't match with "stats.espncricinfo.com"
. Change allowed_domains
to "espncricinfo.com"
to solve that.
In the version of scrapy
I'm using start_urls
has to be a list so you should fix that too.
Your xpath
should now work. Try to keep them as simple as possible in the future. In this case a working css
selector could be ".data1 > td:last-of-type > a"
Answered By - tomjn
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.