Issue
I need to iterate through all the boxscore links with scrapy and then extract the passing,rushing, and receiving tables from each of the boxscores to create a dataset. Main problem is my code returns nothing when I run it.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Nfl20Spider(CrawlSpider):
name = 'nfl20'
allowed_domains = ['www.footballdb.com']
start_urls = ['http://www.footballdb.com/games']
#fixed to iterate through all box scores
rules = (
Rule(LinkExtractor(restrict_xpaths='.//table/tbody/tr[1]/td[7]/a'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = {}
#table of stats.
#need to fix so that it only prints out the text and not the html elements.
item['table'] = response.xpath('//table/tbody').extract_first()
print(item['table'])
yield item
Was able to get it to iterate and save to a file, but I wasn't able to limit it to just the boxscores and it is printing out the html tags. Need help with cleaning it up so that it only extracts the text and only goes to the boxscore links. Thanks for any help.
Solution
I recommend using scrapy shell websitetoscrap.com
to make it easier to determine where to search the information in the HTML structure. On http://www.footballdb.com/games
just add //td
before //a
to only get the links related to boxscore.
The https://www.footballdb.com/games/boxscore.html?gid=...
HTML structure isn't great. There is almost no id to identify the different statistics location.
First, if you want to determine the different opponents, for example Cleveland Browns at New York Jets
for this match try to find if it has an id in the HTML stucture. On this website, there is no id and his parents tags neither. So try to determine the most unique path, for this one the best we can do is :
response.xpath('//center//h1/text()').get()
As there is only one result returned, I can directly use get()
.
Now, if we want to get the date of the match (ex. December 27, 2020
), after analysis of the HTML structure, we can procede like that :
response.xpath('//center//div/text()').getall()[2]
In this case, several results are returned, so you must first use getall()
and then locate the position of the information you are looking for.
We can do the same for the place (ex. MetLife Stadium, East Rutherford, NJ
) :
response.xpath('//center//div/text()').getall()[3]
For statistics, it will be necessary to use the same technique: try to determine a path as unique as possible, if several results are returned, locate the one that interests us.
For the table
tag you can't just convert it to text, it's not as simple, you will have to go trough each table header th
and then trough each row tr
and then column td
.
I hope you now have a more general view of the process.
Here is your code with some corrections and additions listed above :
class Nfl20Spider(CrawlSpider):
name = 'nfl20'
allowed_domains = ['www.footballdb.com']
start_urls = ['http://www.footballdb.com/games']
rules = (
Rule(LinkExtractor(restrict_xpaths='//td//a'), callback='parse_item'),
)
def parse_item(self, response):
item = {}
item['stats'] = {}
item['stats']['visitor'] = {}
item['stats']['home'] = {}
item['match'] = response.xpath('//center//h1/text()').get()
item['date'] = response.xpath('//center//div/text()').getall()[2]
item['location'] = response.xpath('//center//div/text()').getall()[3]
item['stats']['visitor']['name'] = response.xpath('//div[@class="boxdiv_visitor"]//span/text()').get()
item['stats']['home']['name'] = response.xpath('//div[@class="boxdiv_home"]//span/text()').get()
print(item)
yield item
Answered By - Torpedo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.