Saturday, October 16, 2021

[FIXED] Using Scrapy to iterate through Boxscore links on footballdb

October 16, 2021 python, scrapy No comments

Issue

I need to iterate through all the boxscore links with scrapy and then extract the passing,rushing, and receiving tables from each of the boxscores to create a dataset. Main problem is my code returns nothing when I run it.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Nfl20Spider(CrawlSpider):
    name = 'nfl20'
    allowed_domains = ['www.footballdb.com']
    start_urls = ['http://www.footballdb.com/games']
#fixed to iterate through all box scores
    rules = (
        Rule(LinkExtractor(restrict_xpaths='.//table/tbody/tr[1]/td[7]/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        #table of stats.
#need to fix so that it only prints out the text and not the html elements.
        item['table'] = response.xpath('//table/tbody').extract_first()
        print(item['table'])
        yield item

Was able to get it to iterate and save to a file, but I wasn't able to limit it to just the boxscores and it is printing out the html tags. Need help with cleaning it up so that it only extracts the text and only goes to the boxscore links. Thanks for any help.

Solution

I recommend using scrapy shell websitetoscrap.com to make it easier to determine where to search the information in the HTML structure. On http://www.footballdb.com/games just add //td before //a to only get the links related to boxscore.

The https://www.footballdb.com/games/boxscore.html?gid=... HTML structure isn't great. There is almost no id to identify the different statistics location.

First, if you want to determine the different opponents, for example Cleveland Browns at New York Jets for this match try to find if it has an id in the HTML stucture. On this website, there is no id and his parents tags neither. So try to determine the most unique path, for this one the best we can do is :

response.xpath('//center//h1/text()').get()

As there is only one result returned, I can directly use get().

Now, if we want to get the date of the match (ex. December 27, 2020), after analysis of the HTML structure, we can procede like that :

response.xpath('//center//div/text()').getall()[2]

In this case, several results are returned, so you must first use getall() and then locate the position of the information you are looking for.

We can do the same for the place (ex. MetLife Stadium, East Rutherford, NJ) :

response.xpath('//center//div/text()').getall()[3]

For statistics, it will be necessary to use the same technique: try to determine a path as unique as possible, if several results are returned, locate the one that interests us.

For the table tag you can't just convert it to text, it's not as simple, you will have to go trough each table header th and then trough each row tr and then column td.

I hope you now have a more general view of the process.

Here is your code with some corrections and additions listed above :

class Nfl20Spider(CrawlSpider):
    name = 'nfl20'
    allowed_domains = ['www.footballdb.com']
    start_urls = ['http://www.footballdb.com/games']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//td//a'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = {}
        item['stats'] = {}
        item['stats']['visitor'] = {}
        item['stats']['home'] = {}

        item['match'] = response.xpath('//center//h1/text()').get()
        item['date'] = response.xpath('//center//div/text()').getall()[2]
        item['location'] = response.xpath('//center//div/text()').getall()[3]
        item['stats']['visitor']['name'] = response.xpath('//div[@class="boxdiv_visitor"]//span/text()').get()
        item['stats']['home']['name'] = response.xpath('//div[@class="boxdiv_home"]//span/text()').get()
        print(item)
        yield item

Answered By - Torpedo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 16, 2021

[FIXED] Using Scrapy to iterate through Boxscore links on footballdb

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels