Friday, January 28, 2022

[FIXED] Crawler not producing any output

January 28, 2022 python-3.x, scrapy, web-crawler, web-scraping No comments

Issue

Building my first web scraper. I'm simply trying to get a list of names and append them to a csv file. The scraper seems to work but not as intended. Output file only produces one name which is always the last name scraped. Its always a different name when I rerun the scraper. In this case the name written to the csv file was Ola Aina.

#Create the spider class
class premSpider(scrapy.Spider):
    name = "premSpider"
    
    def start_requests(self):
        
        # Create a List of Urls with which we wish to scrape
        urls = ['https://www.premierleague.com/players']
        
        #Iterate through each url and send it to be parsed
        
        for url in urls:
            
            #yield kind of acts like return
            yield scrapy.Request(url = url, callback = self.parse)
            
    def parse(self, response):
        
        #extract links to player pages
        plinks = response.xpath('//tr').css('a::attr(href)').extract()
        
        #follow links to specific player pages
        for plink in plinks:
            
            yield response.follow(url = plink, callback = self.parse2)
            
    def parse2(self, response):
        
        plinks2 = response.xpath('//a[@href="stats"]').css('a::attr(href)').extract()
        
        for link2 in plinks2:
            
            yield response.follow(url = link2, callback = self.parse3)
        
    def parse3(self, response):
        
        names= response.xpath('//div[@class="name t-colour"]/text()').extract()
        
        filepath = 'playerlinks.csv'
        
        with open(filepath, 'w') as f:
            f.writelines([name + '\n' for name in names])

process = CrawlerProcess()

process.crawl(premSpider)

process.start()

Solution

You could also use Scrapy's own "FEEDS" export..

add this just below your spider name:

custom_settings = {'FEEDS':{'results1.csv':{'format':'csv'}}}"

And modify parse3 to read as below:

    def parse3(self, response):   
        names=response.xpath('.//div[@class="name t-colour"]/text()').get()
        yield {'names':names}

Answered By - Dr Pi

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 28, 2022

[FIXED] Crawler not producing any output

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels