Friday, January 28, 2022

[FIXED] Traversing Links using Scrapy

January 28, 2022 python, scrapy No comments

Issue

I'm having a strange issue regarding Scrapy. I followed the tutorial for traversing links but for some reason nothing is happening.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
import pandas as pd
from time import strftime

class Covid_Crawler(scrapy.Spider):
     name = "Covid_Crawler"
     allowed_domains = ['worldometers.info/coronavirus/']
     start_urls = ['https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/']

 def parse(self, response):
    count = 0
    soup = BeautifulSoup(response.text, "lxml")
    try:
        covid_table = soup.find('table')
        df = pd.read_html(str(covid_table))[0]
        print(df)
        df.to_csv("CovidFile.csv",index=False)                        

    except:
        print("Table not found")

    NEXT_PAGE_SELECTOR = 'a::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).getall()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

For some reason when I try running this spider, It grabs the table from the first page just fine. But it doesn't want to go to the other links. When I run it I get something like this.

 2020-12-12 20:45:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
 https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/> (referer: None)
 2020-12-12 20:45:15 [numexpr.utils] INFO: NumExpr defaulting to 6 threads.
               Country     Cases  Deaths             Region
 0       United States  16549366  305082      North America
 1               India   9857380  143055               Asia
 2              Brazil   6880595  181143      South America
 3              Russia   2625848   46453             Europe
 4              France   2365319   57761             Europe
 ..                ...       ...     ...                ...
 214        MS Zaandam         9       2                NaN
 215  Marshall Islands         4       0  Australia/Oceania
 216   Wallis & Futuna         3       0  Australia/Oceania
 217             Samoa         2       0  Australia/Oceania
 218           Vanuatu         1       0  Australia/Oceania

 [219 rows x 4 columns]
 2020-12-12 20:45:15 [scrapy.core.scraper] ERROR: Spider error processing <GET 
 https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/> (referer: None)
 Traceback (most recent call last):
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
 File "C:\Users\Zach Kunz\Documents\Crawler_Test\Covid_Crawler\Covid_Crawler\spiders\Crawler_spider.py", line 84, in parse
yield response.follow(next_page, callback=self.parse)
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 169, in follow
return super().follow(
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\__init__.py", line 143, in follow
url = self.urljoin(url)
 File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 102, in urljoin
return urljoin(get_base_url(self), url)
 File "C:\Users\Zach Kunz\anaconda3\lib\urllib\parse.py", line 512, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
 File "C:\Users\Zach Kunz\anaconda3\lib\urllib\parse.py", line 121, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
 TypeError: Cannot mix str and non-str arguments
 2020-12-12 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)

And using the scrapy shell to check if its getting links, I get this

 In [6]: response.css('a::attr(href)').getall()
 Out[6]:
 ['/',
  '/coronavirus/',
  '/population/',
  '/coronavirus/',
  '/coronavirus/',
  '/coronavirus/coronavirus-cases/',
  '/coronavirus/worldwide-graphs/',
  '/coronavirus/#countries',
  '/coronavirus/coronavirus-death-rate/',
  '/coronavirus/coronavirus-incubation-period/',
  '/coronavirus/coronavirus-age-sex-demographics/',
  '/coronavirus/coronavirus-symptoms/',
  '/coronavirus/',
  '/coronavirus/coronavirus-death-toll/',
  '/coronavirus/#countries',
  '/coronavirus/',
  '/coronavirus/coronavirus-cases/',
  '/coronavirus/coronavirus-death-toll/',
  '/coronavirus/coronavirus-death-rate/',
  '/coronavirus/coronavirus-incubation-period/',
  '/coronavirus/coronavirus-age-sex-demographics/',
  '/coronavirus/coronavirus-symptoms/',
  '/coronavirus/countries-where-coronavirus-has-spread/',
  '/coronavirus/#countries',
  '/',
  '/about/',
  '/faq/',
  '/languages/',
  '/contact/',
  '/newsletter-subscribe/',
   'https://twitter.com/Worldometers',
   'https://www.facebook.com/Worldometers.info',
   '/disclaimer/']

Any help or insight would be much appreciated. And if you're willing to help with another problem, I'm looking for a solution to store all of the charts I collect into multiple csv or xlsx files. Thanks!

Solution

response.follow() can't work with a list. You need to provide a single string argument:

next_pages = response.css(NEXT_PAGE_SELECTOR).getall()
for next_page in next_pages:
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

Answered By - gangabass

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 28, 2022

[FIXED] Traversing Links using Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels