Monday, April 4, 2022

[FIXED] scrapy spider won't start due to TypeError

April 04, 2022 attributeerror, pandas, python, scrapy, typeerror No comments

Issue

I'm trying to throw together a scrapy spider for a german second-hand products website using code I have successfully deployed on other projects. However this time, I'm running into a TypeError and I can't seem to figure out why.

Comparing to this question ('TypeError: expected string or bytes-like object' while scraping a site) It seems as if the spider is fed a non-string-type URL, but upon checking the the individual chunks of code responsible for generating URLs to scrape, they all seem to spit out strings.

To describe the general functionality of the spider & make it easier to read:

The URL generator is responsible for providing the starting URL (first page of search results)
The parse_search_pages function is responsible for pulling a list of URLs from the posts on that page.
It checks the Dataframe if it was scraped in the past. If not, it will scrape it.
The parse_listing function is called on an individual post. It uses the x_path variable to pull all the data. It will then continue to the next page using the CrawlSpider rules.

It's been ~2 years since I've used this code and I'm aware a lot of functionality might have changed. So hopefully you can help me shine a light on what I'm doing wrong?

Cheers, R.

///

The code

import pandas as pd
import scrapy
from datetime import date
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# whitevan scraper - Ebay Kleinanzeigen "Elektronik" category scraper
# 1. URL filters out "Gesuche", "Gewerblich" & sets sorting to "Günstigste zuerst"
# to-do: scrapes only listings marked "Zu verschenken"
# to-do: make sure reserviert and removed ads are also removed from the CSV

TODAY = date.today().strftime("%d/%m/%Y")

df = pd.read_csv(
    r'C:\Users\stefa\Documents\VSCodeProjects\scrapers\whitevan\data\whitevan.csv', delimiter=';')
pd.set_option('display.max_columns', None)

# pick city & category to scrape
city_pick = "berlin"  # berlin, munich, hannover
category_pick = "electronics"  # electronics

PRE = "https://www.",
DOMAIN = "ebay-kleinanzeigen.de",

def url_generator(city, category):
    # Function generates an eBay-Kleinanzeigen URL from chosen city & category
    # To-do: make sorting & filtering a function variable

    URL_LIBRARY = {
        "sorting": ["sortierung:preis", "sortierung:zeit"],
        "seller": ["anbieter:privat", "anbieter:gewerblich"],
        "listing": ["angebote", "gesuche"],
        "cities": {
            "berlin": ["berlin", "l3331"],
            "munich": ["muenchen", "l6411"],
            "hannover": ["hannover", "l3155"]
        },
        "categories": {
            "electronics": ["s-multimedia-elektronik", "c161"]
        }
    }

    return "/{category}/{city}/{sorting}/{seller}/{listing}/{code}{city_code}".format(
        category=URL_LIBRARY["categories"][category][0],
        city=URL_LIBRARY["cities"][city][0],
        sorting=URL_LIBRARY["sorting"][0],
        seller=URL_LIBRARY["seller"][0],
        listing=URL_LIBRARY["listing"][0],
        code=URL_LIBRARY["categories"][category][1],
        city_code=URL_LIBRARY["cities"][city][1]
    )


# tested with scrapy shell
x_paths = {
    'header': '//h1[@class="boxedarticle--title"]/text()',
    'description': '//p[@class="text-force-linebreak "]/text()',
    'location': '//span[@id="viewad-locality"]/text()',
    'listing_date': '//div[@id="viewad-extra-info"]/div/span/text()',
    'url': '//head/link[@rel="canonical"]/@href',
    'type': '//li[contains(text(),"Art")]/span/text()',
    'subgroup': '//li[contains(text(),"Gerät & Zubehör")]/span/text()',
    'condition': '//li[contains(text(),"Zustand")]/span/text()',
    'shipping': '//li[contains(text(),"Versand")]/span/text()',
    'user': '//span[@class="text-body-regular-strong text-force-linebreak"]/a/text()',
    'phone_no': '//span[@id="viewad-contact-phone"]/text()',
    'satisfaction': '//span[@class="userbadges-vip userbadges-profile-rating"]/span/text()',
    'friendliness': '//span[@class="userbadges-vip userbadges-profile-friendliness"]/span/text()',
    'reliability': '//span[@class="userbadges-vip userbadges-profile-reliability"]/span/text()',
    'user_id': '//a[@id="poster-other-ads-link"]/@href',
    'posts_online': '//a[@id="poster-other-ads-link"]/text()'
}


class Whitevan(CrawlSpider):
    name = 'whitevan'
    allowed_domains = [DOMAIN]
    search_url = url_generator(city_pick, category_pick)
    start_urls = [f"https://www.ebay-kleinanzeigen.de{search_url}"]
    rules = [
        Rule(
            LinkExtractor(
                restrict_xpaths='//a[@class="pagination-next"]'
            ),
            callback='parse_search_pages',
            follow=True
        )
    ]

    def parse_search_pages(self, response):
        #creates a list of each post's respective URLs to be scraped
        url_list = response.xpath(
            '//li[@class="ad-listitem lazyload-item   "]/article/div/a/@href').getall()
        
        #adds the top level URL to the url so it can be compared to the URLs in the dataframe
        for item in url_list:
            full_url = f"https://www.ebay-kleinanzeigen.de{item}"

            #checks if URL exists in dataframe (thus can be skipped)
            if not df['url'].str.contains(full_url).any():
                #yields the function responsible for scraping the individual post
                yield scrapy.Request(full_url, callback=self.parse_listing)

    def parse_listing(self, response):
        temp_dict = {'date_scraped': TODAY}

        #goes through the dictionary of xpaths, checks the response & adds it to a temp_dict.
        #yields the temp_dict to be added to a CSV.
        for key in x_paths.keys():
            if response.xpath(x_paths[key]):
                temp_dict[key] = response.xpath(x_paths[key]).extract_first()
            else:
                temp_dict[key] = None

        yield temp_dict

    parse_start_url = parse_search_pages

Output from Terminal:

PS C:\Users\stefa\Documents\VSCodeProjects\scrapers\whitevan> conda activate C:\ProgramData\Anaconda3\envs\whitevan
PS C:\Users\stefa\Documents\VSCodeProjects\scrapers\whitevan> & C:/ProgramData/Anaconda3/envs/whitevan/python.exe c:/Users/stefa/Documents/VSCodeProjects/scrapers/whitevan/whitevan/main.py
2022-02-26 12:43:03 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: whitevan)
2022-02-26 12:43:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 21.7.0, Python 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.0, Platform Windows-10-10.0.19044-SP0
2022-02-26 12:43:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-26 12:43:03 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'whitevan',
 'COOKIES_ENABLED': False,
 'DOWNLOAD_DELAY': 1,
 'NEWSPIDER_MODULE': 'whitevan.spiders',
 'SPIDER_MODULES': ['whitevan.spiders']}
2022-02-26 12:43:03 [scrapy.extensions.telnet] INFO: Telnet Password: e670bb7369bd25dd
2022-02-26 12:43:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-02-26 12:43:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',   
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',     
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',   
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-26 12:43:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-26 12:43:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
    2022-02-26 12:43:03 [scrapy.core.engine] INFO: Spider opened
    2022-02-26 12:43:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2022-02-26 12:43:03 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method OffsiteMiddleware.spider_opened of <scrapy.spidermiddlewares.offsite.OffsiteMiddleware object at 0x00000197491DF880>>
    Traceback (most recent call last):
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\utils\defer.py", line 157, in maybeDeferred_coro
        result = f(*args, **kw)
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
        return receiver(*arguments, **named)
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 76, in spider_opened
        self.host_regex = self.get_host_regex(spider)
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 62, in get_host_regex
        elif url_pattern.match(domain):
    TypeError: expected string or bytes-like object
    2022-02-26 12:43:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
    2022-02-26 12:43:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ebay-kleinanzeigen.de/s-multimedia-elektronik/berlin/sortierung:preis/anbieter:privat/angebote/c161l3331> (referer: None)
    2022-02-26 12:43:04 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.ebay-kleinanzeigen.de/s-multimedia-elektronik/berlin/sortierung:preis/anbieter:privat/angebote/c161l3331> (referer: None)
    Traceback (most recent call last):
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
        yield next(it)
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
        return next(self.data)
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
        return next(self.data)
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
        for r in iterable:
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 31, in process_spider_output
        if x.dont_filter or self.should_follow(x, spider):
      File "C:\ProgramData\Anaconda3\envs\whitevan\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 46, in should_follow
        regex = self.host_regex
    AttributeError: 'OffsiteMiddleware' object has no attribute 'host_regex'
    2022-02-26 12:43:04 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-26 12:43:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 307,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 24282,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'elapsed_time_seconds': 1.146168,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2022, 2, 26, 11, 43, 4, 745511),
     'httpcompression/response_bytes': 180025,
     'httpcompression/response_count': 1,
     'log_count/DEBUG': 1,
     'log_count/ERROR': 2,
     'log_count/INFO': 10,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'spider_exceptions/AttributeError': 1,
     'start_time': datetime.datetime(2022, 2, 26, 11, 43, 3, 599343)}
    2022-02-26 12:43:04 [scrapy.core.engine] INFO: Spider closed (finished)

Solution

So the answer is simple :) always triple-check your code! There were still some commas where they shouldn't have been. This resulted in my allowed_domains variable being a tuple instead of a string.

Incorrect

PRE = "https://www.",
DOMAIN = "ebay-kleinanzeigen.de",

Fixed

PRE = "https://www."
DOMAIN = "ebay-kleinanzeigen.de"

Answered By - Revers3

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, April 4, 2022

[FIXED] scrapy spider won't start due to TypeError

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels