Friday, November 26, 2021

[FIXED] Scrapy - TypeError: can only concatenate str (not "list") to str

November 26, 2021 python, python-3.x, scrapy, typeerror, web-scraping No comments

Issue

While I try to gather a list of URL from a website and put them to combine with a base URL, then continue it inside the page.

Once combine and will crawl those Url 1 by 1 then crawl the details of it. The Layer is like MainPage > Categories > List of Company > Details of each company (data I want)

it's return TypeError: can only concatenate str (not "list") to str. Below is my code for Scrapy Spider

import scrapy
from scrapy.spiders import Rule
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
# from urllib.parse import urljoin


class ZomatoSpider(scrapy.Spider):
    name = 'zomato'
    allowed_domain = ['foodbizmalaysia.com']
    start_urls = ['http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850']
    headers = {
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "cookie": "dnetsid=5kegaefgfpb0efhf3idfxn30; afrvt=14846924c9bb4e87b5576addf94f8cc4; _ga=GA1.2.1937980614.1603360774; _gid=GA1.2.1358979332.1603360774"
    }


    def parse(self, response):
        url = "http://www.foodbizmalaysia.com/"

        yield scrapy.Request(url, 
            callback=self.parse_api, 
            headers=self.headers)
        


    def parse_api(self, response):
        base_url = 'http://www.foodbizmalaysia.com'
        sel = Selector(response)
        sites = sel.xpath('/html')
                
        for data in sites:
            categories = data.xpath('//div[@class="post_content"]/a[contains(@href, "category")]/@href').extract()
            category_url = base_url + categories

            request = scrapy.Request(
                category_url, 
                callback=self.parse_restaurant_company, 
                headers=self.headers
            ) 

            yield request


    def parse_restaurant_company(self, response):
        base_url = 'http://www.foodbizmalaysia.com'
        sel = Selector(response)
        sites = sel.xpath('/html')

        for data in sites:
            company = data.xpath('//a[contains(@id, "ContentPlaceHolder1_dgrdCompany_Hyperlink4_")]/@href').extract_first()
            company_url = base_url + company
            # for i in company:
            #     yield response.urljoin(
            #         'http://www.foodbizmalaysia.com', i[1:],
            #         callback=self.parse_company_details)

        request = scrapy.Request(
                company_url,
                callback=self.parse_company_details, 
                headers=self.headers

        )
        yield request

    def parse_company_details(self, response):
        sel = Selector(response)
        sites = sel.xpath('/html')

        yield {
            'name' : sites.xpath('//span[@class="coprofileh3"]/text()').get()
        }

As below is the log after I scrapy runspider:

2020-10-23 10:58:50 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-10-23 10:58:50 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.9.0, Python 3.8.6 (default, Sep 25 2020, 09:36:53) - [GCC 10.2.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.8, Platform Linux-5.5.0-kali2-amd64-x86_64-with-glibc2.29
2020-10-23 10:58:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-23 10:58:50 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2020-10-23 10:58:50 [scrapy.extensions.telnet] INFO: Telnet Password: 97316bde34a4b21d
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-23 10:58:50 [scrapy.core.engine] INFO: Spider opened
2020-10-23 10:58:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-23 10:58:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-10-23 10:58:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850> (referer: None)
2020-10-23 10:58:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.foodbizmalaysia.com/> (referer: http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850)
2020-10-23 10:58:54 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.foodbizmalaysia.com/> (referer: http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850)
Traceback (most recent call last):
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/home/limjack4511/Dev/0temp/zomato.py", line 34, in parse_api
    category_url = base_url + categories
TypeError: can only concatenate str (not "list") to str
2020-10-23 10:58:54 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-23 10:58:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 752,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 34411,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 3.888395,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 10, 23, 2, 58, 54, 321201),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 53633024,
 'memusage/startup': 53633024,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2020, 10, 23, 2, 58, 50, 432806)}
2020-10-23 10:58:54 [scrapy.core.engine] INFO: Spider closed (finished)

Solution

There are some problems in your code that makes it seems like the code you are executing is NOT the same you posted. For example, this is your parse_api method (copied and pasted):

def parse_api(self, response):
    base_url = 'http://www.foodbizmalaysia.com'
    sel = Selector(response)
    sites = sel.xpath('/html')
            
    for data in sites:
        categories = data.xpath('//div[@class="post_content"]/a[contains(@href, "category")]/@href').extract()


        request = scrapy.Request(
            category_url, 
            callback=self.parse_restaurant_company, 
            headers=self.headers
        ) 
        yield request

That would raise a NameError as category_url isn't defined anywhere. That's not the only inconsistency, here is a piece of your execution log:

  File "/home/limjack4511/Dev/0temp/zomato.py", line 33, in parse_api
    category_url = base_url + categories
TypeError: can only concatenate str (not "list") to str

It's telling me that in the method parse_api this line is returning an error: category_url = base_url + categories, but this line doesn't exist in this method (not in the one you posted at least), you have that same line, but inside another method, called parse_restaurant_company.

The error is telling you that you are trying to concatenate a string with a list, which means that from base_url and categories one is a string and another is a list. I can't tell which is which because I can't trust the code you posted.

Edit:

Now with the full code I can tell you here is the problem: (parse_api method)

    for data in sites:
        categories = data.xpath('//div[@class="post_content"]/a[contains(@href, "category")]/@href').extract()
        category_url = base_url + categories

You are calling .extract() when defining categories. The extract method returns a list not a string. Replace it with .get() or .extract_first()

On another note: You probably want to use data.xpath('.//div[... instead of data.xpath('//div[..., because the first case will look for that XPath inside the data node. Without the . it will look for the XPath in the whole document, ignoring the context already established by the data var.

Answered By - renatodvc

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] Scrapy - TypeError: can only concatenate str (not "list") to str

Issue

Solution

Edit:

0 comments:

Post a Comment

Popular Posts

Labels