Issue
While I try to gather a list of URL from a website and put them to combine with a base URL, then continue it inside the page.
Once combine and will crawl those Url 1 by 1 then crawl the details of it.
The Layer is like MainPage
> Categories
> List of Company
> Details of each company
(data I want)
it's return TypeError: can only concatenate str (not "list") to str. Below is my code for Scrapy Spider
import scrapy
from scrapy.spiders import Rule
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
# from urllib.parse import urljoin
class ZomatoSpider(scrapy.Spider):
name = 'zomato'
allowed_domain = ['foodbizmalaysia.com']
start_urls = ['http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850']
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"cookie": "dnetsid=5kegaefgfpb0efhf3idfxn30; afrvt=14846924c9bb4e87b5576addf94f8cc4; _ga=GA1.2.1937980614.1603360774; _gid=GA1.2.1358979332.1603360774"
}
def parse(self, response):
url = "http://www.foodbizmalaysia.com/"
yield scrapy.Request(url,
callback=self.parse_api,
headers=self.headers)
def parse_api(self, response):
base_url = 'http://www.foodbizmalaysia.com'
sel = Selector(response)
sites = sel.xpath('/html')
for data in sites:
categories = data.xpath('//div[@class="post_content"]/a[contains(@href, "category")]/@href').extract()
category_url = base_url + categories
request = scrapy.Request(
category_url,
callback=self.parse_restaurant_company,
headers=self.headers
)
yield request
def parse_restaurant_company(self, response):
base_url = 'http://www.foodbizmalaysia.com'
sel = Selector(response)
sites = sel.xpath('/html')
for data in sites:
company = data.xpath('//a[contains(@id, "ContentPlaceHolder1_dgrdCompany_Hyperlink4_")]/@href').extract_first()
company_url = base_url + company
# for i in company:
# yield response.urljoin(
# 'http://www.foodbizmalaysia.com', i[1:],
# callback=self.parse_company_details)
request = scrapy.Request(
company_url,
callback=self.parse_company_details,
headers=self.headers
)
yield request
def parse_company_details(self, response):
sel = Selector(response)
sites = sel.xpath('/html')
yield {
'name' : sites.xpath('//span[@class="coprofileh3"]/text()').get()
}
As below is the log after I scrapy runspider:
2020-10-23 10:58:50 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrapybot)
2020-10-23 10:58:50 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.9.0, Python 3.8.6 (default, Sep 25 2020, 09:36:53) - [GCC 10.2.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.8, Platform Linux-5.5.0-kali2-amd64-x86_64-with-glibc2.29
2020-10-23 10:58:50 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-23 10:58:50 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2020-10-23 10:58:50 [scrapy.extensions.telnet] INFO: Telnet Password: 97316bde34a4b21d
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-23 10:58:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-23 10:58:50 [scrapy.core.engine] INFO: Spider opened
2020-10-23 10:58:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-23 10:58:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-10-23 10:58:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850> (referer: None)
2020-10-23 10:58:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.foodbizmalaysia.com/> (referer: http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850)
2020-10-23 10:58:54 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.foodbizmalaysia.com/> (referer: http://www.foodbizmalaysia.com/category/3/bakery-pastry-supplies?classid=DS-B42850)
Traceback (most recent call last):
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
yield next(it)
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
return next(self.data)
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
return next(self.data)
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/limjack4511/.local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "/home/limjack4511/Dev/0temp/zomato.py", line 34, in parse_api
category_url = base_url + categories
TypeError: can only concatenate str (not "list") to str
2020-10-23 10:58:54 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-23 10:58:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 752,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 34411,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 3.888395,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 23, 2, 58, 54, 321201),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'memusage/max': 53633024,
'memusage/startup': 53633024,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2020, 10, 23, 2, 58, 50, 432806)}
2020-10-23 10:58:54 [scrapy.core.engine] INFO: Spider closed (finished)
Solution
There are some problems in your code that makes it seems like the code you are executing is NOT the same you posted. For example, this is your parse_api
method (copied and pasted):
def parse_api(self, response):
base_url = 'http://www.foodbizmalaysia.com'
sel = Selector(response)
sites = sel.xpath('/html')
for data in sites:
categories = data.xpath('//div[@class="post_content"]/a[contains(@href, "category")]/@href').extract()
request = scrapy.Request(
category_url,
callback=self.parse_restaurant_company,
headers=self.headers
)
yield request
That would raise a NameError
as category_url
isn't defined anywhere. That's not the only inconsistency, here is a piece of your execution log:
File "/home/limjack4511/Dev/0temp/zomato.py", line 33, in parse_api
category_url = base_url + categories
TypeError: can only concatenate str (not "list") to str
It's telling me that in the method parse_api
this line is returning an error: category_url = base_url + categories
, but this line doesn't exist in this method (not in the one you posted at least), you have that same line, but inside another method, called parse_restaurant_company
.
The error is telling you that you are trying to concatenate a string with a list, which means that from base_url
and categories
one is a string and another is a list. I can't tell which is which because I can't trust the code you posted.
Edit:
Now with the full code I can tell you here is the problem: (parse_api
method)
for data in sites:
categories = data.xpath('//div[@class="post_content"]/a[contains(@href, "category")]/@href').extract()
category_url = base_url + categories
You are calling .extract()
when defining categories. The extract method returns a list not a string. Replace it with .get()
or .extract_first()
On another note: You probably want to use data.xpath('.//div[...
instead of data.xpath('//div[...
, because the first case will look for that XPath inside the data
node. Without the .
it will look for the XPath in the whole document, ignoring the context already established by the data
var.
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.