Issue
I'm a python newbie and trying to crawl kununu with scrapy. when I crawl with this, I'm getting 0 pages and getting 0 items.
output:
...
'scrapy.extensions.logstats.LogStats']
2021-07-25 11:56:08 [scrapy.core.engine] INFO: Spider opened
2021-07-25 11:56:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 11:56:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.kununu.com/de/joimax1/kommentare> from <GET https://www.kununu.com/de/joimax1/kommentare/>
2021-07-25 11:56:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.kununu.com/de/joimax1/kommentare> from <GET http://www.kununu.com/de/joimax1/kommentare>
Aktuelle Seite : https://www.kununu.com/de/joimax1/kommentare
....
import scrapy
import logging
class KununuSpider(scrapy.Spider):
name = "kununu"
allowed_domains = ["kununu.com"]
# Reduce Log-Level of some Loggers to avoid "spam" messages in Command line
def __init__(self, *args, **kwargs):
logger = logging.getLogger('scrapy.core.scraper')
logger.setLevel(logging.INFO)
logger2 = logging.getLogger('scrapy.core.engine')
logger2.setLevel(logging.INFO)
logger3 = logging.getLogger('scrapy.middleware')
logger3.setLevel(logging.WARNING)
logger4 = logging.getLogger('kununu')
logger4.setLevel(logging.WARNING)
super().__init__(*args, **kwargs)
def start_requests(self):
yield scrapy.Request('https://www.kununu.com/de/joimax1/kommentare/',self.parse)
def parse(self, response):
print("Aktuelle Seite : {}".format(response.url))
review_list = response.css('article.company-profile-review')
print(review_list)
for elem in review_list:
item = {
'url': response.url,
'date': elem.css('span::text')[1].extract(),
'title': elem.css('a::text')[0].extract(),
'rating': elem.css('div.tile-heading::text')[0].extract()
}
yield item
next_page_url = response.css('a.btn.btn-default.btn-block::attr(href)') # does this attribute exist at all or is returned an empty list?
if next_page_url:
next_page_url = next_page_url[0].extract()
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback = self.parse)
else:
self.log('Last page reached: ' + response.url)
self.log('Last page contained {} item(s)'.format(len(review_list)))
Solution
0 items come back because data is generating in the backend with the help of JavaScript
. Go to chrome devtool then network tab then xhr tab and click header tab then you will get the url and click preview tab to see data.
Here is the working solution:
import scrapy
import json
class KununuSpider(scrapy.Spider):
name = 'kununu'
headers = {
"authority": "www.kununu.com",
"path": "/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=2",
"scheme": "https",
"accept": "application/json",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,bn;q=0.8,es;q=0.7,ar;q=0.6",
"content-type": "application/json",
"referer": "https://www.kununu.com/de/joimax1/kommentare",
#"sec-ch-ua":""Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"",
"sec-ch-ua-mobile": "?0",
"sec-fetch-dest":"empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"x-lang": "de_DE"
}
def start_requests(self):
yield scrapy.Request(
url = 'https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1',
callback = self.parse,
method = "GET",
headers = self.headers
)
def parse(self, response):
response = json.loads(response.body)
for resp in response['reviews']:
items = {
'title':resp['title'],
'date':resp['createdAt'],
'rating':resp['roundedScore']
}
yield items
OUTPUT:
2021-07-25 17:28:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1> (referer: https://www.kununu.com/de/joimax1/kommentare)
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mit viel Abstand betrachtet leider viel Negatives und wenig Positives', 'date': '2021-06-30T00:00:00+00:00', 'rating': 2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Eigene Meinung ist nicht willkommen.', 'date': '2021-02-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Gar nicht so schlimm', 'date': '2021-04-21T00:00:00+00:00', 'rating': 4}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Es könnte alles so schön sein...', 'date': '2021-01-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Außen Hui...', 'date': '2020-12-16T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Mirco-Managment as its best', 'date': '2020-08-20T00:00:00+00:00', 'rating':
2}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Katastrophal', 'date': '2020-07-01T00:00:00+00:00', 'rating': 1}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Licht und Schatten sind sehr nahe beieinander.', 'date': '2020-05-01T00:00:00+00:00', 'rating': 3.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Leider keine Empfehlung von mir', 'date': '2019-11-19T00:00:00+00:00', 'rating': 2.5}
2021-07-25 17:28:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.kununu.com/middlewares/profiles/de/joimax1/263a64cb-b345-4e2b-9014-cc733dcd643f/reviews?reviewType=employees&page=1>
{'title': 'Wohl und Weh nahe beieinander', 'date': '2019-03-30T00:00:00+00:00', 'rating': 3}
2021-07-25 17:28:12 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-25 17:28:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 748,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12850,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.