Issue
First of all, thank you if you are reading this.
I have been using Python with scrapy to scrape minor data, however, I want to pull in some additional information but I got stuck on pagination. The website is https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html
The element is
<span class="jslink pg-btn page-next" data-href="https://home.mobile.de/regional/baden-württemberg/2.html" title="Zur nächsten Seite"> </span>
What is the xpath expression I can use in Rule(LinkExtractor(restrict_xpaths="")
?
I'm using crawl template. My code so far:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Baden1Spider(CrawlSpider):
name = 'baden1'
allowed_domains = ['home.mobile.de']
start_urls = ['https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html?fbclid=IwAR0MpRTx1TrrrBdg2cKr5E08QiP4fE-pjOAwb7_UsEytToJmWFEfpdD6X0w/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@class='box']/div[@class='row ']"), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_xpaths="//span[@class='jslink pg-btn page-next']"))
)
def parse_item(self, response):
yield{
'Dealer Name': response.xpath("//address[@class='fullAddress']/strong/text()").get(),
'Street': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text())").get(),
'ZIP Code': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[0],
'City': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[1],
'Phone Number 1': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text())").get(),
'Phone Number 2': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text()/following::text()[1])").get(),
'Source': response.url
}
N.B. This is my first post here in stackoverflow. If I made any mistake, pardon me.
Solution
Here is the pagination:
Your code is working fine. Starting url: " https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html" is same as your mentioned. If you click on the first page then you will get this url, from where data is generating. I make the pagination in start_urls using list comprehension. Now You can increase or decrease range of page numbers at anytime. Here I scrape only five pages and you can scrape total pages or whatever you wish just put the page numbers inside the range. I scrape 5 pages total 160 items.
CODE:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Baden1Spider(CrawlSpider):
name = 'baden1'
allowed_domains = ['home.mobile.de']
start_urls = ['https://home.mobile.de/regional/baden-w%C3%BCrttemberg/'+ str(x) +'.html' for x in range(0,5)]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@class='box']/div[@class='row ']"), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_xpaths="//span[@class='jslink pg-btn page-next']"))
)
def parse_item(self, response):
yield{
'Dealer Name': response.xpath("//address[@class='fullAddress']/strong/text()").get(),
'Street': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text())").get(),
'ZIP Code': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[0],
'City': response.xpath("normalize-space(//div[contains(@class, 'addressData')]/text()/following::text()[1])").get().split()[1],
'Phone Number 1': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text())").get(),
'Phone Number 2': response.xpath("normalize-space(//div[contains(@class, 'dealerContactPhoneNumbers')]/text()/following::text()[1])").get(),
'Source': response.url
}
OUTPUT: A portion of total output.
'Dealer Name': 'Abbas KfZ An- und Verkauf', 'Street': 'schießstattweg 18', 'ZIP Code': '88677', 'City': 'Markdorf', 'Phone Number 1': 'Tel.:\xa0+49 (0)176 56730811', 'Phone Number 2': '', 'Source': 'https://home.mobile.de/ABBASKFZANUNDVERKAUF'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/SCHAIBLEMASCHINENHANDEL>
{'Dealer Name': 'Schaible Maschinenhandel', 'Street': 'In Oberwiesen 7', 'ZIP Code': '88682', 'City': 'Salem', 'Phone Number 1': 'Tel.:\xa0+49 (0)7553 60146', 'Phone Number 2': 'Mobiltelefon:\xa0+49 (0)171 7998515', 'Source': 'https://home.mobile.de/SCHAIBLEMASCHINENHANDEL'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/RUSH-AUTOMOBILE>
{'Dealer Name': 'RUSH Automobile UG (haftungsbeschränkt)', 'Street': 'Hallendorferstrasse 6', 'ZIP Code': '88690', 'City': 'Uhldingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7551 949277', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)171 3608800', 'Source': 'https://home.mobile.de/RUSH-AUTOMOBILE'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/FIRST-CLASS-AUTOMOBILE> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AH-MUTTER> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/LOCFAHRZEUGE>
{'Dealer Name': 'LOC Fahrzeuge OHG', 'Street': 'Meersburger Straße 2', 'ZIP Code': '88690', 'City': 'Uhldingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7556 928597', 'Phone Number 2': 'Fax:\xa0+49 (0)7556 928583', 'Source': 'https://home.mobile.de/LOCFAHRZEUGE'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AH-SCHMID-BERMATINGEN> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/FIRST-CLASS-AUTOMOBILE>
{'Dealer Name': 'First Class Automobile Seit 1989', 'Street': 'Büro: Oberer Höhenweg 29', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)176 20491640', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7544 91111', 'Source': 'https://home.mobile.de/FIRST-CLASS-AUTOMOBILE'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AH-MUTTER>
{'Dealer Name': 'Autohaus Matthias Mutter', 'Street': 'Salemerstrasse 42', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 912100', 'Phone Number 2': 'Fax:\xa0+49 (0)7544 91110', 'Source': 'https://home.mobile.de/AH-MUTTER'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AH-SCHMID-BERMATINGEN>
{'Dealer Name': 'Autohaus Schmid', 'Street': 'Salemer Straße 30', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 7544 2375', 'Phone Number 2': 'Fax:\xa0+49 7544 1355', 'Source': 'https://home.mobile.de/AH-SCHMID-BERMATINGEN'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/YAMAHA-NESENSOHN> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/YAMAHA-NESENSOHN>
{'Dealer Name': 'Yamaha Nesensohn', 'Street': 'Salemerstrasse 51', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 2902', 'Phone Number 2': 'Fax:\xa0+49 (0)7544 73025', 'Source': 'https://home.mobile.de/YAMAHA-NESENSOHN'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUS-KIRCHHOFF> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOMOBILEREHM> (referer:
https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUS-KIRCHHOFF>
{'Dealer Name': 'Autohaus Kirchhoff', 'Street': 'Am Luckengraben 4', 'ZIP Code': '88699', 'City': 'Frickingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7554 8450', 'Phone Number 2': 'Fax:\xa0+49 (0)7554 8252', 'Source': 'https://home.mobile.de/AUTOHAUS-KIRCHHOFF'}
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/AUTOHAUSREICHLEOHG> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOMOBILEREHM>
{'Dealer Name': 'Automobile Rehm', 'Street': 'Heidbühlstr. 9', 'ZIP Code': '88697', 'City': 'Bermatingen', 'Phone Number 1': 'Tel.:\xa0+49 175 2234111', 'Phone Number 2': '', 'Source': 'https://home.mobile.de/AUTOMOBILEREHM'}
2021-08-06 12:40:14 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG>
{'Dealer Name': 'Autohaus Sailer GmbH & Co.KG', 'Street': 'Hofäckerstr. 1', 'ZIP Code': '88697', 'City': 'Bermatingen-Ahausen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7544 968300', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7544 9683018', 'Source': 'https://home.mobile.de/AUTOHAUSSAILERGMBHCOKG'}
2021-08-06 12:40:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE> (referer: https://home.mobile.de/regional/baden-w%C3%BCrttemberg/0.html)
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1>
{'Dealer Name': 'Patrick Kayser', 'Street': 'Langbrühl 6', 'ZIP Code': '88709', 'City': 'Hagnau', 'Phone Number 1':
'Tel.:\xa0+49 (0)178 6524858', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)7532 4458081', 'Source': 'https://home.mobile.de/PATRICKKAYSERHAGNAUAMBODENSEE1'}
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/AUTOHAUSREICHLEOHG>
{'Dealer Name': 'Autohaus Reichle OHG', 'Street': 'Hauptstraße 57', 'ZIP Code': '88699', 'City': 'Frickingen-Altheim', 'Phone Number 1': 'Tel.:\xa0+49 7554 8337', 'Phone Number 2': 'Mobiltelefon:\xa0+49 151 65828855', 'Source': 'https://home.mobile.de/AUTOHAUSREICHLEOHG'}
2021-08-06 12:40:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE>
{'Dealer Name': 'Lackiermeisterbetrieb & KFZ Service', 'Street': 'Lippertsreuterstr. 6b', 'ZIP Code': '88699', 'City': 'Frickingen', 'Phone Number 1': 'Tel.:\xa0+49 (0)7554 9892115', 'Phone Number 2': '2. Tel.-Nr.:\xa0+49 (0)1525 2160629', 'Source': 'https://home.mobile.de/LACKIERMEISTERBETRIEBKFZSERVICE'}
2021-08-06 12:40:15 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-06 12:40:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369317,
'downloader/request_count': 165,
'downloader/request_method_count/GET': 165,
'downloader/response_bytes': 2468479,
'downloader/response_count': 165,
'downloader/response_status_count/200': 165,
'elapsed_time_seconds': 17.246198,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 6, 6, 40, 15, 130449),
'httpcompression/response_bytes': 6481573,
'httpcompression/response_count': 165,
'item_scraped_count': 160,
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.