Issue
it is necessary to bypass all the locations of this site mkm If I understood correctly, geolocation is transmitted by the ID parameter in the url (https://mkm-metal.ru/?REGION_ID=141 ) and ID parameters in cookies ('BITRIX_SM_CITY_ID': loc_id).
import scrapy
import re
class Mkm(scrapy.Spider):
name = 'mkm'
def start_requests(self, **cb_kwargs):
for loc_id in ['142', '8', '12', '96']:
url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
yield scrapy.Request(
url=url,
callback=self.parse,
# meta={'cookiejar': loc_id},
cookies=cb_kwargs['cookies'],
cb_kwargs=cb_kwargs,
)
def parse(self, response, **cb_kwargs):
yield scrapy.Request(
url='https://mkm-metal.ru/catalog/',
callback=self.parse_2,
# meta={'cookiejar': response.meta['cookiejar']},
cookies=cb_kwargs['cookies'],
)
def parse_2(self, response, **cb_kwargs):
city = response.css('a.place span::text').get().strip()
print(city, response.url)
But in my case, the parse_2 method returns only one city (first ID = 142). what's wrong? where is the error?
here's the log...
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=142> (referer: None)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=8> (referer: None)
2022-06-05 17:32:46 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://mkm-metal.ru/catalog/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2022-06-05 17:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/catalog/> (referer: https://mkm-metal.ru/?REGION_ID=142)
Бугульма https://mkm-metal.ru/catalog/
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=12> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mkm-metal.ru/?REGION_ID=96> (referer: None)
2022-06-05 17:32:47 [scrapy.core.engine] INFO: Closing spider (finished)
Solution
In function parse
you request the same url for every cookie. Scrapy filters duplicate requests so you only get the first request and the rest are ignored. Add dont_filter=True
:
import scrapy
import re
class Mkm(scrapy.Spider):
name = 'mkm'
def start_requests(self, **cb_kwargs):
for loc_id in ['142', '8', '12', '96']:
url = f"https://mkm-metal.ru/?REGION_ID={loc_id}"
cb_kwargs['cookies'] = {'BITRIX_SM_CITY_ID': loc_id}
yield scrapy.Request(
url=url,
callback=self.parse,
# meta={'cookiejar': loc_id},
cookies=cb_kwargs['cookies'],
cb_kwargs=cb_kwargs,
)
def parse(self, response, **cb_kwargs):
yield scrapy.Request(
url='https://mkm-metal.ru/catalog/',
callback=self.parse_2,
# meta={'cookiejar': response.meta['cookiejar']},
cookies=cb_kwargs['cookies'],
dont_filter=True
)
def parse_2(self, response, **cb_kwargs):
city = response.css('a.place span::text').get().strip()
print(city, response.url)
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.