Issue
I have something similar to following code. I know that in this example it would be possible to navigate directly to the yourself tag page, but in my application I need to go to page 1 in order to get the links to go to page 2, and I need the links from page 2 in order to get to page 3, etc. (i.e. the urls don't follow a specific pattern).
import scrapy
class SampleSpider(scrapy.Spider):
name = "sample"
start_urls = [
"https://quotes.toscrape.com/",
]
def parse(self, response):
links = response.css(
'a[class="tag"][href*=inspirational]::attr(href)'
).extract()
for link in links:
yield response.follow(link, self.parse_inspirational)
def parse_inspirational(self, response):
links = response.css('a[class="tag"][href*=life]::attr(href)').extract()
for link in links:
yield response.follow(link, self.parse_life)
def parse_life(self, response):
links = response.css('a[class="tag"][href*=yourself]::attr(href)').extract()
for link in links:
yield response.follow(link, self.parse_yourself)
def parse_yourself(self, response):
for resp in response.css('span[itemprop="text"]::text').extract():
print(resp)
Since the same pattern of following a link and looking for a new css pattern is repeated 3 times, I want to write a function that would iterate over a list of css strings and recursively yield the responses. This is what I thought of, but it doesn't work. I'm expecting something that prints the same output as the original/long-version code:
def parse_recurse(self, response, css_str=None):
links = response.css(css_str.pop(0)).extract()
for link in links:
yield response.follow(link, callback=self.parse_recurse, cb_kwargs={"css_str":css_str})
def parse(self, response):
css = ['a[class="tag"][href*=inspirational]::attr(href)',
'a[class="tag"][href*=life]::attr(href)',
'a[class="tag"][href*=yourself]::attr(href)']
response = self.parse_recurse(response, css_str=css)
for resp in response.css('span[itemprop="text"]::text').extract():
print(resp)
Solution
You can't do response = self.parse_recurse(...)
because parse_recurse
yields only request
, not response
.
Normally function yield request
and Scrapy catch it and it sends request
to engine
which will later send request
to server, ang get response
from server, and execute callback
with this response
.
See details in documentation: Architecture overview
You have to use start_requests
to run parse_request
with list css
, and
it should check if css
is not empty. If css
is not empty then yield request with callback parse_requests
and with smaller css
(so it runs recursion). And if css
is empty then it should yield request with callback parse
which will get text.
import scrapy
class SampleSpider(scrapy.Spider):
name = "sample"
start_urls = ["https://quotes.toscrape.com/"]
road = [
'a[class="tag"][href*=inspirational]::attr(href)',
'a[class="tag"][href*=life]::attr(href)',
'a[class="tag"][href*=yourself]::attr(href)',
]
def start_requests(self):
"""Run starting URL with full road."""
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse_recurse, cb_kwargs={"road": self.road})
def parse_recurse(self, response, road):
"""If road is not empty then send to parse_recurse with smaller road.
If road is empty then send to parse."""
first = road[0]
rest = road[1:]
links = response.css(first).extract()
if rest:
# repeat recursion
for link in links:
yield response.follow(link, callback=self.parse_recurse, cb_kwargs={"road": rest})
else:
# exit recursion
for link in links:
yield response.follow(link, callback=self.parse)
def parse(self, response):
for resp in response.css('span[itemprop="text"]::text').extract():
print(resp)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(SampleSpider)
c.start()
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.