Issue
I'm trying to refactor my code and break things to learn and I broke something, hope you can help me learn.
I got a working scraper that runs over multiple pages as follows:
class someSpider(scrapy.Spider):
name = 'spider_name'
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com&page=1']
def parse(self, response):
result_parsed = json.loads(result)
results = result_parsed.get('results') #yield actual results
current_page_number = result_parsed.get('currentPage') #gets the page from the link as part of the API response
for result in results:
count += 1
yield{
... #gives me the results as desired
}
go_to_nextpage(self, current_page_number) #### THIS DOES NOT WORK, not error, just stops at one page ####
#### THIS WORKS ####
# next_page_number = result_parsed.get('currentPage') +1
# yield scrapy.Request(
# url=f'https://www.immoweb.be/en/search-results/house-and-apartment/for-sale/brussels/district?countries=BE&hasRecommendationActivated=true&page={next_page_number}&orderBy=relevance&searchType=similar',
# callback=self.parse
# )
With next_page_number() defined as:
def go_to_nextpage(self, current_page_number):
next_page_number = current_page_number +1
yield scrapy.Request(
url=f'https://www.example.com&page={next_page_number}',
callback=self.parse
)
I'm guessing I don't properly understand 2 things:
- the working of the self keyword
- the way the callback method and the parse method work / interact
any help is appreciated
Solution
There are a couple of issues that I can hopefully help clarify.
You are not using the
self
parameter correctly.- In python, when you call a class method like this:
myclass.method()
;myclass
is the variable for the an instance of the class. - When the same method is called from inside another instance method the
self
variable, which is automatically injected as the first parameter, is used instead:self.method()
. - In the context of your code it should look like this
self.go_to_nextpage(current_page_number)
- In python, when you call a class method like this:
Scrapy can only process requests that are returned/yielded from the it's parser callback.
- You correctly yield the first items as you indicated, but request that is yielded by the
go_to_nextpage
method because your current code doesn't do anything with the return value. - Another issue is that you are yielding a single result in
go_to_nextpage
which automatically turns that method into a generator - The easiest solution to this is to simply return the request instead of yielding it.
- You correctly yield the first items as you indicated, but request that is yielded by the
Here is an example of what it should look like:
class someSpider(scrapy.Spider):
name = 'spider_name'
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com&page=1']
def parse(self, response):
result_parsed = json.loads(result)
results = result_parsed.get('results')
current_page_number = result_parsed.get('currentPage')
for result in results:
count += 1
yield{ something }
# go_to_nextpage(self, current_page_number) <- this line is the issue
# because you don't handle the return value.
yield self.go_to_nextpage(current_page_number)
def go_to_nextpage(self, current_page_number):
next_page_number = current_page_number +1
return scrapy.Request(url=(f'https://www.immoweb.be/en/search-results/house-and-apartment/for-sale/brussels/district?countries= BE&hasRecommendationActivated=true&page={next_page_number}&orderBy=relevance&searchType=similar',
callback=self.parse)
If you wanted to use yield in your go_to_nextpage
method you could write it like this.
class someSpider(scrapy.Spider):
def parse(self, request):
...
# because you don't handle the return value.
for i in self.go_to_nextpage(current_page_number):
yield
def go_to_nextpage(self, current_page_number):
next_page_number = current_page_number +1
yield scrapy.Request(url=(f'https://www.immoweb.be/en/search-results/house-and-apartment/for-sale/brussels/district?countries= BE&hasRecommendationActivated=true&page={next_page_number}&orderBy=relevance&searchType=similar',
callback=self.parse)
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.