Issue
So I am scraping some real estate prices but I only want the data before a certain date say 2010, which means I need to follow the next page only up to a certain page. How do I go about to achieve this?
I can get the page that I want to the follow link to stop at, manually but obviously I want to avoid this.
Can we somehow utilise the number of items scraped? For example, in this website (given below), I am only scraping 10 items per page. Say I only want to scrape the data up to page 14 (including page 14 but not 15), then there should be 14 x 10 = 140 items scraped. Can I then tell scrapy to stop at when items scraped = 140?
import scrapy
class PropertySpider(scrapy.Spider):
name = 'property'
start_urls = ['http://house.speakingsame.com/p.php?q=Fortitude+Valley&p=0&s=1&st=&type=House&count=288®ion=Fortitude+Valley&lat=0&lng=0&sta=qld&htype=&agent=0&minprice=0&maxprice=0&minbed=0&maxbed=0&minland=0&maxland=0'
]
def parse(self, response):
# my code here
next_page = response.xpath("/html/body/center/table").xpath(".//tr").xpath(".//td")[-1].css('a').attrib[
'href']
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
Solution
Scrapy provides the Close spider extension.
class scrapy.extensions.closespider.CloseSpider
Closes a spider automatically when some conditions are met, using a specific closing reason for each condition.
By enabling the extension you get access to several settings that can be used to halt the spider at some point including CLOSESPIDER_ITEMCOUNT
which will do exactly what you are asking.
In your settings.py file
EXTENSIONS = {
'scrapy.extensions.closespider.CloseSpider': 500
}
# CLOSESPIDER_TIMEOUT = 0
CLOSESPIDER_ITEMCOUNT = 140 # change value to suite your needs
# CLOSESPIDER_PAGECOUNT = 0
# CLOSESPIDER_ERRORCOUNT = 0
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.