Issue
I am trying to retreive all the results from the following website by making a request as follows :
class MyPropertySpider(scrapy.Spider):
name = 'my_property'
start_urls = [
'https://www.myproperty.co.za/search?last=1y&coords%5Blat%5D=-33.2277918&coords%5Blng%5D=21.8568586&coords%5Bnw%5D%5Blat%5D=-30.4302599&coords%5Bnw%5D%5Blng%5D=17.7575637&coords%5Bse%5D%5Blat%5D=-47.1313489&coords%5Bse%5D%5Blng%5D=38.2216904&description=Western%20Cape%2C%20South%20Africa&status=For%20Sale',
]
def parse(self, response):
headers = {
'authority': 'jf6e1ij07f.execute-api.eu-west-1.amazonaws.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'accept': 'application/json, text/plain, */*',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Mobile Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.myproperty.co.za',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.myproperty.co.za/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
data = '{"clientOfficeId":[],"countryCode":"za","sortField":"distance","sortOrder":"asc","last":"0.5y","statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],"coords":{"lat":"-33.9248685","lng":"18.4240553","nw":{"lat":"-33.47127","lng":"18.3074488"},"se":{"lat":"-34.3598061","lng":"19.00467"}},"radius":2500,"nearbySuburbs":true,"limit":210,"start":0}'
response = requests.post('https://jf6e1ij07f.execute-api.eu-west-1.amazonaws.com/p/search', headers=headers,
data=data)
However, I am only able to get 200 results from the page, even tho 1000+ is available on the given search page. I see that the data limit in the request is 210 and when I try to increase it does not change. I am not sure how to (or if it is possible?) to fix this? Any suggestions? Thanks in advance!
Solution
Since you are using scrapy I suggest you use FormRequest
instead of the requests
lib. You can do the same POST request with both. Here is the docs if you want to read on this method.
This is the form data you are passing, it gives the server all the search parameters you are interested.
data = {
"clientOfficeId": [],
"countryCode":"za",
"sortField":"distance",
"sortOrder":"asc",
"last":"0.5y",
"statuses":["For Sale","Pending Sale","Under Offer","Final Sale","Auction"],
"coords":{"lat":"-33.9248685","lng":"18.4240553",
"nw":{"lat":"-33.47127","lng":"18.3074488"},
"se":{"lat":"-34.3598061","lng":"19.00467"}},
"radius":2500,
"nearbySuburbs":True,
"limit":210,
"start":0
}
Since the server isn't willing to give you all the data at once (I haven't tested, but you said that increasing limit didn't change the result), it expects you to "paginate" through the data, just like you would do it in a website.
When you send the form above, it returns you 210 results, so the next time you call it you need to tell the server you want the NEXT 210 results, not the same you already received. For that you will use the start
field in the form. In your next request use "start":210
and keep adding up until the server start returning empty responses. (Usually the responses are not completely empty, but the field with results return empty)
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.