Issue
I want to scrape this page "https://www.yaencontre.com/alquiler/pisos/barcelona" to get the price, latitude and longitude from every apartment.
I'm able to get the price, but not the latitude and longitude.
Here's my attempt
#!/usr/local/bin/env python3
import scrapy
from scraping.items import fields
import pandas as pd
import re
n = 'barcelona'
list_of_urls = []
for i in range(1,2):
url = 'https://www.yaencontre.com/alquiler/pisos/barcelona/pag-{}'.format(i)
to_append = [url]
for j in to_append:
list_of_urls.append(j)
class scraperApp(scrapy.Spider):
name = n
start_urls = list_of_urls
def parse(self,response):
for href in response.xpath("//a[@class= 'd-ellipsis']/@href"):
u = 'https://www.yaencontre.com'+ href.extract()
print(u)
yield scrapy.Request(u, callback=self.parse_dir_contents)
def parse_dir_contents(self,response):
if(response):
item = fields()
item['vivienda'] = n
item['price'] = response.xpath("//div[@class='price-wrapper mb-sm']/span").extract_first()
item['lat'] = response.xpath ('substring-after(substring-before(//img[@class="d-block"]/@src,"%2C"),"=")').extract()
item['lon'] = response.xpath ('substring-after(substring-before(//img[@class="d-block"]/@src,"&zoom"),"%2C")*1').extract()
print(item)
yield item
else:
pass
Solution
Data is dynamically loaded by JavaScript from hidden API as GET
method. So you can easily grab your required data from API response. Below is given working solution as an example.
import scrapy
import json
class TestSpider(scrapy.Spider):
name = "test"
headers = {
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
def start_requests(self):
yield scrapy.Request(
url="https://api.yaencontre.com/v3/searchmap?family=FLAT&lang=es&latMax=45.58873524958013&latMin=37.00876645649905&location=barcelona&lonMax=7.707878748840145&lonMin=-11.628058751159855&operation=RENT&orderBy=RELEVANCE&size=200",
callback= self.parse,
method= "GET",
headers= self.headers
)
def parse(self, response):
json_response = json.loads(response.text)
res = json_response["result"]["items"]
for item in res:
yield {
'lat': item['lat'],
'lot': item['lon'],
'price': item['price']
}
Output:
{'lat': 41.388501578993406, 'lon': 2.1665850093524353, 'price': 4800}
2023-03-22 00:30:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.yaencontre.com/v3/searchmap?family=FLAT&lang=es&latMax=45.58873524958013&latMin=37.00876645649905&location=barcelona&lonMax=7.707878748840145&lonMin=-11.628058751159855&operation=RENT&orderBy=RELEVANCE&size=200>
{'lat': 41.39807929851551, 'lon': 2.1822061506785406, 'price': 1890}
2023-03-22 00:30:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.yaencontre.com/v3/searchmap?family=FLAT&lang=es&latMax=45.58873524958013&latMin=37.00876645649905&location=barcelona&lonMax=7.707878748840145&lonMin=-11.628058751159855&operation=RENT&orderBy=RELEVANCE&size=200>
{'lat': 41.380495743458205, 'lon': 2.1555876168586092, 'price': 2650}
2023-03-22 00:30:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.yaencontre.com/v3/searchmap?family=FLAT&lang=es&latMax=45.58873524958013&latMin=37.00876645649905&location=barcelona&lonMax=7.707878748840145&lonMin=-11.628058751159855&operation=RENT&orderBy=RELEVANCE&size=200>
{'lat': 41.38348724393205, 'lon': 2.1584030638883083, 'price': 2600}
2023-03-22 00:30:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.yaencontre.com/v3/searchmap?family=FLAT&lang=es&latMax=45.58873524958013&latMin=37.00876645649905&location=barcelona&lonMax=7.707878748840145&lonMin=-11.628058751159855&operation=RENT&orderBy=RELEVANCE&size=200>
{'lat': 41.37912095360161, 'lon': 2.1722173322787586, 'price': 2990}
2023-03-22 00:30:25 [scrapy.core.engine] INFO: Closing spider (finished)
2023-03-22 00:30:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 538,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 135736,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.20321,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 3, 21, 18, 30, 25, 752063),
'item_scraped_count': 200
... so on
Answered By - Md. Fazlul Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.