Issue
Can someone please tell me how to scrape the data (Names & Numbers) from this page using Scrapy. The data is dynamically loaded. If you check Network tab you'll find a POST request to https://www.icab.es/rest/icab-api/collegiates. So I copied it as cURL and send the request through Postman. But I am getting error. Could someone please help me? URL: https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales/?extraSearch=false&probono=false
Solution
This is a very good question! But maybe next time you'll want to add your code and maybe format it a little better. How to ask
Solution:
You need to recreate the request. I inspected the request with Burp Suite.
I got the headers for the url in 'start_urls', and both the headers and the body for the json_url.
If you try to to get the json_url from start_request you'll get 401 error, so we first go to the 'start_urls' url and only then request the json_url.
The complete code:
import scrapy
class Temp(scrapy.Spider):
name = "tempspider"
allowed_domains = ['icab.es']
start_urls = ['https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales']
json_url = 'https://www.icab.es/rest/icab-api/collegiates'
def start_requests(self):
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Origin": "https://www.icab.es",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Host": "www.icab.es",
"Pragma": "no-cache",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-GPC": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
}
yield scrapy.Request(url=self.start_urls[0], headers=headers, callback=self.parse)
def parse(self, response):
headers = {
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Pragma": "no-cache",
"Sec-GPC": "1",
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Content-Type': 'application/json',
'Host': 'www.icab.es',
'Sec-Ch-Ua': '"Chromium";v="91", " Not;A Brand";v="99"',
'Sec-Ch-Ua-Mobile': '?0',
'Origin': 'https://www.icab.es',
'Referer': 'https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
"X-KL-Ajax-Request": "Ajax_Request",
}
body = '{"filters":{"keyword":"","name":"","surname":"","street":"","postalCode":"","collegiateNumber":"","dedication":"","language":"","paginationFirst":"1","paginationLast":"25","paginationOrder":"surname","paginationOrderAscDesc":"ASC"}}'
yield scrapy.Request(url=self.json_url, headers=headers, body=body, method='POST', callback=self.parse_json)
def parse_json(self, response):
json_response = response.json()
members = json_response['members']
for member in members:
yield {
'randomPosition': member['randomPosition'],
'collegiateNumber': member['collegiateNumber'],
'surname': member['surname'],
'name': member['name'],
'gender': member['gender'],
}
Output:
{'randomPosition': '27661107', 'collegiateNumber': '35080', 'surname': 'Abad Bamala', 'name': 'Ana', 'gender': 'M'}
{'randomPosition': '98668217', 'collegiateNumber': '14890', 'surname': 'Abad Calvo', 'name': 'Encarnacion', 'gender': 'M'}
{'randomPosition': '53180188', 'collegiateNumber': '29746', 'surname': 'Abad de Brocá', 'name': 'Laura', 'gender': 'M'}
{'randomPosition': '41073111', 'collegiateNumber': '31865', 'surname': 'Abad Esteve', 'name': 'Joan Domènec', 'gender': 'H'}
{'randomPosition': '63371735', 'collegiateNumber': '29647', 'surname': 'Abad Fernández', 'name': 'Dolors', 'gender': 'M'}
{'randomPosition': '30290704', 'collegiateNumber': '45016', 'surname': 'Abad Hernández', 'name': 'Laura', 'gender': 'M'}
{'randomPosition': '57510617', 'collegiateNumber': '16083', 'surname': 'Abad Mariné', 'name': 'Jose Antonio', 'gender': 'H'}
................
................
................
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.