Friday, December 29, 2023

[FIXED] iterate over a list of url of an API using scrapy

December 29, 2023 request, scrapy, web-scraping No comments

Issue

I have this code and I want to iterate over the "list_of_urls", but I don't know how to call this in the "url" variable. Is there a way to pass this list and iterate over the pageNumber?

import scrapy
import json
 
list_of_urls = []
for i in range(1,3):
    url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
    to_append = [url]
    for j in to_append:
        list_of_urls.append(j)

print(list_of_urls)
class TestSpider(scrapy.Spider):
    name = "test"
       
    headers = {
        'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        }
   
    def start_requests(self):
        yield scrapy.Request(
            url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber=7&pageSize=42', 
            callback= self.parse,
            method= "GET",
            headers= self.headers

        )
    def parse(self, response):
        pass
        json_response = json.loads(response.text)
        res = json_response["result"]["items"]
        for item in res:
            yield {
                'lat': item['realEstate']['address']['geoLocation']['lat'],
                'lon': item['realEstate']['address']['geoLocation']['lon'],
                'price': item['realEstate']['price']
            }

Solution

Yes, there are many ways to do this.

One way would be to simply use a for loop and iterate of the list_of_urls variable inside of your start_requests method.

Example:

...

list_of_urls = []
for i in range(1,3):
    url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
    list_of_urls.append(url)

print(list_of_urls)

...
...

    def start_requests(self):
        for url in list_of_urls:
            yield scrapy.Request(
                url = url, 
                callback= self.parse,
                method= "GET",
                headers= self.headers)

Another would be to simply move your list_of_urls code inside of the start_requests method:

def start_requests(self):
    for i in range(1,3):
        url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
        yield scrapy.Request(url=url, headers=self.headers)

Some Additional tips:

You can use the custom_settings to set the USER_AGENT setting instead of setting it in the headers for every request.

As you can see in my first example you were unnecessarily adding the url to a list and then iterating that list to append it the list_of_urls when you could have just simply appended the url to the list.

The "GET" method is default for scrapy requests so there is no need to set it explicitly, and the same is true for the callback and self.parse, it will choose it by default.

In your parse method you can simply use response.json() instead of json_response = json.loads(response.text).

Using all of the above your code could look something like this.

import scrapy

class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }

    def start_requests(self):
        for i in range(1, 3):
            yield scrapy.Request('https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i))

    def parse(self, response):
        for item in response.json()["result"]["items"]:
            yield {
               'lat': item['realEstate']['address']['geoLocation']['lat'],
               'lon': item['realEstate']['address']['geoLocation']['lon'],
               'price': item['realEstate']['price']
            }

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 29, 2023

[FIXED] iterate over a list of url of an API using scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels