Friday, June 3, 2022

[FIXED] Scrapy: parse the data from multiple pages(pagination) and combine the yield output in single array

June 03, 2022 python, scrapy No comments

Issue

What I'm trying to do is to scrape multiple pages and yield the result in a single array.

What I've tried so far:

import scrapy


class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["realtor.com"]
    start_urls = ["http://realtor.com/"]

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Sec-GPC": "1",
        "Connection": "keep-alive",
        "If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm+uEgs"',
        "Cache-Control": "max-age=0",
        "TE": "trailers",
    }

    def start_requests(self):
        url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"

        for page in range(1, 4):
            next_page = url + "/pg-" + str(page)
            yield scrapy.Request(
                url=next_page, headers=self.headers, callback=self.parse, priority=1
            )

    def parse(self, response):
        # extract data
        for card in response.css("ul.property-list"):
            item = {"price": card.css("span[data-label=pc-price]::text").getall()}
            yield item

which gives me three separate list of prices.

['$740,000', '$998,000', '$620,000', ......, '$719,000', '$2,975,000', '$1,099,000']
['$500,000', '$474,000', '$725,000', ......, '$895,000', '$619,500', '$1,199,000']
['$1,095,000', '$475,000', '$700,000', ........, '$950,000', '$995,000', '$639,950']

what I am looking for is to get one single list like this:

$740,000 - 1
$998,000 - 2
$620,000 - 3
$719,000 - 4
     .
     .
     .
$995,000 - 143
$639,950 - 144

Solution

I am not sure what exactly resulted in the example list, but let's say you have called one of the functions in the RealtorSpider that actually resulted in getting three lists. Since these function uses yield to return the value you probably need to call list on the output of these function to have a list instead of a generator.

I suggest you edit your realtor.py file such as what follows:

import scrapy
import json

class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["realtor.com"]
    start_urls = ["http://realtor.com/"]
    prices = []
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Sec-GPC": "1",
        "Connection": "keep-alive",
        "If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm+uEgs"',
        "Cache-Control": "max-age=0",
        "TE": "trailers",
    }

    def start_requests(self):
        url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"

        for page in range(1, 4):
            next_page = url + "/pg-" + str(page)
            yield scrapy.Request(
                url=next_page, headers=self.headers, callback=self.parse, priority=1
            )

    def parse(self, response):
        # extract data
        for card in response.css("ul.property-list"):
            item = {"price": card.css("span[data-label=pc-price]::text").getall()}
            self.prices.append(item["price"])
            yield item
        data = [x for y in self.prices for x in y]
        with open("data.json", "w") as f:
          f.write(json.dumps(data))

If you edit the file into this file, after running scrapy crawl realtor in shell, it will generate a file named data.json. This file is what exactly you want. Therefore, you can just read it:

import json
data = json.load(open("data.json"))
data

Output

['$575,000',
 '$399,950',
 '$620,000',
 '$1,150,000',
 '$1,100,000',
 '$880,000',
 '$735,000',
 '$337,000',
 '$759,800',
 '$330,000',
 '$575,000',
 '$740,000',
 '$639,950',
 '$950,000',
 '$575,000',
 '$895,000',
 '$950,000',
 '$675,000',
 '$629,000',
 '$2,000,000',
 '$1,325,000',
 '$714,900',
 '$699,950',
 '$998,000',
 '$1,150,000',
 '$849,999',
 '$999,000',
 '$1,050,000',
 '$750,000',
 '$2,975,000',
 '$1,300,000',
 '$1,350,000',
 '$400,000',
 '$1,349,000',
 '$1,175,000',
 '$1,049,000',
 '$3,500,000',
 '$849,000',
 '$719,000',
 '$734,950',
 '$1,099,000',
 '$769,000',
 '$489,000',
 '$1,095,000',
 '$700,000',
 '$475,000',
 '$450,000',
 '$625,000',
 '$330,000',
 '$425,000',
 '$685,000',
 '$385,000',
 '$649,950',
 '$815,000',
 '$699,000',
 '$525,000',
 '$1,495,000',
 '$325,000',
 '$835,000',
 '$599,950',
 '$1,150,000',
 '$895,000',
 '$998,900',
 '$775,000',
 '$565,000',
 '$750,000',
 '$879,000',
 '$325,000',
 '$1,000,000',
 '$785,000',
 '$725,000',
 '$899,000',
 '$1,095,000',
 '$1,175,000',
 '$815,000',
 '$2,300,000',
 '$950,000',
 '$929,000',
 '$1,249,900',
 '$1,650,000',
 '$1,500,000',
 '$639,950',
 '$995,000',
 '$750,000',
 '$630,000',
 '$999,000',
 '$474,000',
 '$390,000',
 '$485,000',
 '$725,000',
 '$500,000',
 '$340,000',
 '$689,000',
 '$525,000',
 '$650,000',
 '$589,950',
 '$665,000',
 '$725,000',
 '$460,000',
 '$749,450',
 '$1,088,000',
 '$525,000',
 '$495,000',
 '$830,000',
 '$475,000',
 '$999,000',
 '$849,950',
 '$848,000',
 '$480,000',
 '$538,000',
 '$4,585,000',
 '$1,150,000',
 '$1,045,000',
 '$730,000',
 '$630,000',
 '$1,950,000',
 '$899,000',
 '$1,975,000',
 '$1,179,500',
 '$2,100,000',
 '$829,000',
 '$2,750,000',
 '$895,000',
 '$849,950',
 '$619,500',
 '$1,199,000']

Answered By - Amirhossein Kiani

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 3, 2022

[FIXED] Scrapy: parse the data from multiple pages(pagination) and combine the yield output in single array

Issue

Solution

Output

0 comments:

Post a Comment

Popular Posts

Labels