Issue
What I'm trying to do is to scrape multiple pages and yield the result in a single array.
What I've tried so far:
import scrapy
class RealtorSpider(scrapy.Spider):
name = "realtor"
allowed_domains = ["realtor.com"]
start_urls = ["http://realtor.com/"]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-GPC": "1",
"Connection": "keep-alive",
"If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm+uEgs"',
"Cache-Control": "max-age=0",
"TE": "trailers",
}
def start_requests(self):
url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"
for page in range(1, 4):
next_page = url + "/pg-" + str(page)
yield scrapy.Request(
url=next_page, headers=self.headers, callback=self.parse, priority=1
)
def parse(self, response):
# extract data
for card in response.css("ul.property-list"):
item = {"price": card.css("span[data-label=pc-price]::text").getall()}
yield item
which gives me three separate list of prices.
['$740,000', '$998,000', '$620,000', ......, '$719,000', '$2,975,000', '$1,099,000']
['$500,000', '$474,000', '$725,000', ......, '$895,000', '$619,500', '$1,199,000']
['$1,095,000', '$475,000', '$700,000', ........, '$950,000', '$995,000', '$639,950']
what I am looking for is to get one single list like this:
$740,000 - 1
$998,000 - 2
$620,000 - 3
$719,000 - 4
.
.
.
$995,000 - 143
$639,950 - 144
Solution
I am not sure what exactly resulted in the example list, but let's say you have called one of the functions in the RealtorSpider
that actually resulted in getting three lists. Since these function uses yield
to return the value you probably need to call list
on the output of these function to have a list instead of a generator
.
I suggest you edit your realtor.py
file such as what follows:
import scrapy
import json
class RealtorSpider(scrapy.Spider):
name = "realtor"
allowed_domains = ["realtor.com"]
start_urls = ["http://realtor.com/"]
prices = []
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-GPC": "1",
"Connection": "keep-alive",
"If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm+uEgs"',
"Cache-Control": "max-age=0",
"TE": "trailers",
}
def start_requests(self):
url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"
for page in range(1, 4):
next_page = url + "/pg-" + str(page)
yield scrapy.Request(
url=next_page, headers=self.headers, callback=self.parse, priority=1
)
def parse(self, response):
# extract data
for card in response.css("ul.property-list"):
item = {"price": card.css("span[data-label=pc-price]::text").getall()}
self.prices.append(item["price"])
yield item
data = [x for y in self.prices for x in y]
with open("data.json", "w") as f:
f.write(json.dumps(data))
If you edit the file into this file, after running scrapy crawl realtor
in shell, it will generate a file named data.json
. This file is what exactly you want. Therefore, you can just read it:
import json
data = json.load(open("data.json"))
data
Output
['$575,000',
'$399,950',
'$620,000',
'$1,150,000',
'$1,100,000',
'$880,000',
'$735,000',
'$337,000',
'$759,800',
'$330,000',
'$575,000',
'$740,000',
'$639,950',
'$950,000',
'$575,000',
'$895,000',
'$950,000',
'$675,000',
'$629,000',
'$2,000,000',
'$1,325,000',
'$714,900',
'$699,950',
'$998,000',
'$1,150,000',
'$849,999',
'$999,000',
'$1,050,000',
'$750,000',
'$2,975,000',
'$1,300,000',
'$1,350,000',
'$400,000',
'$1,349,000',
'$1,175,000',
'$1,049,000',
'$3,500,000',
'$849,000',
'$719,000',
'$734,950',
'$1,099,000',
'$769,000',
'$489,000',
'$1,095,000',
'$700,000',
'$475,000',
'$450,000',
'$625,000',
'$330,000',
'$425,000',
'$685,000',
'$385,000',
'$649,950',
'$815,000',
'$699,000',
'$525,000',
'$1,495,000',
'$325,000',
'$835,000',
'$599,950',
'$1,150,000',
'$895,000',
'$998,900',
'$775,000',
'$565,000',
'$750,000',
'$879,000',
'$325,000',
'$1,000,000',
'$785,000',
'$725,000',
'$899,000',
'$1,095,000',
'$1,175,000',
'$815,000',
'$2,300,000',
'$950,000',
'$929,000',
'$1,249,900',
'$1,650,000',
'$1,500,000',
'$639,950',
'$995,000',
'$750,000',
'$630,000',
'$999,000',
'$474,000',
'$390,000',
'$485,000',
'$725,000',
'$500,000',
'$340,000',
'$689,000',
'$525,000',
'$650,000',
'$589,950',
'$665,000',
'$725,000',
'$460,000',
'$749,450',
'$1,088,000',
'$525,000',
'$495,000',
'$830,000',
'$475,000',
'$999,000',
'$849,950',
'$848,000',
'$480,000',
'$538,000',
'$4,585,000',
'$1,150,000',
'$1,045,000',
'$730,000',
'$630,000',
'$1,950,000',
'$899,000',
'$1,975,000',
'$1,179,500',
'$2,100,000',
'$829,000',
'$2,750,000',
'$895,000',
'$849,950',
'$619,500',
'$1,199,000']
Answered By - Amirhossein Kiani
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.