Issue
I'm new to scrapy. I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect. Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug. My printing structure is as bellow (as coded in the script)
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:} #URL1
]
when I put 2 URLs to scrap I get this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]
It is still fine, but when I add more, the structure messes up and become like this:
[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]
If you see closely you will notice that the title of the third URL is below the title of the second one. Can somebody help, please?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "attributes"
start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
"https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]
def parse(self, response):
yield{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
yield{
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" :res
}
except:
continue
res=""
for post in response.css(".lie-one-canshu"):
try:
dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
yield dicti
except:
continue
UPDATE: I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.
Solution
Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....
For example :
def parse(self, response):
results = {
'items': [{
"title": response.css ("div.sku-top-title::text").get(),
"desc" : response.css ("div.sku-top-desc::text").get(),
"brochure" :'brochure'
}]
}
for post in response.css(".el-collapse"):
for i in range(len(post.css(".el-collapse-item__header"))):
res=""
lst=post.css(".value-el-desc")
x=lst[i].css(".value-el-desc p::text").extract()
for y in x:
res+=y.strip()+"&&"
try:
results['items'].append({
"descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
"desc" : res
})
except:
continue
res = ""
for post in response.css(".lie-one-canshu"):
try:
results['items'].append({
"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
})
except:
continue
yield results
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.