Sunday, August 21, 2022

[FIXED] order of json data is messed up when scraping multiple urls Scrapy

August 21, 2022 python, scrapy, web-scraping No comments

Issue

I'm new to scrapy. I made a script to scrap data from a website and it works fine, I get the results as a JSON file and it looks perfect. Now when I try to use my script to scrap multiple URLs (same site), it works, I can get the data in JSON file for each URL, but there is a bug. My printing structure is as bellow (as coded in the script)

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:},  #URL1
{attribute:} #URL1
]

when I put 2 URLs to scrap I get this:

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:},#URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{titleDesc:,,,Content:}, #URL2
{attribute:} #URL2
]

It is still fine, but when I add more, the structure messes up and become like this:

[
{Title:,,,Description:,,,Brochure:}, #URL1
{titleDesc:,,,Content:}, #URL1
{attribute:}, #URL1
{Title:,,,Description:,,,Brochure:}, #URL2
{Title:,,,Description:,,,Brochure:}, #URL3
{titleDesc:,,,Content:}, #URL2
{attribute:}, #URL2
{titleDesc:,,,Content:}, #URL3
{attribute:}
]

If you see closely you will notice that the title of the third URL is below the title of the second one. Can somebody help, please?

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "attributes"
    start_urls = ["https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/161/",
    "https://product.sanyglobal.com/concrete_machinery/truck_mixer/119/162/"]

    def parse(self, response):
        yield{
            "title": response.css ("div.sku-top-title::text").get(),
            "desc" : response.css ("div.sku-top-desc::text").get(),
            "brochure" :'brochure'  
        }
        for post in response.css(".el-collapse"):
            for i in range(len(post.css(".el-collapse-item__header"))):
                res=""
                lst=post.css(".value-el-desc")
                x=lst[i].css(".value-el-desc p::text").extract()
                for y in x:
                    res+=y.strip()+"&&"
                try:      
                    yield{         
                        "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                        "desc" :res 
                        }  
                except:
                    continue
            res=""
            
        
        for post in response.css(".lie-one-canshu"):
            try:       
                dicti = {"attribute" : post.css('.lie-one-canshu::text')[0].get().strip()}
                yield dicti                   
            except:
                continue

UPDATE: I notice that the bug isn't permanent, sometimes I execute the script and the result is fine.

Solution

Scrapy's is asynchronous, so there is no guarantee to the ordering in which item's are output or processed, at least not out of the box anyway. If you want all of the output from a single URL to come out together then I suggest you only yield 1 item from each call to the parse method....

For example :

def parse(self, response):
    results = {
       'items': [{
           "title": response.css ("div.sku-top-title::text").get(),
           "desc" : response.css ("div.sku-top-desc::text").get(),
           "brochure" :'brochure'  
        }]
    }
    for post in response.css(".el-collapse"):
        for i in range(len(post.css(".el-collapse-item__header"))):
            res=""
            lst=post.css(".value-el-desc")
            x=lst[i].css(".value-el-desc p::text").extract()
            for y in x:
                res+=y.strip()+"&&"
            try:      
                results['items'].append({         
                    "descTitle" : post.css('.el-collapse-item__header::text')[i].get().strip(),
                    "desc" : res 
                 }) 
            except:
                continue
        res = ""
            
        
    for post in response.css(".lie-one-canshu"):
        try:       
            results['items'].append({  
                "attribute" : post.css('.lie-one-canshu::text')[0].get().strip()
            })
        except:
            continue
    yield results

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, August 21, 2022

[FIXED] order of json data is messed up when scraping multiple urls Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels