Thursday, June 30, 2022

[FIXED] Unshapable list error when scraping information

June 30, 2022 python, scrapy, web-scraping No comments

Issue

I am trying to extract information but they will give me error of unshapable list these is page link https://rejestradwokatow.pl/adwokat/abaewicz-agnieszka-51004

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


  

    def parse(self, response):
        wev={}
        tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
        det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()
        wev[tuple(tic)]=[i.strip() for i in det]
        
        yield wev

They will give me output like these:

But I want output like these:

Solution

You have to use zip() to group values from tic and det

        for name, value in zip(tic, det):
            wev[name.strip()] = value.strip()

and this will give wev with

{
    'Status:': 'Były adwokat', 
    'Data wpisu w aktualnej izbie na listę adwokatów:': '2013-09-01', 
    'Data skreślenia z listy:': '2019-07-23', 
    'Ostatnie miejsce wpisu:': 'Katowice', 
    'Stary nr wpisu:': '1077', 
    'Zastępca:': 'Pieprzyk Mirosław'
}

and this will give CSV with correct values

Status:,Data wpisu w aktualnej izbie na listę adwokatów:,Data skreślenia z listy:,Ostatnie miejsce wpisu:,Stary nr wpisu:,Zastępca:
Były adwokat,2013-09-01,2019-07-23,Katowice,1077,Pieprzyk Mirosław

EDIT:

Eventually you should first get rows and later search name and value in every row.

        all_rows = response.xpath("//div[@class='line_list_K']/div")
        
        for row in all_rows:
            name  = row.xpath(".//span/text()").get()
            value = row.xpath(".//div/text()").get()
            wev[name.strip()] = value.strip()

And this method sometimes can be safer if some row don't has some value. Or when row has unusuall value like email which is added by JavaScript (but scrapy can run JavaScript) but it keep it as attributes in tag <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com">

Because only some pages have Email so it may not add this value in file - so it need to add default value to wev = {'Email:': '', ...} at start. The same problem can be with other values.

       wev = {'Email:': ''}

       for row in all_rows:
            name  = row.xpath(".//span/text()").get()
            value = row.xpath(".//div/text()").get()
            if name and value:
                wev[name.strip()] = value.strip()
            elif name and name.strip() == 'Email:':
                    # <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com"></div>
                    div = row.xpath('./div')
                    email_a = div.attrib['data-ea']
                    email_b = div.attrib['data-eb']
                    wev[name.strip()] = f'{email_a}@{email_b}'

Full working code

# rejestradwokatow

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):

    name = 'test'

    start_urls = [
        #'https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9',
        'https://rejestradwokatow.pl/adwokat/abaewicz-agnieszka-51004',
        'https://rejestradwokatow.pl/adwokat/adach-micha-55082',
    ]

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        # it may need default value when item doesn't exist on page 
        wev = {
            'Status:': '',
            'Data wpisu w aktualnej izbie na listę adwokatów:': '',
            'Stary nr wpisu:': '',
            'Adres do korespondencji:': '',
            'Fax:': '',
            'Email:': '',
        }

        tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
        det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()

        #print(tic)
        #print(det)
        #print('---')

        all_rows = response.xpath("//div[@class='line_list_K']/div")

        for row in all_rows:
            name  = row.xpath(".//span/text()").get()
            value = row.xpath(".//div/text()").get()
            if name and value:
                wev[name.strip()] = value.strip()
            elif name and name.strip() == 'Email:':
                # <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com"></div>
                div = row.xpath('./div')
                email_a = div.attrib['data-ea']
                email_b = div.attrib['data-eb']
                wev[name.strip()] = f'{email_a}@{email_b}'

        print(wev)

        yield wev

# --- run without creating project and save results in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(TestSpider)
c.start()

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 30, 2022

[FIXED] Unshapable list error when scraping information

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels