Issue
I am trying to extract information but they will give me error of unshapable list these is page link https://rejestradwokatow.pl/adwokat/abaewicz-agnieszka-51004
import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def parse(self, response):
wev={}
tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()
wev[tuple(tic)]=[i.strip() for i in det]
yield wev
They will give me output like these:
Solution
You have to use zip()
to group values from tic
and det
for name, value in zip(tic, det):
wev[name.strip()] = value.strip()
and this will give wev
with
{
'Status:': 'Były adwokat',
'Data wpisu w aktualnej izbie na listę adwokatów:': '2013-09-01',
'Data skreślenia z listy:': '2019-07-23',
'Ostatnie miejsce wpisu:': 'Katowice',
'Stary nr wpisu:': '1077',
'Zastępca:': 'Pieprzyk Mirosław'
}
and this will give CSV
with correct values
Status:,Data wpisu w aktualnej izbie na listę adwokatów:,Data skreślenia z listy:,Ostatnie miejsce wpisu:,Stary nr wpisu:,Zastępca:
Były adwokat,2013-09-01,2019-07-23,Katowice,1077,Pieprzyk Mirosław
EDIT:
Eventually you should first get rows and later search name
and value
in every row.
all_rows = response.xpath("//div[@class='line_list_K']/div")
for row in all_rows:
name = row.xpath(".//span/text()").get()
value = row.xpath(".//div/text()").get()
wev[name.strip()] = value.strip()
And this method sometimes can be safer if some row don't has some value. Or when row has unusuall value like email
which is added by JavaScript (but scrapy
can run JavaScript) but it keep it as attributes in tag <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com">
Because only some pages have Email
so it may not add this value in file - so it need to add default value to wev = {'Email:': '', ...}
at start. The same problem can be with other values.
wev = {'Email:': ''}
for row in all_rows:
name = row.xpath(".//span/text()").get()
value = row.xpath(".//div/text()").get()
if name and value:
wev[name.strip()] = value.strip()
elif name and name.strip() == 'Email:':
# <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com"></div>
div = row.xpath('./div')
email_a = div.attrib['data-ea']
email_b = div.attrib['data-eb']
wev[name.strip()] = f'{email_a}@{email_b}'
Full working code
# rejestradwokatow
import scrapy
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = [
#'https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9',
'https://rejestradwokatow.pl/adwokat/abaewicz-agnieszka-51004',
'https://rejestradwokatow.pl/adwokat/adach-micha-55082',
]
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def parse(self, response):
# it may need default value when item doesn't exist on page
wev = {
'Status:': '',
'Data wpisu w aktualnej izbie na listę adwokatów:': '',
'Stary nr wpisu:': '',
'Adres do korespondencji:': '',
'Fax:': '',
'Email:': '',
}
tic = response.xpath("//div[@class='line_list_K']//div//span//text()").getall()
det = response.xpath("//div[@class='line_list_K']//div//div//text()").getall()
#print(tic)
#print(det)
#print('---')
all_rows = response.xpath("//div[@class='line_list_K']/div")
for row in all_rows:
name = row.xpath(".//span/text()").get()
value = row.xpath(".//div/text()").get()
if name and value:
wev[name.strip()] = value.strip()
elif name and name.strip() == 'Email:':
# <div class="address_e" data-ea="adwokat.adach" data-eb="gmail.com"></div>
div = row.xpath('./div')
email_a = div.attrib['data-ea']
email_b = div.attrib['data-eb']
wev[name.strip()] = f'{email_a}@{email_b}'
print(wev)
yield wev
# --- run without creating project and save results in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
#'USER_AGENT': 'Mozilla/5.0',
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(TestSpider)
c.start()
Answered By - furas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.