Issue
I am newbie to Scrapy and Python.
Nevertheless I've created a spider that extracts required information for me. The only issue is that I am unable to remove \n \t symbols from output and leave, at the same time, html tags at its places.
For example:
My current output is:
{'specification': ['<div class="col-lg-5 model__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<ul class="offer__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li class="offer__spec-elem">\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--left muted">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Бренд</span>\n\t\t\t\t\t\t\t\t\t\t\t</div>\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--right">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Huawei</span>\n\t\t\t\t\t\t\t\t\t\t\t</div> ...']}
Desired output:
{'specification': ['<div class="col-lg-5 model__spec"><ul class="offer__spec"><li class="offer__spec-elem"><div class="offer__spec-elem--left muted"><span>Бренд</span></div><div class="offer__spec-elem--right"><span>Huawei</span></div> ...']}
My script:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
]
def parse(self, response):
for quote in response.xpath('descendant::div[@class="col-lg-5 model__spec"]'):
yield {
'specification': quote.getall()
}
I've tried to use 'normalize-space' but it removed \t \n along with all html tags an I got raw text
def parse(self, response):
for quote in response.xpath('normalize-space(descendant::div[@class="col-lg-5 model__spec"])'):
yield {
'specification': quote.getall()
}
Output:
{'specification': ['Бренд Huawei Емкость аккумулятора 3340 мАч Диагональ экрана 6.1 Процессор HiSilicon Kirin 710 Количество ядер процессора 8 Частота процессора 2.2 ГГц Встроенная память 128 ГБ Оперативная память 4 ГБ Беспроводные коммуникации 3G, 4G(LTE), Bluetooth, GPS, NFC, Wi-Fi, ГЛОНАСС Стандарт связи 3G (WCDMA/UMTS), 4G (LTE), GSM Все характеристики']}
Thanks in advance.
Solution
Try this out:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
]
def parse(self, response):
for quote in response.xpath('(descendant::div[@class="col-lg-5 model__spec"])'):
quote = quote.getall()
quote = [i.replace("\t", "").replace("\n", "") for i in quote]
yield {
'specification': quote
}
Answered By - Shivam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.