Saturday, February 12, 2022

[FIXED] How to remove \n \t from Scrapy output but leave HTML tags there

February 12, 2022 scrapy No comments

Issue

I am newbie to Scrapy and Python.

Nevertheless I've created a spider that extracts required information for me. The only issue is that I am unable to remove \n \t symbols from output and leave, at the same time, html tags at its places.

For example:

My current output is:

{'specification': ['<div class="col-lg-5 model__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<ul class="offer__spec">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li class="offer__spec-elem">\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--left muted">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Бренд</span>\n\t\t\t\t\t\t\t\t\t\t\t</div>\n\t\t\t\t\t\t\t\t\t\t\t<div class="offer__spec-elem--right">\n\t\t\t\t\t\t\t\t\t\t\t\t<span>Huawei</span>\n\t\t\t\t\t\t\t\t\t\t\t</div> ...']}

Desired output:

{'specification': ['<div class="col-lg-5 model__spec"><ul class="offer__spec"><li class="offer__spec-elem"><div class="offer__spec-elem--left muted"><span>Бренд</span></div><div class="offer__spec-elem--right"><span>Huawei</span></div> ...']}

My script:

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

        'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
        'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
    ]

def parse(self, response):
    for quote in response.xpath('descendant::div[@class="col-lg-5 model__spec"]'):
        yield {
            'specification': quote.getall()
        }

I've tried to use 'normalize-space' but it removed \t \n along with all html tags an I got raw text

    def parse(self, response):
    for quote in response.xpath('normalize-space(descendant::div[@class="col-lg-5 model__spec"])'):
        yield {
            'specification': quote.getall()
        }

Output:

{'specification': ['Бренд Huawei Емкость аккумулятора 3340 мАч Диагональ экрана 6.1 Процессор HiSilicon Kirin 710 Количество ядер процессора 8 Частота процессора 2.2 ГГц Встроенная память 128 ГБ Оперативная память 4 ГБ Беспроводные коммуникации 3G, 4G(LTE), Bluetooth, GPS, NFC, Wi-Fi, ГЛОНАСС Стандарт связи 3G (WCDMA/UMTS), 4G (LTE), GSM Все характеристики']}

Thanks in advance.

Solution

Try this out:

import scrapy

class QuotesSpider(scrapy.Spider):

    name = "quotes"

    start_urls = [

        'https://boo.ua/catalog/smartfony/huawei-p30-lite-4-128gb--mar-lx1a-/',
        'https://boo.ua/catalog/smartfony/huawei-p20-2018-4-128gb-black-eml-l29/',
    ]

    def parse(self, response):
        for quote in response.xpath('(descendant::div[@class="col-lg-5 model__spec"])'):
            quote = quote.getall()
            quote = [i.replace("\t", "").replace("\n", "") for i in quote]
            yield {
                'specification': quote
            }

Answered By - Shivam

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 12, 2022

[FIXED] How to remove \n \t from Scrapy output but leave HTML tags there

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels