Issue
I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file.
I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.
I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.
This is my code:
def parse(self, response):
for sel in response.xpath('//div[@class="d-grid-main"]'):
item = xItem()
item['TITLE'] = sel.xpath('xpath').extract()
item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())
I tried also with:
item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()
But it raised an error. What's the best way?
Solution
unicode.strip
only deals with whitespace characters at the beginning and end of strings
Return a copy of the string with the leading and trailing characters removed.
not with \n
, \r
, or \t
in the middle.
You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()
returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.
Example python shell session:
>>> text='''<html>
... <body>
... <div class="d-grid-main">
... <p class="class-name">
...
... This is some text,
... with some newlines \r
... and some \t tabs \t too;
...
... <a href="http://example.com"> and a link too
... </a>
...
... I think we're done here
...
... </p>
... </div>
... </body>
... </html>'''
>>> response = scrapy.Selector(text=text)
>>> response.xpath('//div[@class="d-grid-main"]')
[<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>]
>>> div = response.xpath('//div[@class="d-grid-main"]')[0]
>>>
>>> # you'll want to use relative XPath expressions, starting with "./"
>>> div.xpath('.//p[@class="class-name"]/text()').extract()
[u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',
u"\n\nI think we're done here\n\n"]
>>>
>>> # only leading and trailing whitespace is removed by strip()
>>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
[u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"]
>>>
>>> # normalize-space() will get you a single string on the whole element
>>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
[u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
>>>
Answered By - paul trmbrth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.