Issue
I want to extract text from an html of the structure below.
selector = scrapy.Selector(text="""
<li>Text1
<ul>
<li>Text2</li>
<li>Text3</li>
</ul>
</li>
""")
The following options give me all the text, including new lines, but lose all of the structure of the html.
selector.xpath('/descendant-or-self::*/text()').extract()
selector.xpath('//li/text()').extract()
Is there a way of accessing these elements via some path? I would expect to be able to select the first text (Text1) via something like this,
selector.xpath('//li/text()').extract()
as the rest (Text2 and Text3) can be selected via,
selector.xpath('//li/ul/li/text()').extract()
Solution
As what is your expected output is not clear, So Assume you want to extract the string/text nodes from the top li
tags and ul/li
tags.
The following xpath expression will select text from the top li tags
el =''.join(selector.xpath('//*[@class="a"]/ancestor::li/text()').extract()).replace('\n','').strip()
txt = re.sub(r'\s+',' ',el)
and
sel =' '.join(selector.xpath('//*[@class="a"]/ancestor::ul//li//text()').extract()).replace('\n','').strip()
txt2 = re.sub(r'\s+',' ',sel)
The above path expression will select text from the ul/li
tags
P/S : I use re
module only for removing the extra white spaces
Proven by scrapy shell:
%paste
selector = scrapy.Selector(text="""
<li>Text1
<ul>
<li>Text2</li>
<li>Text3</li>
<li><class="a">
<i>Text4</i>
Text5
<cite style="Style2" class="a">
<a href="href1" title="Title1"> Text6</a>.
</cite>
<span class="b" title="Title2">
<span style="Style1"></span>
</span>
</li>
<li>
Text7
<cite style="Style2" class="a">
<i>Text8</i>
Text9
<a href="href2" title="Title2">Text10</a>.
</cite>
<span class="b" title="Title3">
<span style="Style3"></span>
</span>
</li>
</ul>
</li>
""")
el =''.join(selector.xpath('//*[@class="a"]/ancestor::li/text()').extract()).replace('\n','').strip()
In [3]: el
Out[3]: 'Text1 Text7'
In [4]: import re
In [5]: txt = re.sub(r'\s+',' ',el)
In [6]: txt
Out[6]: 'Text1 Text7'
In [7]: sel =' '.join(selector.xpath('//*[@class="a"]/ancestor::ul//li//text()').extract()).replace('\n','').strip(
...: )
In [8]: sel
Out[8]: 'Text2 Text3 Text4 Text5 Text6 . Text7 Text8 Text9 Text10 .'
In [9]: txt2 = re.sub(r'\s+',' ',sel)
In [10]: txt2
Out[10]: 'Text2 Text3 Text4 Text5 Text6 . Text7 Text8 Text9 Text10 .'
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.