Issue
Hi all I would like to extract all the text from an html block using xpath in scrapy
Let's say we have a block like this:
<div>
<p>Blahblah</p>
<p><a>Bluhbluh</a></p>
<p><a><span>Bliblih</span></a></p>
</div>
I want to extract the text as ["Blahblah","Bluhbluh","Blihblih"]. I want xpath to recursively look for text in the div node.
I have heard tried: //div/p[descendant-or-self::*]/text()
but it does not extract nested elements.
Cheers! Seb
Solution
You can use XPath's string()
function on each p
element:
>>> import scrapy
>>> selector = scrapy.Selector(text="""<div>
... <p>Blahblah</p>
... <p><a>Bluhbluh</a></p>
... <p><a><span>Bliblih</span></a></p>
... </div>""")
>>> [p.xpath("string()").extract() for p in selector.xpath('//div/p')]
[[u'Blahblah'], [u'Bluhbluh'], [u'Bliblih']]
>>> import operator
>>> map(operator.itemgetter(0), [p.xpath("string()").extract() for p in selector.xpath('//div/p')])
[u'Blahblah', u'Bluhbluh', u'Bliblih']
>>>
Answered By - paul trmbrth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.