Issue
i'am trying to scrap a page where some sites have a normal H-Tag and other Sites have other Tags inside the H-Tags.
some examples:
<h1>Text</h1>
<h1><a href="">Text</a></h1>
<h1><span>Text</span></h1>
<h1><span>Text</span><span>Text2</span></h1>
and many more...
must i write a check for every html-tag by my self, or is there a nice way in scrapy?
a nasty and unwanted way would be:
h1 = response.xpath('//h1').extract()
if '<a' in h1[0]:
h1 = json.dumps(response.xpath('//h1/a/text()').extract(), ensure_ascii=False)
elif '<span' in h1[0]:
h1 = json.dumps(response.xpath('//h1/span/text()').extract(), ensure_ascii=False)
else:
h1 = json.dumps(response.xpath('//h1/text()').extract(), ensure_ascii=False)
Solution
There is one nice way to use string()
from XPath:
response.xpath('string(//h1)').extract_first()
Answered By - gangabass
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.