Issue
I wanna use Scrapy to get any kind of text tag like h1,p,span ,strong and others in side the section tag and ignore the others like img :
<section>
<h1>text</h1>
<h2>text</h2>
<span>text</span>
<img>text</img>
<p>text</p>
<p>text</p>
<p>text</p>
</section>
my starting code some thing like this:
import scrapy
class example (scrapy.Spider):
name = 'example '
allowed_domains = ['www.example .com']
start_urls = ['example ']
def parse(self, response):
self.log('//////////////////////////////////////////////////////////////')
section= response.xpath('//section')
for p in section.xpath('.//p/text()'):
self.log('//////////////////////////////////////////////////////////////')
self.log(p.extract())
now as I said instead of only selecting p tags I need to get any text tag . is there any way to do this ?
Solution
In this case the only option - is to cycle through each html tag and filter it by it's name
def parse(self, response):
req_tags = ['h1', 'p', 'span', 'strong']
section_selector = response.css('section')
for section in section_selector:
texts = []
for tag in section.css('*'):
if tag.root.tag in req_tags:
texts = texts + tag.css('*::text').getall()
self.log(texts)
For this case - each tag name needs to be directly placed inside req_tags
list.
Answered By - Georgiy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.