Issue
I researched but can't find any answers to my question: I want get the main content, ignoring the commented content, how should I do?
<td>
<!--
<i class="fab fa-youtube" aria-hidden="true" style="color: #f00;"></i>
-->
main content
</td>
my scrapy spider looks like:
'name': row.xpath('td[2]/text()').get()
But this codes give me only some \n\t. plz help, thank you.
Solution
When /text() in XPath or ::text in CSS fails to produce the desired result, I use another library.
to install it.
pip3 install html2text
from html2text import HTML2Text
h = HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.ignore_emphasis = True
#Inside the scrapy project
name = h.handle(row.xpath('td[2]').get()).strip()
yield ....
Answered By - Ahmed Ellban
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.