Issue
I want to insert whitespace into the resulting text when I strip tags and extract text using lxml
I don't really know lxml
. Via this answer (which seems based on a comment on the same page from @bluu), I have the following:
import lxml
def strip_html(s):
return str(lxml.html.fromstring(s).text_content())
When I try it with this:
strip_html("<p>This what you want.</p><p>This what you get.</p>")
I get this:
'This what you want.This what you get.'
But I want this:
'This what you want. This what you get.'
What I really want is the equivalent of this:
from bs4 import BeautifulSoup
s = "<p>This what you want.</p><p>This what you get.</p>"
BeautifulSoup(s, "lxml").get_text(separator=" ")
which does give the desired output - for all tags - but I want to avoid the amazing BeautifulSoup
in this case
I also want it to work for all tags, and without my having to spell out all the tags, or loop and search for particular characters etc
I have looked at the code of bs4
's element.py
to try to adapt the separator
and I see it's not a simple matter
I was also looking at lxml.html.clean
as in this answer
Solution
You could select all tags that contains text iterate over these and join()
the ResultSet
by seperator:
s = "<p>This what you want.</p><p>This what you get.</p>"
' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])
Example
import lxml
def strip_html(s):
return ' '.join([e.text_content() for e in lxml.html.document_fromstring(s).xpath("//*[text()]")])
strip_html("<p>This what you want.</p><p>This what you get.</p>")
Output
This what you want. This what you get.
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.