Issue
I'm creating a news parser which can summarize news from different sites and create keywords based off the news content. Most news sources wrap the news content inside the article
tag, so I'm extracting it from the sites to get the content.
The problem is, when using beautiful soup it will return the raw HTML inside the article tag, which sometimes contains images, links and tags like <b>
. My question is, is there a simple way to get the written content of the page like an user sees it? That means ignoring everything that isn't text. The only that I have is looping through every tag inside the article and checking the inner HTML for text content. The reasons I haven't already done that are:
- there may be multiple tags inside tags which I'd need to parse;
- there are tags which I'd need to ignore, such as script tags, which the browser doesn't display;
- there may be a builtin way to do that inside the beautiful soup library or another HTML focused library
An example, the following p
tag
<p>
hello <b>world</b> </br> <img src="world.png">. fine <a href="#"> day </a> isn't it?
</p>
would become
hello world. fine day isn't it?
So, is there any better way to extract the page text information using Beautiful Soup or another html parsing library? Note: I don't care about rendering JS - script tags can be ignored.
Solution
I ended up using html2text. It ignores the text from script tags (BS getText
doesn't) and can handle inner html.
Answered By - Samuel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.