Issue
Using Beautiful Soup, I'd like to detect porn keywords (that i get by concatening two lists of porn-keywords (one in french, the other in english) in a web page.
Here's my code (from BeautifulSoup find two different strings):
proxy_support = urllib.request.ProxyHandler(my_proxies)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
lst_porn_keyword_eng = str(urllib.request.urlopen("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt").read()).split('\\n')
# the textfile starts with a LF, deleting it.
if lst_porn_keyword_eng[0] == "b\"":
del lst_porn_keyword_eng[0]
lst_porn_keyword_fr = str(urllib.request.urlopen("https://raw.githubusercontent.com/darwiin/french-badwords-list/master/list.txt").read()).split('\\n')
lst_porn_keyword = lst_porn_keyword_eng + lst_porn_keyword_fr
lst_porn_keyword_found = []
with urllib.request.urlopen("http://www.example.com") as page_to_check:
soup = BeautifulSoup(page_to_check, "html5lib")
for node in soup.find_all(text=lambda text: any(x in text for x in lst_porn_keyword)):
lst_porn_keyword_found.append(str(node.text))
return lst_porn_keyword_found
This code runs correctly but porn keyword are found even if they shouldn't be. For instance, the text of the second node found in "http://www.example.com" is This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. And none of these words are in lst_porn_keyword
Solution
Your soup.find_all()
doesn't return the html but the css instead:
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
div {
width: 600px;
margin: 5em auto;
padding: 2em;
background-color: #fdfdff;
border-radius: 0.5em;
box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
div {
margin: 0 auto;
width: auto;
}
}
The words "color", "gin", and the character " appear in lst_porn_keyword
and on the css, which triggered your detection.
Partial words like "gin" in "margin" are also problematic using soup.findall()
, consider using regular expressions with word delimiters like the example below:
import regex as re
for word in lst_porn_keyword:
result = re.findall(fr"\W{word}\W", node)
if len(result) > 0:
print(f"detected in text: {word}")
Answered By - krasnapolsky
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.