Wednesday, December 7, 2022

[FIXED] Find keyword from a list in a page using BeautifulSoup

December 07, 2022 beautifulsoup, python, web-scraping No comments

Issue

Using Beautiful Soup, I'd like to detect porn keywords (that i get by concatening two lists of porn-keywords (one in french, the other in english) in a web page.

Here's my code (from BeautifulSoup find two different strings):

proxy_support = urllib.request.ProxyHandler(my_proxies)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
lst_porn_keyword_eng = str(urllib.request.urlopen("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt").read()).split('\\n')
# the textfile starts with a LF, deleting it.
if lst_porn_keyword_eng[0] == "b\"":
   del lst_porn_keyword_eng[0]
lst_porn_keyword_fr = str(urllib.request.urlopen("https://raw.githubusercontent.com/darwiin/french-badwords-list/master/list.txt").read()).split('\\n')

lst_porn_keyword = lst_porn_keyword_eng + lst_porn_keyword_fr
lst_porn_keyword_found = []

with urllib.request.urlopen("http://www.example.com") as page_to_check:
     soup = BeautifulSoup(page_to_check, "html5lib")
     for node in soup.find_all(text=lambda text: any(x in text for x in lst_porn_keyword)):
          lst_porn_keyword_found.append(str(node.text))

return lst_porn_keyword_found

This code runs correctly but porn keyword are found even if they shouldn't be. For instance, the text of the second node found in "http://www.example.com" is This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. And none of these words are in lst_porn_keyword

Solution

Your soup.find_all() doesn't return the html but the css instead:

    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }

The words "color", "gin", and the character " appear in lst_porn_keyword and on the css, which triggered your detection.

Partial words like "gin" in "margin" are also problematic using soup.findall(), consider using regular expressions with word delimiters like the example below:

import regex as re

for word in lst_porn_keyword:
    result = re.findall(fr"\W{word}\W", node)
    if len(result) > 0:
        print(f"detected in text: {word}")

Answered By - krasnapolsky

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 7, 2022

[FIXED] Find keyword from a list in a page using BeautifulSoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels