Issue
I have looked through various other questions but none seem to fit the bill. So here goes
I have a list of words
l = ['red','green','yellow','blue','orange']
I also have a source code of a webpage in another variable. I am using the requests lib
import requests
url = 'https://google.com'
response = requests.get(url)
source = response.content
I then created a substring lookup function like so
def find_all_substrings(string, sub):
import re
starts = [match.start() for match in re.finditer(re.escape(sub), string)]
return starts
I now lookup the words using the following code where I am stuck
for word in l:
substrings = find_all_substrings(source, word)
new = []
for pos in substrings:
ok = False
if not ok:
print(word + ";")
if word not in new:
new.append(word)
print(new)
page['words'] = new
My ideal output looks like the following
Found words - ['red', 'green']
Solution
If all you want is a list of words that are present, you can avoid most of the regex processing and just use
found_words = [word for word in target_words if word in page_content]
(I've renamed your string
-> page_content
and l
-> target_words
.)
If you need additional information or processing (e.g. the regexs / BeautifulSoup parser) and have a list of items which you need to deduplicate, you can just run it through a set()
call. If you need a list instead of a set, or want to guarantee the order of found_words, just cast it again. Any of the following should work fine:
found_words = set(possibly_redundant_list_of_found_words)
found_words = list(set(possibly_redundant_list_of_found_words))
found_words = sorted(set(possibly_redundant_list_of_found_words))
If you've got some sort of data structure you're parsing (because BeautifulSoup & regex can provide supplemental information about position & context, and you might care about those), then just define a custom function extract_word_from_struct()
which extracts the word from that structure, and call that inside a set comprehension:
possibly_redundant_list_of_found_words = [extract_word_from_struct(struct) for struct in possibly_redundant_list_of_findings]
found_words = set(word for word in possibly_redundant_list_of_found_words if word in target_words)
Answered By - Sarah Messer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.