Issue
I am trying to find a keyword from a webpage using a spider (web crawler) that stores the matching keyword against the URL link in a csv file. But the issue is if a keyword appears multiple times on the same page, then there is duplication in the csv file. How do I remove duplicate links against a keyword?
allowed_domains = ["www.geo.tv"]
start_urls = ["https://www.geo.tv/"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
crawl_count = 0
words_found = 0
def check_buzzwords(self, response):
self.__class__.crawl_count += 1
crawl_count = self.__class__.crawl_count
wordlist = [
"Imran",
"Hello",
"Nauman",
]
url = response.url
contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
data = response.body.decode('utf-8')
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
ok = False
if not ok:
self.__class__.words_found += 1
print(word + ";" + url + ";")
return Item()
Solution
I am not entirely sure what your asking, but it sounds like all you need to do is to stop iterating the full iterable returned by find_all_substrings
. Just break
after the first one since you know all the additional iterations will be duplicates.
for example:
for word in wordlist:
substrings = find_all_substrings(data, word)
for pos in substrings:
self.__class__.words_found += 1
print(word + ";" + url + ";")
break
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.