Issue
I am working on a web-scraping project. As the code runs, first is saved a list of around 100 products. The links are saved into one list called current_products.
Then after some time this is done again but saved into other list called new_products. Then I compare if they are the same (no new products listed on the website) or the new_products list is different from the current_items list and there was item added. I have the following code:
if (new_products == current_products):
print("No new items found. Trying again in 5 seconds...")
new_products = []
else:
print("New products found")
products_new = set(new_products) - set(current_products)
print(products_new)
As a testing data i put a duplicate link into the new_items. The code prints that the lists are different but then doesnt print any new product link (the difference between these lists) just empty set ({}). Any idea how to fix it?
EDIT: It doesnt work with DUPLICATE test data. When i put brand new link (that is not in either list) it works perfectly. Is there a way how to IGNORE duplicates?
Solution
Here is a trivial example of how this could be happening:
>>> a = [1, 2]
>>> b = [2, 1]
>>> a == b
False
>>> set(a) - set(b)
set()
To avoid this, either sort the lists before comparing them, or just compare sets to start with:
ds = set(a) - set(b)
if ds:
print('diffs')
else:
print('no diffs')
Making a set is more efficient for large data than sorting because it's an O(n) operation vs O(n log n) for sorting.
Answered By - Mad Physicist
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.