Issue
I am scraping the prices of the laptops on Amazon on the first page. It scrapes all of the prices but they are not in same order as they are on the web. What could be the problem?
Here is my code where you can also find the link to the page:
import requests
from bs4 import BeautifulSoup
URL = "https://www.amazon.com/s?k=laptop&crid=288NMI7Z5E2WR&sprefix=laptop%2Caps%2C572&ref=nb_sb_noss_1"
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
"Accept-Language": "hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7"
}
response = requests.get(URL, headers=header)
web_page = response.text
soup = BeautifulSoup(web_page, 'lxml')
boxs = soup.find_all("div", class_="puisg-col puisg-col-4-of-12 puisg-col-8-of-16 puisg-col-12-of-20 puisg-col-12-of-24 puis-list-col-right")
for box in boxs:
name = box.find("span", class_="a-size-medium a-color-base a-text-normal").getText()
price = box.find("span", class_="a-offscreen").getText()
print(price)
And here is the snap shot of the prices that I get:
Solution
Websites like that use loads of algorithms to sort products, based on things like previous behavior on the site, location, search history, etc (dynamic content personalization). When you scrape data from such a website using a script, you are treated as a different "user" each time you send a request.
But you can apply sessions with cookies that simulate more persistent user behavior. Keep in mind that even with a session though, if the shops algorithm decides to shuffle the product listings, the order of the results you scrape still do not match what you see in your browser. A session can make it more consistent, but most probably not completely consistent.
You can try it out:
import requests
from bs4 import BeautifulSoup
# Create a session object
s = requests.Session()
# Set headers
s.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
})
URL = "https://www.amazon.com/s?k=laptop&crid=288NMI7Z5E2WR&sprefix=laptop%2Caps%2C572&ref=nb_sb_noss_1"
# Use the session to get the page content
response = s.get(URL)
web_page = response.text
soup = BeautifulSoup(web_page, 'lxml')
# Define the correct class or tag structure to find the elements you want
# This is just an example and might not match Amazon's current page structure
boxes = soup.find_all("div", {"class": "YOUR-CSS-CLASS-FOR-ITEM-CONTAINER"})
for box in boxes:
name = box.find("span", {"class": "a-size-medium a-color-base a-text-normal"}).getText()
price = box.find("span", {"class": "a-offscreen"}).getText()
print(price)
Answered By - Miles
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.