Issue
I am trying to scrape reviews from Amazon. The reviews can appear on multiple pages to scrape more than one page I construct a list of links which I later scrape separately:
# Construct list of links to scrape multiple pages
links = []
for x in range(1,5):
links.append(f'https://www.amazon.de/-/en/SanDisk-microSDHC-memory-adapter-performance/product-reviews/B08GY9NYRM/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
I then use requests and beautiful soup to obtain the raw review data as below:
# Scrape all links in the constructed list
reviews = []
for link in links:
html = requests.get(link, headers=HEADERS)
if html.status_code == 200:
# HTML response was sucssesfull
soup = BeautifulSoup(html.text, 'html.parser')
results = soup.find_all('span', {'data-hook': 'review-body'})
print(len(results))
for review in results:
reviews.append(review.text.replace('\n', ''))
else:
# HTML response was unsuccsessfull
print('[BAD HTML RESPONSE] Response Code =', html.status_code)
Each page contains 10 Reviews and I receive all 10 reviews for the first page (&pageNumber=1), in each following page I do not receive any information.
When checking the corresponding soup objects I cant find the review information. Why is this?
I tried only scraping page 2 outside of the for loop but no review information is returned.
Two months ago I tried the same code which worked on over 80 pages. I do not understand why it is not working now (has Amazon changed something?) Thanks for your time and help!
Solution
I happened to come across the same exact problem as you. Did abit of research, turns out you would need to give proper headers (not just the user-agent). I'm not sure what header you used but this works for me:
go to http://httpbin.org/get Copy everything under "headers", but remove "Host", and paste it as your header!
Hopefully, this works for you!
Answered By - renazxcv
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.