Issue
I'm trying to scrape this page: https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page=1&sortBy= and it works fine on the first page but when I try to go through the other pages, it only scrapes the first page items over and over. I've tried to change the url to https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page=2&sortBy= (page=2) so it goes to the second page and it works on the browser but not in my code. I also tried to scrape the link off the Next Page button on the site but the issue remains.
Here's my code (Scrapping link from next button on site):
next_page = 'https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page=1&sortBy='
for count in range(1, 3):
products = []
product_response = requests.get(next_page, timeout=3)
soup = BeautifulSoup(product_response.text, 'html5lib')
products = soup.find_all('div', class_='product_name')
print('-'*200)
print('Printing contents from ' + next_page)
print('-'*200)
for product in products:
print(count, product.contents[1])
product_links.append(product.contents[1].get('href'))
next_page = soup.find('a', class_='pageNavLink pageNavNext').get('href')
Code (Iterating through pages in url query):
url = 'https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?cid=1082&facetNameValue=&sort=&size=100&page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')
# Sees how many pages there are on the site
page_count = soup.find_all('label', class_='control-label')[3].contents[0].split(' ')[1]
# Goes through each item on every page and gets the links
print('Getting links...')
product_links = []
products = soup.find_all('div', class_='product_name')
for count in range(1, 3):
products = []
page_url = f'https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page={count}&sortBy='
product_response = requests.get(page_url, timeout=3)
soup = BeautifulSoup(product_response.text, 'html5lib')
products = soup.find_all('div', class_='product_name')
print('-'*200)
print('Printing contents from ' + page_url)
print('-'*200)
for product in products:
print(count, product.contents[1])
product_links.append(product.contents[1].get('href'))
Also, the loop goes from 1-3 for testing purposes. With the live product, the range(1,3) would be replaced with range(page_count).
Solution
From some quick playing around, the page
parameter affects what's sent after page load, not what's sent initially. It seems like the site works by sending the first page, reading the page
query parameter, then fetching the proper page afterward if page
!= 1.
To see this, navigate to a page after the first, open the dev tools, go to the network tab, and search "Annabelles" (or some other product name on the page that you want to scrape). You'll see that it comes from a products.jhtm
document sent last; not the page you requested using requests
.
You could try just requesting that directly (if you can figure out their API), or use Selenium or something similar instead of BS4 so that the "page requesting logic" on the initial page works properly. BS4 won't work since it doesn't execute JavaScript.
Answered By - Carcigenicate
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.