Wednesday, December 13, 2023

[FIXED] BeautifulSoup only scrapes first page

December 13, 2023 beautifulsoup, python, web-scraping No comments

Issue

I'm trying to scrape this page: https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page=1&sortBy= and it works fine on the first page but when I try to go through the other pages, it only scrapes the first page items over and over. I've tried to change the url to https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page=2&sortBy= (page=2) so it goes to the second page and it works on the browser but not in my code. I also tried to scrape the link off the Next Page button on the site but the issue remains.

Here's my code (Scrapping link from next button on site):

next_page = 'https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page=1&sortBy='
for count in range(1, 3):
    products = []
    product_response = requests.get(next_page, timeout=3)
    soup = BeautifulSoup(product_response.text, 'html5lib')
    products = soup.find_all('div', class_='product_name')

    print('-'*200)
    print('Printing contents from ' + next_page)
    print('-'*200)
    for product in products:
        print(count, product.contents[1])
        product_links.append(product.contents[1].get('href'))
    next_page = soup.find('a', class_='pageNavLink pageNavNext').get('href')

Code (Iterating through pages in url query):

url = 'https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?cid=1082&facetNameValue=&sort=&size=100&page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')

# Sees how many pages there are on the site
page_count = soup.find_all('label', class_='control-label')[3].contents[0].split(' ')[1]

# Goes through each item on every page and gets the links
print('Getting links...')
product_links = []
products = soup.find_all('div', class_='product_name')

for count in range(1, 3):
    products = []
    page_url = f'https://www.bargainw.com/wholesale/1082/Wholesale-Products.html?sort=&size=100&page={count}&sortBy='
    product_response = requests.get(page_url, timeout=3)
    soup = BeautifulSoup(product_response.text, 'html5lib')
    products = soup.find_all('div', class_='product_name')

    print('-'*200)
    print('Printing contents from ' + page_url)
    print('-'*200)
    for product in products:
        print(count, product.contents[1])
        product_links.append(product.contents[1].get('href'))

Also, the loop goes from 1-3 for testing purposes. With the live product, the range(1,3) would be replaced with range(page_count).

Solution

From some quick playing around, the page parameter affects what's sent after page load, not what's sent initially. It seems like the site works by sending the first page, reading the page query parameter, then fetching the proper page afterward if page != 1.

To see this, navigate to a page after the first, open the dev tools, go to the network tab, and search "Annabelles" (or some other product name on the page that you want to scrape). You'll see that it comes from a products.jhtm document sent last; not the page you requested using requests.

You could try just requesting that directly (if you can figure out their API), or use Selenium or something similar instead of BS4 so that the "page requesting logic" on the initial page works properly. BS4 won't work since it doesn't execute JavaScript.

Answered By - Carcigenicate

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 13, 2023

[FIXED] BeautifulSoup only scrapes first page

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels