Saturday, October 23, 2021

[FIXED] Selenium: How to scrape/crawl until last page?

October 23, 2021 python, python-3.x, selenium, web-scraping No comments

Issue

So I currently have a function:

def main(search_term):
    # RUN MAIN PROGRAM ROUTINE
    chromedriver = "chromedriver path"
    driver = webdriver.Chrome(chromedriver)
    
    records = []
    url = get_url(search_term)
    
    # SELECT NUMBER OF PAGES TO CRAWL
    
    #
    for page in range(1, 21):
    #for page in itertools.count():
        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        results = soup.find_all('div', {'data-component-type': 's-search-result'})
        print(page)     
        
        for item in results:
            record = extract_record(item)
            if record:
                records.append(record)

which scrapes data from page 1 to page 21 of the search result given the search_term like "electronics" or "cosmetics" or "airpod pro case"

However, I realized some search results gives me pages from page 1 to 3, page 1 to 7, page 1 to 20 and so on depending how specific my search_term is.

I was thinking I could scrape the data if next button is enabled until my code notices that the next button is disabled, which would mean it is the last page of the result.

The xpaths of the enabled next button and the disabled next button are:

next_button_enabled = driver.find_element_by_xpath('//li[@class="a-last"]')
next_button_disabled = driver.find_element_by_xpath('//li[@class="a-disabled a-last"]')

but I am not sure how to work with this information with what I have written so far.

Solution

Since this is what a page url looks like https://www.amazon.com/s?k=phone&page=2 you can do some basic link hacking. The only thing you'll need to find out is how many pages there are in total. soup.find('ul', class_="a-pagination").find_all('li') will retrieve the pagination list. The last page number is in the second last item in that list:

from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

url='https://www.amazon.com/s?k=phone' #or https://www.amazon.com/s?k=maison+kitsune+airpod+pro+case
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(url)
soup = BeautifulSoup(wd.page_source, "html.parser")
last_page = int([i.get_text() for i in soup.find('ul', class_="a-pagination").find_all('li')][-2])

for page in range(2, last_page + 1):
  page_url = f'{url}&page={page}'
  #get url with Selenium etc.

Answered By - RJ Adriaansen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 23, 2021

[FIXED] Selenium: How to scrape/crawl until last page?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels