Issue
So I currently have a function:
def main(search_term):
# RUN MAIN PROGRAM ROUTINE
chromedriver = "chromedriver path"
driver = webdriver.Chrome(chromedriver)
records = []
url = get_url(search_term)
# SELECT NUMBER OF PAGES TO CRAWL
#
for page in range(1, 21):
#for page in itertools.count():
driver.get(url.format(page))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all('div', {'data-component-type': 's-search-result'})
print(page)
for item in results:
record = extract_record(item)
if record:
records.append(record)
which scrapes data from page 1 to page 21 of the search result given the search_term like "electronics" or "cosmetics" or "airpod pro case"
However, I realized some search results gives me pages from page 1 to 3, page 1 to 7, page 1 to 20 and so on depending how specific my search_term is.
I was thinking I could scrape the data if next button is enabled until my code notices that the next button is disabled, which would mean it is the last page of the result.
The xpaths of the enabled next button and the disabled next button are:
next_button_enabled = driver.find_element_by_xpath('//li[@class="a-last"]')
next_button_disabled = driver.find_element_by_xpath('//li[@class="a-disabled a-last"]')
but I am not sure how to work with this information with what I have written so far.
Solution
Since this is what a page url looks like https://www.amazon.com/s?k=phone&page=2
you can do some basic link hacking. The only thing you'll need to find out is how many pages there are in total. soup.find('ul', class_="a-pagination").find_all('li')
will retrieve the pagination list. The last page number is in the second last item in that list:
from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
url='https://www.amazon.com/s?k=phone' #or https://www.amazon.com/s?k=maison+kitsune+airpod+pro+case
wd = webdriver.Chrome('chromedriver',options=options)
wd.get(url)
soup = BeautifulSoup(wd.page_source, "html.parser")
last_page = int([i.get_text() for i in soup.find('ul', class_="a-pagination").find_all('li')][-2])
for page in range(2, last_page + 1):
page_url = f'{url}&page={page}'
#get url with Selenium etc.
Answered By - RJ Adriaansen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.