Friday, December 8, 2023

[FIXED] Retrieving elements screenshots behind pagination

December 08, 2023 ajax, beautifulsoup, python, selenium-webdriver, web-scraping No comments

Issue

Kindly check this website out. My goal is to retrieve screenshots of all PDF links in a page, given a URL.

First I attempted request URL and parse HTML text and find all PDF links:

from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Found file: ", i, link.get('href', []))

This task performs successfuly, all 864 files showed up.

Now I try to take a full window Selenium screenshot of the page window containing the link:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
from selenium.webdriver.common.by import By

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        url_pdf = link.get('href')
        element = driver.find_element(By.XPATH,'//a[@href="'+url_pdf+'"]' )
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))


driver.quit()

It fails at the next page's element:

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@href="https://aplng.com.au/wp-content/uploads/2022/06/Australia-Pacific-LNG-Pty-Limited-FY2021-Tax-Contribution-Report.pdf"]"}
  (Session info: chrome=117.0.5938.88); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception

Upon

How could the URL request successfully return all the PDF links? But when using webdriver, not?
All similar problems online I've looked into suggest to perform next page element click. But I think that solution is too specific. Is there any more general solution?
If perform clicking is the only/best solution, how could I ensure similar cases to be mitigated

I have read about AJAX, but I don't really understand. My understanding of web technology is still minimal, so feel free to go as thoroughly as you need.

Solution

There is no general solution for every case. It's some kind of reverse-engineering and can be different for each case.

In your case it's easier to do it by selecting All option in Show on page dropdown.

But on other resources can be other approaches, like

Continuous scroll to the end of the page, Example 2
Clicking on next pagination button in a loop
Pass page number parameter inside url (if site pagination is implemented by parameter like example.com/?page=1 or ?p=1, etc.)
Pass filter parameters that retrieve needed data if filter parameters are stored in url, like example.com/?limit=1000.
Selecting All option for output (this case)

It's site-specific, depends on pagination logic that is implemented on your resource.

Your case:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#other imports

driver.get(url)
wait = WebDriverWait(driver, 10)
items_dropdown = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dataTables_length [class*=select2-container]')))
items_dropdown.click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@class='select2-results']//li[text()='All']"))).click()

for link in links:
    if ('.pdf' in link.get('href', [])):
    # your code

Answered By - Yaroslavm

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 8, 2023

[FIXED] Retrieving elements screenshots behind pagination

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels