Issue
Kindly check this website out. My goal is to retrieve screenshots of all PDF links in a page, given a URL.
First I attempted request URL and parse HTML text and find all PDF links:
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Found file: ", i, link.get('href', []))
This task performs successfuly, all 864 files showed up.
Now I try to take a full window Selenium screenshot of the page window containing the link:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
from selenium.webdriver.common.by import By
url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
url_pdf = link.get('href')
element = driver.find_element(By.XPATH,'//a[@href="'+url_pdf+'"]' )
_ = element.screenshot_as_png
driver.get_screenshot_as_file(f'screenshot_{i}.png')
print("Found file: ", i, link.get('href', []))
driver.quit()
It fails at the next page's element:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@href="https://aplng.com.au/wp-content/uploads/2022/06/Australia-Pacific-LNG-Pty-Limited-FY2021-Tax-Contribution-Report.pdf"]"}
(Session info: chrome=117.0.5938.88); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Upon
- How could the URL request successfully return all the PDF links? But when using webdriver, not?
- All similar problems online I've looked into suggest to perform next page element click. But I think that solution is too specific. Is there any more general solution?
- If perform clicking is the only/best solution, how could I ensure similar cases to be mitigated
I have read about AJAX, but I don't really understand. My understanding of web technology is still minimal, so feel free to go as thoroughly as you need.
Solution
There is no general solution for every case. It's some kind of reverse-engineering and can be different for each case.
In your case it's easier to do it by selecting All
option in Show on page dropdown.
But on other resources can be other approaches, like
- Continuous scroll to the end of the page, Example 2
- Clicking on next pagination button in a loop
- Pass page number parameter inside url (if site pagination is implemented by parameter like example.com/
?page=1
or?p=1
, etc.) - Pass filter parameters that retrieve needed data if filter parameters are stored in url, like example.com/
?limit=1000
. - Selecting All option for output (this case)
It's site-specific, depends on pagination logic that is implemented on your resource.
Your case:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#other imports
driver.get(url)
wait = WebDriverWait(driver, 10)
items_dropdown = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dataTables_length [class*=select2-container]')))
items_dropdown.click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@class='select2-results']//li[text()='All']"))).click()
for link in links:
if ('.pdf' in link.get('href', [])):
# your code
Answered By - Yaroslavm
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.