Issue
I have several webpages that i would like to scrape using selenium. I want to automate this and run it on a remote machine. Since each website is different, the script would require different functionalities to complete the job. Instead of having each script having the same code to start a virutal display and a webdriver, i have a rough idea of using a decorator that can start up a virtual display and webdriver like so:
def open_headless_browser(func: Callable) -> Callable:
disp = Display(visible=False, size=(100, 100))
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--dns-prefetch-disable")
def start(): -> None
with disp as display:
with webdriver.Chrome(options=self.options) as wd:
func()
return start
And then i can potentially have my scripts (the one that will actually perform the scraping) like so:
@open_headless_browser
def scrape_abc(url_abc: str) -> None:
driver.get(url_abc)
driver.find_elements_by_xpath('abc')
@open_headless_browser
def scrape_xyz(url_xyz: str) -> None:
driver.get(url_xyz)
driver.find_elements_by_css('xyz')
However, several things concerning me:
- is the code in my
scrape_abc
andscrape_xzy
functions cinsidered a bit awkward because it doesn not have any idea of whatdriver
is (since it is defined in the decorator). - would this even work? Am i over-complicating things or am i just approaching this idea incorrectly?
- is this pythonic
i am on python3.10 selenium4.15 pyvirtualdisplay3.0
EDIT: after some thinking, this approach will not work after all. The decorated functions will not have access to the webdriver object defined in the decorator
Solution
EDIT: after some thinking, this approach will not work after all. The decorated functions will not have access to the webdriver object defined in the decorator
Sure it will, you just need to pass wd
as an argument to the function, something like this:
def open_headless_browser(func: Callable) -> Callable:
disp = Display(visible=False, size=(100, 100))
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--dns-prefetch-disable")
def start(): -> None
with disp as display:
with webdriver.Chrome(options=options) as wd:
func(wd)
return start
Then your functions will look like:
@open_headless_browser
def scrape_abc(driver: webdriver.Chrome) -> None:
driver.get(url_abc)
driver.find_elements_by_xpath('abc')
@open_headless_browser
def scrape_abc(driver: webdriver.Chrome) -> None:
driver.get(url_xyz)
driver.find_elements_by_xpath('xyz')
If you want to be able to pass in a URL, you need to define arguments in the wrapper function, too:
def open_headless_browser(func: Callable) -> Callable:
disp = Display(visible=False, size=(100, 100))
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
options.add_argument("--dns-prefetch-disable")
def start(url: str): -> None
with disp as display:
with webdriver.Chrome(options=options) as wd:
func(wd, url)
return start
@open_headless_browser
def scrape_abc(driver: webdriver.Chrome, url: str) -> None:
driver.get(url)
driver.find_elements_by_xpath('abc')
Then it's just a case of remembering that although you define the function as having two arguments, you only call it with one.
Answered By - RoadieRich
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.