Issue
I've seen a couple of posts on using Selenium in Databricks using %sh
to install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when instatiating Chrome to a mounted folder on Azure Blob Storage, the file would not be placed there after downloading.
Following links show people with the same problem but no clear answer:
https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html
And some struggling with getting Selenium to run properly at all: https://forums.databricks.com/questions/14814/selenium-in-databricks.html
Is there a clear guide to use Selenium on Databricks and manage downloaded files?
Solution
Here is the guide to installing and using Selenium. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in it's own cell.
Selenium installation guide found here: https://forums.databricks.com/questions/15480/how-to-add-webdriver-for-selenium-in-databricks.html?childToView=21347#answer-21347
- Install Selenium
%pip install selenium
- Do your imports
import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
- Download the latest Chrome driver to the DBFS root storage
/tmp/
. Make sure your version is the latest version available (same as latest Chrome) because it has to match with the Chrome version
%sh
wget https://chromedriver.storage.googleapis.com/91.0.4472.19/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
- Unzip the file to a new folder in the DBFS root
/tmp/
. I tried to use non-root path and it does not work.
%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
- Get the latest Chrome download and install it.
%sh
sudo add-apt-repository ppa:canonical-chromium-builds/stage
/usr/bin/yes | sudo apt update
/usr/bin/yes | sudo apt install chromium-browser
- Configure your storage account. Example is for Azure Blob Storage using ADLSGen2.
service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": service_principal_id,
"fs.azure.account.oauth2.client.secret": service_principle_key,
"fs.azure.account.oauth2.client.endpoint": directory,
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
- Configure your mounting location and mount.
mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"
if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
dbutils.fs.mount(
source = storage,
mount_point = mount_point,
extra_configs = configs)
print(mount_point + " has been mounted.")
else:
print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")
- Create method for instantiating Chrome browser. I need to load in a cookies file that I have placed in my
utils
folder which points tomnt/container-data/utils/selenium
. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)
def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
"""
Instatiates a Chrome browser.
Parameters
----------
download_path : str
The download path to place files downloaded from this browser session.
chrome_driver_path : str
The path of the chrome driver executable binary (.exe file).
cookies_path : str
The path of the cookie file to load in (.pkl file).
url : str
The URL address of the page to initially load.
Returns
-------
Browser
Returns the instantiated browser object.
"""
options = Options()
prefs = {
"download.default_directory": download_path + "/",
"directory_upgrade": "true",
}
options.add_experimental_option('prefs', prefs)
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(chrome_driver_path, options=options)
browser.get(url)
cookies = pkl.load(open(cookies_path, "rb"))
for cookie in cookies:
browser.add_cookie(cookie)
browser.get(url)
return browser
- Instatiate browser. Set the downloads location to the DBFS root file system
/tmp/downloads
. Make sure the cookies path has/dbfs
in front so the full cookies path is like/dbfs/mnt/...
browser = init_chrome_browser(
download_path="/tmp/downloads",
chrome_driver_path="/tmp/chromedriver/chromedriver",
cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
url="YOUR_URL"
)
Do your navigating and any downloads you need.
OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.
import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
print(root)
if any(".csv" in s for s in filenames):
print(filenames)
break
- Copy the file from DBFS root tmp to your mounted storage (
/mnt/container-data/raw/
). You can rename during this operation as well. You can only access root file system usingfile:
prefix when using dbutils.
dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')
Answered By - kindofhungry
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.