Wednesday, October 27, 2021

[FIXED] How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage

October 27, 2021 azure-databricks, databricks, python, selenium No comments

Issue

I've seen a couple of posts on using Selenium in Databricks using %shto install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when instatiating Chrome to a mounted folder on Azure Blob Storage, the file would not be placed there after downloading.

Following links show people with the same problem but no clear answer:

https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html

https://forums.databricks.com/questions/45388/selenium-in-databricks-with-add-experimental-optio.html

Is there a way to identify where the file gets downloaded in Azure Databricks when I do web automation using Selenium Python?

And some struggling with getting Selenium to run properly at all: https://forums.databricks.com/questions/14814/selenium-in-databricks.html

not in path error: https://webcache.googleusercontent.com/search?q=cache:NrvVKo4LLdIJ:https://stackoverflow.com/questions/57904372/cannot-get-selenium-webdriver-to-work-in-azure-databricks+&cd=5&hl=en&ct=clnk&gl=us

Is there a clear guide to use Selenium on Databricks and manage downloaded files?

Solution

Here is the guide to installing and using Selenium. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in it's own cell.

Selenium installation guide found here: https://forums.databricks.com/questions/15480/how-to-add-webdriver-for-selenium-in-databricks.html?childToView=21347#answer-21347

Install Selenium

%pip install selenium

Do your imports

import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Download the latest Chrome driver to the DBFS root storage /tmp/. Make sure your version is the latest version available (same as latest Chrome) because it has to match with the Chrome version

%sh
wget https://chromedriver.storage.googleapis.com/91.0.4472.19/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

Unzip the file to a new folder in the DBFS root /tmp/. I tried to use non-root path and it does not work.

%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/

Get the latest Chrome download and install it.

%sh
sudo add-apt-repository ppa:canonical-chromium-builds/stage

/usr/bin/yes | sudo apt update

/usr/bin/yes | sudo apt install chromium-browser

Configure your storage account. Example is for Azure Blob Storage using ADLSGen2.

service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id":  service_principal_id,
       "fs.azure.account.oauth2.client.secret": service_principle_key,
       "fs.azure.account.oauth2.client.endpoint": directory,
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

Configure your mounting location and mount.

mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"

if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
  dbutils.fs.mount(
    source = storage,
    mount_point = mount_point,
    extra_configs = configs)
  print(mount_point + " has been mounted.")
else:
  print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")

Create method for instantiating Chrome browser. I need to load in a cookies file that I have placed in my utils folder which points to mnt/container-data/utils/selenium. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)

def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
    """
    Instatiates a Chrome browser.

    Parameters
    ----------
    download_path : str
        The download path to place files downloaded from this browser session.
    chrome_driver_path : str
        The path of the chrome driver executable binary (.exe file).
    cookies_path : str
        The path of the cookie file to load in (.pkl file).
    url : str
        The URL address of the page to initially load.

    Returns
    -------
    Browser
        Returns the instantiated browser object.
    """
    options = Options()
    prefs = {
        "download.default_directory": download_path + "/",
        "directory_upgrade": "true",
    }
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-dev-shm-usage')
    browser = webdriver.Chrome(chrome_driver_path, options=options)
    browser.get(url)
    cookies = pkl.load(open(cookies_path, "rb"))
    for cookie in cookies:
        browser.add_cookie(cookie)
    browser.get(url)
    return browser

Instatiate browser. Set the downloads location to the DBFS root file system /tmp/downloads. Make sure the cookies path has /dbfs in front so the full cookies path is like /dbfs/mnt/...

browser = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver",
    cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
    url="YOUR_URL"
)

Do your navigating and any downloads you need.
OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.

import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
    print(root)
    if any(".csv" in s for s in filenames):
        print(filenames)
        break

Copy the file from DBFS root tmp to your mounted storage (/mnt/container-data/raw/). You can rename during this operation as well. You can only access root file system using file: prefix when using dbutils.

dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')

Answered By - kindofhungry

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 27, 2021

[FIXED] How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels