Issue
In a CLI, I moved to hadoop directory(in EMR) and downloaded ojdbc.jar file. And I tried to connect Oracle DB using spark shell commands below:
pyspark \
--jars "/home/hadoop/ojdbc6.jar" \
--master yarn-client \
--num-executors 5 \
--driver-memory 14g \
--executor-memory 14g \
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:USER/HOST@//IP:PORT/SERVICE") \
.option("dbtable", "TABLE") \
.option("user", "USER") \
.option("password", "PASSWORD") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
It is working but using a terminal is inconvenient for me. So I want to connect EMR Notebooks to Oracle DB and tried codes below:
from pyspark import SparkContext, SparkConf
spark_conf = SparkConf().setAll([('spark.pyspark.python', 'python3'),
('spark.pyspark.virtualenv.enabled', 'true'),
('spark.pyspark.virtualenv.type', 'native'),
('spark.pyspark.virtualenv.bin.path', '/usr/bin/virtualenv'),
('spark.driver.extraClassPath', '/home/hadoop/ojdbc6.jar')])\
.setAppName('SparkJob')
sc = SparkContext.getOrCreate(conf=spark_conf)
df = sqlContext.read \
.format("jdbc") \
.options(url="jdbc:oracle:thin:USER/HOST@//IP:PORT/SERVICE",
dbtable="TABLE",
user="USER",
password="PASSWORD",
driver = "oracle.jdbc.driver.OracleDriver") \
.load()
And got an error:
An error occurred while calling o97.load.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
Is there any missing step, in order to connect to the Oracle DB from the EMR Jupyter Notebooks?
Solution
You can use this solution, First upload the jdbc driver to s3 bucket and copy the link, then you can specify the jars files in first cell(first to be executed). For example I did this for ms sql jdbc driver (you need oracle here).
%%configure -f
{
"conf": {
"spark.jars": "s3://jar-test/mssql-jdbc-8.4.0.jre8.jar"
}
}
Also, here is the snippet to read from jdbc.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession, Window, DataFrame,Row
spark_session = SparkSession\
.builder\
.appName("test") \
.getOrCreate()
spark_context = spark_session.sparkContext
sql_context = SQLContext(spark_context)
df = sql_context.read.format("jdbc") \
.option("url", "jdbc:sqlserver://<public-dns>:<port>") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("dbtable","<table-name>") \
.option("user", "<username>") \
.option("password", "<password>") \
.load()
Answered By - A.B
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.