Issue
I am trying to launch an EMR cluster that includes the Postgres driver JAR file so I can load data from Postgres and analyze it with PySpark. I have the the JAR I would like to include stored in S3. I have tried the following things:
1 - entering the following configuration:
[
{
"Classification": "presto-connector-postgresql",
"Properties": {
"connection-url": "jdbc:postgresql://example.net:5432/database",
"connection-user": "MYUSER",
"connection-password": "MYPASS"
},
"Configurations": []
}
]
2 - Adding the JAR as a custom step (selecting the JAR from S3)
3 - Adding the JAR as a custom bootstrap action (selecting the JAR from S3)
None of these work, I can't figure out how to use the connector in Step 1 within Jupyter, and the custom step/bootstrap action both fail when I launch the cluster. How can I launch an EMR cluster with the Postgres drivers installed so I can query my data in Jupyter?
EDIT:
I used the following bootstrap script to copy the JAR to my master/worker nodes:
#!/bin/bash
aws s3 cp s3://BUCKETNAME/postgresql-42.2.8.jar /mnt1/myfolder
But still get the following error:
An error was encountered:
An error occurred while calling o90.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver
With the following code:
df = spark.read \
.format("jdbc") \
.option("url", "jdbcURL") \
.option("user", "user") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.option("query", "select * from slm_files limit 100") \
.load()
df.count()
Solution
Using this code in the first cell of my Jupyter notebook solved it for me:
%%configure -f
{ "conf":{
"spark.jars": "s3://JAR-LOCATION/postgresql-42.2.8.jar"
}
}
Answered By - DBA108642
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.