Issue
I am trying to connect to a spark cluster that I created locally from my laptop. the docker-compose I used is the following :
services:
spark-master:
image: docker.io/bitnami/spark:3.2.1
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '7075:8080'
- "7077:7077"
volumes:
- "./execution_scripts:/execution_scripts:rw"
spark-worker:
image: docker.io/bitnami/spark:3.2.1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
spark-worker-2:
image: docker.io/bitnami/spark:3.2.1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
spark-worker-3:
image: docker.io/bitnami/spark:3.2.1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
the image above is a bitnami image with 3 workers and 1 master. and the code i trying to connect through my jupyter notebook is the following:
import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Day1_1").master("spark://localhost:7077").getOrCreate()
df_NYTaxi = spark.read.csv(file)
the error i get is the following after running the above code is the following :
: java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:78)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:518)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:596)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
```
I have tried a lot of things but every time I just can't seem to connect to that docker image some how, or I can connect but the job times out.
my local spark version is 3.2.1 and the image used has the same version.
Solution
So the workaround to that was to actually create a docker image with multiple containers and then connect to it through VS code and then run the scripts from inside.
here is the docker compose after modification
version: '2'
services:
spark:
build : .
container_name: spark_master
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '7075:8080'
- "7077:7077"
volumes:
- "./execution_scripts:/execution_scripts:rw"
- "./resources:/resources:rw"
spark-worker:
build : .
container_name: spark_worker_1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- "./execution_scripts:/execution_scripts:rw"
- "./resources:/resources:rw"
spark-worker-2:
build : .
container_name: spark_worker_2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- "./execution_scripts:/execution_scripts:rw"
- "./resources:/resources:rw"
and the docker file for building this image is the following :
FROM bitnami/spark:3.2.1 USER root
# Installing package into Spark if needed
# spark-shell --master local --packages "<package name>" RUN pip install findspark
EXPOSE 8080
EXPOSE 7075
EXPOSE 7077
after building this image(of course you need to create a 2 folders called execution_scripts and resources. you can attach to the running container in VS Code or any similar way from any other IDE.
Answered By - TheDataJanit0r
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.