Thursday, June 16, 2022

[FIXED] Connecting to a local Docker Spark Cluster

June 16, 2022 apache-spark, docker, docker-compose, jupyter-notebook, python No comments

Issue

I am trying to connect to a spark cluster that I created locally from my laptop. the docker-compose I used is the following :


services:
  spark-master:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '7075:8080'
      - "7077:7077"
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
  spark-worker:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  spark-worker-2:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  spark-worker-3:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

the image above is a bitnami image with 3 workers and 1 master. and the code i trying to connect through my jupyter notebook is the following:

import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Day1_1").master("spark://localhost:7077").getOrCreate()
df_NYTaxi =  spark.read.csv(file)

the error i get is the following after running the above code is the following :

: java.lang.NullPointerException
    at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:78)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:518)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:596)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
    ```
I have tried a lot of things but every time I just can't seem to connect to that docker image some how, or I can connect but the job times out. 
my local spark version is 3.2.1 and the image used has the same version.

Solution

So the workaround to that was to actually create a docker image with multiple containers and then connect to it through VS code and then run the scripts from inside.

here is the docker compose after modification

version: '2'

services:
  spark:
    build : .
    container_name: spark_master
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '7075:8080'
      - "7077:7077"
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"

    
  spark-worker:
    build : .
    container_name: spark_worker_1

    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"
 
  spark-worker-2:
    build : .
    container_name: spark_worker_2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"

and the docker file for building this image is the following :

 FROM bitnami/spark:3.2.1 USER root
 
 # Installing package into Spark if needed
 # spark-shell --master local --packages "<package name>" RUN pip install findspark 
 EXPOSE 8080 
 EXPOSE 7075 
 EXPOSE 7077

after building this image(of course you need to create a 2 folders called execution_scripts and resources. you can attach to the running container in VS Code or any similar way from any other IDE.

Answered By - TheDataJanit0r

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 16, 2022

[FIXED] Connecting to a local Docker Spark Cluster

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels