Issue
I'm new with PySpark and tried a simple code like that
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('Read File')
sc = SparkContext.getOrCreate(conf=conf)
rdd = sc.textFile('data1.txt')
print(rdd.collect())
rdd2 = rdd.map(lambda x: x.split(' '))
print(rdd2.collect())
but the rdd2.collect()
execution always give me problems like:
ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 5)/ 2]
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
I have this versions installed all local and executing in Windows 10 with cmd.exe:
- Python 3.12.1
- Java 11.0.20
- Spark 3.5.0
- Hadoop 3.3.6
I also have declared all the environment variables, JAVA_HOME
, SCALA_HOME
, HADOOP_HOME
, SPARK_HOME
, PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
. Last 2 with the route to python.exe on the Python instalation route.
I tried to uninstall and reinstall everything, change versions, change environment variables but I do not know what to do now.
Solution
Finally I have finished using docker with the specific image of pyspark in jupyter "https://hub.docker.com/r/jupyter/pyspark-notebook" and it works correctly without problems, in case there is someone interested that has this kind of problems you can also follow this guide "https://medium.com/@suci/running-pyspark-on-jupyter-notebook-with-docker-602b18ac4494" and this "https://subhamkharwal.medium.com/data-lakehouse-with-pyspark-setup-pyspark-docker-jupyter-lab-env-1261a8a55697".
Thanks to this I was able to install docker easily and load the "jupyter/pyspark-notebook" image correctly so everything works as it should.
Answered By - Roterun
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.