Monday, October 25, 2021

virtualenv (...)`

October 25, 2021 amazon-emr, amazon-web-services, apache-spark, jupyter-notebook, pyspark No comments

Issue

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).

The problem arises once I perform operations that actually hit the spark machinery. For example:

sc.parallelize(list(range(10))).map(lambda x: x**2).collect()

I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0

The full stack trace is here.

A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:

The path python3 (from --python=python3) does not exist

I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.

While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.

Technical information on the cluster setup:

emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.

Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.

Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.

Solution

I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.

Answered By - josteinb

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, October 25, 2021

[FIXED] AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels