Issue
This simple code from the latest doc does not work on the EMR Studio Spark cluster (current version: 3.3.1-amzn-0
)
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def subtract_mean(pdf: pd.DataFrame) -> pd.DataFrame:
# pdf is a pandas.DataFrame
v = pdf.v
return pdf.assign(v=v - v.mean())
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()
The error looks like this:
An error was encountered:
An error occurred while calling o184.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 59) (ip-10-130-55-119.us-east-1.aws.(website).com executor 7): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --no-pip --system-site-packages virtualenv_application_1693557403809_0024_0
at org.apache.spark.api.python.VirtualEnvFactory.execCommand(VirtualEnvFactory.scala:125)
at org.apache.spark.api.python.VirtualEnvFactory.setupVirtualEnv(VirtualEnvFactory.scala:83)
at org.apache.spark.api.python.PythonWorkerFactory.<init>(PythonWorkerFactory.scala:95)
I am convinced this is a problem of Python package versions, as another user had a similar problem with a previous version of Spark (see here). However I did not succeed to find the right version of pandas/pyarrow to use...
Solution
The solution was to open a ticket with AWS Support and they fixed the issue. Part of the solution was to use this in the first notebook cell:
%%configure -f
{
"conf": {
"spark.pyspark.python":"python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path": "/usr/local/bin/virtualenv"
}
}
Answered By - mountrix
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.