Issue
I installed PySpark under Anaconda by issuing the following commands at a Conda prompt:
conda create -n py39 python=3.9 anaconda
conda activate py39
conda install openjdk
conda install pyspark
conda install -c conda-forge findspark
As can be seen, this is all within the py39
environment.
Additionally, I fetched Hadoop 2.7.1 from
GitHub and created
c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1
to contain the corresponding
README.md
file and bin
subfolder [1]. Here, %HOMEPATH%
is
\Users\User.Name
. Finally, I had to create file
%SPARK_HOME%/conf/spark-defaults.conf
(Annex A).
With the above setup, I could launch PySpark using the following
myspark.cmd
script located in
c:%HOMEPATH%\anaconda3\envs\py39\bin\
:
set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
pyspark
I am now following this
page
to be able to use Spyder instead of the Conda command line. I am
using the following SpyderSpark.cmd
script to set the the variables
and launch Spyder:
set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=C:%HOMEPATH%\anaconda3\envs\py39\Library"
set "SPARK_HOME=C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
C:%HOMEPATH%\anaconda3\pythonw.exe ^
C:%HOMEPATH%\anaconda3\cwp.py ^
C:%HOMEPATH%\anaconda3\envs\py39 ^
C:%HOMEPATH%\anaconda3\envs\py39\pythonw.exe ^
C:%HOMEPATH%\anaconda3\envs\py39\Scripts\spyder-script.py
Some points that may not be clear:
Folder
%JAVA_HOME%\bin
containsjava.exe
andjavac.exe
The second half of the above code block is the command that is executed by Anaconda's shortcut for
Spyder (py39)
As I am still trying to get SpyderSpark.cmd
to work, I execute it
from the Conda prompt, specifically the py39
environment. This way,
it inherits environment variables that I may have missed in
SpyderSpark.cmd
. Issuing SpyderSpark.cmd
launches the Spyder GUI,
but Spark commands aren't recognized at the console. Here is a
transcript of the response to the the first few lines of code from
this
tutorial:
In [1]: columns = ["language","users_count"]
...: data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
In [2]: spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
NameError: name 'SparkSession' is not defined
The likely cause is that all but the PYTHONPATH
variable propagated
their values into the Spyder session. From the Spyder console:
import os
print(os.environ.get("HADOOP_HOME"))
print(os.environ.get("JAVA_HOME"))
print(os.environ.get("SPARK_HOME"))
print(os.environ.get("PYSPARK_DRIVER_PYTHON"))
print(os.environ.get("PYSPARK_PYTHON"))
print(os.environ.get("PYTHONPATH"))
c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
C:\Users\User.Name\anaconda3\envs\py39\Library
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark
Python
Python
None
Why isn't PYTHONPATH
propagating into the Spyder session, and how
can I fix this?
I don't think that this Q&A explains the problem because I am launching Spyder from a CMD environment after setting the variable. Furthermore, all the other variables succeed in propagating to the Spyder session.
Notes
[1] Using Cygwin, I found that for all the files in
c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1\bin
, the permission bits for
execution were disabled and needed to be explicitly enabled.
Afternote 2023-09-02:
Respondents posted helpful hints on how to get Spark commands
recognized in Spyder, i.e., to first issue from pyspark.sql import SparkSession
. I didn't see this tutorial code because it was in a screen capture and the image was blocked by AdBlocker. Also, it was not needed after issuing pyspark
from the Conda prompt of the py39
environment. It was needed after issuing SpyderSpark.cmd
, as I found from the comments, and this allowed the Spark statements to be recognized. I assume, therefore, that pyspark
imports SparkSession
on the user's behalf, making it unnecessary to explicitly import it after launching pyspark
from the Conda prompt.
As useful as it was to know that SparkSession needs to be imported from within Spyder, it doesn't answer the question of why 1 of 6
environment variables fail to propagate from SpyderSpark.cmd
to Spyder,
i.e., variable PYTHONPATH
. Admittedly, it solved the real
showstopper for me at present, which is to get Spark working from Spyder, for which I thank the respondents.
I would still be interested in why PYTHONPATH
doesn't propagate.
On a separate but related issue, I found it tricky to create a shortcut to
SpyderSpark.cmd
that doesn't leave a redundant terminal on the
desktop. The solution turned out to be to prefix the Spyder launching
command with start
:
set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
start "" ^
%USERPROFILE%\anaconda3\pythonw.exe ^
%USERPROFILE%\anaconda3\cwp.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\pythonw.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py
All the arguments starting with %USERPROFILE%
would ideally be
enclosed in double-quotes in case they expand to include
non-alphanumeric characters. For some reason, I couldn't do that
without incurring the incorrect behaviour in Annex B (below). Therefore, I did not adorn the arguments with double-quotes.
With SpyderSpark as revised above, the Target
field of the Windows shortcut
should contain:
%SystemRoot%\System32\cmd.exe /D /C "%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"
I found it handy to simply copy the Spyder shortcut and modify the
Target
field. For the sake of readability, here is the same command
broken into two physical lines (which isn't suitable for the Target
field of a shortcut):
%SystemRoot%\System32\cmd.exe /D /C ^
"%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"
Thanks to Mofi for advice on having improved this afternote.
Further troubleshooting 2023-09-03
To further troubleshoot the propagation of environment variable PYTHONPATH
into Spyder, I followed Mofi's advice and revised SpyderSpark.cmd
to use the
console oriented python
rather than GUI-oriented pythonw
:
set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
set PYTHONPATH & REM HHHHHHHHHHHHHHHHH
%USERPROFILE%\anaconda3\python.exe ^
%USERPROFILE%\anaconda3\cwp-debug.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\python.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py
As can be seen from above, PYTHONPATH
is also displayed to the screen prior to the Spyder launching command. Furthermore, SpyderSpark.cmd
was revised
to use a modified cwp.py
, dubbed cwp-debug.py
, wherein
PYTHONPATH
is printed out twice:
import os
import sys
import subprocess
from os.path import join, pathsep
from menuinst.knownfolders import FOLDERID, get_folder_path, PathNotFoundException
# call as: python cwp.py PREFIX ARGs...
prefix = sys.argv[1]
args = sys.argv[2:]
new_paths = pathsep.join([prefix,
join(prefix, "Library", "mingw-w64", "bin"),
join(prefix, "Library", "usr", "bin"),
join(prefix, "Library", "bin"),
join(prefix, "Scripts")])
print(os.environ["PYTHONPATH"]) ###################
env = os.environ.copy()
env['PATH'] = new_paths + pathsep + env['PATH']
env['CONDA_PREFIX'] = prefix
documents_folder, exception = get_folder_path(FOLDERID.Documents)
if exception:
documents_folder, exception = get_folder_path(FOLDERID.PublicDocuments)
if not exception:
os.chdir(documents_folder)
print(env["PYTHONPATH"]) ######################
sys.exit(subprocess.call(args, env=env))
When SpyderSpark.cmd
is executed from a CMD console, the expected
PYTHONPATH
is printed out by SpyderSpark.cmd
and at both
locations in cwp-debug.py
. Furthermore, PYTHONPATH
is echoed
to the screen when it is prepended to in SpyderSpark.cmd
. I have
lumped together lines in the session transcript so that the different
echoings of PYTHONPATH
are easier to recognize:
C:\Users\User.Name> C:\Users\User.Name\anaconda3\envs\py39\bin\SpyderSpark.cmd
C:\Users\User.Name> set "HADOOP_HOME=C:\Users\User.Name\AppData\Local\Hadoop\2.7.1"
C:\Users\User.Name> set "JAVA_HOME=C:\Users\User.Name\anaconda3\envs\py39\Library"
C:\Users\User.Name> set "SPARK_HOME=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark"
C:\Users\User.Name> set "PYSPARK_DRIVER_PYTHON=Python"
C:\Users\User.Name> set "PYSPARK_PYTHON=Python"
C:\Users\User.Name> set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name> set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name> set PYTHONPATH & REM HHHHHHHHHHHHHHHHH
PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;
C:\Users\User.Name> C:\Users\User.Name\anaconda3\python.exe C:\Users\User.Name\anaconda3\cwp-debug.py C:\Users\User.Name\anaconda3\envs\py39 C:\Users\User.Name\anaconda3\envs\py39\python.exe C:\Users\User.Name\anaconda3\envs\py39\Scripts\spyder-script.py
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;
fromIccProfile: failed minimal tag size sanity
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
"class": algorithms.Blowfish,
C:\Users\User.Name>
The final warnings about fromIccProfile
and Blowfish
are
innocuous. Explanations about the fromIccProfile
warning can be
found here and
here while the
Blowfish
warning is just about deprecation. Therefore, the modifications to SpyderSpark
and cwp.py
(in the form of cwp-debug.py
) did not reveal why PYTHONPATH
fails to propagate to Spyder.
The next step was to check whether PYTHONPATH
was being clobbered
by spyder-script.py
, which is a very short script:
import re
import sys
from spyder.app.start import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
sys.exit(main())
I'm actually trying to spin up on Python, so I'm wondering whether anyone can help decipher this code.
Further troubleshooting 2023-09-06
Mofi
explained that the regular expression substitution in
spyder-script.py
above simply strips away a suffix -script.py[w]
or .exe
from the script name, which merely affects the file identification
shown in diagnostic messages.
I noticed that the ensuing statement invokes main()
from module
spyder.app.start
. I examined
%USERPROFILE%\anaconda3\envs\py39\Lib\site-packages\spyder\app\start.py
,
with emphasis on main()
. I found pre-amble code that removes PYTHONPATH
paths from sys.path
. I confirmed this from within Spyder: sys.path
contains neither of the PySpark paths that are added to PYTHONPATH
by
SpyderSpark.cmd
. PYTHONPATH
is empty before running
SpyderSpark.cmd
, so there are no other paths to check.
As for the disappearance of PYTHONPATH
itself, I could see no code in
start.py
that modifies os.environ['PYTHONPATH']
or removes that
variable from the environment. However, it doesn't really matter,
as PYTHONPATH
merely contributes to sys.path
and start.py
explicitly removes PYTHONPATH
paths from sys.path
.
I lack the experience to appreciate why this is done. Spyder is supposed to provide a development IDE, but it's hard to use if it removes the paths in PYTHONPATH.
Annex A: %SPARK_HOME%/conf/spark-defaults.conf
Here, %SPARK_HOME%
is
C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark
:
spark.eventLog.enabled true
spark.eventLog.dir C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.history.fs.logDirectory C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.sql.autoBroadcastJoinThreshold -1
Annex B: Incorrect behaviour when start
arguments are double-quoted in SpyderSpark.cmd
When SpyderSpark.cmd
is run, a terminal console appears with the following
messages:
C:\Users\User.Name\Documents\Python Scripts>set "HADOOP_HOME=C:\Users\User.Name\AppData\Local\Hadoop\2.7.1"
C:\Users\User.Name\Documents\Python Scripts>set "JAVA_HOME=C:\Users\User.Name\anaconda3\envs\py39\Library"
C:\Users\User.Name\Documents\Python Scripts>set "SPARK_HOME=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_DRIVER_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>start "" "C:\Users\User.Name\anaconda3\pythonw.exe" ^
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\cwp.py" "C:\Users\User.Name\anaconda3\envs\py39" ^
[main 2023-09-02T23:29:02.117Z] update#setState idle
[main 2023-09-02T23:29:04.434Z] WSL is not installed, so could not detect WSL profiles
The VS Code app then appears, opened to a file cwp.py
(the 2nd
argument supplied to start
in SpyderSpark.cmd
). When I exit VS Code, the following
additional messages are printed to the terminal console, followed by
the appearance of the Spyder app:
[main 2023-09-02T23:29:09.998Z] Extension host with pid 21404 exited with code: 0, signal: unknown.
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\envs\py39\pythonw.exe" "C:\Users\User.Name\anaconda3\envs\py39\Scripts\spyder-script.py"
When I exit Spyder, the terminal console then disappears.
2023-09-06 afternote: According to Mofi, the cause for all of this unexpected behaviour is incorrect parsing of the Spyder launching command as a multi-line statement. Specifically, the caret symbol at the end of a physical line indicates the continuation of the statement on the next line, and this caret should not be preceded by a space. Rather, the next physical line, which the statement continues onto, should start with a space. With this fix, arguments to Start
can be double=quoted and the script still launches Spyder in the expected manner. Here is the revised and properly working SpyderSpark.cmd
:
set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"
start ""^
"%USERPROFILE%\anaconda3\pythonw.exe"^
"%USERPROFILE%\anaconda3\cwp.py"^
"%USERPROFILE%\anaconda3\envs\py39"^
"%USERPROFILE%\anaconda3\envs\py39\pythonw.exe"^
"%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py"
Other than for aesthetics, I have not seen described anywhere this prescription to avoid a space before the caret and to start the next line with a space. However, it works. In this specific case, the need to start a continuation line with a space could be due to the fact that the first character is "
, which is meant to delimit a file path but is not part of the file path. Since the first character of a continuation line is automatically escaped, we do not want the "
to be the first character or else it loses its special meaning.
Solution
As noted in section Further troubleshooting 2023-09-06 in the
question, start.py
removes PYTHONPATH
paths from sys.path
. With
a bit more test driving of Spyder, one possible reason became clear:
Spyder maintains it's own PYTHONPATH
as a tool parameter (Tools ->
PYTHONPATH manager`).
This Q&A shows how to
add to sys.path
the paths from the PYTHONPATH
of the shell that executed Spyder, ideally testing for their presence beforehand.
One could write a script to do this and specify the script's full path via Spyder's Tools -> Preferences -> IPython console -> Startup (tab)
.
It's not clear to me why Spyder was designed to maintain its own
PYTHONPATH
rather than inheritting it from the environment.
PYTHONPATH
depends on how other Python packages/modules are set up,
which changes piece-wise with time. It's hazardous to manage
manually.
It is interesting to see that in 2019, Spyder did inherit PYTHONPATH
from the environment if Tools -> Preferences -> Python interpreter -> Python interpreter
was set to Default (i.e. the same as Spyder's)
(see this GitHub bug
ticket). In this
bug ticket, it was going to be fixed imminently, so it seems that the
fix was to use the PYTHONPATH
from Spyder's PYTHONPATH manager in
all cases.
Just as a test, however, I set Python interpreter
to
C:\Users\User.Name\anaconda3\envs\py39\python.exe
, which was the
path shown by issuing where python
from Conda prompt of the py39
environment. After both of the following tests, nothing was still
shown from import os; os.environ.get("PYTHONPATH")
:
exit()
from the Spyder console, which I guess restarts the python interpreter- Exiting and restarting Spyder
Also relevant from 2019 is this request "Avoid dropping predefined PYTHONPATH when using an external interpreter". I'm not familiar enough with GitHub to follow what happened to this request, so if anyone can weigh in, thanks!
Finally, I also found that unless you put custom paths into
PYTHONPATH
, it doesn't need to propagate into Spyder.
In the posted question, the two paths added to PYTHONPATH
are from
issuing os.environ.get("PYTHONPATH")
after launching pyspark
from
within myspark.cmd
. Even though PYTHONPATH
was empty within Spyder,
I found that the two paths are included in sys.path
, directly or
indirectly.
The first of the two PYTHONPATH
paths is
%SPARK_HOME%\python\lib\py4j-0.10.9.7-src.zip
. If I try to add this
to Spyder's PYTHONPATH manager, it complains that the path is invalid.
Examining the files therein and using Cygwin's find
on folder
%USERPROFILE%\anaconda3\envs/py39
reveals that the files are
unpacked into package folder
%USERPROFILE%\anaconda3\envs\py39\Lib\site-packages\py4j
. If I
correctly understand my readings about packages, package py4j
is
available to Python because sys.cmd
already includes
%USERPROFILE%\anaconda3\envs\py39\lib\site-packages
. Why
PYTHONPATH
pointed to the zip file from within a pyspark
session,
I do not know.
The second PYTHONPATH
path is %SPARK_HOME%\python
. This is
already included in sys.path
within Spyder.
A question that remained was whether Spyder startup scripts figured
out to add these two paths to sys.path
because SpyderSpark.cmd
added them to PYTHONPATH
beforehand. I removed the setting of
PYTHONPATH
from SpyderSpark.cmd
and sure enough, the two paths were
still present in sys.path
within Spyder.
It seems, therefore, that the scripts responsible for launching Spyder also add the
necessary paths to sys.path
without the need for PYTHONPATH
. I
suspect that this is because Conda sets up the environment with all
the necessary dependencies.
I also confirmed that if PYTHONPATH
did contain custom paths prior
to launching Spyder, they did not propagate to sys.path
within
Spyder. This simply corroborates with the fact that the py4j zip
file pointed to by PYTHONPATH
also didn't propagate into sys.path
.
I presume, therefore, that one would need to use Spyder's PYTHONPATH manager for
custom paths or add the custom paths to sys.path
using code.
Answered By - user2153235
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.