Wednesday, December 27, 2023

[FIXED] PYTHONPATH not propagating from CMD to Spyder

December 27, 2023 anaconda, cmd, pyspark, python, spyder No comments

Issue

I installed PySpark under Anaconda by issuing the following commands at a Conda prompt:

conda create -n py39 python=3.9 anaconda
conda activate py39
conda install openjdk
conda install pyspark
conda install -c conda-forge findspark

As can be seen, this is all within the py39 environment. Additionally, I fetched Hadoop 2.7.1 from GitHub and created c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1 to contain the corresponding README.md file and bin subfolder [1]. Here, %HOMEPATH% is \Users\User.Name. Finally, I had to create file %SPARK_HOME%/conf/spark-defaults.conf (Annex A).

With the above setup, I could launch PySpark using the following myspark.cmd script located in c:%HOMEPATH%\anaconda3\envs\py39\bin\:

set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
pyspark

I am now following this page to be able to use Spyder instead of the Conda command line. I am using the following SpyderSpark.cmd script to set the the variables and launch Spyder:

set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=C:%HOMEPATH%\anaconda3\envs\py39\Library"
set "SPARK_HOME=C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

C:%HOMEPATH%\anaconda3\pythonw.exe ^
C:%HOMEPATH%\anaconda3\cwp.py ^
C:%HOMEPATH%\anaconda3\envs\py39 ^
C:%HOMEPATH%\anaconda3\envs\py39\pythonw.exe ^
C:%HOMEPATH%\anaconda3\envs\py39\Scripts\spyder-script.py

Some points that may not be clear:

Folder %JAVA_HOME%\bin contains java.exe and javac.exe
The second half of the above code block is the command that is executed by Anaconda's shortcut for Spyder (py39)

As I am still trying to get SpyderSpark.cmd to work, I execute it from the Conda prompt, specifically the py39 environment. This way, it inherits environment variables that I may have missed in SpyderSpark.cmd. Issuing SpyderSpark.cmd launches the Spyder GUI, but Spark commands aren't recognized at the console. Here is a transcript of the response to the the first few lines of code from this tutorial:

In [1]: columns = ["language","users_count"]
   ...: data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
In [2]: spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
NameError: name 'SparkSession' is not defined

The likely cause is that all but the PYTHONPATH variable propagated their values into the Spyder session. From the Spyder console:

import os
print(os.environ.get("HADOOP_HOME"))
print(os.environ.get("JAVA_HOME"))
print(os.environ.get("SPARK_HOME"))
print(os.environ.get("PYSPARK_DRIVER_PYTHON"))
print(os.environ.get("PYSPARK_PYTHON"))
print(os.environ.get("PYTHONPATH"))

   c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
   C:\Users\User.Name\anaconda3\envs\py39\Library
   C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark
   Python
   Python
   None

Why isn't PYTHONPATH propagating into the Spyder session, and how can I fix this?

I don't think that this Q&A explains the problem because I am launching Spyder from a CMD environment after setting the variable. Furthermore, all the other variables succeed in propagating to the Spyder session.

Notes

[1] Using Cygwin, I found that for all the files in c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1\bin, the permission bits for execution were disabled and needed to be explicitly enabled.

Afternote 2023-09-02:

Respondents posted helpful hints on how to get Spark commands recognized in Spyder, i.e., to first issue from pyspark.sql import SparkSession. I didn't see this tutorial code because it was in a screen capture and the image was blocked by AdBlocker. Also, it was not needed after issuing pyspark from the Conda prompt of the py39 environment. It was needed after issuing SpyderSpark.cmd, as I found from the comments, and this allowed the Spark statements to be recognized. I assume, therefore, that pyspark imports SparkSession on the user's behalf, making it unnecessary to explicitly import it after launching pyspark from the Conda prompt.

As useful as it was to know that SparkSession needs to be imported from within Spyder, it doesn't answer the question of why 1 of 6 environment variables fail to propagate from SpyderSpark.cmd to Spyder, i.e., variable PYTHONPATH. Admittedly, it solved the real showstopper for me at present, which is to get Spark working from Spyder, for which I thank the respondents. I would still be interested in why PYTHONPATH doesn't propagate.

On a separate but related issue, I found it tricky to create a shortcut to SpyderSpark.cmd that doesn't leave a redundant terminal on the desktop. The solution turned out to be to prefix the Spyder launching command with start:

set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

start "" ^
%USERPROFILE%\anaconda3\pythonw.exe ^
%USERPROFILE%\anaconda3\cwp.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\pythonw.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py

All the arguments starting with %USERPROFILE% would ideally be enclosed in double-quotes in case they expand to include non-alphanumeric characters. For some reason, I couldn't do that without incurring the incorrect behaviour in Annex B (below). Therefore, I did not adorn the arguments with double-quotes.

With SpyderSpark as revised above, the Target field of the Windows shortcut should contain:

%SystemRoot%\System32\cmd.exe /D /C "%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"

I found it handy to simply copy the Spyder shortcut and modify the Target field. For the sake of readability, here is the same command broken into two physical lines (which isn't suitable for the Target field of a shortcut):

%SystemRoot%\System32\cmd.exe /D /C ^
   "%USERPROFILE%\anaconda3\envs\py39\bin\SpyderSpark.cmd"

Thanks to Mofi for advice on having improved this afternote.

Further troubleshooting 2023-09-03

To further troubleshoot the propagation of environment variable PYTHONPATH into Spyder, I followed Mofi's advice and revised SpyderSpark.cmd to use the console oriented python rather than GUI-oriented pythonw:

set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

set PYTHONPATH & REM HHHHHHHHHHHHHHHHH
%USERPROFILE%\anaconda3\python.exe ^
%USERPROFILE%\anaconda3\cwp-debug.py ^
%USERPROFILE%\anaconda3\envs\py39 ^
%USERPROFILE%\anaconda3\envs\py39\python.exe ^
%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py

As can be seen from above, PYTHONPATH is also displayed to the screen prior to the Spyder launching command. Furthermore, SpyderSpark.cmd was revised to use a modified cwp.py, dubbed cwp-debug.py, wherein PYTHONPATH is printed out twice:

import os
import sys
import subprocess
from os.path import join, pathsep

from menuinst.knownfolders import FOLDERID, get_folder_path, PathNotFoundException

# call as: python cwp.py PREFIX ARGs...

prefix = sys.argv[1]
args = sys.argv[2:]

new_paths = pathsep.join([prefix,
                         join(prefix, "Library", "mingw-w64", "bin"),
                         join(prefix, "Library", "usr", "bin"),
                         join(prefix, "Library", "bin"),
                         join(prefix, "Scripts")])
print(os.environ["PYTHONPATH"]) ###################
env = os.environ.copy()
env['PATH'] = new_paths + pathsep + env['PATH']
env['CONDA_PREFIX'] = prefix

documents_folder, exception = get_folder_path(FOLDERID.Documents)
if exception:
    documents_folder, exception = get_folder_path(FOLDERID.PublicDocuments)
if not exception:
    os.chdir(documents_folder)
print(env["PYTHONPATH"]) ######################
sys.exit(subprocess.call(args, env=env))

When SpyderSpark.cmd is executed from a CMD console, the expected PYTHONPATH is printed out by SpyderSpark.cmd and at both locations in cwp-debug.py. Furthermore, PYTHONPATH is echoed to the screen when it is prepended to in SpyderSpark.cmd. I have lumped together lines in the session transcript so that the different echoings of PYTHONPATH are easier to recognize:

C:\Users\User.Name> C:\Users\User.Name\anaconda3\envs\py39\bin\SpyderSpark.cmd

C:\Users\User.Name> set "HADOOP_HOME=C:\Users\User.Name\AppData\Local\Hadoop\2.7.1"
C:\Users\User.Name> set "JAVA_HOME=C:\Users\User.Name\anaconda3\envs\py39\Library"
C:\Users\User.Name> set "SPARK_HOME=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark"
C:\Users\User.Name> set "PYSPARK_DRIVER_PYTHON=Python"
C:\Users\User.Name> set "PYSPARK_PYTHON=Python"
C:\Users\User.Name> set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name> set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"

C:\Users\User.Name> set PYTHONPATH   & REM HHHHHHHHHHHHHHHHH
PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;

C:\Users\User.Name> C:\Users\User.Name\anaconda3\python.exe C:\Users\User.Name\anaconda3\cwp-debug.py C:\Users\User.Name\anaconda3\envs\py39 C:\Users\User.Name\anaconda3\envs\py39\python.exe C:\Users\User.Name\anaconda3\envs\py39\Scripts\spyder-script.py

C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;

fromIccProfile: failed minimal tag size sanity
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,

C:\Users\User.Name>

The final warnings about fromIccProfile and Blowfish are innocuous. Explanations about the fromIccProfile warning can be found here and here while the Blowfish warning is just about deprecation. Therefore, the modifications to SpyderSpark and cwp.py (in the form of cwp-debug.py) did not reveal why PYTHONPATH fails to propagate to Spyder.

The next step was to check whether PYTHONPATH was being clobbered by spyder-script.py, which is a very short script:

import re
import sys

from spyder.app.start import main

if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

I'm actually trying to spin up on Python, so I'm wondering whether anyone can help decipher this code.

Further troubleshooting 2023-09-06

Mofi explained that the regular expression substitution in spyder-script.py above simply strips away a suffix -script.py[w] or .exe from the script name, which merely affects the file identification shown in diagnostic messages.

I noticed that the ensuing statement invokes main() from module spyder.app.start. I examined %USERPROFILE%\anaconda3\envs\py39\Lib\site-packages\spyder\app\start.py, with emphasis on main(). I found pre-amble code that removes PYTHONPATH paths from sys.path. I confirmed this from within Spyder: sys.path contains neither of the PySpark paths that are added to PYTHONPATH by SpyderSpark.cmd. PYTHONPATH is empty before running SpyderSpark.cmd, so there are no other paths to check.

As for the disappearance of PYTHONPATH itself, I could see no code in start.py that modifies os.environ['PYTHONPATH'] or removes that variable from the environment. However, it doesn't really matter, as PYTHONPATH merely contributes to sys.path and start.py explicitly removes PYTHONPATH paths from sys.path.

I lack the experience to appreciate why this is done. Spyder is supposed to provide a development IDE, but it's hard to use if it removes the paths in PYTHONPATH.

Annex A: %SPARK_HOME%/conf/spark-defaults.conf

Here, %SPARK_HOME% is C:%HOMEPATH%\anaconda3\envs\py39\lib\site-packages\pyspark:

spark.eventLog.enabled true
spark.eventLog.dir C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.history.fs.logDirectory C:\\Users\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
spark.sql.autoBroadcastJoinThreshold -1

Annex B: Incorrect behaviour when `start` arguments are double-quoted in `SpyderSpark.cmd`

When SpyderSpark.cmd is run, a terminal console appears with the following messages:

C:\Users\User.Name\Documents\Python Scripts>set "HADOOP_HOME=C:\Users\User.Name\AppData\Local\Hadoop\2.7.1"
C:\Users\User.Name\Documents\Python Scripts>set "JAVA_HOME=C:\Users\User.Name\anaconda3\envs\py39\Library"
C:\Users\User.Name\Documents\Python Scripts>set "SPARK_HOME=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_DRIVER_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYSPARK_PYTHON=Python"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>set "PYTHONPATH=C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python;C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;"
C:\Users\User.Name\Documents\Python Scripts>start "" "C:\Users\User.Name\anaconda3\pythonw.exe" ^
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\cwp.py" "C:\Users\User.Name\anaconda3\envs\py39" ^
[main 2023-09-02T23:29:02.117Z] update#setState idle
[main 2023-09-02T23:29:04.434Z] WSL is not installed, so could not detect WSL profiles

The VS Code app then appears, opened to a file cwp.py (the 2nd argument supplied to startin SpyderSpark.cmd). When I exit VS Code, the following additional messages are printed to the terminal console, followed by the appearance of the Spyder app:

[main 2023-09-02T23:29:09.998Z] Extension host with pid 21404 exited with code: 0, signal: unknown.
C:\Users\User.Name\Documents\Python Scripts>"C:\Users\User.Name\anaconda3\envs\py39\pythonw.exe" "C:\Users\User.Name\anaconda3\envs\py39\Scripts\spyder-script.py"

When I exit Spyder, the terminal console then disappears.

2023-09-06 afternote: According to Mofi, the cause for all of this unexpected behaviour is incorrect parsing of the Spyder launching command as a multi-line statement. Specifically, the caret symbol at the end of a physical line indicates the continuation of the statement on the next line, and this caret should not be preceded by a space. Rather, the next physical line, which the statement continues onto, should start with a space. With this fix, arguments to Start can be double=quoted and the script still launches Spyder in the expected manner. Here is the revised and properly working SpyderSpark.cmd:

set "HADOOP_HOME=%USERPROFILE%\AppData\Local\Hadoop\2.7.1"
set "JAVA_HOME=%USERPROFILE%\anaconda3\envs\py39\Library"
set "SPARK_HOME=%USERPROFILE%\anaconda3\envs\py39\lib\site-packages\pyspark"
set "PYSPARK_DRIVER_PYTHON=Python"
set "PYSPARK_PYTHON=Python"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python\lib\py4j-0.10.9.7-src.zip;%PYTHONPATH%"
set "PYTHONPATH=%SPARK_HOME%\python\lib\site-packages\pyspark\python;%PYTHONPATH%"

start ""^
 "%USERPROFILE%\anaconda3\pythonw.exe"^
 "%USERPROFILE%\anaconda3\cwp.py"^
 "%USERPROFILE%\anaconda3\envs\py39"^
 "%USERPROFILE%\anaconda3\envs\py39\pythonw.exe"^
 "%USERPROFILE%\anaconda3\envs\py39\Scripts\spyder-script.py"

Other than for aesthetics, I have not seen described anywhere this prescription to avoid a space before the caret and to start the next line with a space. However, it works. In this specific case, the need to start a continuation line with a space could be due to the fact that the first character is ", which is meant to delimit a file path but is not part of the file path. Since the first character of a continuation line is automatically escaped, we do not want the " to be the first character or else it loses its special meaning.

Solution

As noted in section Further troubleshooting 2023-09-06 in the question, start.py removes PYTHONPATH paths from sys.path. With a bit more test driving of Spyder, one possible reason became clear: Spyder maintains it's own PYTHONPATH as a tool parameter (Tools -> PYTHONPATH manager`).

This Q&A shows how to add to sys.path the paths from the PYTHONPATH of the shell that executed Spyder, ideally testing for their presence beforehand. One could write a script to do this and specify the script's full path via Spyder's Tools -> Preferences -> IPython console -> Startup (tab).

It's not clear to me why Spyder was designed to maintain its own PYTHONPATH rather than inheritting it from the environment. PYTHONPATH depends on how other Python packages/modules are set up, which changes piece-wise with time. It's hazardous to manage manually.

It is interesting to see that in 2019, Spyder did inherit PYTHONPATH from the environment if Tools -> Preferences -> Python interpreter -> Python interpreter was set to Default (i.e. the same as Spyder's) (see this GitHub bug ticket). In this bug ticket, it was going to be fixed imminently, so it seems that the fix was to use the PYTHONPATH from Spyder's PYTHONPATH manager in all cases.

Just as a test, however, I set Python interpreter to C:\Users\User.Name\anaconda3\envs\py39\python.exe, which was the path shown by issuing where python from Conda prompt of the py39 environment. After both of the following tests, nothing was still shown from import os; os.environ.get("PYTHONPATH"):

exit() from the Spyder console, which I guess restarts the python interpreter
Exiting and restarting Spyder

Also relevant from 2019 is this request "Avoid dropping predefined PYTHONPATH when using an external interpreter". I'm not familiar enough with GitHub to follow what happened to this request, so if anyone can weigh in, thanks!

Finally, I also found that unless you put custom paths into PYTHONPATH, it doesn't need to propagate into Spyder.

In the posted question, the two paths added to PYTHONPATH are from issuing os.environ.get("PYTHONPATH") after launching pyspark from within myspark.cmd. Even though PYTHONPATH was empty within Spyder, I found that the two paths are included in sys.path, directly or indirectly.

The first of the two PYTHONPATH paths is %SPARK_HOME%\python\lib\py4j-0.10.9.7-src.zip. If I try to add this to Spyder's PYTHONPATH manager, it complains that the path is invalid. Examining the files therein and using Cygwin's find on folder %USERPROFILE%\anaconda3\envs/py39 reveals that the files are unpacked into package folder %USERPROFILE%\anaconda3\envs\py39\Lib\site-packages\py4j. If I correctly understand my readings about packages, package py4j is available to Python because sys.cmd already includes %USERPROFILE%\anaconda3\envs\py39\lib\site-packages. Why PYTHONPATH pointed to the zip file from within a pyspark session, I do not know.

The second PYTHONPATH path is %SPARK_HOME%\python. This is already included in sys.path within Spyder.

A question that remained was whether Spyder startup scripts figured out to add these two paths to sys.path because SpyderSpark.cmd added them to PYTHONPATH beforehand. I removed the setting of PYTHONPATH from SpyderSpark.cmd and sure enough, the two paths were still present in sys.path within Spyder.

It seems, therefore, that the scripts responsible for launching Spyder also add the necessary paths to sys.path without the need for PYTHONPATH. I suspect that this is because Conda sets up the environment with all the necessary dependencies.

I also confirmed that if PYTHONPATH did contain custom paths prior to launching Spyder, they did not propagate to sys.path within Spyder. This simply corroborates with the fact that the py4j zip file pointed to by PYTHONPATH also didn't propagate into sys.path. I presume, therefore, that one would need to use Spyder's PYTHONPATH manager for custom paths or add the custom paths to sys.path using code.

Answered By - user2153235

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 27, 2023

[FIXED] PYTHONPATH not propagating from CMD to Spyder

Issue

Annex A: %SPARK_HOME%/conf/spark-defaults.conf

Annex B: Incorrect behaviour when `start` arguments are double-quoted in `SpyderSpark.cmd`

Solution

0 comments:

Post a Comment

Popular Posts

Labels

Wednesday, December 27, 2023

Issue

Annex A: %SPARK_HOME%/conf/spark-defaults.conf

Annex B: Incorrect behaviour when start arguments are double-quoted in SpyderSpark.cmd

Solution

0 comments:

Post a Comment

Popular Posts

Labels

Annex B: Incorrect behaviour when `start` arguments are double-quoted in `SpyderSpark.cmd`