Issue
I am using spark
on python both iteratively launching the command pyspark
from Terminal and also launching an entire script with the command spark-submit pythonFile.py
I am using to analyze a local csv
file, so no distributed computation is performed.
I would like to use the library matplotlib
to plot columns of a dataframe. When importing matplotlib I get the error ImportError: No module named matplotlib
. Then I came across this question and tried the command sc.addPyFile()
but you could not find any file relating to matplotlib that I can pass to it on my OS (OSX).
For this reason I created a virtual environment and installed matplotlib with it. Navigating through the virtual environment I saw there was no file such as marplotlib.py
so I tried to pass it the entire folder sc.addPyFile("venv//lib/python3.7/site-packages/matplotlib")
but again no success.
I do not know which file I should include or how at this point and I ran out of ideas.
Is there a simple way to import matplotlib
library inside spark (installing with virtualenv or referencing the OS installation)? And if so, which *.py
files I should pass the command sc.addPyFile()
Again I am not interested in distributed computation: the python code will run only locally on my machine.
Solution
I will post what I have done. First of all I am working with virtualenv
. So I created a new one with virtualenv path
.
Then I activated it with source path/bin/activate
.
I installed the packages I needed with pip3 install packageName
.
After that I created a script in python that creates a zip archive of the libraries installed with virtualenv in the path ./path/lib/python3.7/site-packages/
.
The code of this script is the following (it is zipping only numpy
):
import zipfile
import os
#function to archive a single package
def ziplib(general_path, libName):
libpath = os.path.dirname(general_path + libName) # this should point to your packages directory
zippath = libName + '.zip' # some random filename in writable directory
zf = zipfile.PyZipFile(zippath, mode='w')
try:
zf.debug = 3 # making it verbose, good for debugging
zf.writepy(libpath)
return zippath # return path to generated zip archive
finally:
zf.close()
general_path = './path//lib/python3.7/site-packages/'
matplotlib_name = 'matplotlib'
seaborn_name = 'seaborn'
numpy_name = 'numpy'
zip_path = ziplib(general_path, numpy_name) # generate zip archive containing your lib
print(zip_path)
After that the archives must be referenced in the pyspark file myPyspark.py
. You do this by calling the method addPyFile()
of the sparkContext
class. After that you just can import in your code as always. In my case I did the following:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sc.addPyFile("matplot.zip") #generate with testZip.py
sc.addPyFile("numpy.zip") #generate with testZip.py
import matplotlib
import numpy
When you are launching the script you have to reference the zip archives in the command with --py-files
. For example:
sudo spark-submit --py-files matplot.zip --py-files numpy.zip myPyspark.py
I considered two archives because to me it was clear how to import one but not two of them.
Answered By - Francesco Boi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.