Issue
I am interested in testing out many of the internal classes and functions defined within sklearn (eg. maybe add print statement to the treebuilder so I can see how the tree got built). However as many of the internals were written in Cython, I want to learn what is the best practices and workflows of testing out the functions in Jupyter notebook.
For example, I managed to import the Stack class from the tree._utils module. I was even able to construct it but unable to call any of the methods. Any thoughts on what I should do in order to call and test the cdef classes and its methods in Python?
%%cython
from sklearn.tree import _utils
s = _utils.Stack(10)
print(s.top())
# AttributeError: 'sklearn.tree._utils.Stack' object has no attribute 'top'
Solution
There are some problems which must be solved in order to be able to use c-interfaces of the internal classes.
First problem (skip if your sklearn version is >=0.21.x):
Until version 0.21.x sklearn used implicit relative imports (as in Python2), compiling it with Cython's language_level=3
(default in IPython3) would not work - so setting language_level=2
is needed for versions < 0.21.x (i.e. %cython -2
) or even better, scikit-learn
should be updated.
Second problem:
We need to include path to numpy-headers. Let's take a look at a simpler version:
%%cython
from sklearn.tree._tree cimport Node
print("loaded")
which fails with nothing saying error "command 'gcc' failed with exit status 1" - but the real reason can be seen in the terminal, where gcc outputs its error message (and not to notebook):
fatal error: numpy/arrayobject.h: No such file or directory compilation terminated.
_tree.pxd
uses numpy-API and thus we need to provide the location of numpy-headers.
That means we need to add include_dirs=[numpy.get_include()]
to Extension
definition. There are two ways to do it in %%cython
-magic, via -I
option:
%%cython -I <path from numpy.get_include()>
...
or somewhat dirtier trick, exploiting that %%cython
magic will add the include automatically when it sees string "numpy", by adding a comment like
%%cython
# requires numpy headers
...
is enough.
Last but not least:
Note: since 0.22 this is no longer an issue as pxd-files are included into the installation (see this).
The pxd-files must be present in the installation for us to be able to cimport them. This is the case for pxd-files from the sklearn.tree
subpackage, as one can see in the local setup.py
-file (given this PR, this seems to be more or less a random decision without a strategy behind):
...
config.add_data_files("_criterion.pxd")
config.add_data_files("_splitter.pxd")
config.add_data_files("_tree.pxd")
config.add_data_files("_utils.pxd")
...
but not for some other cython-extensions, in particular not for sklearn.neighbors
-subpackage. Now, that is a problem for your example:
%%cython
# requires numpy headers
from sklearn.tree._utils cimport Stack
s = Stack(10)
print(s.top())
fails to be cythonized, because _utils.pxd
cimports data structures from neighbors/*.pxd
's:
...
from sklearn.neighbors.quad_tree cimport Cell
...
which are not present in the installation.
The situation is described with more details in this SO-post, your options to build are (as described in the link)
- copy pdx-files to installation
- reinstall from the downloaded source with
pip install -e
- reinstall from the downloaded source after manipulating corresponding local
setup.py
-files.
Another option is to ask the developers of sklearn to include pxd-files into the installation, so not only building but also distribution becomes possible.
Answered By - ead
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.