Issue
Context: in Gaussian Process (GP) regression we can use two approaches:
(I) Fit the kernel parameters via Maximum Likelihood (maximize data likelihood) and use the GP defined by these parameters for prediction.
(II) Bayesian approach: put a parametric prior distribution on the kernel parameters. The parameters of this prior distribution are called the hyperparameters. Condition on the data to obtain a posterior distribution for the kernel parameters and now either
(IIa) fit the kernel parameters by maximizing the posterior kernel-parameter likelihood (MAP parameters) and use the GP defined by the MAP-parameters for prediction, or
(IIb) (the full Bayesian approach): predict using the mixture model which integrates all the GPs defined by the admissible kernel parameters along the posterior distribution of kernel-parameters.
(IIb) is the principal approach advocated in the reference [RW2006] cited in the package.
The point is that hyperparameters exist only in the Bayesian approach and are the parameters of the prior distribution on kernel parameters.
Therefore I am confused about the use of the term "hyperparameters" in the documentation, e.g. here where it is stated that "Kernels are parameterized by a vector of hyperparameters".
This must be interpreted as a sort of indirect parameterization via conditioning on the data as the hyperparameters do not directly determine the kernel parameters. Then an example is given of the exponential kernel and its length-scale parameter. This is definitely not a hyperparameter as this term is generally used.
No distinction seems to be drawn between kernel-parameters and hyperparameters. This is confusing and it is now unclear if the package uses the Bayesian approach at all. For example where do we specify the parametric family of prior distributions on kernel parameters?
Question: does scikit-learn use approach (I) or (II)?
Here is my own tentative answer: the confusion comes from the fact that a Gaussian Process is often called a "prior on functions" indicating some sort of Bayesianism. Worse still the process is infinite dimensional so restricting to the finite data dimensions is some sort of "marginalization". This is also confusing since in general you have marginalization only in the Bayesian approach where you have a joint distribution of data and parameters, so you often marginalize out one or the other.
The correct view here however is the following: the Gaussian Process is the model, the kernel parameters are the model parameters, in sci-kit learn there are no hyperparameters since there is no prior distribution on kernel parameters, the so called LML (log marginal likelihood) is ordinary data likelihood given the model parameters and the parameter-fit is ordinary maximum data-likelihood. In short the approach is (I) and not (II).
Solution
If you read the scikit-learn documentation on GP regression, you clearly see that the kernel (hyper)parameters are optimized. Take a look for example at the description of the argument n_restarts_optimizer
: "The number of restarts of the optimizer for finding the kernel’s parameters which maximize the log-marginal likelihood." In your question that is approach (i).
I would note two more things though:
- In my mind, the fact that they are called "hyperparameters" automatically implies that they are deterministic and can be estimated directly. Otherwise, they are random variables and that is why they can have a distribution. Another way to think of it is: did you define a prior for it? If not, then it is a parameter! If you did, then the prior's hyperparameter(s) may be what needs to be determined.
- Note that the
GaussianProcessRegressor
class "exposes a method log_marginal_likelihood(theta), which can be used externally for other ways of selecting hyperparameters, e.g., via Markov chain Monte Carlo." So, technically it is possible to make it "fully Bayesian" (your approach (ii)) but you must provide the inference method.
Answered By - ATony
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.