Issue
I'm trying to create a smooth function to represent some data I am working with. Trouble is the data is pretty noisy, and simply using the least squares method to create a polynomial line of best fit does not come out that well.
polynomial = np.polyfit(df.x,df.y)
Doing some digging, apparently for this sort of thing some people first transform their input data into its natural log, which produces a nicer curve.
df['y']=np.log(df['y'])
lop_polynomial = np.polyfit(df.x,df.y)
And lo and behold it seems to work. When I turn this polynomial into a function and graph a bunch of points (by just taking e^x of the output), it looks like the kind of thing I want;
log_polynomial =np.poly1d(put_log_polynomial)
x_values = np.arange(0, 100,.01)
plt.plot(
x_values,
np.exp(log_polynomial (x_values ))
However, this requires me to transform the input data into log data, then find a polynomial of best fit, then turn that polynomial into a function, then turn that function into a bunch of discrete points, then make another polynomial of best fit around those points. Because this is something I am going to run a lot of times that's a lot of extra steps.
Is there a way to take a polynomial that I got using log inputs and transform it back into non-log space without manually doing an exponent on all of the inputs to get discrete points then trying to create a line of best fit for those points in non-log space?
edit* here is some more background on my use case. I have a set of discrete points that form a curve. The second derivative of this curve contains important information that I want to extract cleanly, but if I take the second derivative without modification it ends up very noisy, over-interpreting small kinks in the curve as huge rough spikes in the second derivative.
I have tried various things to smooth out the smooth out the curve, but all the other ones I've tried either failed to adequately address the noise or disrupted the inherent signal I am trying to isolate. Here's what I've tried so far:
Least squares line of best fit in non-log space
polynomial = np.polyfit(df.x,df.y)
1 dimensional gaussian filter
scipy.ndimage.gaussian_filter1d(df.y, 3)
Cubic spline
t,c,k=scipy.interpolate.splrep(df.x,df.y,k=3)
B-spline
spline = scipy.interpolate.BSpline(t, c, k, extrapolate=False)
The reason I am trying to log transform it before finding the polynomial of best fit is because I found some research papers that use this method of interpolation for the exact use case I am working on and when I test it on my data it seems to work well.
*edit2 As @NickODell astutely pointed out, there are a lot of potential polynomials that cannot be transformed back to non log space and still represent themselves in the python polynomial format of a list of coefficients for x to the power of integers. So that makes turning the log space polynomial back into a polynomial in non log space directly effectively impossible as far as I can tell.
What I really want though is the second derivative of my input data that is free of noise and retains all the signal. I was hopeful that I might be able to do this by transoforming the log space polynomial directly back to non-log space and deriving it directly, but apparently that's not going to be possible. However, it could still be represented as a function that I can derive using np.gradient, which would effectively allow the same thing. But I'm less confident this would actually be significantly increase the efficiency of the program. I'll try messing around with some profiling and see if I can get a handle on how large a deal this is.
*edit3 I want to thank @jlandercy and @Nick ODell for their clear and thorough answers. I wish I could select both of them as the answer.
Solution
If I correctly understand your requirements, you want:
- Apply log transform on
y
axis; - Fit polynomial data with huge noise;
- Predict values at new
x
points; - Take second derivative of fitted curve;
- Automatize the process that must be performed multiple times.
Sklearn is a perfect package to perform such operations through Pipeline
s.
First we create some polynomial noisy data:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
np.random.seed(12345)
x = np.linspace(-1, 1, 50).reshape(-1, 1)
P = np.poly1d([6, 5, 4, 3, 2, 10])
y = P(x[:, 0])
n = 0.5 * np.random.randn(y.size)
yn = y + n
Simple regression
As a baseline, we create a pipeline to linearly regress polynomial from your data:
pipeline_lin = Pipeline([
("transformer", PolynomialFeatures(5)),
("model", LinearRegression(fit_intercept=False))
])
And fit to dataset:
pipeline_lin.fit(x, y)
pipeline_lin["model"].coef_
# array([10.0226118 , 1.51830909, 2.23638955, 2.91665499, 5.99904944, 7.88614861])
Log Transform
If you wish to apply logarithmic transformation on target (y
axis, indeed it requires positiveness) we can change the pipeline for:
pipeline_log = Pipeline([
("transformer", PolynomialFeatures(5)),
("model",
TransformedTargetRegressor(
regressor=LinearRegression(fit_intercept=False),
func=np.log,
inverse_func=np.exp
)
)
])
pipeline_log.fit(x, yn)
pipeline_log["model"].regressor_.coef_
# array([ 2.2901081 , 0.15049063, 0.46685674, 0.32678866, -0.12522207, 0.34130133])
Indeed coefficients are not directly related to the original polynomial as we applied the log transformation.
Now we can predict new target values for other features without having the need to care about the forward and back log transform:
xlin = np.linspace(-1, 1, 200).reshape(-1, 1)
yh_lin = pipeline_lin.predict(xlin)
yh_log = pipeline_log.predict(xlin)
fig, axe = plt.subplots()
axe.scatter(x, yn, marker=".", label="Data")
axe.plot(xlin, yh_lin, label="Linear")
axe.plot(xlin, yh_log, label="Log Linear")
axe.legend()
axe.grid()
Automation
The process can be repeated multiple times, simply chaining fit
and predict
:
# Iterate datasets:
for i in range(10):
x = ... # Select X data
yn = ... # Select Y data
# Fit and predict:
yhat = pipeline_log.fit(x, yn).predict(xlin)
Alternative
If you don't know the polynomial order and don't require the fit to be a polynomial regression (eg. you just want to predict plausible values), then Gaussian Process can be a good alternative:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
pipeline_gpr = Pipeline([
("model",
TransformedTargetRegressor(
regressor=GaussianProcessRegressor(
kernel=1.0*RBF(),
alpha=0.01
),
func=np.log,
inverse_func=np.exp
)
)
])
Taking second derivative
Numerically
A potential candidate to estimate second derivative is savgol_filter
from scipy
:
from scipy import signal
dx = xlin[1,0] - xlin[0,0]
yh_lin_d2 = signal.savgol_filter(yh_lin, 15, 3, deriv=2, delta=dx)
yh_log_d2 = signal.savgol_filter(yh_log, 15, 3, deriv=2, delta=dx)
yh_gpr_d2 = signal.savgol_filter(yh_gpr, 15, 3, deriv=2, delta=dx)
Caution, taking the derivative (or even worse the second derivative) of a fit is not a trivial operation. As expected the derivative of polynomial of 5th order (Linear) is a 3rd order polynomial.
From regressed coefficients
If fitted model is a polynomial, we can use regressed polynomial coefficients to deduce any order derivative.
Phat = np.poly1d(pipeline["model"].coef_[::-1])
Phat_xx = Phat.deriv(m=2)
Which can be use to interpolate new points:
yh_d2 = Phat_xx(np.squeeze(xlin))
This solution perfectly agrees with yh_lin_d2
.
Proposal
Here is a complete proposal for having the second derivative based on a set of points that follow a polynomial trend:
def fit_diff(x, y, poly_order=5, resolution=200):
pipeline = Pipeline([
("transformer", PolynomialFeatures(poly_order)),
("model", LinearRegression(fit_intercept=False))
])
xlin = np.linspace(x.min(), x.max(), resolution).reshape(-1, 1)
yhat = pipeline.fit(x, y).predict(xlin)
dx = xlin[1, 0] - xlin[0, 0]
dy2dx2 = signal.savgol_filter(yhat, 15, 3, deriv=2, delta=dx)
return dy2dx2
Answered By - jlandercy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.