Issue
I am creating a logistic regression model on Snowflake using Python. I did the same logistic regression in R locally, but want to transition it to my Snowflake data warehouse. I'm having some success, but I'm not nearly as familiar with python as I am with R.
I believe that the regression is fitting and giving a model. I don't really know what the predicted probabilities look like, but that is genuinely a secondary concern at this point.
I just want to return a snowflake DataFrame from a pandas DataFrame. I can't get it to happen.
Below is a snippet of my code.
import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd
def main(session: snowpark.Session):
#
# EVERYTHING BEFORE WHAT'S BELOW IS DATA TRANSFORMATION, ALL OF IT WORKS JUST FINE
# AS FAR AS I KNOW
# ind_cols and dep_cols are arrays of column names
# defining which columns are independent variables and which are dependent.
# Here I split the sample into independent and dependent columns,
# and use LogisticRegression from scikit-learn.
X = full_sample[ind_cols].to_pandas()
y = full_sample[dep_col].to_pandas()
# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
ret_df_lm = ret_df[ind_cols].to_pandas()
lm = LogisticRegression()
lm.fit(X, y)
y_pred = lm.predict_proba(ret_df_lm)
y_final = session.table(y_pred)
#retention_pred = lm.predict(ret_df)
return y_final
When I try to return y_final
I get an error TypeError: sequence item 0: expected str instance, numpy.ndarray found
. I've got to be missing something. I've tried other things, like snowflake's session.write_pandas()
but I'm not sure it's what I need.
How do I get y_final
to be a snowflake DataFrame?
Solution
I fixed your code with the following observations:
- I had to generate random data.
- The original error came from
session.table(y_pred)
as it expects an input string, not a data frame. - To return a Snowpark DataFrame you need to transform the Pandas one:
return session.create_dataframe(y_final)
.
# The Snowpark package is required for Python Worksheets.
# You can add more packages by selecting them using the Packages control and then importing them.
import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd
import numpy as np
def main(session: snowpark.Session):
#X = full_sample[ind_cols].to_pandas()
#y = full_sample[dep_col].to_pandas()
# Number of samples and features
n_samples = 100 # for example, 100 samples
n_features = 5 # for example, 5 features
# Generate random data for X
np.random.seed(0) # for reproducibility
X_data = np.random.rand(n_samples, n_features)
X = pd.DataFrame(X_data, columns=[f'feature_{i}' for i in range(n_features)])
# Generate random binary data for y
y_data = np.random.randint(2, size=n_samples)
y = pd.DataFrame(y_data, columns=['target'])
# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
# ret_df_lm = ret_df[ind_cols].to_pandas()
ret_df_data = np.random.rand(n_samples, n_features)
ret_df = pd.DataFrame(ret_df_data, columns=[f'feature_{i}' for i in range(n_features)])
lm = LogisticRegression()
lm.fit(X, y)
y_pred = lm.predict_proba(ret_df)
# y_final = session.table(y_pred)
#retention_pred = lm.predict(ret_df)
y_final = pd.DataFrame(y_pred, columns=['Prob_0', 'Prob_1'])
# return a Snowpark DataFrame instead of a Pandas one
return session.create_dataframe(y_final)
Answered By - Felipe Hoffa
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.