Tuesday, December 26, 2023

[FIXED] how to do prediction using model learned by sklearn feature hasher

December 26, 2023 hash, machine-learning, pandas, python, scikit-learn No comments

Issue

I encoded my data using feature hasher, the accuracy i am getting on training data is about 90 percent, but still i am not able to predic the new data_points by model, i guess something is wrong with encoding my new datapoint:Please have a look at the code of training data encoding:


columns_to_hash = ['port_cluster', 'org', 'asn', 'protocol', 'event_type', 'os', 'country_name', 'city_name', 'class']

attribute_weights = {
    'port_cluster': 0.7,
    "protocol": 0.4,
    "city_name": 0.4,
    'org': 0.5,
    'asn': 0.5,
    'event_type': 0.6,
    'os': 0.3,
    'country_name': 0.4,
    'class': 0.3,
}

# Create a dictionary to store the hashed features
hashed_feature_dict = {}

# Iterate over unique src_ips
unique_src_ips = df['src_ip'].unique()
for src_ip in unique_src_ips:
    src_ip_data = df[df['src_ip'] == src_ip]
    
    # Initialize the FeatureHasher for the current src_ip
    hasher = FeatureHasher(n_features=20, input_type='string')
    
    src_ip_hashed_feature_dict = {}
    
    # Iterate over columns to hash and store hashed features
    for column in columns_to_hash:
        hashed_features = hasher.fit_transform(src_ip_data[column].astype(str).values.reshape(-1, 1))

        
        weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
        src_ip_hashed_feature_dict[column] = weighted_hashed_features
 
    hashed_features_array = np.concatenate([src_ip_hashed_feature_dict[column] for column in columns_to_hash], axis=1)
    hashed_feature_dict[src_ip] = hashed_features_array
# Concatenate all hashed features
all_hashed_features = np.concatenate(list(hashed_feature_dict.values()), axis=0)

# Split the data into training and testing sets
X = all_hashed_features
y = df['cluster_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

new data prediction code:

# Create a dictionary to store the hashed features for the new data point
new_data = {
    'src_ip': '65.21.234.90',
    'asn': 24940,
    'country_name': 'Finland',
    'city_name': 'Helsinki',
    'open_ports': ['5060/sip', '2000/ikettle'],
    'protocol': 'UDP',
    'ip_rep': None,
    'first_seen': '2023-08-22T16:56:45.733Z',
    'last_time': '2023-08-22T16:56:45.733Z',
    'class': 'A',
    'event_type': 'sip',
    'event_data': {
        'request_line': 'OPTIONS sip:[email protected] SIP/2.0',
        'uri': 'sip:[email protected]',
        'version': 'SIP/2.0',
        'method': 'OPTIONS'
    },
    'link': None,
    'os': 'None',
    'org': 'Hetzner Online GmbH',
    'port_cluster': -1
}

# Initialize the FeatureHasher for the new data point
# hasher = FeatureHasher(n_features=11, input_type='string')

new_data_hashed_feature_dict = {}

# Iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
    if column in new_data:
        # Convert a single string to a list containing that string
        feature_value = [str(new_data[column])]
        hashed_features = hasher.transform([feature_value])
        weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
        new_data_hashed_feature_dict[column] = weighted_hashed_features

# Concatenate all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash], axis=1)

# Predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)

print(f"Predicted Cluster Label: {predicted_label[0]}")

Solution

In your training data encoding, you use a new FeatureHasher for each unique src_ip. While in your new data prediction code, you use a generic hasher object. I just think that the FeatureHasher class does not learn from the data, so using a different one for new data points might not result in consistent hashing. Maybe you need to use the same hashing process for both training and prediction to ensure consistency.

Also in your training data, I see that you concatenate the hashed features for each src_ip. However, in your prediction code, you're hashing and concatenating features for only one src_ip. So I assume that if you're using a different number of hashed features for training and prediction, that will throw off your model.

Anther remark you are using columns_to_hash list to iterate through the dictionary. However, I think you need to make sure that all keys listed in columns_to_hash are indeed present in the new_data dictionary this how I assume that your code is skipping over missing keys the same shape of input data for training and prediction for the model

new_data_hashed_feature_dict = {}

# Here I iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
    if column in new_data:
        feature_value = [str(new_data[column])]  # Convert a single string to a list containing that string
        hashed_features = hasher.transform([feature_value])
        weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
        new_data_hashed_feature_dict[column] = weighted_hashed_features
    else:
        print(f"Warning: Missing key {column} in new data")

# Then I concatenated all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash if column in new_data], axis=1)

# Here you will need to predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)

print(f"Predicted Cluster Label: {predicted_label[0]}")

Answered By - Amira Bedhiafi

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 26, 2023

[FIXED] how to do prediction using model learned by sklearn feature hasher

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels