Issue
I encoded my data using feature hasher, the accuracy i am getting on training data is about 90 percent, but still i am not able to predic the new data_points by model, i guess something is wrong with encoding my new datapoint:Please have a look at the code of training data encoding:
columns_to_hash = ['port_cluster', 'org', 'asn', 'protocol', 'event_type', 'os', 'country_name', 'city_name', 'class']
attribute_weights = {
'port_cluster': 0.7,
"protocol": 0.4,
"city_name": 0.4,
'org': 0.5,
'asn': 0.5,
'event_type': 0.6,
'os': 0.3,
'country_name': 0.4,
'class': 0.3,
}
# Create a dictionary to store the hashed features
hashed_feature_dict = {}
# Iterate over unique src_ips
unique_src_ips = df['src_ip'].unique()
for src_ip in unique_src_ips:
src_ip_data = df[df['src_ip'] == src_ip]
# Initialize the FeatureHasher for the current src_ip
hasher = FeatureHasher(n_features=20, input_type='string')
src_ip_hashed_feature_dict = {}
# Iterate over columns to hash and store hashed features
for column in columns_to_hash:
hashed_features = hasher.fit_transform(src_ip_data[column].astype(str).values.reshape(-1, 1))
weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
src_ip_hashed_feature_dict[column] = weighted_hashed_features
hashed_features_array = np.concatenate([src_ip_hashed_feature_dict[column] for column in columns_to_hash], axis=1)
hashed_feature_dict[src_ip] = hashed_features_array
# Concatenate all hashed features
all_hashed_features = np.concatenate(list(hashed_feature_dict.values()), axis=0)
# Split the data into training and testing sets
X = all_hashed_features
y = df['cluster_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
new data prediction code:
# Create a dictionary to store the hashed features for the new data point
new_data = {
'src_ip': '65.21.234.90',
'asn': 24940,
'country_name': 'Finland',
'city_name': 'Helsinki',
'open_ports': ['5060/sip', '2000/ikettle'],
'protocol': 'UDP',
'ip_rep': None,
'first_seen': '2023-08-22T16:56:45.733Z',
'last_time': '2023-08-22T16:56:45.733Z',
'class': 'A',
'event_type': 'sip',
'event_data': {
'request_line': 'OPTIONS sip:[email protected] SIP/2.0',
'uri': 'sip:[email protected]',
'version': 'SIP/2.0',
'method': 'OPTIONS'
},
'link': None,
'os': 'None',
'org': 'Hetzner Online GmbH',
'port_cluster': -1
}
# Initialize the FeatureHasher for the new data point
# hasher = FeatureHasher(n_features=11, input_type='string')
new_data_hashed_feature_dict = {}
# Iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
if column in new_data:
# Convert a single string to a list containing that string
feature_value = [str(new_data[column])]
hashed_features = hasher.transform([feature_value])
weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
new_data_hashed_feature_dict[column] = weighted_hashed_features
# Concatenate all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash], axis=1)
# Predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)
print(f"Predicted Cluster Label: {predicted_label[0]}")
Solution
In your training data encoding, you use a new FeatureHasher
for each unique src_ip
. While in your new data prediction code, you use a generic hasher object. I just think that the FeatureHasher
class does not learn from the data, so using a different one for new data points might not result in consistent hashing. Maybe you need to use the same hashing process for both training and prediction to ensure consistency.
Also in your training data, I see that you concatenate the hashed features for each src_ip
. However, in your prediction code, you're hashing and concatenating features for only one src_ip
. So I assume that if you're using a different number of hashed features for training and prediction, that will throw off your model.
Anther remark you are using columns_to_hash
list to iterate through the dictionary. However, I think you need to make sure that all keys listed in columns_to_hash
are indeed present in the new_data
dictionary this how I assume that your code is skipping over missing keys the same shape of input data for training and prediction for the model
new_data_hashed_feature_dict = {}
# Here I iterate over columns to hash and store hashed features for the new data point
for column in columns_to_hash:
if column in new_data:
feature_value = [str(new_data[column])] # Convert a single string to a list containing that string
hashed_features = hasher.transform([feature_value])
weighted_hashed_features = hashed_features.toarray() * attribute_weights[column]
new_data_hashed_feature_dict[column] = weighted_hashed_features
else:
print(f"Warning: Missing key {column} in new data")
# Then I concatenated all hashed features for the new data point
new_data_features = np.concatenate([new_data_hashed_feature_dict[column] for column in columns_to_hash if column in new_data], axis=1)
# Here you will need to predict the cluster label for the new data point
predicted_label = model.predict(new_data_features)
print(f"Predicted Cluster Label: {predicted_label[0]}")
Answered By - Amira Bedhiafi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.