Issue
My objective is to first sort a dataframe into 3 categories and then create 3 new dataframes containing those 3 categories. Here is the code I have below.
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']
train_path = tf.keras.utils.get_file(
"iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
"iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")
train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
train.pop('SepalWidth')
train.pop('PetalWidth')
flower0 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower1 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower2 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
for row in range(len(train)):
species = train.iloc[row]['Species']
info = train.iloc[row]
info.pop('Species')
if species == 0.0:
flower0.append(info)
elif species == 1.0:
flower1.append(info)
else:
flower2.append(info)
print(flower0)
plt.scatter(flower0.pop('SepalLength'), flower0.pop('PetalLength'), color='Red')
plt.scatter(flower1.pop('SepalLength'), flower1.pop('PetalLength'), color='Blue')
plt.scatter(flower2.pop('SepalLength'), flower2.pop('PetalLength'), color='Green')
plt.show()
I am very new to machine learning and data engineering, so I wanted to visualize a bit what my data looked like on a scatter plot. Since I cannot plot this data in 4 dimensions (since I have 4 categories: Sepal width/length and Petal width/length) I decided to just plot 2, Sepal length and Petal length. I deleted the unnecessary columns by using the .pop() method and am stuck at this code chunk.
flower0 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower1 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower2 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
for row in range(len(train)):
species = train.iloc[row]['Species']
info = train.iloc[row]
info.pop('Species')
if species == 0.0:
flower0.append(info)
elif species == 1.0:
flower1.append(info)
else:
flower2.append(info)
print(flower0)
plt.scatter(flower0.pop('SepalLength'), flower0.pop('PetalLength'), color='Red')
plt.scatter(flower1.pop('SepalLength'), flower1.pop('PetalLength'), color='Blue')
plt.scatter(flower2.pop('SepalLength'), flower2.pop('PetalLength'), color='Green')
plt.show()
Here I am creating 3 empty dataframes with the 2 columns I want to use later for axis plotting, and am looping through the large dataset in the for loop. The for loop sorts the rows by species and then appends them to the corresponding dataframe. Here the appending does not seem to work because when I print out one of the new dataframes it reads:
Empty DataFrame
Columns: [SepalLength, PetalLength]
Index: []
Does anyone know how I should go about adding these rows to specific new dataframes? Thank you so much in advance!!
Side question if you want brownie points: Is this the best way of displaying the scatter plot? I looked online and it said the best was to plot the data in different scatter sets so that I can change each group's color independently. My entire goal is just to visually see each of the flowers' petal length and sepal length in different colors.
Solution
I don't think you need to use a for loop here and for a large data set word on the street is that iterating through dataframes using a for loop is highly inefficient.
Just drop the for loop and replace the definitions of flower0, flower1, flower2 with an iloc definition.
# change definition to what you want using iloc
flower0 = train.loc[train.Species==0.0][['SepalLength', 'PetalLength']]
flower1 = train.loc[train.Species==1.0][['SepalLength', 'PetalLength']]
flower2 = train.loc[train.Species>1 ][['SepalLength', 'PetalLength']]
# drop the for loop
plt.scatter(flower0.pop('SepalLength'), flower0.pop('PetalLength'), color='Red')
plt.scatter(flower1.pop('SepalLength'), flower1.pop('PetalLength'), color='Blue')
plt.scatter(flower2.pop('SepalLength'), flower2.pop('PetalLength'), color='Green')
plt.show()
In any case I believe you're returning an empty dataframe because you are trying to "append" a series object (info = train.iloc[row]) to a dataframe. To append a series to an existing data frame use df = pd.concat([df, s.to_frame().T])
Answered By - born_naked
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.