Issue
I am new to python, and am struggling with a calculation. I have several thousand rows of data in a CSV table in the following format:
This data is in the wrong format in that several of my xmin/ymin values are higher than the xmax/ymax values (examples can be seen in the image link above). I need to create new columns and use either numpy
or pandas
to reorder the data so that they are in the correct format, such as using this code:
import numpy as np
xmin_new = np.min(xmin, xmax)
xmax_new = np.max(xmin, xmax)
ymin_new = np.min(ymin, ymax)
ymax_new = np.max(ymin, ymax)
The trouble is that I'm having trouble defining a column in a CSV and iterating through rows to do this. Can anyone suggest how I could modify this script to accomplish this?
import pandas
import numpy as np
import os
import csv
#Set cwd
os.chdir("C:\\Users\\desired_directory")
#Open desired csv file
v = open("train.csv")
r = csv.reader(v)
row0 = r.next()
#print header to look at file
print row0
row0.append('xmin_new')
row0.append('xmax_new')
row0.append('ymin_new')
row0.append('ymax_new')
#Check appends
print row0
xmin_new = np.min(xmin, xmax)
xmax_new = np.max(xmin, xmax)
ymin_new = np.min(ymin, ymax)
ymax_new = np.max(ymin, ymax)
#Errors occur here saying that the "xmin_new" column is undefined.
#Also looking to save the file to the directory, but unsure of how to do this properly.
Solution
If you are looking for speed, numpy is a good way to go. I assume you know how to read the whole data into a DataFrame (look up pandas.read_csv()
).
# First, make a reproducible example
# In your case, you would read the df instead
n = 6
np.random.seed(0)
cols = 'xmin xmax ymin ymax'.split()
df = pd.DataFrame(
np.random.randint(0, 10, (n,4)),
columns=cols,
).assign(foo=np.random.choice(list('abcd'), n))
>>> df
xmin xmax ymin ymax foo
0 5 0 3 3 a
1 7 9 3 5 d
2 2 4 7 6 a
3 8 8 1 6 d
4 7 7 8 1 b
5 5 9 8 9 c
Then, the actual bit:
# reorder min/max for both x and y
#
# Note: cols must be ['xmin', 'xmax', 'ymin', 'ymax']
# or ['ymin', 'ymax', 'xmin', 'xmax']
z = df[cols].values.reshape(-1, 2)
df[cols] = np.c_[z.min(1), z.max(1)].reshape(-1, 4)
And now:
>>> df
xmin xmax ymin ymax foo
0 0 5 3 3 a
1 7 9 3 5 d
2 2 4 6 7 a
3 8 8 1 6 d
4 7 7 1 8 b
5 5 9 8 9 c
Note: if instead, you want to create new columns as per your question, consider this instead:
cols_new = [f'{k}_new' for k in cols]
z = df[cols].values.reshape(-1, 2)
df[cols_new] = np.c_[z.min(1), z.max(1)].reshape(-1, 4)
There is a slightly more verbose way in pandas-only:
df = df.assign(
xmin=df[['xmin', 'xmax']].min(1),
xmax=df[['xmin', 'xmax']].max(1),
ymin=df[['ymin', 'ymax']].min(1),
ymax=df[['ymin', 'ymax']].max(1),
)
Same remark as before, if you intend to create new columns instead, then df.assign(xmin_new=...)
etc.
Answered By - Pierre D
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.