Issue
I am new to Python. I have a numpy.array
which size is 66049x1
(66049 rows and 1 column). The values are sorted smallest to largest and are of float
type, with some of them being repeated.
I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.
The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401
times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries
help in any way? Unfortunately I cannot change the size of the arrays I am working with.
+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))
#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
temp=x3[i]
for j in range(len(x3)):
if (temp<=x3[j]):
count[j]=count[j]+1
#Creates a 2D array with (value, occurrences)
x4=numpy.zeros((len(x3), 2))
for i in range(len(x3)):
x4[i,0]=x3[i]
x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++
Solution
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data'])
df_p.T.plot(kind='hist')
plt.show()
That whole script took a very short period to execute (~2s) for (100,000x1)
array. I didn't time, but if you provide the time it took to do yours we can compare.
I used [Counter][2]
from collections
to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame
to plot and used T
to transpose.
Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.
Edit
Create CDF using cumsum()
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p['cumu'].plot(kind='line')
plt.show()
Edit 2
For scatter()
plot you must specify the (x,y) explicitly. Also, calling df_p['cumu']
will result in a Series
, not a DataFrame
.
To properly display a scatter plot you'll need the following:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
arr = np.random.randint(0, 100, (100000,1))
df = pd.DataFrame(arr)
cnt = Counter(df[0])
df_p = pd.DataFrame(cnt, index=['data']).T
df_p['cumu'] = df_p['data'].cumsum()
df_p.plot(kind='scatter', x='data', y='cumu')
plt.show()
Answered By - Leb
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.