Issue
import matplotlib.pyplot as plt
import pandas as pd
import math
I want to delete the outliers outside of the green lines from my data. My actually data is in a data frame and there are a lot of outliers. It would also be helpful to scale the width of the green lines for larger values of x.
x = [1,1.1,2,3,4,5,5.5,6,7,8,9,10, 10.10]
y = [1,0.1,2,3,4,5,1,6,7,8,9,10, 16]
df = pd.DataFrame(list(zip(x, y)),
columns =['x_vals', 'y_vals'])
plt.scatter(df.x_vals,df.y_vals)
z = np.polyfit(df.x_vals,df.y_vals, 1)
p = np.poly1d(z)
z = plt.plot(df.x_vals,p(df.x_vals),"r--")
z_1 = plt.plot(df.x_vals,p(df.x_vals) + 1,"g--")
z_2 = plt.plot(df.x_vals,p(df.x_vals) - 1,"g--")
plt.show()
Solution
You can select the outliers as the points for which the y-value is either larger than the upper limit or smaller than the lower limit.
The following code example encircles the leftover points in purple, and crosses out the outliers.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
###import math
x = [1, 1.1, 2, 3, 4, 5, 5.5, 6, 7, 8, 9, 10, 10.10]
y = [1, 0.1, 2, 3, 4, 5, 1, 6, 7, 8, 9, 10, 16]
df = pd.DataFrame(list(zip(x, y)),
columns=['x_vals', 'y_vals'])
plt.scatter(df.x_vals, df.y_vals)
z = np.polyfit(df.x_vals, df.y_vals, 1)
p = np.poly1d(z)
z = plt.plot(df.x_vals, p(df.x_vals), "r--")
delta = 1
z_1 = plt.plot(df.x_vals, p(df.x_vals) + delta, "g--")
z_2 = plt.plot(df.x_vals, p(df.x_vals) - delta, "g--")
plt.fill_between(df.x_vals, p(df.x_vals) - delta, p(df.x_vals) + delta, color='g', alpha=0.1)
outliers = df[(df.y_vals < p(df.x_vals) - delta) | (df.y_vals > p(df.x_vals) + delta)]
cleaned = df[(df.y_vals >= p(df.x_vals) - delta) & (df.y_vals <= p(df.x_vals) + delta)]
plt.scatter(cleaned.x_vals, cleaned.y_vals, s=100, fc='none', ec='purple')
plt.scatter(outliers.x_vals, outliers.y_vals, marker='x', s=200, fc='none', ec='r')
plt.show()
Answered By - JohanC
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.