Issue
Main objective: Find cities that sell toys the most different from one another (top 10 differentials). For example Los Angeles sells the most Toys 3 and 4 and the city most opposite of that would be Salt Lake City, which sells Toy 9 and 15 the most and Toys 3 and 4 the least.
I have a CSV that I have put in a dataframe.
It has hundreds of rows currently and each row has 15 columns... Example:
City | Toy1 | Toy2 | Toy3 | ToyN |
---|---|---|---|---|
Los Angeles | 15 | 20 | 1 | 44 |
Miami | 33 | 2 | 545 | 15 |
Dallas | 111 | 222 | 545 | 448 |
City N | 15 | 555 | 44 | 987 |
So I need Los Angeles to compare Toy1 to all other cities, Toy2, through ToyN. And then so on for each city against the rest of the rows in the dataframe.
I am having trouble structuring this as I need a calculation difference on every column and doing a comparison between each city.
Expected Output: A new column with a difference score for City vs City. Example: |City|Toy1|Toy2|Toy3|ToyN|DiffMiami|Diff Dallas| |----|----|----|----|----|----|----| |Los Angeles|15|20|1|44|-17|15|
I have been trying to use DataFrame.diff() but not sure how to structure to use it in this scenario. Any suggestions would be gladly taken. Thanks.
Solution
In my proposed solution, for each pair of cities A,B we calculate sum_i(abs(toy_i(A) - toy_i(B))) where toy_i(A) is the number of toys i sold in city A etc
we report the results as a matrix of cities
This is easiest done in numpy
First we load the data
from io import StringIO
data = StringIO('''
City Toy1 Toy2 Toy3 ToyN
LosAngeles 15 20 1 44
Miami 33 2 545 15
Dallas 111 222 545 448
CityN 15 555 44 987
''')
df = pd.read_csv(data, sep = '\s+')
df2 = df.set_index('City')
v = df2.values
Then a bit of numpy wizardy, inspired by https://stackoverflow.com/a/46266707/14551426, to calculate pairwise sum of abs differences, and transforming back into a df
res = np.sum(np.abs(v - v[:, None]),axis=2)
df3 = pd.DataFrame(data = res, index = df2.index, columns = df2.index)
df3
output:
City LosAngeles Miami Dallas CityN
City
LosAngeles 0 609 1246 1521
Miami 609 0 731 2044
Dallas 1246 731 0 1469
CityN 1521 2044 1469 0
we see the largest value is for the Miami/CityN combination hence this are the two cities with the largest differences
it would not be too difficult to find the top 10 largest numbers here either:
df3.unstack().sort_values()
produces
City City
LosAngeles LosAngeles 0
Miami Miami 0
Dallas Dallas 0
CityN CityN 0
LosAngeles Miami 609
Miami LosAngeles 609
Dallas 731
Dallas Miami 731
LosAngeles Dallas 1246
Dallas LosAngeles 1246
CityN 1469
CityN Dallas 1469
LosAngeles CityN 1521
CityN LosAngeles 1521
Miami CityN 2044
CityN Miami 2044
Answered By - piterbarg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.