Issue
I'm analyzing Political Advertisements from Facebook, which is a dataset released here, by ProPublica.
Here's what I mean. I have an entire column of targets that I want to analyze but it's formatted in a very inaccessible way for someone of my skill level.
This is from just 1 cell:
[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]
and another:
[{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]
What I need to do is to separate every "target" item to become the column label with each of its corresponding "segment" to be a possible value within that column.
Or, is the solution to create a function to call each dictionary key within each row to count frequency?
Solution
- The columns are
lists
ofdicts
.- Each
dict
in thelist
can be moved to a separate column by usingpandas.explode()
. - Convert the column of
dicts
to a dataframe where the keys are column headers and the values are observations, by usingpandas.json_normalize()
,.join()
this back todf
.
- Each
- Use
.drop()
to remove the unneeded column. - If the column contains list of dicts that are strings (e.g.
"[{key: value}]"
), refer to this solution in Splitting dictionary/list inside a Pandas Column into Separate Columns, and use:df.col2 = df.col2.apply(literal_eval)
, withfrom ast import literal_eval
.
import pandas as pd
# create sample dataframe
df = pd.DataFrame({'col1': ['x', 'y'], 'col2': [[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}], [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]]})
# display(df)
col1 col2
0 x [{'target': 'NAge', 'segment': '21 and older'}, {'target': 'MinAge', 'segment': '21'}, {'target': 'Retargeting', 'segment': 'people who may be similar to their customers'}, {'target': 'Region', 'segment': 'the United States'}]
1 y [{'target': 'NAge', 'segment': '18 and older'}, {'target': 'Location Type', 'segment': 'HOME'}, {'target': 'Interest', 'segment': 'Hispanic culture'}, {'target': 'Interest', 'segment': 'Republican Party (United States)'}, {'target': 'Location Granularity', 'segment': 'country'}, {'target': 'Country', 'segment': 'the United States'}, {'target': 'MinAge', 'segment': 18}]
# use explode to give each dict in a list a separate row
df = df.explode('col2', ignore_index=True)
# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
df = df.join(pd.json_normalize(df.col2)).drop(columns=['col2'])
display(df)
col1 target segment
0 x NAge 21 and older
1 x MinAge 21
2 x Retargeting people who may be similar to their customers
3 x Region the United States
4 y NAge 18 and older
5 y Location Type HOME
6 y Interest Hispanic culture
7 y Interest Republican Party (United States)
8 y Location Granularity country
9 y Country the United States
10 y MinAge 18
Get count
- If the goal is to get the
count
for each'target'
and associated'segment'
counts = df.groupby(['target', 'segment']).count()
Updated
- This update is implemented for the full file
import pandas as pd
from ast import literal_eval
# load the file
df = pd.read_csv('en-US.csv')
# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')
# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')
# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)
# use explode to give each dict in a list a separate row
df = df.explode('targets', ignore_index=True)
# fillna with {} is required for json_normalize
df.targets = df.targets.fillna({i: {} for i in df.index})
# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
normalized = pd.json_normalize(df.targets)
# get the counts
counts = normalized.groupby(['target', 'segment']).segment.count().reset_index(name='counts')
Answered By - Trenton McKinney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.