Issue
I have some data in pandas dataframe form below, where the columns represent discrete skills and the rows represent discrete jobs. A 1 is present only if the skill is required by the job, otherwise 0.
skill_1, skill_2,
job_1 1, 0,
job_2 0, 0,
job_3 1, 1,
I want to create a graph to visualize this relationship between jobs and skills, using networkx. I've tried two methods, one on the dataframe, itself, nx.from_pandas_adjacency
and nx.from_numpy_matrix
. The latter method was applied to a numpy representation of the dataframe, where the column and row names were removed.
In either situation, an error was raised because this is a non_square matrix. This makes sense as networkx is likely interpreting both columns and rows as the same set of nodes. However, the columns and nodes represent distinctly different things here. Two jobs are connected by the skill(s) they share and two skills are connected by the job(s) they share, but there is no direct edge between any two skills or any two jobs.
How can I import my data into networkx given that my rows and columns are different sets of nodes?
Solution
One option is to generate the missing rows and columns
(I was curious about a vectorised method to achieve this, so I asked this question which has answers which provide such a method.)
df = pd.DataFrame({'skill_1': {'job_1': 1, 'job_2': 0, 'job_3': 1},
'skill_2': {'job_1': 0, 'job_2': 0, 'job_3': 1}})
edges = df.columns
for i in df.index:
df[i] = [0 for _ in range(len(df.index))]
for e in edges:
df = df.append(pd.Series({c:0 for c in df.columns},name=e))
Which gives us:
>>> df
skill_1 skill_2 job_1 job_2 job_3
job_1 1 0 0 0 0
job_2 0 0 0 0 0
job_3 1 1 0 0 0
skill_1 0 0 0 0 0
skill_2 0 0 0 0 0
And then we can read in to networkx using nx.from_pandas_adjacency
(assuming you want a directed graph)
G = nx.from_pandas_adjacency(df, create_using=nx.DiGraph)
Alternatively, we can use df.stack()
df = pd.DataFrame({'skill_1': {'job_1': 1, 'job_2': 0, 'job_3': 1},
'skill_2': {'job_1': 0, 'job_2': 0, 'job_3': 1}})
G = nx.DiGraph()
for x,y in df.stack().reset_index().iterrows():
G.add_node(y['level_0'])
G.add_node(y['level_1'])
if y[0]:
G.add_edge(y['level_0'], y['level_1'])
Answered By - CDJB
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.