Wednesday, November 3, 2021

[FIXED] map values in a dataframe from a dictionary using pyspark

November 03, 2021 apache-spark, pyspark, python No comments

Issue

I want to know how to map values in a specific column in a dataframe.

I have a dataframe which looks like:

df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])

+-----+-------+
| col1|   col2|
+-----+-------+
|india|  japan|
|  usa|uruguay|
+-----+-------+

I have a dictionary from where I want to map the values.

dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')])

The output I want is:

+-----+-------+--------+--------+
| col1|   col2|col1_map|col2_map|
+-----+-------+--------+--------+
|india|  japan|     ind|     jpn|
|  usa|uruguay|      us|     urg|
+-----+-------+--------+--------+

I have tried using the lookup function but it doesn't work. It throws error SPARK-5063. Following is my approach which failed:

def map_val(x):
    return dicts.lookup(x)[0]

myfun = udf(lambda x: map_val(x), StringType())

df = df.withColumn('col1_map', myfun('col1')) # doesn't work
df = df.withColumn('col2_map', myfun('col2')) # doesn't work

Solution

udf way

I would suggest you to change the list of tuples to dicts and broadcast it to be used in udf

dicts = sc.broadcast(dict([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]))

from pyspark.sql import functions as f
from pyspark.sql import types as t
def newCols(x):
    return dicts.value[x]

callnewColsUdf = f.udf(newCols, t.StringType())

df.withColumn('col1_map', callnewColsUdf(f.col('col1')))\
    .withColumn('col2_map', callnewColsUdf(f.col('col2')))\
    .show(truncate=False)

which should give you

+-----+-------+--------+--------+
|col1 |col2   |col1_map|col2_map|
+-----+-------+--------+--------+
|india|japan  |ind     |jpn     |
|usa  |uruguay|us      |urg     |
+-----+-------+--------+--------+

join way (slower than udf way)

All you have to do is change the dicts rdd to dataframe too and use two joins with aliasings as following

df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])

dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]).toDF(['key', 'value'])

from pyspark.sql import functions as f
df.join(dicts, df['col1'] == dicts['key'], 'inner')\
    .select(f.col('col1'), f.col('col2'), f.col('value').alias('col1_map'))\
    .join(dicts, df['col2'] == dicts['key'], 'inner') \
    .select(f.col('col1'), f.col('col2'), f.col('col1_map'), f.col('value').alias('col2_map'))\
    .show(truncate=False)

which should give you the same result

Answered By - Ramesh Maharjan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 3, 2021

[FIXED] map values in a dataframe from a dictionary using pyspark

Issue

Solution

udf way

join way (slower than udf way)

0 comments:

Post a Comment

Popular Posts

Labels