Issue
I have a dataframe with a column that registers bank names, but I have different values that refers to the same bank. The data looks something like this:
+---+--------------------+
| id| name|
+---+--------------------+
| 1| BANCO SANTANDER|
| 2| SANTANDER|
| 3|BANCO SANTANDER S.A.|
| 4| JP MORGAN|
| 5| JP MORGAN CHASE|
| 6| CITIBANK|
| 7| CITI|
| 8| CITIGROUP|
| 9| HSBC HOLDINGS|
| 10| HBSC|
+---+--------------------+
Since I can have one or more possible replacements to do for the same bank and I have an extensive list of institutions to correct, I created a dict so I could spare some time instead of creating case when statements, which will take a lot of time to do. The dict looks like this:
bank_dict = {
('JP MORGAN CHASE',):'JP MORGAN',
('CITI', 'CITIGROUP'):'CITIBANK',
('BANCO SANTANDER', 'BANCO SANTANDER S.A.', 'SANTANDER CREDIT CARDS'):'SANTANDER',
('HSBC HOLDINGS',):'HSBC'
}
What I need to do is check if my current value matches any of the values from the dict key and, if so, replace it with value. The expected result would be the following:
+---+--------------------+---------+
| id| name| new_name|
+---+--------------------+---------+
| 1| BANCO SANTANDER|SANTANDER|
| 2| SANTANDER|SANTANDER|
| 3|BANCO SANTANDER S.A.|SANTANDER|
| 4| JP MORGAN|JP MORGAN|
| 5| JP MORGAN CHASE|JP MORGAN|
| 6| CITIBANK| CITIBANK|
| 7| CITI| CITIBANK|
| 8| CITIGROUP| CITIBANK|
| 9| HSBC HOLDINGS| HBSC|
| 10| HBSC| HBSC|
+---+--------------------+---------+
What do I need to do to make this work?
Solution
You can use udf it simpler to go through a pyspark Dataframe
from pyspark.sql import types as T
# replace the name with the value in the dict
def replace_name(name):
for k, v in bank_dict.items():
if name in k:
return v
return name
udf_replace_name = udf(replace_name, T.StringType())
df = df.withColumn('new_name', udf_replace_name('name'))
Alternatively use pandas_udf
@pandas_udf('array<string>')
def replace_name(name):
return next((v for k, v in bank_dict.items() if name in k), name)
df = df.withColumn('new_name', replace_name(col('name')))
Answered By - Zakaria Hamane
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.