Issue
I would like to remove the double-quotes "" when the value is empty when registering my CSV in S3 with the method df.write.save() of Spark
Spark Version : 2.4.0
Python Version : 3.6.5
Here is my code in Python for load the csv file:
df = spark.read.load(
path('in'),
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true'
)
The loaded CSV file:
|id|first_name|last_name|zip_code|
|1 | |Elsner |57315 |
|2 |Noelle | | |
|3 |James |Moser |48256 |
Here is my code in Python for write the csv file:
df.write.save(
path('out'),
format = 'csv',
delimiter = '|',
header = 'true'
)
The CSV file written:
|id|first_name|last_name|zip_code|
|1 |"" |Elsner |57315 |
|2 |Noelle |"" |"" |
|3 |James |Moser |48256 |
How to remove the double quote when registering?
Thank you very much by advance.
Solution
You have empty string in your data frame, if you want to write them as nulls, you can replace empty string with null, and then set nullValues=None
when saving it:
df.replace('', None) # replace empty string with null
.write.save(
path('out'),
format='csv',
delimiter='|',
header=True,
nullValue=None # write null value as None
)
And it will saved as:
id|first_name|last_name|zip_code
1||Elsner|57315
2|Noelle||
3|James|Moser|48256
Answered By - Psidom
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.