Issue
I'd like to do something similar to array_max()
but I need to return another field that's not being used to sort.
Here's the data that's in a column, let's call it df.idInfo
:
[
{
"accountId": 123,
"lastUpdated": {
"sourceSystemTimestamp": "2023-01-01 07:44:58.000",
"userId": null
},
"individualNumber": 33333
},
{
"accountId": 456,
"lastUpdated": {
"sourceSystemTimestamp": "2023-02-12 18:44:58.000",
"userId": null
},
"individualNumber": 22222
},
{
"accountId": 789,
"lastUpdated": {
"sourceSystemTimestamp": "2023-03-23 23:44:58.000",
"userId": null
},
"individualNumber": 11111
}
]
I want to get the individualNumber
from the element that has the latest value of lastUpdated.sourceSystemTimestamp
.
Is there any efficient way to do this? Any built-in functions to Python or Pyspark I could leverage?
Solution
You can use array_max
to get the max value of sourceSystemTimestamp
and use that to filter the same array to fetch the corresponding individualNumber
.
df = (df.withColumn('max_time',
F.array_max(
F.transform('idInfo', lambda x: x['lastUpdated']['sourceSystemTimestamp'])
))
.withColumn('max_number',
F.filter('idInfo', lambda x: x['lastUpdated']['sourceSystemTimestamp'] == F.col('max_time'))[0]['individualNumber']))
Answered By - Emma
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.