Issue
I am certain this is a simple change but having issues simply matching a date column to a list of items. Example is using the holidays package but the problem can be applied to other use cases, just my experience in PySpark is not great!!
Using the holidays package (https://pypi.org/project/holidays/) I retrieve a dict of holidays
nyse_holidays=holidays.financial.ny_stock_exchange.NewYorkStockExchange(years=2018)
print(nyse_holidays)
*{datetime.date(2018, 12, 5): 'Day of Mourning for President George H.W. Bush', datetime.date(2018, 1, 1): "New Year's Day", datetime.date(2018, 1, 15): 'Martin Luther King Jr. Day', datetime.date(2018, 2, 19): "Washington's Birthday", datetime.date(2018, 3, 30): 'Good Friday', datetime.date(2018, 5, 28): 'Memorial Day', datetime.date(2018, 7, 4): 'Independence Day', datetime.date(2018, 9, 3): 'Labor Day', datetime.date(2018, 11, 22): 'Thanksgiving Day', datetime.date(2018, 12, 25): 'Christmas Day'} *
I also have another Spark data frame with the following schema
root
|-- id: long (nullable = false)
|-- date: timestamp (nullable = false)
|-- year: integer (nullable = false)
|-- month: integer (nullable = false)
|-- day: string (nullable = false)
|-- day_of_year: string (nullable = false)
|-- hour: string (nullable = false)
|-- minute: string (nullable = false)
|-- is_weekend: boolean (nullable = false)
|-- only_date: date (nullable = false)
I simply want to add another field saying if the date for that row is a holiday or not
The following code never matches any dates
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
mapping_expr = create_map([lit(x) for x in chain(*nyse_holidays.items())])
#search_date
display(df.withColumn("value", mapping_expr["only_date"]).filter(col("value").isNotNull()))
If I change the code to a fixed value to check the mapping_expr works then it works fine.
search_date = datetime.strptime('2018-01-01', '%Y-%m-%d')
display(df.withColumn("value", mapping_expr[search_date]).filter(col("value").isNotNull()))
Preferbably the code would just use the 'date' field but I thought I would create a only_date field.
Any recommendations, sure I am just missing something silly. Assuming its the conversion of the field being past into the mapping_expr
Solution
It looks like the problem is related to the format of the date in the nyse_holidays
dictionary and the format of the only_date
field in your DataFrame. The keys in the nyse_holidays
dictionary are of type datetime.date
, while the only_date
field is of type date
.
from pyspark.sql.functions import col, lit, create_map
from itertools import chain
from datetime import datetime
nyse_holidays = holidays.financial.ny_stock_exchange.NewYorkStockExchange(years=2018)
#convert datetime.date keys to date objects
nyse_holidays = {date_obj.date(): holiday for date_obj, holiday in nyse_holidays.items()}
#create a mapping expression
mapping_expr = create_map([lit(x) for x in chain(*nyse_holidays.items())])
#apply the mapping expression to create a new column indicating if the date is a holiday
df_with_holidays = df.withColumn("is_holiday", mapping_expr[col("only_date")])
#show the resulting dataframe
df_with_holidays.show()
Answered By - Yes
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.