Issue
I have simple Spark dataframe which has three columns (contract name, contract start date, contract length) as follows:
root
|-- contract_name: string (nullable = true)
|-- start_date: date (nullable = true)
|-- length: long (nullable = true)
As contracts continue to renew, I would like to add a column which shows the next future date the contract would renew. In Python, I calculate this recursively as follows:
def find_next_date(date, periods):
new_date = date + relativedelta(months=periods)
if new_date <= datetime.today():
return find_next_date(new_date, periods)
else:
return new_date
However, is there a way to do this directly in PySpark?
Solution
You can create a sequence
of dates and pick the last one.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('cont1', '2022-01-14', 5),
('cont2', '2022-01-15', 5),
('cont3', '2022-01-16', 5),
('cont4', '2022-02-14', 1),
('cont5', '2022-02-15', 1),
('cont6', '2022-02-16', 1)],
['contract_name', 'start_date', 'length'])
df = df.withColumn('start_date', F.col('start_date').cast('date'))
dates = F.expr("sequence(start_date, add_months(current_date(), length), make_interval(0, length))")
df = df.withColumn('next_date', F.element_at(dates, -1))
df.show()
# +-------------+----------+------+----------+
# |contract_name|start_date|length| next_date|
# +-------------+----------+------+----------+
# | cont1|2022-01-14| 5|2022-11-14|
# | cont2|2022-01-15| 5|2022-11-15|
# | cont3|2022-01-16| 5|2022-06-16|
# | cont4|2022-02-14| 1|2022-07-14|
# | cont5|2022-02-15| 1|2022-07-15|
# | cont6|2022-02-16| 1|2022-06-16|
# +-------------+----------+------+----------+
Answered By - ZygD
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.