Issue
I have a pyspark dataframe that essentially looks like the following table:
Product | Name |
---|---|
abcd - 12 | abcd |
xyz - 123543 | xyz |
I am hoping to create a new column (UPC) that only contains the numbers to the right of the hyphen in the Product column.
I know in Excel I can use the Right function with len and find but from what I can tell, those do not have equivalents in Python.
I have tried creating 2 new columns, LastHyphen (because the product column could have more than 1 hyphen) and ProductLength. I was then hoping to plug those into the substring function but I keep getting a "Column is not iterable" error.
df4 = df3.withColumn("LastHyphen",length(col("PRODUCT"))-locate('-',reverse(col("PRODUCT"))))
df4 = df4.withColumn("ProductLength",length(col("PRODUCT")))
df4 = df4.withColumn("UPC", substring("PRODUCT", df4.LastHyphen, df4.ProductLength - df4.LastHyphen))
TypeError: Column is not iterable
I am hoping to get an output like:
Product | UPC |
---|---|
abcd - 12 | 12 |
xyz - 123543 | 123543 |
Solution
There's a similar question here and the answer there involves a regexp split.
In your specific circumstance it might be easiest to do a regular expression to extract the UPC from the string.
from pyspark.sql import Row
from pyspark.sql.functions import col, regexp_extract
df = spark.createDataFrame(
[
Row(product="abcd - 12", name="abcd"),
Row(product="xyz - 123543", name="xyz"),
Row(product="xyz - abc - 123456", name="xyz - abc"),
]
)
df.withColumn("UPC", regexp_extract(col("product"), ".* - ([0-9]{1,})", 1)).show()
+------------------+---------+------+
| product| name| UPC|
+------------------+---------+------+
| abcd - 12| abcd| 12|
| xyz - 123543| xyz|123543|
|xyz - abc - 123456|xyz - abc|123456|
+------------------+---------+------+
Answered By - jonathan-dufault-kr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.