Tuesday, July 19, 2022

[FIXED] How to explode feature vector to a column in PySpark Dataframe?

July 19, 2022 apache-spark-sql, jupyter-notebook, pyspark, python, rdd No comments

Issue

id	texts	vector
0	[a, b, c]	(3,[0,1,2],[1.0,1.0,1.0])
1	[a, b, c]	(3,[0,1,2],[2.0,2.0,1.0])

This is my above spark dataframe, I want to convert it to something like below -

id	texts	list_2
0	a	1.0
0	b	1.0
0	c	1.0
1	a	2.0
1	b	2.0
1	c	1.0

Solution

from pyspark.sql.types import *
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import *



def to_array_(v):
 return v.toArray().tolist()
def to_vector_(v):
 return Vectors.dense(v)


to_array = udf(lambda z: to_array_(z),ArrayType(DoubleType())) #watch your return type
to_vector = udf(lambda z: to_vector_(z),VectorUDT()) # helper to make an example for your question.
getFeatureVector=udf(lambda v:v[2],VectorUDT()) #this should work on your Feature Vector, but I'm too lazy to contrive an example with Vectors of vectors.
getFeatureVectorExample=udf(lambda v:v[2],FloatType()) # This works for this example but gives you the general idea of how to access vectors.

schema = ["id","texts","vector"]
data = [
(0,['a', 'b', 'c'],[1.0,1.0,1.0]), #small cheat
(1,['a', 'b', 'c'],[2.0,2.0,1.0]),
]
df = spark.createDataFrame( data, schema )


df = df.withColumn("vector", to_vector(df.vector) ) #convert the array to a vector so I can prove this works
#DataFrame[id: bigint, texts: array<string>, vector: vector]

This may make you ask the question how do I access the element of vector to turn it into an array: (we use another udf that will translate for us.)

df.select(col('*'), getFeatureVectorExample( df.vector ) ).show()
+---+---------+-------------+----------------+
| id|    texts|       vector|<lambda>(vector)|
+---+---------+-------------+----------------+
|  0|[a, b, c]|[1.0,1.0,1.0]|             1.0|
|  1|[a, b, c]|[2.0,2.0,1.0]|             1.0|
+---+---------+-------------+----------------+

Ok so now we know how to get the element we're interest in so the rest of this example show how to convert a vector into an array, and then explode it.

df.withColumn( 'text', explode( df.texts) )\# I use with column as I'm lazy
.withColumn( 'feature', explode( to_array(df.vector) ) )\#can't have to explodes in 1 select so don't try to do that.
.drop('texts','vector')\#book keeping to clean up columns you don't want.
.show()
| id|text|feature|
+---+----+-------+
|  0|   a|    1.0|
|  0|   a|    1.0|
|  0|   a|    1.0|
|  0|   b|    1.0|
|  0|   b|    1.0|
|  0|   b|    1.0|
|  0|   c|    1.0|
|  0|   c|    1.0|
|  0|   c|    1.0|
|  1|   a|    2.0|
|  1|   a|    2.0|
|  1|   a|    1.0|
|  1|   b|    2.0|
|  1|   b|    2.0|
|  1|   b|    1.0|
|  1|   c|    2.0|
|  1|   c|    2.0|
|  1|   c|    1.0|
+---+----+-------+

To further clarify if you wish to access elements of a vector you can create a static function:

This function pulls the last element(2) of a vector out and returns it as a vector, but gives a hint to how to access other elements. getFeatureVector=udf(lambda v:v[2],VectorUDT()) If the elements are different types you will need to write extra logic to handle it and the return type: Here's an example to access the first(0) element of a vector and return it as a FloatType: getFeatureVectorExample=udf(lambda v:v[0],FloatType())

You can of course combine these elements and return a more complex structure, that may suit your needs. I suggest returning them as a struct as you can use 'column_name.*' to turn the columns from the struct as rows or struct_column.field_name to access elements and return them as columns. See this example for how to build out the return type.

Further example using multitple elements in struct and turning them into a column


def structExample(v):
 return (
  float(v[0]),       
  float(v[0])
 )
getstructExample=udf(structExample,StructType([StructField( "flt", FloatType(), False), StructField( "array", FloatType() ) ]))

df.select(col('*'), getstructExample( df.vector ).alias("struct") ).select(col("struct.*")).show()
+---+-----+
|flt|array|
+---+-----+
|1.0|  1.0|
|2.0|  2.0|
+---+-----+

Answered By - Matt Andruff

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, July 19, 2022

[FIXED] How to explode feature vector to a column in PySpark Dataframe?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels