Skip to content Skip to sidebar Skip to footer

How To Stack Two Columns Into A Single One In Pyspark?

I have the following PySpark DataFrame: id col1 col2 A 2 3 A 2 4 A 4 6 B 1 2 I want to stack col1 and col2 in order to get a single column as f

Solution 1:

The simplest is merge col1 and col2 into an array column and then explode it:

df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  A|   2|   3|
|  A|   2|   4|
|  A|   4|   6|
|  B|   1|   2|
+---+----+----+

df.selectExpr('id', 'explode(array(col1, col2))').show()
+---+---+
| id|col|
+---+---+
|  A|  2|
|  A|  3|
|  A|  2|
|  A|  4|
|  A|  4|
|  A|  6|
|  B|  1|
|  B|  2|
+---+---+

You can drop duplicates if you don't need them.

Solution 2:

To do this, group by the "id", then collect the lists from both "col1" and "col2" in an aggregation, to then explode it again into one column. To get the unique numbers, just drop the duplicates after.

I see that you also have the numbers sorted in your end result, this is done by sorting the concatted lists in the aggregation.

The following code:

from pyspark.sql.functions import concat, collect_list, explode, col, sort_array

df = (
    sc.parallelize([
        ('A', 2, 3), ('A', 2, 4), ('A', 4, 6),
        ('B', 1, 2),
    ]).toDF(["id", "col1", "col2"])
)

result= df.groupBy("id") \
.agg(sort_array(concat(collect_list("col1"),collect_list("col2"))).alias("all_numbers")) \
.orderBy("id") \
.withColumn('number', explode(col('all_numbers'))) \
.dropDuplicates() \
.select("id","number") \
.show()

will yield:

+---+------+
| id|number|
+---+------+
|  A|     2|
|  A|     3|
|  A|     4|
|  A|     6|
|  B|     1|
|  B|     2|
+---+------+

Solution 3:

Rather a simple solution if the number of columns involved is less.

df = (
    sc.parallelize([
        ('A', 2, 3), ('A', 2, 4), ('A', 4, 6),
        ('B', 1, 2),
    ]).toDF(["id", "col1", "col2"])
)


df.show()

+---+----+----+
| id|col1|col2|
+---+----+----+
|  A|   2|   3|
|  A|   2|   4|
|  A|   4|   6|
|  B|   1|   2|
+---+----+----+

df1 = df.select(['id', 'col1'])
df2 = df.select(['id', 'col2']).withColumnRenamed('col2', 'col1')

df_new = df1.union(df2)
df_new = df_new.drop_duplicates()
df_new.show()

+---+----+
| id|col1|
+---+----+
|  A|   3|
|  A|   4|
|  B|   1|
|  A|   6|
|  A|   2|
|  B|   2|
+---+----+

Post a Comment for "How To Stack Two Columns Into A Single One In Pyspark?"