Skip to content Skip to sidebar Skip to footer

Spark 2.2.2 - Joining Multiple Rdds Giving Out Of Memory Excepton. Resulting Rdd Has 124 Columns. What Should Be The Optimal Joining Method?

I have a file which has multiple values for each phone number. for example : phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 att

Solution 1:

I would probably use window functions:

from pyspark.sql.window import Window
import pyspark.sql.functions as spf

df = spark.createDataFrame([
    (123, 1, 'a', 2, 'c'),
    (123, 2, 'b', 1, 'd'),
    (456, 3, 'e', 4, 'f')
], ['phone', 'priority1', 'attribute1', 'priority2', 'attribute2'])

w = Window.partitionBy('phone')
df2 = (
    df
    .select(
        'phone',
        spf.first('attribute1').over(w.orderBy('priority1')).alias('attribute1'),
        spf.first('attribute2').over(w.orderBy('priority2')).alias('attribute2'),
    )
)

(
    df2
    .groupby('phone')
    .agg(*[spf.first(c).alias(c) for c in df2.columns if c != 'phone'])
    .toPandas()
)

Gives:

   phone attribute1 attribute2
0123a          d
1456          e          f

It's an exercise for the reader to template this out (e.g. using list comprehensions) to generalize to all attributes and priorities.

Post a Comment for "Spark 2.2.2 - Joining Multiple Rdds Giving Out Of Memory Excepton. Resulting Rdd Has 124 Columns. What Should Be The Optimal Joining Method?"