Spark 2.2.2 - Joining Multiple Rdds Giving Out Of Memory Excepton. Resulting Rdd Has 124 Columns. What Should Be The Optimal Joining Method?
I have a file which has multiple values for each phone number. for example : phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 att
Solution 1:
I would probably use window functions:
from pyspark.sql.window import Window
import pyspark.sql.functions as spf
df = spark.createDataFrame([
(123, 1, 'a', 2, 'c'),
(123, 2, 'b', 1, 'd'),
(456, 3, 'e', 4, 'f')
], ['phone', 'priority1', 'attribute1', 'priority2', 'attribute2'])
w = Window.partitionBy('phone')
df2 = (
df
.select(
'phone',
spf.first('attribute1').over(w.orderBy('priority1')).alias('attribute1'),
spf.first('attribute2').over(w.orderBy('priority2')).alias('attribute2'),
)
)
(
df2
.groupby('phone')
.agg(*[spf.first(c).alias(c) for c in df2.columns if c != 'phone'])
.toPandas()
)
Gives:
phone attribute1 attribute2
0123a d
1456 e f
It's an exercise for the reader to template this out (e.g. using list comprehensions) to generalize to all attributes and priorities.
Post a Comment for "Spark 2.2.2 - Joining Multiple Rdds Giving Out Of Memory Excepton. Resulting Rdd Has 124 Columns. What Should Be The Optimal Joining Method?"