How To Optimize This Code On Spark?

March 27, 2023 Post a Comment

How to make this code more efficient in Spark? I need to calculate minimum, maximum, count, mean from data. Here is my sample data, Name Shop Money A Shop001 99.99 A Shop001 8

Solution 1:

You should use aggregateByKey for more optimal processing. The idea is that you store state vector which consists of count, min, max, and sum, and use aggregation functions to get the final values. Also, you can use tuple as a key, it is not necessary to concatenate keys into a single string.

data = [
        ['x', 'shop1', 1],
        ['x', 'shop1', 2],
        ['x', 'shop2', 3],
        ['x', 'shop2', 4],
        ['x', 'shop3', 5],
        ['y', 'shop4', 6],
        ['y', 'shop4', 7],
        ['y', 'shop4', 8]
    ]

def add(state, x):
    state[0] += 1
    state[1] = min(state[1], x)
    state[2] = max(state[2], x)
    state[3] += x
    return state

def merge(state1, state2):
    state1[0] += state2[0]
    state1[1] = min(state1[1], state2[1])
    state1[2] = max(state1[2], state2[2])
    state1[3] += state2[3]
    return state1

res = sc.parallelize(data).map(lambda x: ((x[0], x[1]), x[2])).aggregateByKey([0, 10000, 0, 0], add, merge)

for x in res.collect():
    print 'Client "%s" shop "%s" : count %d min %f max %f avg %f' % (
        x[0][0], x[0][1],
        x[1][0], x[1][1], x[1][2], float(x[1][3])/float(x[1][0])
    )

Python Tutorial for Beginners

How To Optimize This Code On Spark?

Solution 1:

Post a Comment for "How To Optimize This Code On Spark?"