問題描述
我有一個帶有一列數字的 pyspark 數據框.我需要對該列求和,然后將結果返回為 python 變量中的 int.
I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
我執行以下操作來對列求和.
I do the following to sum the column.
df.groupBy().sum()
但我得到了一個數據框.
But I get a dataframe back.
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
我會將 130 作為存儲在變量中的 int 返回,以便在程序中的其他位置使用.
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130
推薦答案
最簡單的方法真的:
df.groupBy().sum().collect()
但是操作很慢:避免groupByKey,你應該使用RDD和reduceByKey:
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
我嘗試了更大的數據集并測量了處理時間:
I tried on a bigger dataset and i measured the processing time:
RDD 和 ReduceByKey:2.23 秒
RDD and ReduceByKey : 2.23 s
GroupByKey:30.5 秒
GroupByKey: 30.5 s
這篇關于PySpark - 在數據框中求和一列并將結果返回為 int的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!