問題描述
我正在編寫一個 Spark 應用程序,并希望將一組鍵值對 (K, V1), (K, V2), ..., (K, Vn)
組合成一個鍵-多值對(K, [V1, V2, ..., Vn])
.我覺得我應該能夠使用具有某種風味的 reduceByKey
函數來做到這一點:
I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn)
into one Key-Multivalue pair (K, [V1, V2, ..., Vn])
. I feel like I should be able to do this using the reduceByKey
function with something of the flavor:
My_KMV = My_KV.reduce(lambda a, b: a.append([b]))
發生這種情況時我得到的錯誤是:
The error that I get when this occurs is:
NoneType"對象沒有附加"屬性.
'NoneType' object has no attribue 'append'.
我的鍵是整數,值 V1,...,Vn 是元組.我的目標是使用鍵和值列表(元組)創建一對.
My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).
推薦答案
Map和ReduceByKey
reduce
的輸入類型和輸出類型必須相同,所以如果你想聚合一個列表,你必須map
輸入到列表.然后將這些列表合并為一個列表.
Input type and output type of reduce
must be the same, therefore if you want to aggregate a list, you have to map
the input to lists. Afterwards you combine the lists into one list.
組合列表
您需要一種將列表合并為一個列表的方法.Python 提供了一些組合列表的方法.
You'll need a method to combine lists into one list. Python provides some methods to combine lists.
append
修改第一個列表,并且總是返回 None
.
append
modifies the first list and will always return None
.
x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]
extend
做同樣的事情,但解開列表:
extend
does the same, but unwraps lists:
x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]
這兩種方法都返回 None
,但您需要一個返回組合列表的方法,因此只需 使用加號.
Both methods return None
, but you'll need a method that returns the combined list, therefore just use the plus sign.
x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]
火花
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" "))
.map(lambda actor: (actor.split(",")[0], actor))
# transform each value into a list
.map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ]))
# combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
.reduceByKey(lambda a, b: a + b)
<小時>
組合鍵
也可以使用 combineByKey
來解決這個問題,它在內部用于實現 reduceByKey
,但它更復雜并且 在 Spark 中使用一種專門的按鍵組合器可以更快"一個>.對于上面的解決方案,您的用例已經足夠簡單了.
It's also possible to solve this with combineByKey
, which is used internally to implement reduceByKey
, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.
GroupByKey
也可以使用 groupByKey
、但它會減少并行化,因此對于大數據集可能會慢得多.
It's also possible to solve this with groupByKey
, but it reduces parallelization and therefore could be much slower for big data sets.
這篇關于使用 Apache Spark 將鍵值對縮減為鍵列表對的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!