問題描述
我最近遇到了 Python 的 和 Python 代碼 用于對各種包進行基準測試.
看起來 Wes 可能在 data.table
中發現了一個已知問題,當唯一字符串的數量(levels) 很大:10,000.
Rprof()
是否揭示了調用 sortedmatch(levels(i[[lc]]), levels(x[[rc]])
? 這并不是真正的連接本身(算法),而是一個初步步驟.
最近的努力已經進入允許鍵中的字符列,這應該通過與 R 自己的全局字符串哈希表更緊密地集成來解決這個問題.test.data.table()
已經報告了一些基準測試結果,但該代碼尚未連接以將級別替換為級別匹配.
對于常規整數列,pandas 的合并速度是否比 data.table
快?這應該是一種將算法本身與因素問題隔離開來的方法.
此外,data.table
考慮了 時間序列合并.有兩個方面:i) 多列 有序 鍵,例如 (id,datetime) ii) 快速流行連接 (roll=TRUE
) 也就是最后的觀察結轉.
我需要一些時間來確認,因為這是我第一次看到與 data.table
的比較.
2012 年 7 月發布的 data.table v1.8.0 更新
- 內部函數 sortedmatch() 被移除并替換為 chmatch()將 i 級別與因子"類型的列的 x 級別匹配時.這初步步驟導致(已知)顯著放緩,當數字因子列的水平數很大(例如 > 10,000).加劇了正如 Wes McKinney 所展示的,連接四個這樣的列的測試(Python 包 Pandas 的作者).匹配 100 萬個字符串,其中例如,其中 600,000 個是唯一的,現在從 16 秒減少到 0.5 秒.
在那個版本中還有:
現在允許在鍵中使用字符列,并且優先于因素.data.table() 和 setkey() 不再強制字符因素.因素仍然得到支持.實現 FR#1493、FR#1224和(部分)FR#951.
新函數 chmatch() 和 %chin%,match() 的更快版本和 %in% 用于字符向量.R的內部字符串緩存是使用(沒有建立哈希表).它們的速度大約快 4 倍比 ?chmatch 中的示例中的 match().
截至 2013 年 9 月,data.table 在 CRAN 上是 v1.8.10,我們正在開發 v1.9.0.新聞 實時更新.
<小時>但正如我最初寫的那樣,上面:
<塊引用>data.table
考慮到了時間序列合并.兩個方面:i)多列 ordered 鍵,例如 (id,datetime) ii) 快速流行加入 (roll=TRUE
) 也就是最后一次觀察結轉.
因此,兩個字符列的 Pandas equi join 可能仍然比 data.table 快.因為它聽起來像是對合并的兩列進行哈希處理.data.table 不會散列密鑰,因為它考慮了普遍的有序連接.data.table 中的鍵"實際上只是排序順序(類似于 SQL 中的聚集索引;即,這就是數據在 RAM 中的排序方式).清單上是添加輔助鍵,例如.
總而言之,由于已知問題已得到修復,這個包含超過 10,000 個唯一字符串的特殊兩字符列測試所突出的明顯速度差異現在應該不會那么糟糕了.
I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).
Why is pandas
so much faster than data.table
? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table
without resorting to merge(X, Y, all=FALSE)
and merge(X, Y, all=TRUE)
?
Here's the R code and the Python code used to benchmark the various packages.
It looks like Wes may have discovered a known issue in data.table
when the number of unique strings (levels) is large: 10,000.
Does Rprof()
reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])
? This isn't really the join itself (the algorithm), but a preliminary step.
Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table()
but that code isn't hooked up yet to replace the levels to levels match.
Are pandas merges faster than data.table
for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.
Also, data.table
has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE
) a.k.a. last observation carried forward.
I'll need some time to confirm as it's the first I've seen of the comparison to data.table
as presented.
UPDATE from data.table v1.8.0 released July 2012
- Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.
also in that release was :
character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.
New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.
As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live.
But as I wrote originally, above :
data.table
has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE
) a.k.a. last observation carried forward.
So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.
In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.
這篇關于為什么在 2012 年 python 中的 pandas 合并比 R 中的 data.table 合并更快?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!