久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

<small id='3WYGV'></small><noframes id='3WYGV'>

      <i id='3WYGV'><tr id='3WYGV'><dt id='3WYGV'><q id='3WYGV'><span id='3WYGV'><b id='3WYGV'><form id='3WYGV'><ins id='3WYGV'></ins><ul id='3WYGV'></ul><sub id='3WYGV'></sub></form><legend id='3WYGV'></legend><bdo id='3WYGV'><pre id='3WYGV'><center id='3WYGV'></center></pre></bdo></b><th id='3WYGV'></th></span></q></dt></tr></i><div class="qwawimqqmiuu" id='3WYGV'><tfoot id='3WYGV'></tfoot><dl id='3WYGV'><fieldset id='3WYGV'></fieldset></dl></div>

      1. <tfoot id='3WYGV'></tfoot>
          <bdo id='3WYGV'></bdo><ul id='3WYGV'></ul>
        <legend id='3WYGV'><style id='3WYGV'><dir id='3WYGV'><q id='3WYGV'></q></dir></style></legend>
      2. 為什么在 2012 年 python 中的 pandas 合并比 R 中的

        Why were pandas merges in python faster than data.table merges in R in 2012?(為什么在 2012 年 python 中的 pandas 合并比 R 中的 data.table 合并更快?)

          <tfoot id='0bOU5'></tfoot>
        1. <i id='0bOU5'><tr id='0bOU5'><dt id='0bOU5'><q id='0bOU5'><span id='0bOU5'><b id='0bOU5'><form id='0bOU5'><ins id='0bOU5'></ins><ul id='0bOU5'></ul><sub id='0bOU5'></sub></form><legend id='0bOU5'></legend><bdo id='0bOU5'><pre id='0bOU5'><center id='0bOU5'></center></pre></bdo></b><th id='0bOU5'></th></span></q></dt></tr></i><div class="qwawimqqmiuu" id='0bOU5'><tfoot id='0bOU5'></tfoot><dl id='0bOU5'><fieldset id='0bOU5'></fieldset></dl></div>
            • <bdo id='0bOU5'></bdo><ul id='0bOU5'></ul>
              <legend id='0bOU5'><style id='0bOU5'><dir id='0bOU5'><q id='0bOU5'></q></dir></style></legend>

                  <tbody id='0bOU5'></tbody>

                  <small id='0bOU5'></small><noframes id='0bOU5'>

                  本文介紹了為什么在 2012 年 python 中的 pandas 合并比 R 中的 data.table 合并更快?的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

                  問題描述

                  我最近遇到了 Python 的 和 Python 代碼 用于對各種包進行基準測試.

                  解決方案

                  看起來 Wes 可能在 data.table 中發現了一個已知問題,當唯一字符串的數量(levels) 很大:10,000.

                  Rprof() 是否揭示了調用 sortedmatch(levels(i[[lc]]), levels(x[[rc]])? 這并不是真正的連接本身(算法),而是一個初步步驟.

                  最近的努力已經進入允許鍵中的字符列,這應該通過與 R 自己的全局字符串哈希表更緊密地集成來解決這個問題.test.data.table() 已經報告了一些基準測試結果,但該代碼尚未連接以將級別替換為級別匹配.

                  對于常規整數列,pandas 的合并速度是否比 data.table 快?這應該是一種將算法本身與因素問題隔離開來的方法.

                  此外,data.table 考慮了 時間序列合并.有兩個方面:i) 多列 有序 鍵,例如 (id,datetime) ii) 快速流行連接 (roll=TRUE) 也就是最后的觀察結轉.

                  我需要一些時間來確認,因為這是我第一次看到與 data.table 的比較.

                  <小時>

                  2012 年 7 月發布的 data.table v1.8.0 更新

                  • 內部函數 sortedmatch() 被移除并替換為 chmatch()將 i 級別與因子"類型的列的 x 級別匹配時.這初步步驟導致(已知)顯著放緩,當數字因子列的水平數很大(例如 > 10,000).加劇了正如 Wes McKinney 所展示的,連接四個這樣的列的測試(Python 包 Pandas 的作者).匹配 100 萬個字符串,其中例如,其中 600,000 個是唯一的,現在從 16 秒減少到 0.5 秒.

                  在那個版本中還有:

                  • 現在允許在鍵中使用字符列,并且優先于因素.data.table() 和 setkey() 不再強制字符因素.因素仍然得到支持.實現 FR#1493、FR#1224和(部分)FR#951.

                  • 新函數 chmatch() 和 %chin%,match() 的更快版本和 %in% 用于字符向量.R的內部字符串緩存是使用(沒有建立哈希表).它們的速度大約快 4 倍比 ?chmatch 中的示例中的 match().

                  截至 2013 年 9 月,data.table 在 CRAN 上是 v1.8.10,我們正在開發 v1.9.0.新聞 實時更新.

                  <小時>

                  但正如我最初寫的那樣,上面:

                  <塊引用>

                  data.table 考慮到了時間序列合并.兩個方面:i)多列 ordered 鍵,例如 (id,datetime) ii) 快速流行加入 (roll=TRUE) 也就是最后一次觀察結轉.

                  因此,兩個字符列的 Pandas equi join 可能仍然比 data.table 快.因為它聽起來像是對合并的兩列進行哈希處理.data.table 不會散列密鑰,因為它考慮了普遍的有序連接.data.table 中的鍵"實際上只是排序順序(類似于 SQL 中的聚集索引;即,這就是數據在 RAM 中的排序方式).清單上是添加輔助鍵,例如.

                  總而言之,由于已知問題已得到修復,這個包含超過 10,000 個唯一字符串的特殊兩字符列測試所突出的明顯速度差異現在應該不會那么糟糕了.

                  I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It's even faster than the data.table package in R (my language of choice for analysis).

                  Why is pandas so much faster than data.table? Is it because of an inherent speed advantage python has over R, or is there some tradeoff I'm not aware of? Is there a way to perform inner and outer joins in data.table without resorting to merge(X, Y, all=FALSE) and merge(X, Y, all=TRUE)?

                  Here's the R code and the Python code used to benchmark the various packages.

                  解決方案

                  It looks like Wes may have discovered a known issue in data.table when the number of unique strings (levels) is large: 10,000.

                  Does Rprof() reveal most of the time spent in the call sortedmatch(levels(i[[lc]]), levels(x[[rc]])? This isn't really the join itself (the algorithm), but a preliminary step.

                  Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by test.data.table() but that code isn't hooked up yet to replace the levels to levels match.

                  Are pandas merges faster than data.table for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.

                  Also, data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

                  I'll need some time to confirm as it's the first I've seen of the comparison to data.table as presented.


                  UPDATE from data.table v1.8.0 released July 2012

                  • Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.

                  also in that release was :

                  • character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.

                  • New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.

                  As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. NEWS is updated live.


                  But as I wrote originally, above :

                  data.table has time series merge in mind. Two aspects to that: i) multi column ordered keys such as (id,datetime) ii) fast prevailing join (roll=TRUE) a.k.a. last observation carried forward.

                  So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.

                  In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.

                  這篇關于為什么在 2012 年 python 中的 pandas 合并比 R 中的 data.table 合并更快?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!

                  【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題,如果有圖片或者內容侵犯了您的權益,請聯系我們刪除處理,感謝您的支持!

                  相關文檔推薦

                  python: Two modules and classes with the same name under different packages(python:不同包下同名的兩個模塊和類)
                  Configuring Python to use additional locations for site-packages(配置 Python 以使用站點包的其他位置)
                  How to structure python packages without repeating top level name for import(如何在不重復導入頂級名稱的情況下構造python包)
                  Install python packages on OpenShift(在 OpenShift 上安裝 python 包)
                  How to refresh sys.path?(如何刷新 sys.path?)
                  Distribute a Python package with a compiled dynamic shared library(分發帶有已編譯動態共享庫的 Python 包)

                      1. <legend id='4mjyn'><style id='4mjyn'><dir id='4mjyn'><q id='4mjyn'></q></dir></style></legend>

                        <i id='4mjyn'><tr id='4mjyn'><dt id='4mjyn'><q id='4mjyn'><span id='4mjyn'><b id='4mjyn'><form id='4mjyn'><ins id='4mjyn'></ins><ul id='4mjyn'></ul><sub id='4mjyn'></sub></form><legend id='4mjyn'></legend><bdo id='4mjyn'><pre id='4mjyn'><center id='4mjyn'></center></pre></bdo></b><th id='4mjyn'></th></span></q></dt></tr></i><div class="qwawimqqmiuu" id='4mjyn'><tfoot id='4mjyn'></tfoot><dl id='4mjyn'><fieldset id='4mjyn'></fieldset></dl></div>
                      2. <small id='4mjyn'></small><noframes id='4mjyn'>

                            <bdo id='4mjyn'></bdo><ul id='4mjyn'></ul>
                              <tbody id='4mjyn'></tbody>

                            <tfoot id='4mjyn'></tfoot>
                          • 主站蜘蛛池模板: 亚洲成人三级 | 青青久久 | 91精品国产一区 | 亚洲欧美日韩精品久久亚洲区 | 91在线视频国产 | 日韩精品一区二区三区在线播放 | 中文字幕一区二区三区四区 | 在线观看不卡av | 东方伊人免费在线观看 | 亚洲高清一区二区三区 | 中文字幕免费观看 | 久久精品亚洲 | 久久久新视频 | 久久精品成人热国产成 | 99久久久久久 | 欧洲亚洲一区二区三区 | 国产欧美日韩在线播放 | 成人一区二区三区在线 | 色婷婷国产精品综合在线观看 | 国产精品久久在线 | 国产乱码精品1区2区3区 | 射欧美 | 极品电影院 | www国产成人免费观看视频,深夜成人网 | 一区二区三区不卡视频 | 日韩国产精品一区二区三区 | 天天操夜夜艹 | 日本精品视频一区二区 | 岛国av免费看| 亚洲一区二区三区免费视频 | 亚洲精品久久久一区二区三区 | 久久精品手机视频 | 欧美精品久久久久 | 日韩欧美国产不卡 | 国产亚洲精品一区二区三区 | 国产精品久久久久久久免费大片 | 亚洲精品成人 | 视频一二三区 | 久久伊人影院 | 精品国产一区探花在线观看 | 久久综合久色欧美综合狠狠 |