亚洲高清不卡视频,国产成人在线免费观看视频,日本精品不卡

本文介紹了跨多處理 python 共享 pandas 數(shù)據(jù)框字典的處理方法，對(duì)大家解決問題具有一定的參考價(jià)值，需要的朋友們下面隨著小編來一起學(xué)習(xí)吧！

問題描述

限時(shí)送ChatGPT賬號(hào)..

我有一本 python pandas 數(shù)據(jù)框字典.這本詞典的總大小約為 2GB.但是，當(dāng)我在 16 個(gè)多進(jìn)程中共享它時(shí)(在子進(jìn)程中我只讀取 dict 的數(shù)據(jù)而不修改它)，它需要 32GB 內(nèi)存.所以我想問一下我是否可以在不復(fù)制的情況下跨多處理共享這本字典.我試圖將其轉(zhuǎn)換為 manager.dict().但似乎時(shí)間太長(zhǎng)了.實(shí)現(xiàn)這一目標(biāo)的最標(biāo)準(zhǔn)方法是什么?謝謝.

I have a dictionary of python pandas dataframes. The total size of this dictionary is about 2GB. However, when I share it across 16 multiprocessing (in the subprocesses I only read the data of the dict without modifying it), it takes 32GB ram. So I would like to ask if it is possible for me to share this dictionary across multiprocessing without copying it. I tried to convert it to manager.dict(). But it seems it takes too long. What would be the most standard way to achieve this? Thank you.

推薦答案

我發(fā)現(xiàn)的最佳解決方案(它僅適用于某些類型的問題)是使用使用 Python 的 BaseManager 和 SyncManager 類的客戶端/服務(wù)器設(shè)置.為此，您首先設(shè)置一個(gè)服務(wù)器，為數(shù)據(jù)提供代理類.

The best solution I've found (and it only works for some types of problems) is to use a client/server setup using Python's BaseManager and SyncManager classes. To do this you first setup a Server that serve's up a proxy class for the data.

DataServer.py

#!/usr/bin/python
from    multiprocessing.managers import SyncManager
import  numpy

# Global for storing the data to be served
gData = {}

# Proxy class to be shared with different processes
# Don't put big data in here since that will force it to be piped to the
# other process when instantiated there, instead just return a portion of
# the global data when requested.
class DataProxy(object):
    def __init__(self):
        pass

    def getData(self, key, default=None):
        global gData
        return gData.get(key, None)

if __name__ == '__main__':
    port  = 5000

    print 'Simulate loading some data'
    for i in xrange(1000):
        gData[i] = numpy.random.rand(1000)

    # Start the server on address(host,port)
    print 'Serving data. Press <ctrl>-c to stop.'
    class myManager(SyncManager): pass
    myManager.register('DataProxy', DataProxy)
    mgr = myManager(address=('', port), authkey='DataProxy01')
    server = mgr.get_server()
    server.serve_forever()

運(yùn)行一次以上并讓它繼續(xù)運(yùn)行.下面是您用來訪問數(shù)據(jù)的客戶端類.

Run the above once and leave it running. Below is the client class you use to access the data.

DataClient.py

from   multiprocessing.managers import BaseManager
import psutil   #3rd party module for process info (not strictly required)

# Grab the shared proxy class.  All methods in that class will be availble here
class DataClient(object):
    def __init__(self, port):
        assert self._checkForProcess('DataServer.py'), 'Must have DataServer running'
        class myManager(BaseManager): pass
        myManager.register('DataProxy')
        self.mgr = myManager(address=('localhost', port), authkey='DataProxy01')
        self.mgr.connect()
        self.proxy = self.mgr.DataProxy()

    # Verify the server is running (not required)
    @staticmethod
    def _checkForProcess(name):
        for proc in psutil.process_iter():
            if proc.name() == name:
                return True
        return False

下面是多處理的測(cè)試代碼.

Below is the test code to try this with multiprocessing.

TestMP.py

#!/usr/bin/python
import time
import multiprocessing as mp
import numpy
from   DataClient import *    

# Confusing, but the "proxy" will be global to each subprocess, 
# it's not shared across all processes.
gProxy = None
gMode  = None
gDummy = None
def init(port, mode):
    global gProxy, gMode, gDummy
    gProxy  = DataClient(port).proxy
    gMode  = mode
    gDummy = numpy.random.rand(1000)  # Same as the dummy in the server
    #print 'Init proxy ', id(gProxy), 'in ', mp.current_process()

def worker(key):
    global gProxy, gMode, gDummy
    if 0 == gMode:   # get from proxy
        array = gProxy.getData(key)
    elif 1 == gMode: # bypass retrieve to test difference
        array = gDummy
    else: assert 0, 'unknown mode: %s' % gMode
    for i in range(1000):
        x = sum(array)
    return x    

if __name__ == '__main__':
    port   = 5000
    maxkey = 1000
    numpts = 100

    for mode in [1, 0]:
        for nprocs in [16, 1]:
            if 0==mode: print 'Using client/server and %d processes' % nprocs
            if 1==mode: print 'Using local data and %d processes' % nprocs                
            keys = [numpy.random.randint(0,maxkey) for k in xrange(numpts)]
            pool = mp.Pool(nprocs, initializer=init, initargs=(port,mode))
            start = time.time()
            ret_data = pool.map(worker, keys, chunksize=1)
            print '   took %4.3f seconds' % (time.time()-start)
            pool.close()

當(dāng)我在我的機(jī)器上運(yùn)行它時(shí)，我得到...

When I run this on my machine I get...

Using local data and 16 processes
   took 0.695 seconds
Using local data and 1 processes
   took 5.849 seconds
Using client/server and 16 processes
   took 0.811 seconds
Using client/server and 1 processes
   took 5.956 seconds

這是否適合您的多處理系統(tǒng)取決于獲取數(shù)據(jù)的頻率.每次傳輸都會(huì)產(chǎn)生少量開銷.如果您減少 x=sum(array) 循環(huán)中的迭代次數(shù)，您可以看到這一點(diǎn).在某些時(shí)候，您將花費(fèi)更多時(shí)間獲取數(shù)據(jù)而不是處理數(shù)據(jù).

Whether this works for you in your multiprocessing system depends on how often have to grab the data. There's a small overhead associated with each transfer. You can see this if you turn down the number of iterations in the x=sum(array) loop. At some point you'll spend more time getting data than working on it.

除了多處理之外，我也喜歡這種模式，因?yàn)槲抑恍柙诜?wù)器程序中加載一次我的大數(shù)組數(shù)據(jù)，它會(huì)一直加載到我終止服務(wù)器為止.這意味著我可以針對(duì)數(shù)據(jù)運(yùn)行一堆單獨(dú)的腳本，并且它們可以快速執(zhí)行；無需等待數(shù)據(jù)加載.

Besides multiprocessing, I also like this pattern because I only have to load my big array data once in the server program and it stays loaded until I kill the server. That means I can run a bunch of separate scripts against the data and they execute quickly; no waiting for data to load.

雖然這里的方法有點(diǎn)類似于使用數(shù)據(jù)庫，但它的優(yōu)勢(shì)在于可以處理任何類型的 python 對(duì)象，而不僅僅是字符串和整數(shù)等簡(jiǎn)單的 DB 表.我發(fā)現(xiàn)使用 DB 是一種對(duì)于那些簡(jiǎn)單的類型來說要快一些，但對(duì)我來說，它往往更多地以編程方式工作，而且我的數(shù)據(jù)并不總是很容易移植到數(shù)據(jù)庫中.

While the approach here is somewhat similar to using a database, it has the advantage of working on any type of python object, not just simple DB tables of strings and ints, etc. I've found that using a DB is a bit faster for those simple types but for me, it tends to be more work programatically and my data doesn't always port over easily to a database.

這篇關(guān)于跨多處理 python 共享 pandas 數(shù)據(jù)框字典的文章就介紹到這了，希望我們推薦的答案對(duì)大家有所幫助，也希望大家多多支持html5模板網(wǎng)！

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題，如果有圖片或者內(nèi)容侵犯了您的權(quán)益，請(qǐng)聯(lián)系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

跨多處理 python 共享 pandas 數(shù)據(jù)框字典

問題描述

推薦答案

相關(guān)文檔推薦