問題描述
上個月,當我們嘗試使用 Python 2.6.x 多處理包在多臺不同 (linux) 計算機之間共享隊列時,我們遇到了一個持續存在的問題.我也直接向 Jesse Noller 提出了這個問題,因為我們還沒有在 StackOverflow、Python 文檔、源代碼或其他在線網站上找到任何說明該問題的內容.
In the last month, we've had a persistent problem with the Python 2.6.x multiprocessing package when we've tried to use it to share a queue among several different (linux) computers. I've posed this question directly to Jesse Noller as well since we haven't yet found anything that elucidates the issue on StackOverflow, Python docs, source code or elsewhere online.
我們的工程師團隊無法解決這個問題,我們向 python 用戶組中的很多人提出了這個問題,但無濟于事.我希望有人能提供一些見解,因為我覺得我們做錯了什么,但離問題太近了,看不到它的本質.
Our team of engineers hasn't been able to solve this one, and we've posed the question to quite a few people in python user groups to no avail. I was hoping someone could shed some insight, since I feel like we're doing something incorrect but are too close to the problem to see it for what it is.
這是癥狀:
Traceback (most recent call last):
File "/var/django_root/dev/com/brightscope/data/processes/daemons/deferredupdates/servers/queue_server.py", line 65, in get_from_queue
return queue, queue.get(block=False)
File "<string>", line 2, in get
File "/usr/local/lib/python2.6/multiprocessing/managers.py", line 725, in _callmethod
conn.send((self._id, methodname, args, kwds))
IOError: [Errno 32] Broken pipe
(我展示了我們的代碼在共享隊列對象上調用 queue.get() 的位置,該對象由擴展 SyncManger 的管理器托管).
(I'm showing where our code calls queue.get() on a shared queue object, hosted by a manager that extends SyncManger).
這個問題的特殊之處在于,如果我們連接到單臺機器上的這個共享隊列(我們稱之為 machine A
),即使來自許多并發進程,我們似乎也永遠不會遇到問題.只有當我們從其他機器(我們稱這些機器 B 和 C
)連接到隊列(同樣,使用擴展多處理 SyncManager 并且目前沒有添加額外功能的類)并運行大量在我們遇到問題的同時進入和退出隊列.
What's peculiar about the issue is that if we connect to this shared queue on a single machine (let's call this machine A
), even from lots of concurrent processes, we never seem to run into an issue. It's only when we connect to the queue (again, using a class that extends multiprocessing SyncManager and currently adds no additional functionality) from other machines (let's call these machines B and C
) and run a high volume of items into and out of the queue at the same time that we experience a problem.
就好像 python 的 multiprocessing 包處理本地連接(即使它們仍然使用相同的 manager.connect() 連接方法)以一種從 machine A
工作的方式但是當遠程連接是至少從 machines B 或 C
之一同時生成,我們得到了 Broken pipe 錯誤.
It is as though python's multiprocessing package handles local connections (even though they are still using the same manager.connect() connection method) in a manner that works from machine A
but when remote connections are made simultaneously from at least one of machines B or C
we get a Broken pipe error.
在我的團隊所做的所有閱讀中,我們認為問題與鎖定有關.我們想也許我們不應該使用 Queue.Queue
,而是使用 multiprocessing.Queue
,但是我們切換了,問題仍然存在(我們還注意到 SyncManager 自己的共享隊列是Queue.Queue 的實例).
In all the reading my team has done, we thought the problem was related to locking. We thought maybe we shouldn't use Queue.Queue
, but instead multiprocessing.Queue
, but we switched and the problem persisted (we also noticed that SyncManager's own shared Queue is an instance of Queue.Queue).
我們正在努力解決如何調試問題,因為它很難重現,但確實經常發生(如果我們從隊列中插入和 .get() 處理大量項目,則每天多次).
We are pulling our hair out about how to even debug the issue, since it's hard to reproduce but does happen fairly frequently (many times per day if we are inserting and .get()ing lots of items from the queue).
我們創建的方法 get_from_queue
嘗試以隨機睡眠間隔重試從隊列中獲取項目約 10 次,但似乎如果它失敗一次,它將失敗十次(這導致我相信 .register() 和 .connect() 連接到管理器可能不會給服務器提供另一個套接字連接,但我無法通過閱讀文檔或查看 Python 內部源代碼來確認這一點).
The method we created get_from_queue
attempts to retry acquiring the item from a queue ~10 times with randomized sleep intervals, but it seems like if it fails once, it will fail all ten times (which lead me to believe that .register() and .connect()ing to a manager perhaps doesn't give another socket connection to the server, but I couldn't confirm this either by reading the docs or looking at the Python internal source code).
任何人都可以提供有關我們可能查看的位置或我們如何跟蹤實際發生的情況的任何見解嗎?
Can anyone provide any insight into where we might look or how we might track what's actually happening?
我們如何使用 multiprocessing.BaseManager
或 multiprocessing.SyncManager
在管道損壞的情況下啟動新連接?
How can we start a new connection in the event of a broken pipe using multiprocessing.BaseManager
or multiprocessing.SyncManager
?
我們首先如何防止管道破裂?
How can we prevent the broken pipe in the first place?
推薦答案
僅供參考 如果其他人運行同樣的錯誤,在與 Python 核心開發團隊的 Ask Solem 和 Jesse Noller 進行廣泛咨詢后,看起來這實際上是一個當前 python 2.6.x(可能是 2.7+,可能是 3.x)中的錯誤.他們正在尋找可能的解決方案,并且可能會在 Python 的未來版本中包含一個修復程序.
FYI In case anyone else runs by this same error, after extensive consulting with Ask Solem and Jesse Noller of Python's core dev team, it looks like this is actually a bug in current python 2.6.x (and possibly 2.7+ and possibly 3.x). They are looking at possible solutions and a fix will probably be included in a future version of Python.
這篇關于使用 Python 多處理管理器 (BaseManager/SyncManager) 與遠程機器共享隊列時管道損壞的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!