問題描述
我有一個管道,目前在一個大型大學計算機集群上運行.出于發(fā)布目的,我想將其轉換為 mapreduce 格式,以便任何人都可以在使用 hadoop 集群(如 amazon webservices (AWS))時運行它.該管道目前由一系列 python 腳本組成,這些腳本包裝不同的二進制可執(zhí)行文件并使用 python 子進程和 tempfile 模塊管理輸入和輸出.不幸的是,我沒有編寫二進制可執(zhí)行文件,其中許多要么不使用 STDIN,要么不以可用"的方式發(fā)出 STDOUT(例如,僅將其發(fā)送到文件).這些問題是我將大部分問題封裝在 python 中的原因.
I have a pipeline that I currently run on a large university computer cluster. For publication purposes I'd like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). The pipeline currently consists of as series of python scripts that wrap different binary executables and manage the input and output using the python subprocess and tempfile modules. Unfortunately I didn’t write the binary executables and many of them either don’t take STDIN or don't emit STDOUT in a ‘useable’ fashion (e.g., only sent it to files). These problems are why I’ve wrapped most of them in python.
到目前為止,我已經能夠修改我的 Python 代碼,這樣我就有了一個映射器和一個縮減器,我可以在本地機器上以標準的測試格式"運行它們.
So far I’ve been able to modify my Python code such that I have a mapper and a reducer that I can run on my local machine in the standard ‘test format.’
$ cat data.txt | mapper.py | reducer.py
映射器按照它包裝的二進制文件想要的方式格式化每一行數據,使用 subprocess.popen 將文本發(fā)送到二進制文件(這也允許我屏蔽很多虛假的 STDOUT),然后收集我想要的 STOUT,并將其格式化為適合減速器的文本行.當我嘗試在本地 hadoop 安裝上復制命令時出現問題.我可以讓映射器執(zhí)行,但它給出的錯誤提示它找不到二進制可執(zhí)行文件.
The mapper formats each line of data the way the binary it wraps wants it, sends the text to the binary using subprocess.popen (this also allows me to mask a lot of spurious STDOUT), then collects the STOUT I want, and formats it into lines of text appropriate for the reducer. The problems arise when I try to replicate the command on a local hadoop install. I can get the mapper to execute, but it give an error that suggests that it can’t find the binary executable.
文件"/Users/me/Desktop/hadoop-0.21.0/./phyml.py",第 69 行,在main() 文件/Users/me/Desktop/hadoop-0.21.0/./mapper.py",第 66 行,主要phyml(無)文件/Users/me/Desktop/hadoop-0.21.0/./mapper.py",第 46 行,在 phyml 中ft = Popen(cli_parts,stdin=PIPE,stderr=PIPE,stdout=PIPE)文件"/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py",第 621 行,在 init 中錯誤讀取,錯誤寫入)文件/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py",第 1126 行,在 _execute_child 中引發(fā) child_exceptionOSError: [Errno 13] 權限被拒絕
File "/Users/me/Desktop/hadoop-0.21.0/./phyml.py", line 69, in main() File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 66, in main phyml(None) File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 46, in phyml ft = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 621, in init errread, errwrite) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 1126, in _execute_child raise child_exception OSError: [Errno 13] Permission denied
我的 hadoop 命令如下所示:
My hadoop command looks like the following:
./bin/hadoop jar /Users/me/Desktop/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar
-input /Users/me/Desktop/Code/AWS/temp/data.txt
-output /Users/me/Desktop/aws_test
-mapper mapper.py
-reducer reducer.py
-file /Users/me/Desktop/Code/AWS/temp/mapper.py
-file /Users/me/Desktop/Code/AWS/temp/reducer.py
-file /Users/me/Desktop/Code/AWS/temp/binary
正如我上面提到的,在我看來,映射器不知道二進制文件 - 也許它沒有被發(fā)送到計算節(jié)點?不幸的是,我無法真正說出問題所在.任何幫助將不勝感激.很高興看到一些用 python 編寫的封裝二進制可執(zhí)行文件的 hadoop 流映射器/reducer.我無法想象我是第一個嘗試這樣做的人!事實上,這里有另一個帖子問基本相同的問題,但還沒有回答......
As I noted above it looks to me like the mapper isn't aware of the binary - perhaps it's not being sent to the compute node? Unfortunately I can't really tell what the problem is. Any help would be greatly appreciated. It would be particulary nice to see some hadoop streaming mappers/reducers written in python that wrap binary executables. I can’t imagine I’m the first one to try to do this! In fact, here is another post asking essentially the same question, but it hasn't been answered yet...
Hadoop/Elastic Map Reduce 與二進制可執(zhí)行文件?p>
推薦答案
經過大量谷歌搜索(等)后,我想出了如何包含映射器/reducer 可以訪問的可執(zhí)行二進制文件/腳本/模塊.訣竅是首先將所有文件上傳到hadoop.
After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.
$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py
然后你需要像下面的模板那樣格式化你的流命令:
Then you need to format you streaming command like the following template:
$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar
-file /local/file/system/data/data.txt
-file /local/file/system/mapper.py
-file /local/file/system/reducer.py
-cacheFile hdfs://localhost:9000/user/you/module.py#module.py
-input data.txt
-output output/
-mapper mapper.py
-reducer reducer.py
-verbose
如果您要鏈接一個 python 模塊,您需要將以下代碼添加到您的映射器/減速器腳本中:
If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:
import sys
sys.path.append('.')
import module
如果您通過子處理訪問二進制文件,您的命令應如下所示:
If you're accessing a binary via subprocessing your command should look something like this:
cli = "./binary %s" % (argument)
cli_parts = shlex.split(cli)
mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
mp.communicate()[0]
希望這會有所幫助.
這篇關于Hadoop Streaming:映射器“包裝"二進制可執(zhí)行文件的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!