#5 open
Quinn Slack

Homedisco "can't access local input file" errors

Reported by Quinn Slack | January 1st, 2009 @ 11:11 PM

In the git source, the homedisco example (at the bottom of util/homedisco.py) fails to run with the following error:


sqs2 ~/src/disco: python util/homedisco.py
**<MSG>[09/01/02 00:47:19 none ()] Received a new map job! 
**<MSG>[09/01/02 00:47:19 none ()] Done: 3 entries mapped in total 
**<OUT>[09/01/02 00:47:19 none ()] 0 chunk://localhost/homedisco@1230878839/map-chunk-0 
**<MSG>[09/01/02 00:47:19 none ()] Received a new reduce job! 
**<MSG>[09/01/02 00:47:19 none ()] Starting reduce 
connect_input(fname=chunkfile://data/homedisco@1230878839/map-chunk-0)
Traceback (most recent call last):
  File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 39, in open_local
    f = file(fname)
IOError: [Errno 2] No such file or directory: 'data/homedisco@1230878839/map-chunk-0'
None
**<DAT>[09/01/02 00:47:19 none (chunkfile://data/homedisco@1230878839/map-chunk-0)] Can't access a local input file: chunkfile://data/homedisco@1230878839/map-chunk-0 
Traceback (most recent call last):
  File "util/homedisco.py", line 78, in <module>
    reduce = fun_reduce)
  File "util/homedisco.py", line 44, in new_job
    disco_worker.op_reduce(req)
  File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 430, in op_reduce
    fun_reduce(red_in.iter(), red_out, red_params)
  File "util/homedisco.py", line 60, in fun_reduce
    for k, v in iter:
  File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 285, in multi_file_iterator
    sze, fd = connect_input(fname)
  File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 131, in connect_input
    return open_local(input, local_file, is_chunk)
  File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 50, in open_local
    % input, input)
  File "/Users/sqs/src/disco/node/disconode/disco_worker.py", line 39, in open_local
    f = file(fname)
IOError: [Errno 2] No such file or directory: 'data/homedisco@1230878839/map-chunk-0'

It appears that the open_local path is incorrectly determining the filename from the chunkfile:// URI it is given. It does not prepend the value of the DISCO_ROOT environment variable as it should.

Result_iterator also tries to load the result from a relative path when it should be applying DISCO_ROOT to the beginning. It fails with this error if only the open_local issue is fixed:


**<MSG>[09/01/02 00:55:54 none ()] Received a new map job! 
**<MSG>[09/01/02 00:55:54 none ()] Done: 3 entries mapped in total 
**<OUT>[09/01/02 00:55:54 none ()] 0 chunk://localhost/homedisco@1230879354/map-chunk-0 
**<MSG>[09/01/02 00:55:54 none ()] Received a new reduce job! 
**<MSG>[09/01/02 00:55:54 none ()] Starting reduce 
connect_input(fname=chunkfile://data/homedisco@1230879354/map-chunk-0)
**<MSG>[09/01/02 00:55:54 none ()] Reduce done: 3 entries reduced in total 
**<MSG>[09/01/02 00:55:54 none ()] Reduce done 
**<OUT>[09/01/02 00:55:54 none ()] 0 disco://localhost/homedisco@1230879354/reduce-disco-0 
['file://data/homedisco@1230879354/reduce-disco-0']
Traceback (most recent call last):
  File "util/homedisco.py", line 80, in <module>
    for k, v in result_iterator(res):
  File "build/bdist.macosx-10.5-i386/egg/disco/core.py", line 261, in result_iterator
IOError: [Errno 2] No such file or directory: 'data/homedisco@1230879354/reduce-disco-0'

After applying this patch, the correct output is returned:


sqs2 ~/src/disco: python util/homedisco.py
**<MSG>[09/01/02 00:57:57 none ()] Received a new map job! 
**<MSG>[09/01/02 00:57:57 none ()] Done: 3 entries mapped in total 
**<OUT>[09/01/02 00:57:57 none ()] 0 chunk://localhost/homedisco@1230879477/map-chunk-0 
**<MSG>[09/01/02 00:57:57 none ()] Received a new reduce job! 
**<MSG>[09/01/02 00:57:57 none ()] Starting reduce 
**<MSG>[09/01/02 00:57:57 none ()] Reduce done: 3 entries reduced in total 
**<MSG>[09/01/02 00:57:57 none ()] Reduce done 
**<OUT>[09/01/02 00:57:57 none ()] 0 disco://localhost/homedisco@1230879477/reduce-disco-0 
KEY red:dog VALUE dog
KEY red:cat VALUE cat
KEY red:possum VALUE possum

The patch also fixes the problem for a custom HomeDisco job I wrote, but there's no test suite for me to determine whether it is correct in all cases. Specifically, it does not appear to introduces issues when running remote jobs (i.e., not through HomeDisco), but I can't guarantee anything. Also, there may be a better way of doing this. (I saw that the LOCAL_PATH env var exists, but it already has "/data" at the end, and the filenames we are appending to $DISCO_ROOT have "/data" at the beginning, so using LOCAL_PATH would result in an incorrect "/data/data".)

Comments and changes to this ticket

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers.

Shared Ticket Bins

People watching this ticket

Attachments

Tags

Pages