Here's what fastgpu does:
- poll
to_run
- find first file
- check there's an available worker id
- move it to
running
- handle the script
- create lock file
- redirect stdout/err to
out
- run it
- when done, move it to
complete
orfailed
- unlock
For demonstrating how to use fastgpu
, we first create a directory to store our scripts and outputs:
path = Path('data')
These are all the subdirectories that are created for us. Your scripts go in to_run
.
path_run,path_running,path_complete,path_fail,path_out = setup_dirs(path)
Let's create a scripts directory with a couple of "scripts" (actually symlinks for this demo) in it.
def _setup_test_env():
shutil.rmtree('data')
res = setup_dirs(path)
os.symlink(Path('test_scripts/script_succ.sh').absolute(), path_run/'script_succ.sh')
os.symlink(Path('test_scripts/script_fail.sh').absolute(), path_run/'script_fail.sh')
(path_run/'test_dir').mkdir(exist_ok=True)
_setup_test_env()
These functions are used to find and run scripts, and move scripts to the appropriate subdirectory at the appropriate time.
test_eq(find_next_script(path_run).name, 'script_fail.sh')
assert not find_next_script(path_complete)
This abstract class locks and unlocks resources using lockfiles. Override all_ids
to make the list of resources available. See FixedWorkerPool
for a simple example and details on each method.
The simplest possible ResourcePoolBase
subclass - the resources are just a list of ids. For instance:
_setup_test_env()
wp = FixedWorkerPool(L.range(4), path)
If there are no locks, this does nothing:
wp.unlock(0)
Initially all resources are available (unlocked), so the first from the provided list will be returned:
test_eq(wp.find_next(), 0)
After locking the first resource, it is no longer returned next:
wp.lock(0)
test_eq(wp.find_next(), 1)
This is the normal way to access a resource - it simply combines find_next
and lock
:
wp.lock_next()
test_eq(wp.find_next(), 2)
_setup_test_env()
wp = FixedWorkerPool(L.range(4), path)
_setup_test_env()
f = find_next_script(path_run)
wp._run(f, 0)
test_eq(find_next_script(path_run), path_run/'script_succ.sh')
test_eq((path_out/'script_fail.sh.exitcode').read_text(), '1')
assert (path_fail/'script_fail.sh').exists()
_setup_test_env()
wp.poll_scripts()
assert not find_next_script(path_run), find_next_script(path_run)
test_eq((path_out/'script_fail.sh.exitcode').read_text(), '1')
test_eq((path_out/'script_succ.sh.exitcode').read_text(), '0')
assert not (path_run/'script_fail.sh').exists()
assert (path_fail/'script_fail.sh').exists()
assert (path_complete/'script_succ.sh').exists()
test_eq((path_out/'script_succ.sh.stdout').read_text(), '0\n')
# wp = ResourcePoolGPU('data')
# wp.find_next()
This is a resource pool that uses pynvml to find GPUs that aren't being used (based on whether they have memory allocated). It is implemented by overriding two methods from ResourcePoolBase
. Usage is identical to FixedWorkerPool
, except that you don't need to pass in worker_ids
, since available GPUs are considered to be the resource pool.