Getting this error
#4
by
Gemneye
- opened
I changed the cfg_number back to one (although the documentation seems to state for 24B model to set it to 2), and now I get this error:
(magi) root@a86f02cd24e3:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
[W423 01:59:37.770111795 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
[rank0]: pipeline = MagiPipeline(args.config_file)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 32, in init
[rank0]: dist_init(self.config)
[rank0]: File "/workspace/MAGI-1/inference/infra/distributed/dist_utils.py", line 48, in dist_init
[rank0]: assert config.engine_config.cp_size * config.engine_config.pp_size == torch.distributed.get_world_size()
[rank0]: AssertionError
E0423 01:59:39.976000 129005648852800 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 5920) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
inference/pipeline/entry.py FAILED
Failures:
Root Cause (first observed failure):
[0]:
time : 2025-04-23_01:59:39
host : a86f02cd24e3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5920)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html