Another Error

#5
by Gemneye - opened

I tried upgrading torch, torchvision, and torchaudio to see if it made a difference. Now getting a new error. I also downloaded distilled models in case I could not run with the 24B model.

(magi) root@46e1abf287b8:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
[rank0]: pipeline = MagiPipeline(args.config_file)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 32, in init
[rank0]: dist_init(self.config)
[rank0]: File "/workspace/MAGI-1/inference/infra/distributed/dist_utils.py", line 48, in dist_init
[rank0]: assert config.engine_config.cp_size * config.engine_config.pp_size == torch.distributed.get_world_size()
[rank0]: AssertionError
[rank0]:[W423 02:54:17.933492678 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0423 02:54:19.241000 5094 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 5163) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-23_02:54:19
host : 46e1abf287b8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5163)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sand AI org

It looks like the config needs some modifications. Could you let me know how many GPUs you’re using and what type they are?
Also, make sure that pp_size * cp_size equals the total number of GPUs.

I started all over from scratch. I am getting further but still having problems.

[2025-04-24 01:04:51,105 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0424 01:04:52.556000 132482488543040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3378) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

This is from the 24B_config.json file

55 "clean_chunk_kvrange": 1,
56 "clean_t": 0.9999,
57 "seed": 83746,
58 "num_frames": 121,
59 "video_size_h": 540,
60 "video_size_w": 960,
61 "num_steps": 8,
62 "window_size": 4,
63 "fps": 24,
64 "chunk_width": 6,
65 "load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight",
66 "t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
67 "t5_device": "cuda",
68 "vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
69 "scale_factor": 0.18215,
70 "temporal_downsample_factor": 4

I have no idea what is going on, but the files in the directory configured by the "load" parameter are the same as those on huggingface. I am not sure about this error: " assert os.path.exists(inference_weight_dir)" . I tried changing directories to one level back, bad that did not make a difference. I tried this with both a single L40 and with 2xL40s. I am not sure if that is too low of specs for this or not. I will try one of the other configurations with the other models, but I certainly cannot get this to work.

I used cpp_size = 2 when I was using 2xL40s and cpp_size=1 when using 1xL40.

Sand AI org

Change the load path to load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base

Same Problem:

(magi) root@ca1683f2b34d:/workspace# ls -l /workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight
total 46757232
-rw-rw-rw- 1 root root  4988160184 Apr 23 21:18 model-00001-of-00006.safetensors
-rw-rw-rw- 1 root root  7247764000 Apr 23 21:18 model-00002-of-00006.safetensors
-rw-rw-rw- 1 root root 19327358992 Apr 23 21:19 model-00003-of-00006.safetensors
-rw-rw-rw- 1 root root  9663682528 Apr 23 21:18 model-00004-of-00006.safetensors
-rw-rw-rw- 1 root root  3623890200 Apr 23 21:18 model-00005-of-00006.safetensors
-rw-rw-rw- 1 root root  3028420248 Apr 23 21:18 model-00006-of-00006.safetensors
-rw-rw-rw- 1 root root      126708 Apr 23 21:17 model.safetensors.index.json
(magi) root@ca1683f2b34d:/workspace#
(magi) root@ca1683f2b34d:/workspace/MAGI-1# cat example/24B/24B_config.json 
{
    "model_config": {
        "model_name": "videodit_ardf",
        "num_layers": 48,
        "hidden_size": 6144,
        "ffn_hidden_size": 16384,
        "num_attention_heads": 48,
        "num_query_groups": 8,
        "kv_channels": 128,
        "layernorm_epsilon": 1e-06,
        "apply_layernorm_1p": true,
        "x_rescale_factor": 0.1,
        "half_channel_vae": true,
        "params_dtype": "torch.bfloat16",
        "patch_size": 2,
        "t_patch_size": 1,
        "in_channels": 32,
        "out_channels": 32,
        "cond_hidden_ratio": 0.25,
        "caption_channels": 4096,
        "caption_max_length": 800,
        "xattn_cond_hidden_ratio": 1.0,
        "cond_gating_ratio": 1.0,
        "gated_linear_unit": true
    },
    "runtime_config": {
        "cfg_number": 1,
        "cfg_t_range": [
            0.0,
            0.0217,
            0.1,
            0.3,
            0.999
        ],
        "prev_chunk_scales": [
            1.5,
            1.5,
            1.5,
            1.0,
            1.0
        ],
        "text_scales": [
            7.5,
            7.5,
            7.5,
            0.0,
            0.0
        ],
        "noise2clean_kvrange": [
            5,
            4,
            3,
            2
        ],
        "clean_chunk_kvrange": 1,
        "clean_t": 0.9999,
        "seed": 83746,
        "num_frames": 121,
        "video_size_h": 540,
        "video_size_w": 960,
        "num_steps": 8,
        "window_size": 4,
        "fps": 24,
        "chunk_width": 6,
        "load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base",
        "t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
        "t5_device": "cuda",
        "vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
        "scale_factor": 0.18215,
        "temporal_downsample_factor": 4
    },
    "engine_config": {
        "distributed_backend": "nccl",
        "distributed_timeout_minutes": 15,
        "pp_size": 1,
        "cp_size": 1,
        "cp_strategy": "cp_ulysses",
        "ulysses_overlap_degree": 1,
        "fp8_quant": true,
        "distill_nearly_clean_chunk_threshold": 0.3,
        "shortcut_mode": "8,16,16",
        "distill": true,
        "kv_offload": true,
        "enable_cuda_graph": false
    }
}
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[W425 00:54:29.094511239 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[2025-04-25 00:54:29,391 - INFO] Initialize torch distribution and model parallel successfully
[2025-04-25 00:54:29,391 - INFO] MagiConfig(model_config=ModelConfig(model_name='videodit_ardf', num_layers=48, hidden_size=6144, ffn_hidden_size=16384, num_attention_heads=48, num_query_groups=8, kv_channels=128, layernorm_epsilon=1e-06, apply_layernorm_1p=True, x_rescale_factor=0.1, half_channel_vae=True, params_dtype=torch.bfloat16, patch_size=2, t_patch_size=1, in_channels=32, out_channels=32, cond_hidden_ratio=0.25, caption_channels=4096, caption_max_length=800, xattn_cond_hidden_ratio=1.0, cond_gating_ratio=1.0, gated_linear_unit=True), runtime_config=RuntimeConfig(cfg_number=1, cfg_t_range=[0.0, 0.0217, 0.1, 0.3, 0.999], prev_chunk_scales=[1.5, 1.5, 1.5, 1.0, 1.0], text_scales=[7.5, 7.5, 7.5, 0.0, 0.0], noise2clean_kvrange=[5, 4, 3, 2], clean_chunk_kvrange=1, clean_t=0.9999, seed=83746, num_frames=121, video_size_h=540, video_size_w=960, num_steps=8, window_size=4, fps=24, chunk_width=6, t5_pretrained='/workspace/MAGI-1-models/models/T5/ckpt/t5', t5_device='cuda', vae_pretrained='/workspace/MAGI-1-models/models/VAE', scale_factor=0.18215, temporal_downsample_factor=4, load='/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base'), engine_config=EngineConfig(distributed_backend='nccl', distributed_timeout_minutes=15, pp_size=1, cp_size=1, cp_strategy='cp_ulysses', ulysses_overlap_degree=1, fp8_quant=True, distill_nearly_clean_chunk_threshold=0.3, shortcut_mode='8,16,16', distill=True, kv_offload=True, enable_cuda_graph=False))
/workspace/MAGI-1/inference/pipeline/video_process.py:229: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1720538438429/work/torch/csrc/utils/tensor_new.cpp:1544.)
  video = torch.frombuffer(out, dtype=torch.uint8).view(1, h, w, 3)
[2025-04-25 00:54:46,251 - INFO] Precompute validation prompt embeddings
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [01:00<00:00, 30.09s/it]
[2025-04-25 00:55:49,201 - INFO] VideoDiTModel(
  (x_embedder): Conv3d(32, 6144, kernel_size=(1, 2, 2), stride=(1, 2, 2), bias=False)
  (t_embedder): TimestepEmbedder(
    (mlp): Sequential(
      (0): Linear(in_features=256, out_features=1536, bias=True)
      (1): SiLU()
      (2): Linear(in_features=1536, out_features=1536, bias=True)
    )
  )
  (y_embedder): CaptionEmbedder(
    (y_proj_xattn): Sequential(
      (0): Linear(in_features=4096, out_features=6144, bias=True)
      (1): SiLU()
    )
    (y_proj_adaln): Sequential(
      (0): Linear(in_features=4096, out_features=1536, bias=True)
    )
  )
  (rope): LearnableRotaryEmbeddingCat()
  (videodit_blocks): TransformerBlock(
    (layers): ModuleList(
      (0): TransformerLayer(
        (ada_modulate_layer): AdaModulateLayer(
          (act): SiLU()
          (proj): Sequential(
            (0): Linear(in_features=1536, out_features=12288, bias=True)
          )
        )
        (self_attention): FullyParallelAttention(
          (linear_qkv): CustomLayerNormLinear(
            (layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
            (q): Linear(in_features=6144, out_features=6144, bias=False)
            (qx): Linear(in_features=6144, out_features=6144, bias=False)
            (k): Linear(in_features=6144, out_features=1024, bias=False)
            (v): Linear(in_features=6144, out_features=1024, bias=False)
          )
          (linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
          (linear_proj): Linear(in_features=12288, out_features=6144, bias=False)
          (q_layernorm): FusedLayerNorm()
          (q_layernorm_xattn): FusedLayerNorm()
          (k_layernorm): FusedLayerNorm()
          (k_layernorm_xattn): FusedLayerNorm()
        )
        (self_attn_post_norm): FusedLayerNorm()
        (mlp): CustomMLP(
          (layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
          (linear_fc1): Linear(in_features=6144, out_features=32768, bias=False)
          (linear_fc2): Linear(in_features=16384, out_features=6144, bias=False)
        )
        (mlp_post_norm): FusedLayerNorm()
      )
      (1-46): 46 x TransformerLayer(
        (ada_modulate_layer): AdaModulateLayer(
          (act): SiLU()
          (proj): Sequential(
            (0): Linear(in_features=1536, out_features=12288, bias=True)
          )
        )
        (self_attention): FullyParallelAttention(
          (linear_qkv): CustomLayerNormLinear(
            (layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
            (q): PerTensorQuantizedFp8Linear()
            (qx): PerTensorQuantizedFp8Linear()
            (k): PerTensorQuantizedFp8Linear()
            (v): PerTensorQuantizedFp8Linear()
          )
          (linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
          (linear_proj): PerChannelQuantizedFp8Linear()
          (q_layernorm): FusedLayerNorm()
          (q_layernorm_xattn): FusedLayerNorm()
          (k_layernorm): FusedLayerNorm()
          (k_layernorm_xattn): FusedLayerNorm()
        )
        (self_attn_post_norm): FusedLayerNorm()
        (mlp): CustomMLP(
          (layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
          (linear_fc1): PerTensorQuantizedFp8Linear()
          (linear_fc2): PerChannelQuantizedFp8Linear()
        )
        (mlp_post_norm): FusedLayerNorm()
      )
      (47): TransformerLayer(
        (ada_modulate_layer): AdaModulateLayer(
          (act): SiLU()
          (proj): Sequential(
            (0): Linear(in_features=1536, out_features=12288, bias=True)
          )
        )
        (self_attention): FullyParallelAttention(
          (linear_qkv): CustomLayerNormLinear(
            (layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
            (q): Linear(in_features=6144, out_features=6144, bias=False)
            (qx): Linear(in_features=6144, out_features=6144, bias=False)
            (k): Linear(in_features=6144, out_features=1024, bias=False)
            (v): Linear(in_features=6144, out_features=1024, bias=False)
          )
          (linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
          (linear_proj): Linear(in_features=12288, out_features=6144, bias=False)
          (q_layernorm): FusedLayerNorm()
          (q_layernorm_xattn): FusedLayerNorm()
          (k_layernorm): FusedLayerNorm()
          (k_layernorm_xattn): FusedLayerNorm()
        )
        (self_attn_post_norm): FusedLayerNorm()
        (mlp): CustomMLP(
          (layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
          (linear_fc1): Linear(in_features=6144, out_features=32768, bias=False)
          (linear_fc2): Linear(in_features=16384, out_features=6144, bias=False)
        )
        (mlp_post_norm): FusedLayerNorm()
      )
    )
    (final_layernorm): FusedLayerNorm()
  )
  (final_linear): FinalLinear(
    (linear): Linear(in_features=6144, out_features=128, bias=False)
  )
)
[2025-04-25 00:55:49,212 - INFO] (cp, pp) rank (0, 0): param count 23902014382, model size 24.65 GB
[2025-04-25 00:55:49,212 - INFO] Build DiTModel successfully
[2025-04-25 00:55:49,212 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]:     pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]:   File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]:     self._run(prompt, prefix_video, output_path)
[rank0]:   File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]:     dit = get_dit(self.config)
[rank0]:   File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]:     model = load_checkpoint(model)
[rank0]:   File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]:     state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]:   File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]:     assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0425 00:55:50.917000 136965425092416 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3142) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
  File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
inference/pipeline/entry.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-25_00:55:50
  host      : ca1683f2b34d
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3142)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(magi) root@ca1683f2b34d:/workspace/MAGI-1# 

If you’re using the 24B_base model, please set cfg_number=3, fp8_quant=false, and distill=false.
The default config on GitHub seems a bit confusing β€” I’ll try to update it when I get a chance.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment