Another Error
I tried upgrading torch, torchvision, and torchaudio to see if it made a difference. Now getting a new error. I also downloaded distilled models in case I could not run with the 24B model.
(magi) root@46e1abf287b8:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
[rank0]: pipeline = MagiPipeline(args.config_file)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 32, in init
[rank0]: dist_init(self.config)
[rank0]: File "/workspace/MAGI-1/inference/infra/distributed/dist_utils.py", line 48, in dist_init
[rank0]: assert config.engine_config.cp_size * config.engine_config.pp_size == torch.distributed.get_world_size()
[rank0]: AssertionError
[rank0]:[W423 02:54:17.933492678 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0423 02:54:19.241000 5094 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 5163) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
inference/pipeline/entry.py FAILED
Failures:
Root Cause (first observed failure):
[0]:
time : 2025-04-23_02:54:19
host : 46e1abf287b8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5163)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
It looks like the config needs some modifications. Could you let me know how many GPUs youβre using and what type they are?
Also, make sure that pp_size * cp_size equals the total number of GPUs.
I started all over from scratch. I am getting further but still having problems.
[2025-04-24 01:04:51,105 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0424 01:04:52.556000 132482488543040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3378) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
This is from the 24B_config.json file
55 "clean_chunk_kvrange": 1,
56 "clean_t": 0.9999,
57 "seed": 83746,
58 "num_frames": 121,
59 "video_size_h": 540,
60 "video_size_w": 960,
61 "num_steps": 8,
62 "window_size": 4,
63 "fps": 24,
64 "chunk_width": 6,
65 "load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight",
66 "t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
67 "t5_device": "cuda",
68 "vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
69 "scale_factor": 0.18215,
70 "temporal_downsample_factor": 4
I have no idea what is going on, but the files in the directory configured by the "load" parameter are the same as those on huggingface. I am not sure about this error: " assert os.path.exists(inference_weight_dir)" . I tried changing directories to one level back, bad that did not make a difference. I tried this with both a single L40 and with 2xL40s. I am not sure if that is too low of specs for this or not. I will try one of the other configurations with the other models, but I certainly cannot get this to work.
I used cpp_size = 2 when I was using 2xL40s and cpp_size=1 when using 1xL40.
Change the load path to load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base
Same Problem:
(magi) root@ca1683f2b34d:/workspace# ls -l /workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight
total 46757232
-rw-rw-rw- 1 root root 4988160184 Apr 23 21:18 model-00001-of-00006.safetensors
-rw-rw-rw- 1 root root 7247764000 Apr 23 21:18 model-00002-of-00006.safetensors
-rw-rw-rw- 1 root root 19327358992 Apr 23 21:19 model-00003-of-00006.safetensors
-rw-rw-rw- 1 root root 9663682528 Apr 23 21:18 model-00004-of-00006.safetensors
-rw-rw-rw- 1 root root 3623890200 Apr 23 21:18 model-00005-of-00006.safetensors
-rw-rw-rw- 1 root root 3028420248 Apr 23 21:18 model-00006-of-00006.safetensors
-rw-rw-rw- 1 root root 126708 Apr 23 21:17 model.safetensors.index.json
(magi) root@ca1683f2b34d:/workspace#
(magi) root@ca1683f2b34d:/workspace/MAGI-1# cat example/24B/24B_config.json
{
"model_config": {
"model_name": "videodit_ardf",
"num_layers": 48,
"hidden_size": 6144,
"ffn_hidden_size": 16384,
"num_attention_heads": 48,
"num_query_groups": 8,
"kv_channels": 128,
"layernorm_epsilon": 1e-06,
"apply_layernorm_1p": true,
"x_rescale_factor": 0.1,
"half_channel_vae": true,
"params_dtype": "torch.bfloat16",
"patch_size": 2,
"t_patch_size": 1,
"in_channels": 32,
"out_channels": 32,
"cond_hidden_ratio": 0.25,
"caption_channels": 4096,
"caption_max_length": 800,
"xattn_cond_hidden_ratio": 1.0,
"cond_gating_ratio": 1.0,
"gated_linear_unit": true
},
"runtime_config": {
"cfg_number": 1,
"cfg_t_range": [
0.0,
0.0217,
0.1,
0.3,
0.999
],
"prev_chunk_scales": [
1.5,
1.5,
1.5,
1.0,
1.0
],
"text_scales": [
7.5,
7.5,
7.5,
0.0,
0.0
],
"noise2clean_kvrange": [
5,
4,
3,
2
],
"clean_chunk_kvrange": 1,
"clean_t": 0.9999,
"seed": 83746,
"num_frames": 121,
"video_size_h": 540,
"video_size_w": 960,
"num_steps": 8,
"window_size": 4,
"fps": 24,
"chunk_width": 6,
"load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base",
"t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
"t5_device": "cuda",
"vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
"scale_factor": 0.18215,
"temporal_downsample_factor": 4
},
"engine_config": {
"distributed_backend": "nccl",
"distributed_timeout_minutes": 15,
"pp_size": 1,
"cp_size": 1,
"cp_strategy": "cp_ulysses",
"ulysses_overlap_degree": 1,
"fp8_quant": true,
"distill_nearly_clean_chunk_threshold": 0.3,
"shortcut_mode": "8,16,16",
"distill": true,
"kv_offload": true,
"enable_cuda_graph": false
}
}
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[W425 00:54:29.094511239 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[2025-04-25 00:54:29,391 - INFO] Initialize torch distribution and model parallel successfully
[2025-04-25 00:54:29,391 - INFO] MagiConfig(model_config=ModelConfig(model_name='videodit_ardf', num_layers=48, hidden_size=6144, ffn_hidden_size=16384, num_attention_heads=48, num_query_groups=8, kv_channels=128, layernorm_epsilon=1e-06, apply_layernorm_1p=True, x_rescale_factor=0.1, half_channel_vae=True, params_dtype=torch.bfloat16, patch_size=2, t_patch_size=1, in_channels=32, out_channels=32, cond_hidden_ratio=0.25, caption_channels=4096, caption_max_length=800, xattn_cond_hidden_ratio=1.0, cond_gating_ratio=1.0, gated_linear_unit=True), runtime_config=RuntimeConfig(cfg_number=1, cfg_t_range=[0.0, 0.0217, 0.1, 0.3, 0.999], prev_chunk_scales=[1.5, 1.5, 1.5, 1.0, 1.0], text_scales=[7.5, 7.5, 7.5, 0.0, 0.0], noise2clean_kvrange=[5, 4, 3, 2], clean_chunk_kvrange=1, clean_t=0.9999, seed=83746, num_frames=121, video_size_h=540, video_size_w=960, num_steps=8, window_size=4, fps=24, chunk_width=6, t5_pretrained='/workspace/MAGI-1-models/models/T5/ckpt/t5', t5_device='cuda', vae_pretrained='/workspace/MAGI-1-models/models/VAE', scale_factor=0.18215, temporal_downsample_factor=4, load='/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base'), engine_config=EngineConfig(distributed_backend='nccl', distributed_timeout_minutes=15, pp_size=1, cp_size=1, cp_strategy='cp_ulysses', ulysses_overlap_degree=1, fp8_quant=True, distill_nearly_clean_chunk_threshold=0.3, shortcut_mode='8,16,16', distill=True, kv_offload=True, enable_cuda_graph=False))
/workspace/MAGI-1/inference/pipeline/video_process.py:229: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1720538438429/work/torch/csrc/utils/tensor_new.cpp:1544.)
video = torch.frombuffer(out, dtype=torch.uint8).view(1, h, w, 3)
[2025-04-25 00:54:46,251 - INFO] Precompute validation prompt embeddings
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [01:00<00:00, 30.09s/it]
[2025-04-25 00:55:49,201 - INFO] VideoDiTModel(
(x_embedder): Conv3d(32, 6144, kernel_size=(1, 2, 2), stride=(1, 2, 2), bias=False)
(t_embedder): TimestepEmbedder(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=1536, bias=True)
(1): SiLU()
(2): Linear(in_features=1536, out_features=1536, bias=True)
)
)
(y_embedder): CaptionEmbedder(
(y_proj_xattn): Sequential(
(0): Linear(in_features=4096, out_features=6144, bias=True)
(1): SiLU()
)
(y_proj_adaln): Sequential(
(0): Linear(in_features=4096, out_features=1536, bias=True)
)
)
(rope): LearnableRotaryEmbeddingCat()
(videodit_blocks): TransformerBlock(
(layers): ModuleList(
(0): TransformerLayer(
(ada_modulate_layer): AdaModulateLayer(
(act): SiLU()
(proj): Sequential(
(0): Linear(in_features=1536, out_features=12288, bias=True)
)
)
(self_attention): FullyParallelAttention(
(linear_qkv): CustomLayerNormLinear(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(q): Linear(in_features=6144, out_features=6144, bias=False)
(qx): Linear(in_features=6144, out_features=6144, bias=False)
(k): Linear(in_features=6144, out_features=1024, bias=False)
(v): Linear(in_features=6144, out_features=1024, bias=False)
)
(linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
(linear_proj): Linear(in_features=12288, out_features=6144, bias=False)
(q_layernorm): FusedLayerNorm()
(q_layernorm_xattn): FusedLayerNorm()
(k_layernorm): FusedLayerNorm()
(k_layernorm_xattn): FusedLayerNorm()
)
(self_attn_post_norm): FusedLayerNorm()
(mlp): CustomMLP(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=6144, out_features=32768, bias=False)
(linear_fc2): Linear(in_features=16384, out_features=6144, bias=False)
)
(mlp_post_norm): FusedLayerNorm()
)
(1-46): 46 x TransformerLayer(
(ada_modulate_layer): AdaModulateLayer(
(act): SiLU()
(proj): Sequential(
(0): Linear(in_features=1536, out_features=12288, bias=True)
)
)
(self_attention): FullyParallelAttention(
(linear_qkv): CustomLayerNormLinear(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(q): PerTensorQuantizedFp8Linear()
(qx): PerTensorQuantizedFp8Linear()
(k): PerTensorQuantizedFp8Linear()
(v): PerTensorQuantizedFp8Linear()
)
(linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
(linear_proj): PerChannelQuantizedFp8Linear()
(q_layernorm): FusedLayerNorm()
(q_layernorm_xattn): FusedLayerNorm()
(k_layernorm): FusedLayerNorm()
(k_layernorm_xattn): FusedLayerNorm()
)
(self_attn_post_norm): FusedLayerNorm()
(mlp): CustomMLP(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(linear_fc1): PerTensorQuantizedFp8Linear()
(linear_fc2): PerChannelQuantizedFp8Linear()
)
(mlp_post_norm): FusedLayerNorm()
)
(47): TransformerLayer(
(ada_modulate_layer): AdaModulateLayer(
(act): SiLU()
(proj): Sequential(
(0): Linear(in_features=1536, out_features=12288, bias=True)
)
)
(self_attention): FullyParallelAttention(
(linear_qkv): CustomLayerNormLinear(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(q): Linear(in_features=6144, out_features=6144, bias=False)
(qx): Linear(in_features=6144, out_features=6144, bias=False)
(k): Linear(in_features=6144, out_features=1024, bias=False)
(v): Linear(in_features=6144, out_features=1024, bias=False)
)
(linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
(linear_proj): Linear(in_features=12288, out_features=6144, bias=False)
(q_layernorm): FusedLayerNorm()
(q_layernorm_xattn): FusedLayerNorm()
(k_layernorm): FusedLayerNorm()
(k_layernorm_xattn): FusedLayerNorm()
)
(self_attn_post_norm): FusedLayerNorm()
(mlp): CustomMLP(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=6144, out_features=32768, bias=False)
(linear_fc2): Linear(in_features=16384, out_features=6144, bias=False)
)
(mlp_post_norm): FusedLayerNorm()
)
)
(final_layernorm): FusedLayerNorm()
)
(final_linear): FinalLinear(
(linear): Linear(in_features=6144, out_features=128, bias=False)
)
)
[2025-04-25 00:55:49,212 - INFO] (cp, pp) rank (0, 0): param count 23902014382, model size 24.65 GB
[2025-04-25 00:55:49,212 - INFO] Build DiTModel successfully
[2025-04-25 00:55:49,212 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0425 00:55:50.917000 136965425092416 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3142) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
inference/pipeline/entry.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-25_00:55:50
host : ca1683f2b34d
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3142)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(magi) root@ca1683f2b34d:/workspace/MAGI-1#
If youβre using the 24B_base model, please set cfg_number=3, fp8_quant=false, and distill=false.
The default config on GitHub seems a bit confusing β Iβll try to update it when I get a chance.