It seems that the weight files do not correspond correctly.
StreamingT2V-StreamingModelscope/t2v_enhanced/huggingface.co/ali-vilab/text-to-video-ms-1.7b and are newly initialized: ['down_blocks.0.attentions.1.transformer_blocks.0.attn2.conv.weight', 'up_blocks.1.attentions.2.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.4.temporal_transformer.attention.to_out.0.weight', 'up_blocks.3.attentions.0.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.9.temporal_transformer.norm.bias', 'up_blocks.1.attentions.1.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.3.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.11.temporal_transformer.proj_out.bias', 'up_blocks.2.attentions.1.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.attention.to_q.weight', 'up_blocks.2.attentions.0.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.proj_out.bias', 'down_blocks.0.attentions.0.transformer_blocks.0.attn2.conv.bias', 'down_blocks.2.attentions.0.transformer_blocks.0.attn2.conv.bias', 'down_blocks.2.attentions.1.transformer_blocks.0.attn2.conv_ln.weight', 'up_blocks.3.attentions.0.transformer_blocks.0.attn2.conv.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.attention.to_out.0.weight', 'down_blocks.0.attentions.0.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_down_blocks.10.temporal_transformer.attention.to_k.weight', 'up_blocks.2.attentions.0.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_down_blocks.11.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.9.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.10.temporal_transformer.attention.to_v.weight', 'up_blocks.3.attentions.1.transformer_blocks.0.attn2.conv_ln.weight', 'down_blocks.2.attentions.1.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_down_blocks.1.temporal_transformer.attention.to_k.weight', 'up_blocks.3.attentions.1.transformer_blocks.0.attn2.conv.bias', 'down_blocks.0.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.6.temporal_transformer.norm.bias', 'up_blocks.3.attentions.1.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.proj_in.weight', 'up_blocks.1.attentions.1.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_mid_block.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.norm.bias', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.alpha', 'up_blocks.3.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.norm.weight', 'up_blocks.3.attentions.0.transformer_blocks.0.attn2.alpha', 'up_blocks.3.attentions.2.transformer_blocks.0.attn2.conv_ln.bias', 'up_blocks.1.attentions.2.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.attention.to_k.weight', 'cross_attention_merger_down_blocks.9.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.11.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.proj_out.bias', 'cross_attention_merger_down_blocks.3.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.3.temporal_transformer.proj_out.weight', 'up_blocks.2.attentions.1.transformer_blocks.0.attn2.conv_ln.weight', 'up_blocks.1.attentions.2.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.proj_in.weight', 'cross_attention_merger_mid_block.temporal_transformer.norm.weight', 'cross_attention_merger_mid_block.temporal_transformer.attention.to_q.weight', 'up_blocks.1.attentions.0.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_down_blocks.8.temporal_transformer.attention.to_k.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.norm.bias', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.10.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.attention.to_v.weight', 'up_blocks.3.attentions.1.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.0.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.8.temporal_transformer.proj_out.weight', 'mid_block.attentions.0.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.6.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.3.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.9.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.11.temporal_transformer.proj_in.bias', 'up_blocks.2.attentions.2.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.6.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.11.temporal_transformer.attention.to_k.weight', 'up_blocks.1.attentions.1.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.3.temporal_transformer.norm.weight', 'up_blocks.3.attentions.1.transformer_blocks.0.attn2.alpha', 'mid_block.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.8.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.attention.to_out.0.bias', 'down_blocks.2.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'down_blocks.0.attentions.1.transformer_blocks.0.attn2.conv_ln.bias', 'up_blocks.2.attentions.2.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.11.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.attention.to_k.weight', 'up_blocks.1.attentions.1.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.proj_in.weight', 'up_blocks.2.attentions.1.transformer_blocks.0.attn2.alpha', 'up_blocks.3.attentions.2.transformer_blocks.0.attn2.conv.bias', 'down_blocks.2.attentions.0.transformer_blocks.0.attn2.conv.weight', 'down_blocks.0.attentions.0.transformer_blocks.0.attn2.conv.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.alpha', 'up_blocks.3.attentions.2.transformer_blocks.0.attn2.alpha', 'up_blocks.1.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.9.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.proj_out.bias', 'up_blocks.3.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.8.temporal_transformer.proj_out.bias', 'up_blocks.2.attentions.2.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.6.temporal_transformer.attention.to_q.weight', 'up_blocks.1.attentions.2.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_down_blocks.6.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.6.temporal_transformer.attention.to_k.weight', 'up_blocks.2.attentions.1.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.4.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.3.temporal_transformer.proj_out.bias', 'cross_attention_merger_down_blocks.3.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.10.temporal_transformer.proj_in.bias', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_mid_block.temporal_transformer.attention.to_v.weight', 'up_blocks.2.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.proj_in.weight', 'up_blocks.1.attentions.2.transformer_blocks.0.attn2.conv.bias', 'up_blocks.3.attentions.2.transformer_blocks.0.attn2.conv_ln.weight', 'up_blocks.1.attentions.0.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.9.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.2.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.10.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.0.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.10.temporal_transformer.proj_out.bias', 'up_blocks.2.attentions.1.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_mid_block.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.10.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.10.temporal_transformer.proj_in.weight', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.proj_in.weight', 'up_blocks.3.attentions.2.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.proj_out.bias', 'cross_attention_merger_down_blocks.5.temporal_transformer.proj_out.bias', 'cross_attention_merger_mid_block.temporal_transformer.attention.to_k.weight', 'up_blocks.2.attentions.2.transformer_blocks.0.attn2.alpha', 'down_blocks.0.attentions.1.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_mid_block.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.11.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.10.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.attention.to_v.weight', 'down_blocks.0.attentions.1.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.0.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.2.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_mid_block.temporal_transformer.proj_out.bias', 'cross_attention_merger_down_blocks.6.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.0.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.10.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.3.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.10.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.9.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.6.temporal_transformer.proj_out.weight', 'mid_block.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'mid_block.attentions.0.transformer_blocks.0.attn2.alpha', 'down_blocks.2.attentions.0.transformer_blocks.0.attn2.alpha', 'mid_block.attentions.0.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.11.temporal_transformer.attention.to_out.0.bias', 'down_blocks.1.attentions.0.transformer_blocks.0.attn2.conv.weight', 'cross_attention_merger_down_blocks.4.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.attention.to_q.weight', 'down_blocks.0.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.7.temporal_transformer.attention.to_out.0.bias', 'up_blocks.2.attentions.0.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.7.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.9.temporal_transformer.proj_out.bias', 'down_blocks.0.attentions.1.transformer_blocks.0.attn2.alpha', 'cross_attention_merger_down_blocks.11.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.9.temporal_transformer.attention.to_k.weight', 'down_blocks.2.attentions.1.transformer_blocks.0.attn2.conv.bias', 'down_blocks.2.attentions.1.transformer_blocks.0.attn2.conv.weight', 'up_blocks.1.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.2.temporal_transformer.norm.weight', 'cross_attention_merger_down_blocks.6.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.5.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.11.temporal_transformer.norm.bias', 'cross_attention_merger_down_blocks.9.temporal_transformer.proj_in.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.attention.to_k.weight', 'cross_attention_merger_down_blocks.3.temporal_transformer.attention.to_k.weight', 'cross_attention_merger_down_blocks.0.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.6.temporal_transformer.attention.to_out.0.weight', 'up_blocks.2.attentions.0.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_mid_block.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.9.temporal_transformer.attention.to_out.0.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.proj_out.weight', 'cross_attention_merger_down_blocks.7.temporal_transformer.attention.to_out.0.weight', 'up_blocks.2.attentions.2.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.8.temporal_transformer.proj_in.bias', 'cross_attention_merger_mid_block.temporal_transformer.proj_out.weight', 'up_blocks.1.attentions.0.transformer_blocks.0.attn2.conv.bias', 'down_blocks.2.attentions.1.transformer_blocks.0.attn2.conv_ln.bias', 'cross_attention_merger_down_blocks.6.temporal_transformer.proj_out.bias', 'cross_attention_merger_down_blocks.3.temporal_transformer.attention.to_v.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.proj_in.bias', 'cross_attention_merger_down_blocks.3.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.attention.to_out.0.bias', 'down_blocks.1.attentions.1.transformer_blocks.0.attn2.conv.bias', 'cross_attention_merger_down_blocks.11.temporal_transformer.attention.to_q.weight', 'up_blocks.1.attentions.1.transformer_blocks.0.attn2.conv_ln.weight', 'down_blocks.2.attentions.0.transformer_blocks.0.attn2.conv_ln.weight', 'cross_attention_merger_down_blocks.1.temporal_transformer.attention.to_q.weight', 'cross_attention_merger_down_blocks.5.temporal_transformer.attention.to_k.weight', 'cross_attention_merger_down_blocks.2.temporal_transformer.proj_out.bias', 'cross_attention_merger_down_blocks.4.temporal_transformer.proj_in.bias', 'cross_attention_merger_mid_block.temporal_transformer.attention.to_out.0.bias', 'cross_attention_merger_down_blocks.7.temporal_transformer.attention.to_k.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Call extend channel loader with conv_in.
Call extend channel loader with conv_out.
PIPE LOADING DONE
CUSTOM XFORMERS ATTENTION USED.
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default ModelSummary
callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Call extend channel loader with base_model.conv_in.
Call extend channel loader with base_model.conv_out.