W0121 13:50:03.885000 139922494187328 torch/distributed/run.py:779] W0121 13:50:03.885000 139922494187328 torch/distributed/run.py:779] ***************************************** W0121 13:50:03.885000 139922494187328 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0121 13:50:03.885000 139922494187328 torch/distributed/run.py:779] ***************************************** [2025-01-21 13:50:05,946] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 13:50:05,947] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /home/.triton/autotune: No such file or directory df: /home/.triton/autotune: No such file or directory FlashAttention2 is not installed. FlashAttention2 is not installed. `flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'. Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`. `flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'. Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`. Traceback (most recent call last): File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 35, in Traceback (most recent call last): File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 35, in from internvl.patch import (concat_pad_data_collator, File "/home/InternVL/internvl_chat/internvl/patch/__init__.py", line 7, in from internvl.patch import (concat_pad_data_collator, File "/home/InternVL/internvl_chat/internvl/patch/__init__.py", line 7, in from .internlm2_packed_training_patch import replace_internlm2_attention_class File "/home/InternVL/internvl_chat/internvl/patch/internlm2_packed_training_patch.py", line 8, in from .internlm2_packed_training_patch import replace_internlm2_attention_class File "/home/InternVL/internvl_chat/internvl/patch/internlm2_packed_training_patch.py", line 8, in from flash_attn.flash_attn_interface import flash_attn_varlen_func ModuleNotFoundError: No module named 'flash_attn' from flash_attn.flash_attn_interface import flash_attn_varlen_func ModuleNotFoundError: No module named 'flash_attn' E0121 13:50:09.412000 139922494187328 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3488) of binary: /root/miniconda3/envs/py3.10/bin/python3 Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-01-21_13:50:09 host : eec3b8dfaf80 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3489) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-21_13:50:09 host : eec3b8dfaf80 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3488) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0121 13:54:57.881000 140659944720192 torch/distributed/run.py:779] W0121 13:54:57.881000 140659944720192 torch/distributed/run.py:779] ***************************************** W0121 13:54:57.881000 140659944720192 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0121 13:54:57.881000 140659944720192 torch/distributed/run.py:779] ***************************************** [2025-01-21 13:54:59,899] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 13:54:59,899] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 35, in from internvl.patch import (concat_pad_data_collator, File "/home/InternVL/internvl_chat/internvl/patch/__init__.py", line 19, in from .train_dataloader_patch import replace_train_dataloader File "/home/InternVL/internvl_chat/internvl/patch/train_dataloader_patch.py", line 7, in import datasets ModuleNotFoundError: No module named 'datasets' Traceback (most recent call last): File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 35, in from internvl.patch import (concat_pad_data_collator, File "/home/InternVL/internvl_chat/internvl/patch/__init__.py", line 19, in from .train_dataloader_patch import replace_train_dataloader File "/home/InternVL/internvl_chat/internvl/patch/train_dataloader_patch.py", line 7, in import datasets ModuleNotFoundError: No module named 'datasets' E0121 13:55:03.308000 140659944720192 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 4794) of binary: /root/miniconda3/envs/py3.10/bin/python3 Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-01-21_13:55:03 host : eec3b8dfaf80 rank : 1 (local_rank: 1) exitcode : 1 (pid: 4795) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-21_13:55:03 host : eec3b8dfaf80 rank : 0 (local_rank: 0) exitcode : 1 (pid: 4794) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0121 13:56:05.733000 140713030416192 torch/distributed/run.py:779] W0121 13:56:05.733000 140713030416192 torch/distributed/run.py:779] ***************************************** W0121 13:56:05.733000 140713030416192 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0121 13:56:05.733000 140713030416192 torch/distributed/run.py:779] ***************************************** [2025-01-21 13:56:07,744] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 13:56:07,769] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. [2025-01-21 13:56:11,543] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-21 13:56:11,543] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. 01/21/2025 13:56:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 01/21/2025 13:56:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=8, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/runs/Jan21_13-56-11_eec3b8dfaf80, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 01/21/2025 13:56:11 - INFO - __main__ - Loading Tokenizer: OpenGVLab/InternVL2_5-1B [2025-01-21 13:56:11,653] [INFO] [comm.py:652:init_distributed] cdb=None 01/21/2025 13:56:11 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False [INFO|tokenization_utils_base.py:2027] 2025-01-21 13:56:16,318 >> loading file vocab.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/vocab.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 13:56:16,319 >> loading file merges.txt from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/merges.txt [INFO|tokenization_utils_base.py:2027] 2025-01-21 13:56:16,319 >> loading file added_tokens.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/added_tokens.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 13:56:16,319 >> loading file special_tokens_map.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/special_tokens_map.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 13:56:16,319 >> loading file tokenizer_config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/tokenizer_config.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 13:56:16,319 >> loading file tokenizer.json from cache at None [WARNING|logging.py:314] 2025-01-21 13:56:16,750 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-01-21 13:56:16,754 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. --> after Client(conf_path) --> after Client(conf_path) 01/21/2025 13:56:16 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:729] 2025-01-21 13:56:17,300 >> loading configuration file config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/config.json [INFO|configuration_utils.py:792] 2025-01-21 13:56:17,307 >> Model config InternVLChatConfig { "_commit_hash": "4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "OpenGVLab/InternVL2_5-1B--configuration_internvl_chat.InternVLChatConfig", "AutoModel": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 896, "llm_config": { "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 21, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": 32768, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 01/21/2025 13:56:17 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3476] 2025-01-21 13:57:02,796 >> loading weights file model.safetensors from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/model.safetensors [INFO|modeling_utils.py:1426] 2025-01-21 13:57:02,859 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2025-01-21 13:57:02,863 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2025-01-21 13:57:02,957 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } [INFO|modeling_utils.py:4350] 2025-01-21 13:57:05,407 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2025-01-21 13:57:05,407 >> All the weights of InternVLChatModel were initialized from the model checkpoint at OpenGVLab/InternVL2_5-1B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:781] 2025-01-21 13:57:05,916 >> loading configuration file generation_config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/generation_config.json [INFO|configuration_utils.py:826] 2025-01-21 13:57:05,916 >> Generate config GenerationConfig { "eos_token_id": [ 151644, 151645, 151643 ] } 01/21/2025 13:57:05 - INFO - __main__ - Finished 01/21/2025 13:57:05 - INFO - __main__ - model.config.force_image_size: 448 01/21/2025 13:57:05 - INFO - __main__ - data_args.force_image_size: 448 01/21/2025 13:57:05 - INFO - __main__ - model.config.vision_config.image_size: 448 01/21/2025 13:57:05 - INFO - __main__ - [Dataset] num_image_token: 256 01/21/2025 13:57:05 - INFO - __main__ - [Dataset] dynamic_image_size: True 01/21/2025 13:57:05 - INFO - __main__ - [Dataset] use_thumbnail: True 01/21/2025 13:57:05 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 01/21/2025 13:57:05 - INFO - __main__ - Formatting inputs...Skip in lazy mode [rank0]: Traceback (most recent call last): [rank0]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in [rank0]: main() [rank0]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 981, in main [rank0]: train_dataset = build_datasets( [rank0]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 727, in build_datasets [rank0]: dataset = LazySupervisedDataset( [rank0]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 337, in __init__ [rank0]: with open(meta['annotation'], 'r') as f: [rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/home/output.jsonl' [rank1]: Traceback (most recent call last): [rank1]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in [rank1]: main() [rank1]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 981, in main [rank1]: train_dataset = build_datasets( [rank1]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 727, in build_datasets [rank1]: dataset = LazySupervisedDataset( [rank1]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 337, in __init__ [rank1]: with open(meta['annotation'], 'r') as f: [rank1]: FileNotFoundError: [Errno 2] No such file or directory: '/home/output.jsonl' W0121 13:57:07.172000 140713030416192 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 5207 closing signal SIGTERM E0121 13:57:07.387000 140713030416192 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 5206) of binary: /root/miniconda3/envs/py3.10/bin/python3 Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-21_13:57:07 host : eec3b8dfaf80 rank : 0 (local_rank: 0) exitcode : 1 (pid: 5206) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0121 14:00:22.595000 139693165225792 torch/distributed/run.py:779] W0121 14:00:22.595000 139693165225792 torch/distributed/run.py:779] ***************************************** W0121 14:00:22.595000 139693165225792 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0121 14:00:22.595000 139693165225792 torch/distributed/run.py:779] ***************************************** [2025-01-21 14:00:24,631] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 14:00:24,632] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. Using PIL to load images. [2025-01-21 14:00:28,388] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-21 14:00:28,388] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 01/21/2025 14:00:28 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 01/21/2025 14:00:28 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=8, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/runs/Jan21_14-00-28_eec3b8dfaf80, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 01/21/2025 14:00:28 - INFO - __main__ - Loading Tokenizer: OpenGVLab/InternVL2_5-1B [2025-01-21 14:00:28,484] [INFO] [comm.py:652:init_distributed] cdb=None 01/21/2025 14:00:28 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:00:29,160 >> loading file vocab.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/vocab.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:00:29,161 >> loading file merges.txt from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/merges.txt [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:00:29,161 >> loading file added_tokens.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/added_tokens.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:00:29,161 >> loading file special_tokens_map.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/special_tokens_map.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:00:29,161 >> loading file tokenizer_config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/tokenizer_config.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:00:29,161 >> loading file tokenizer.json from cache at None [WARNING|logging.py:314] 2025-01-21 14:00:29,185 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. --> after Client(conf_path) [WARNING|logging.py:314] 2025-01-21 14:00:29,544 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. --> after Client(conf_path) 01/21/2025 14:00:29 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:729] 2025-01-21 14:00:29,791 >> loading configuration file config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/config.json [INFO|configuration_utils.py:792] 2025-01-21 14:00:29,793 >> Model config InternVLChatConfig { "_commit_hash": "4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "OpenGVLab/InternVL2_5-1B--configuration_internvl_chat.InternVLChatConfig", "AutoModel": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 896, "llm_config": { "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 21, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": 32768, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 01/21/2025 14:00:29 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3476] 2025-01-21 14:00:29,796 >> loading weights file model.safetensors from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/model.safetensors [INFO|modeling_utils.py:1426] 2025-01-21 14:00:29,822 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2025-01-21 14:00:29,824 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2025-01-21 14:00:29,883 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } [INFO|modeling_utils.py:4350] 2025-01-21 14:00:32,112 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2025-01-21 14:00:32,113 >> All the weights of InternVLChatModel were initialized from the model checkpoint at OpenGVLab/InternVL2_5-1B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:781] 2025-01-21 14:00:32,344 >> loading configuration file generation_config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/generation_config.json [INFO|configuration_utils.py:826] 2025-01-21 14:00:32,345 >> Generate config GenerationConfig { "eos_token_id": [ 151644, 151645, 151643 ] } 01/21/2025 14:00:32 - INFO - __main__ - Finished 01/21/2025 14:00:32 - INFO - __main__ - model.config.force_image_size: 448 01/21/2025 14:00:32 - INFO - __main__ - data_args.force_image_size: 448 01/21/2025 14:00:32 - INFO - __main__ - model.config.vision_config.image_size: 448 01/21/2025 14:00:32 - INFO - __main__ - [Dataset] num_image_token: 256 01/21/2025 14:00:32 - INFO - __main__ - [Dataset] dynamic_image_size: True 01/21/2025 14:00:32 - INFO - __main__ - [Dataset] use_thumbnail: True 01/21/2025 14:00:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 01/21/2025 14:00:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode 01/21/2025 14:00:34 - INFO - __main__ - Add dataset: vidor with length: 7000 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3779579255334184 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3779579255334184 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.down_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.up_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.up_proj.lora_B.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_A.default.weight 01/21/2025 14:00:34 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_B.default.weight [INFO|trainer.py:571] 2025-01-21 14:00:34,861 >> Using auto half precision backend [2025-01-21 14:00:35,077] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.2, git-hash=unknown, git-branch=unknown [2025-01-21 14:00:35,077] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2 [2025-01-21 14:00:36,221] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Creating extension directory /home/.cache/torch_extensions/py310_cu121/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /home/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Using /home/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/py3.10/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/py3.10/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o [3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so Loading extension module fused_adam... Time to load fused_adam op: 28.913005113601685 seconds [2025-01-21 14:01:05,141] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-01-21 14:01:05,142] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-01-21 14:01:05,197] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-01-21 14:01:05,197] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-01-21 14:01:05,198] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-01-21 14:01:05,198] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-01-21 14:01:05,198] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-01-21 14:01:05,198] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-01-21 14:01:05,198] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False Loading extension module fused_adam... Time to load fused_adam op: 28.971607208251953 seconds [2025-01-21 14:01:05,439] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-01-21 14:01:05,440] [INFO] [utils.py:782:see_memory_usage] MA 1.98 GB Max_MA 1.99 GB CA 2.1 GB Max_CA 2 GB [2025-01-21 14:01:05,440] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.21 GB, percent = 12.0% [2025-01-21 14:01:05,619] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-01-21 14:01:05,620] [INFO] [utils.py:782:see_memory_usage] MA 1.98 GB Max_MA 1.99 GB CA 2.11 GB Max_CA 2 GB [2025-01-21 14:01:05,620] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.21 GB, percent = 12.0% [2025-01-21 14:01:05,620] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized [2025-01-21 14:01:05,799] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-01-21 14:01:05,800] [INFO] [utils.py:782:see_memory_usage] MA 1.98 GB Max_MA 1.98 GB CA 2.11 GB Max_CA 2 GB [2025-01-21 14:01:05,800] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.21 GB, percent = 12.0% [2025-01-21 14:01:05,804] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-01-21 14:01:05,804] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-01-21 14:01:05,804] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-01-21 14:01:05,804] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-01-21 14:01:05,812] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] amp_params ................... False [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] comms_config ................. [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-01-21 14:01:05,813] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] dump_state ................... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 8 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-01-21 14:01:05,814] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] pld_params ................... False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] train_batch_size ............. 16 [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] world_size ................... 2 [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-01-21 14:01:05,815] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-01-21 14:01:05,816] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-01-21 14:01:05,816] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 8, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2025-01-21 14:01:05,816 >> ***** Running training ***** [INFO|trainer.py:1722] 2025-01-21 14:01:05,816 >> Num examples = 7,000 [INFO|trainer.py:1723] 2025-01-21 14:01:05,816 >> Num Epochs = 1 [INFO|trainer.py:1724] 2025-01-21 14:01:05,816 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2025-01-21 14:01:05,816 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1728] 2025-01-21 14:01:05,816 >> Gradient Accumulation steps = 8 [INFO|trainer.py:1729] 2025-01-21 14:01:05,816 >> Total optimization steps = 437 [INFO|trainer.py:1730] 2025-01-21 14:01:05,822 >> Number of trainable parameters = 8,798,208 0%| | 0/437 [00:00 [rank1]: main() [rank1]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main [rank1]: train_result = trainer.train(resume_from_checkpoint=checkpoint) [rank1]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train [rank1]: return inner_training_loop( [rank1]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop [rank1]: tr_loss_step = self.training_step(model, inputs) [rank1]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2781, in training_step [rank1]: self.accelerator.backward(loss) [rank1]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 2188, in backward [rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs) [rank1]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward [rank1]: self.engine.step() [rank1]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step [rank1]: assert not self.inside_no_sync_ctxt, \ [rank1]: AssertionError: It is illegal to call Engine.step() inside no_sync context manager [rank0]: Traceback (most recent call last): [rank0]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in [rank0]: main() [rank0]: File "/home/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main [rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint) [rank0]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train [rank0]: return inner_training_loop( [rank0]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs) [rank0]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2781, in training_step [rank0]: self.accelerator.backward(loss) [rank0]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 2188, in backward [rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs) [rank0]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 175, in backward [rank0]: self.engine.step() [rank0]: File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step [rank0]: assert not self.inside_no_sync_ctxt, \ [rank0]: AssertionError: It is illegal to call Engine.step() inside no_sync context manager 0%| | 0/437 [00:30 sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ internvl/train/internvl_chat_finetune.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-21_14:01:37 host : eec3b8dfaf80 rank : 0 (local_rank: 0) exitcode : 1 (pid: 5933) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ W0121 14:07:32.468000 140160394336064 torch/distributed/run.py:779] W0121 14:07:32.468000 140160394336064 torch/distributed/run.py:779] ***************************************** W0121 14:07:32.468000 140160394336064 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0121 14:07:32.468000 140160394336064 torch/distributed/run.py:779] ***************************************** [2025-01-21 14:07:34,496] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-01-21 14:07:34,499] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) petrel_client is not installed. If you read data locally instead of from ceph, ignore it.petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. Using PIL to load images. [2025-01-21 14:07:38,273] [INFO] [comm.py:652:init_distributed] cdb=None [2025-01-21 14:07:38,273] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 01/21/2025 14:07:38 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 01/21/2025 14:07:38 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=8, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/runs/Jan21_14-07-38_eec3b8dfaf80, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=200, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 01/21/2025 14:07:38 - INFO - __main__ - Loading Tokenizer: OpenGVLab/InternVL2_5-1B [2025-01-21 14:07:38,373] [INFO] [comm.py:652:init_distributed] cdb=None 01/21/2025 14:07:38 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:07:39,121 >> loading file vocab.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/vocab.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:07:39,122 >> loading file merges.txt from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/merges.txt [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:07:39,122 >> loading file added_tokens.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/added_tokens.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:07:39,122 >> loading file special_tokens_map.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/special_tokens_map.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:07:39,122 >> loading file tokenizer_config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/tokenizer_config.json [INFO|tokenization_utils_base.py:2027] 2025-01-21 14:07:39,122 >> loading file tokenizer.json from cache at None [WARNING|logging.py:314] 2025-01-21 14:07:39,532 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. --> after Client(conf_path) 01/21/2025 14:07:39 - INFO - __main__ - Loading InternVLChatModel... [WARNING|logging.py:314] 2025-01-21 14:07:39,668 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. --> after Client(conf_path) [INFO|configuration_utils.py:729] 2025-01-21 14:07:39,769 >> loading configuration file config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/config.json [INFO|configuration_utils.py:792] 2025-01-21 14:07:39,772 >> Model config InternVLChatConfig { "_commit_hash": "4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "OpenGVLab/InternVL2_5-1B--configuration_internvl_chat.InternVLChatConfig", "AutoModel": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 896, "llm_config": { "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 21, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": 32768, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 01/21/2025 14:07:39 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3476] 2025-01-21 14:07:39,774 >> loading weights file model.safetensors from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/model.safetensors [INFO|modeling_utils.py:1426] 2025-01-21 14:07:39,801 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2025-01-21 14:07:39,803 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2025-01-21 14:07:39,882 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } [INFO|modeling_utils.py:4350] 2025-01-21 14:07:42,096 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2025-01-21 14:07:42,096 >> All the weights of InternVLChatModel were initialized from the model checkpoint at OpenGVLab/InternVL2_5-1B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:781] 2025-01-21 14:07:42,333 >> loading configuration file generation_config.json from cache at /home/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/4dcf9845f6a6d8d6c4b188aae707a265cfbe4be5/generation_config.json [INFO|configuration_utils.py:826] 2025-01-21 14:07:42,334 >> Generate config GenerationConfig { "eos_token_id": [ 151644, 151645, 151643 ] } 01/21/2025 14:07:42 - INFO - __main__ - Finished 01/21/2025 14:07:42 - INFO - __main__ - model.config.force_image_size: 448 01/21/2025 14:07:42 - INFO - __main__ - data_args.force_image_size: 448 01/21/2025 14:07:42 - INFO - __main__ - model.config.vision_config.image_size: 448 01/21/2025 14:07:42 - INFO - __main__ - [Dataset] num_image_token: 256 01/21/2025 14:07:42 - INFO - __main__ - [Dataset] dynamic_image_size: True 01/21/2025 14:07:42 - INFO - __main__ - [Dataset] use_thumbnail: True 01/21/2025 14:07:42 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 01/21/2025 14:07:42 - INFO - __main__ - Formatting inputs...Skip in lazy mode 01/21/2025 14:07:44 - INFO - __main__ - Add dataset: vidor with length: 7000 trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3779579255334184 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.1.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.2.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.3.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.4.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.5.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.6.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.7.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.8.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.9.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.10.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.11.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.12.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.13.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.14.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.15.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.16.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.17.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.18.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.19.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.22.mlp.down_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.q_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.q_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.k_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.k_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.v_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.o_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.self_attn.o_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.gate_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.gate_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.up_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.up_proj.lora_B.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_A.default.weight 01/21/2025 14:07:44 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_B.default.weight [INFO|trainer.py:571] 2025-01-21 14:07:44,831 >> Using auto half precision backend trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3779579255334184 [2025-01-21 14:07:45,050] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown [2025-01-21 14:07:45,051] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 2 [2025-01-21 14:07:46,194] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Using /home/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/py3.10/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/envs/py3.10/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o [3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so Loading extension module fused_adam... Time to load fused_adam op: 28.399606943130493 seconds [2025-01-21 14:08:14,601] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-01-21 14:08:14,601] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-01-21 14:08:14,653] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-01-21 14:08:14,654] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-01-21 14:08:14,654] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-01-21 14:08:14,654] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-01-21 14:08:14,654] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-01-21 14:08:14,654] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-01-21 14:08:14,654] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False Loading extension module fused_adam... Time to load fused_adam op: 28.471649646759033 seconds [2025-01-21 14:08:14,897] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-01-21 14:08:14,898] [INFO] [utils.py:782:see_memory_usage] MA 1.98 GB Max_MA 1.99 GB CA 2.1 GB Max_CA 2 GB [2025-01-21 14:08:14,898] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.33 GB, percent = 12.1% [2025-01-21 14:08:15,076] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-01-21 14:08:15,077] [INFO] [utils.py:782:see_memory_usage] MA 1.98 GB Max_MA 1.99 GB CA 2.11 GB Max_CA 2 GB [2025-01-21 14:08:15,077] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.33 GB, percent = 12.1% [2025-01-21 14:08:15,077] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized [2025-01-21 14:08:15,250] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-01-21 14:08:15,251] [INFO] [utils.py:782:see_memory_usage] MA 1.98 GB Max_MA 1.98 GB CA 2.11 GB Max_CA 2 GB [2025-01-21 14:08:15,251] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 30.33 GB, percent = 12.1% [2025-01-21 14:08:15,255] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-01-21 14:08:15,255] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-01-21 14:08:15,256] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-01-21 14:08:15,256] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-01-21 14:08:15,263] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] amp_params ................... False [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] comms_config ................. [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-01-21 14:08:15,264] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] dump_state ................... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 8 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-01-21 14:08:15,265] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] pld_params ................... False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] train_batch_size ............. 16 [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] world_size ................... 2 [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-01-21 14:08:15,266] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-01-21 14:08:15,267] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 8, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2025-01-21 14:08:15,267 >> ***** Running training ***** [INFO|trainer.py:1722] 2025-01-21 14:08:15,267 >> Num examples = 7,000 [INFO|trainer.py:1723] 2025-01-21 14:08:15,267 >> Num Epochs = 1 [INFO|trainer.py:1724] 2025-01-21 14:08:15,267 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2025-01-21 14:08:15,267 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1728] 2025-01-21 14:08:15,267 >> Gradient Accumulation steps = 8 [INFO|trainer.py:1729] 2025-01-21 14:08:15,267 >> Total optimization steps = 437 [INFO|trainer.py:1730] 2025-01-21 14:08:15,273 >> Number of trainable parameters = 8,798,208 0%| | 0/437 [00:00> Saving model checkpoint to work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200 [INFO|configuration_utils.py:473] 2025-01-21 14:30:25,573 >> Configuration saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/config.json [INFO|configuration_utils.py:594] 2025-01-21 14:30:25,574 >> Configuration saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/generation_config.json [INFO|modeling_utils.py:2493] 2025-01-21 14:30:28,259 >> Model weights saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/model.safetensors [INFO|tokenization_utils_base.py:2433] 2025-01-21 14:30:28,260 >> tokenizer config file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2025-01-21 14:30:28,261 >> Special tokens file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2025-01-21 14:30:28,261 >> added tokens file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/added_tokens.json [2025-01-21 14:30:28,596] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint global_step200 is about to be saved! [2025-01-21 14:30:28,623] [INFO] [logging.py:128:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt [2025-01-21 14:30:28,624] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt... [2025-01-21 14:30:30,718] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/mp_rank_00_model_states.pt. [2025-01-21 14:30:30,720] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-01-21 14:30:30,789] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-01-21 14:30:30,789] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-200/global_step200/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-01-21 14:30:30,789] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step200 is ready now! dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5308 [2025-01-21 14:30:31,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.70 | bwd_microstep: 383.54 | bwd_inner_microstep: 383.33 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.15 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3121 [2025-01-21 14:30:32,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.43 | bwd_microstep: 243.15 | bwd_inner_microstep: 242.97 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7865 [2025-01-21 14:30:33,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 510.71 | bwd_microstep: 583.79 | bwd_inner_microstep: 583.57 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:30:34,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 538.35 | bwd_microstep: 605.75 | bwd_inner_microstep: 605.58 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:30:34,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.48 | bwd_microstep: 300.21 | bwd_inner_microstep: 300.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4824 [2025-01-21 14:30:35,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 303.19 | bwd_microstep: 352.56 | bwd_inner_microstep: 352.39 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2720 [2025-01-21 14:30:36,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 182.47 | bwd_microstep: 211.86 | bwd_inner_microstep: 211.56 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6404 [2025-01-21 14:30:36,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.78 | optimizer_step: 0.38 [2025-01-21 14:30:36,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 413.32 | bwd_microstep: 482.60 | bwd_inner_microstep: 474.74 | bwd_allreduce_microstep: 7.72 | step_microstep: 11.58 [2025-01-21 14:30:36,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2775.47 | bwd: 3163.59 | bwd_inner: 3154.57 | bwd_allreduce: 8.24 | step: 12.41 46%|████▌ | 201/437 [22:21<33:44, 8.58s/it] {'loss': 0.4242, 'learning_rate': 2.3619148449539965e-05, 'epoch': 0.46} 46%|████▌ | 201/437 [22:21<33:44, 8.58s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3025 [2025-01-21 14:30:37,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.05 | bwd_microstep: 223.77 | bwd_inner_microstep: 223.60 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:30:38,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.91 | bwd_microstep: 605.34 | bwd_inner_microstep: 605.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3706 [2025-01-21 14:30:39,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.34 | bwd_microstep: 273.71 | bwd_inner_microstep: 273.40 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4969 [2025-01-21 14:30:39,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.78 | bwd_microstep: 360.03 | bwd_inner_microstep: 359.84 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7865 [2025-01-21 14:30:40,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 509.39 | bwd_microstep: 583.15 | bwd_inner_microstep: 582.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3863 [2025-01-21 14:30:41,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.39 | bwd_microstep: 287.11 | bwd_inner_microstep: 286.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:30:42,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 357.04 | bwd_microstep: 398.01 | bwd_inner_microstep: 397.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8170 [2025-01-21 14:30:43,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.64 | optimizer_step: 0.33 [2025-01-21 14:30:43,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 519.76 | bwd_microstep: 617.09 | bwd_inner_microstep: 609.40 | bwd_allreduce_microstep: 7.44 | step_microstep: 10.82 [2025-01-21 14:30:43,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2944.47 | bwd: 3348.34 | bwd_inner: 3339.54 | bwd_allreduce: 7.92 | step: 11.60 46%|████▌ | 202/437 [22:28<31:10, 7.96s/it] {'loss': 0.3044, 'learning_rate': 2.3472963553338614e-05, 'epoch': 0.46} 46%|████▌ | 202/437 [22:28<31:10, 7.96s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4044 [2025-01-21 14:30:44,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.40 | bwd_microstep: 300.14 | bwd_inner_microstep: 299.95 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:30:45,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.77 | bwd_microstep: 605.26 | bwd_inner_microstep: 604.96 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2573 [2025-01-21 14:30:45,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.54 | bwd_microstep: 201.07 | bwd_inner_microstep: 200.75 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4953 [2025-01-21 14:30:46,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.36 | bwd_microstep: 362.45 | bwd_inner_microstep: 362.18 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3333 [2025-01-21 14:30:46,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.55 | bwd_microstep: 255.06 | bwd_inner_microstep: 254.74 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3596 [2025-01-21 14:30:47,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.14 | bwd_microstep: 268.21 | bwd_inner_microstep: 268.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:30:47,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.60 | bwd_microstep: 203.75 | bwd_inner_microstep: 203.43 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4025 [2025-01-21 14:30:48,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.83 | optimizer_step: 0.36 [2025-01-21 14:30:48,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.72 | bwd_microstep: 309.39 | bwd_inner_microstep: 301.25 | bwd_allreduce_microstep: 8.01 | step_microstep: 12.08 [2025-01-21 14:30:48,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2223.91 | bwd: 2505.45 | bwd_inner: 2495.92 | bwd_allreduce: 8.49 | step: 12.90 46%|████▋ | 203/437 [22:33<27:31, 7.06s/it] {'loss': 0.2674, 'learning_rate': 2.3326587091662605e-05, 'epoch': 0.46} 46%|████▋ | 203/437 [22:33<27:31, 7.06s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5341 [2025-01-21 14:30:49,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 346.59 | bwd_microstep: 386.53 | bwd_inner_microstep: 386.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7170 [2025-01-21 14:30:50,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.68 | bwd_microstep: 534.89 | bwd_inner_microstep: 534.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7916 [2025-01-21 14:30:51,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.73 | bwd_microstep: 590.15 | bwd_inner_microstep: 589.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6862 [2025-01-21 14:30:52,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.98 | bwd_microstep: 507.71 | bwd_inner_microstep: 507.50 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6295 [2025-01-21 14:30:53,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.64 | bwd_microstep: 464.86 | bwd_inner_microstep: 464.66 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4163 [2025-01-21 14:30:53,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.96 | bwd_microstep: 305.21 | bwd_inner_microstep: 305.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3341 [2025-01-21 14:30:54,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.46 | bwd_microstep: 255.93 | bwd_inner_microstep: 255.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7417 [2025-01-21 14:30:55,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.74 | optimizer_step: 0.33 [2025-01-21 14:30:55,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.83 | bwd_microstep: 559.72 | bwd_inner_microstep: 552.12 | bwd_allreduce_microstep: 7.49 | step_microstep: 11.13 [2025-01-21 14:30:55,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3149.67 | bwd: 3605.16 | bwd_inner: 3596.41 | bwd_allreduce: 7.98 | step: 11.96 47%|████▋ | 204/437 [22:40<27:19, 7.04s/it] {'loss': 0.2977, 'learning_rate': 2.3180027138502913e-05, 'epoch': 0.47} 47%|████▋ | 204/437 [22:40<27:19, 7.04s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4402 [2025-01-21 14:30:56,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.94 | bwd_microstep: 320.77 | bwd_inner_microstep: 320.54 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4706 [2025-01-21 14:30:56,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.98 | bwd_microstep: 344.18 | bwd_inner_microstep: 343.67 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.27 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5212 [2025-01-21 14:30:57,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 340.95 | bwd_microstep: 380.55 | bwd_inner_microstep: 380.36 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7051 [2025-01-21 14:30:58,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.22 | bwd_microstep: 521.79 | bwd_inner_microstep: 521.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6778 [2025-01-21 14:30:59,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.67 | bwd_microstep: 499.96 | bwd_inner_microstep: 499.46 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:31:00,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.59 | bwd_microstep: 416.69 | bwd_inner_microstep: 416.37 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6941 [2025-01-21 14:31:01,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.05 | bwd_microstep: 511.27 | bwd_inner_microstep: 511.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.18 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5779 [2025-01-21 14:31:02,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.62 | optimizer_gradients: 0.79 | optimizer_step: 0.35 [2025-01-21 14:31:02,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 369.67 | bwd_microstep: 489.67 | bwd_inner_microstep: 425.70 | bwd_allreduce_microstep: 63.86 | step_microstep: 23.16 [2025-01-21 14:31:02,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3011.90 | bwd: 3485.09 | bwd_inner: 3419.33 | bwd_allreduce: 64.60 | step: 24.38 47%|████▋ | 205/437 [22:46<26:51, 6.95s/it] {'loss': 0.3201, 'learning_rate': 2.303329177797172e-05, 'epoch': 0.47} 47%|████▋ | 205/437 [22:46<26:51, 6.95s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5605 [2025-01-21 14:31:02,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.82 | bwd_microstep: 411.76 | bwd_inner_microstep: 411.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3746 [2025-01-21 14:31:03,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.75 | bwd_microstep: 278.57 | bwd_inner_microstep: 278.05 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2585 [2025-01-21 14:31:03,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.08 | bwd_microstep: 214.68 | bwd_inner_microstep: 214.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5469 [2025-01-21 14:31:04,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.18 | bwd_microstep: 404.43 | bwd_inner_microstep: 404.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4665 [2025-01-21 14:31:05,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.16 | bwd_microstep: 343.58 | bwd_inner_microstep: 343.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7306 [2025-01-21 14:31:06,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.82 | bwd_microstep: 545.42 | bwd_inner_microstep: 545.18 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:31:07,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.43 | bwd_microstep: 337.99 | bwd_inner_microstep: 337.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5435 [2025-01-21 14:31:07,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.70 | optimizer_step: 0.34 [2025-01-21 14:31:07,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 354.23 | bwd_microstep: 407.16 | bwd_inner_microstep: 399.45 | bwd_allreduce_microstep: 7.52 | step_microstep: 11.23 [2025-01-21 14:31:07,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2590.31 | bwd: 2943.77 | bwd_inner: 2934.68 | bwd_allreduce: 8.12 | step: 12.19 47%|████▋ | 206/437 [22:52<25:23, 6.60s/it] {'loss': 0.2589, 'learning_rate': 2.2886389103856534e-05, 'epoch': 0.47} 47%|████▋ | 206/437 [22:52<25:23, 6.60s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3133 [2025-01-21 14:31:08,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.25 | bwd_microstep: 242.93 | bwd_inner_microstep: 242.75 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:31:09,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.15 | bwd_microstep: 606.66 | bwd_inner_microstep: 606.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7364 [2025-01-21 14:31:10,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.33 | bwd_microstep: 549.78 | bwd_inner_microstep: 549.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4416 [2025-01-21 14:31:11,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 291.82 | bwd_microstep: 321.21 | bwd_inner_microstep: 321.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4930 [2025-01-21 14:31:11,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.98 | bwd_microstep: 360.54 | bwd_inner_microstep: 360.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6214 [2025-01-21 14:31:12,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.18 | bwd_microstep: 460.84 | bwd_inner_microstep: 460.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:31:13,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 220.49 | bwd_microstep: 242.68 | bwd_inner_microstep: 242.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2340 [2025-01-21 14:31:13,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.42 | optimizer_gradients: 0.94 | optimizer_step: 0.35 [2025-01-21 14:31:13,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.93 | bwd_microstep: 266.71 | bwd_inner_microstep: 232.33 | bwd_allreduce_microstep: 34.27 | step_microstep: 19.13 [2025-01-21 14:31:13,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2638.97 | bwd: 3051.47 | bwd_inner: 3015.97 | bwd_allreduce: 34.78 | step: 19.93 47%|████▋ | 207/437 [22:58<24:30, 6.39s/it] {'loss': 0.2949, 'learning_rate': 2.2739327219173707e-05, 'epoch': 0.47} 47%|████▋ | 207/437 [22:58<24:30, 6.39s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4174 [2025-01-21 14:31:14,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 267.72 | bwd_microstep: 306.45 | bwd_inner_microstep: 306.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3977 [2025-01-21 14:31:15,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.92 | bwd_microstep: 297.40 | bwd_inner_microstep: 297.03 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7932 [2025-01-21 14:31:16,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 516.03 | bwd_microstep: 588.72 | bwd_inner_microstep: 588.56 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6043 [2025-01-21 14:31:17,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.25 | bwd_microstep: 441.51 | bwd_inner_microstep: 441.18 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2859 [2025-01-21 14:31:17,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.45 | bwd_microstep: 222.02 | bwd_inner_microstep: 221.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7079 [2025-01-21 14:31:18,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.55 | bwd_microstep: 524.68 | bwd_inner_microstep: 524.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:31:19,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 509.77 | bwd_microstep: 577.27 | bwd_inner_microstep: 577.06 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:31:20,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.53 | optimizer_gradients: 0.77 | optimizer_step: 0.35 [2025-01-21 14:31:20,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 357.18 | bwd_microstep: 407.63 | bwd_inner_microstep: 399.84 | bwd_allreduce_microstep: 7.68 | step_microstep: 13.91 [2025-01-21 14:31:20,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2955.70 | bwd: 3365.79 | bwd_inner: 3356.72 | bwd_allreduce: 8.19 | step: 14.68 48%|████▊ | 208/437 [23:05<24:35, 6.44s/it] {'loss': 0.2762, 'learning_rate': 2.259211423572152e-05, 'epoch': 0.48} 48%|████▊ | 208/437 [23:05<24:35, 6.44s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6665 [2025-01-21 14:31:21,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 430.49 | bwd_microstep: 494.06 | bwd_inner_microstep: 493.84 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:31:22,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.92 | bwd_microstep: 604.91 | bwd_inner_microstep: 604.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:31:23,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.09 | bwd_microstep: 606.58 | bwd_inner_microstep: 606.27 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3379 [2025-01-21 14:31:24,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.59 | bwd_microstep: 253.50 | bwd_inner_microstep: 253.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6807 [2025-01-21 14:31:25,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 441.04 | bwd_microstep: 505.27 | bwd_inner_microstep: 505.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5200 [2025-01-21 14:31:25,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.61 | bwd_microstep: 378.72 | bwd_inner_microstep: 378.56 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5989 [2025-01-21 14:31:26,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.14 | bwd_microstep: 433.56 | bwd_inner_microstep: 433.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 4061 [2025-01-21 14:31:27,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.68 | optimizer_step: 0.34 [2025-01-21 14:31:27,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 253.28 | bwd_microstep: 311.88 | bwd_inner_microstep: 304.13 | bwd_allreduce_microstep: 7.64 | step_microstep: 11.19 [2025-01-21 14:31:27,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3175.02 | bwd: 3588.60 | bwd_inner: 3579.71 | bwd_allreduce: 8.13 | step: 11.94 48%|████▊ | 209/437 [23:12<25:07, 6.61s/it] {'loss': 0.3394, 'learning_rate': 2.2444758273632693e-05, 'epoch': 0.48} 48%|████▊ | 209/437 [23:12<25:07, 6.61s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7801 [2025-01-21 14:31:28,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 497.90 | bwd_microstep: 580.49 | bwd_inner_microstep: 580.16 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6647 [2025-01-21 14:31:29,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 430.05 | bwd_microstep: 488.68 | bwd_inner_microstep: 488.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5489 [2025-01-21 14:31:30,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.34 | bwd_microstep: 403.61 | bwd_inner_microstep: 403.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6785 [2025-01-21 14:31:31,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 440.56 | bwd_microstep: 503.18 | bwd_inner_microstep: 503.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5186 [2025-01-21 14:31:31,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.69 | bwd_microstep: 379.97 | bwd_inner_microstep: 379.65 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:31:33,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 522.51 | bwd_microstep: 605.77 | bwd_inner_microstep: 605.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:31:33,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.59 | bwd_microstep: 221.06 | bwd_inner_microstep: 220.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2642 [2025-01-21 14:31:34,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.87 | optimizer_step: 0.36 [2025-01-21 14:31:34,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.27 | bwd_microstep: 245.82 | bwd_inner_microstep: 237.85 | bwd_allreduce_microstep: 7.86 | step_microstep: 12.08 [2025-01-21 14:31:34,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2977.74 | bwd: 3428.71 | bwd_inner: 3419.50 | bwd_allreduce: 8.34 | step: 12.86 48%|████▊ | 210/437 [23:18<25:02, 6.62s/it] {'loss': 0.2609, 'learning_rate': 2.2297267460926548e-05, 'epoch': 0.48} 48%|████▊ | 210/437 [23:18<25:02, 6.62s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5683 [2025-01-21 14:31:34,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.98 | bwd_microstep: 415.51 | bwd_inner_microstep: 415.35 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:31:36,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.87 | bwd_microstep: 606.02 | bwd_inner_microstep: 605.70 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7656 [2025-01-21 14:31:37,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.80 | bwd_microstep: 569.36 | bwd_inner_microstep: 569.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6310 [2025-01-21 14:31:38,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.31 | bwd_microstep: 465.77 | bwd_inner_microstep: 465.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8149 [2025-01-21 14:31:39,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.21 | bwd_microstep: 608.72 | bwd_inner_microstep: 608.55 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6270 [2025-01-21 14:31:40,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.87 | bwd_microstep: 463.64 | bwd_inner_microstep: 463.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:31:41,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.69 | bwd_microstep: 478.47 | bwd_inner_microstep: 478.26 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:31:42,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.68 | optimizer_step: 0.34 [2025-01-21 14:31:42,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 523.73 | bwd_microstep: 610.09 | bwd_inner_microstep: 602.56 | bwd_allreduce_microstep: 7.43 | step_microstep: 10.82 [2025-01-21 14:31:42,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3699.30 | bwd: 4217.70 | bwd_inner: 4209.06 | bwd_allreduce: 7.89 | step: 11.61 48%|████▊ | 211/437 [23:26<26:40, 7.08s/it] {'loss': 0.295, 'learning_rate': 2.2149649933060625e-05, 'epoch': 0.48} 48%|████▊ | 211/437 [23:26<26:40, 7.08s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3777 [2025-01-21 14:31:42,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.87 | bwd_microstep: 280.02 | bwd_inner_microstep: 279.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4805 [2025-01-21 14:31:43,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 312.65 | bwd_microstep: 351.43 | bwd_inner_microstep: 351.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6353 [2025-01-21 14:31:44,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.98 | bwd_microstep: 469.10 | bwd_inner_microstep: 468.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4214 [2025-01-21 14:31:44,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.81 | bwd_microstep: 309.29 | bwd_inner_microstep: 309.11 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4714 [2025-01-21 14:31:45,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.73 | bwd_microstep: 344.15 | bwd_inner_microstep: 343.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4951 [2025-01-21 14:31:46,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.87 | bwd_microstep: 360.68 | bwd_inner_microstep: 360.46 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6985 [2025-01-21 14:31:47,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 440.19 | bwd_microstep: 517.31 | bwd_inner_microstep: 517.12 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:31:48,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.68 | optimizer_step: 0.34 [2025-01-21 14:31:48,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.94 | bwd_microstep: 625.47 | bwd_inner_microstep: 617.80 | bwd_allreduce_microstep: 7.54 | step_microstep: 11.09 [2025-01-21 14:31:48,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2848.87 | bwd: 3257.58 | bwd_inner: 3248.83 | bwd_allreduce: 8.02 | step: 11.89 49%|████▊ | 212/437 [23:33<25:42, 6.86s/it] {'loss': 0.3794, 'learning_rate': 2.200191383248197e-05, 'epoch': 0.48} 49%|████▊ | 212/437 [23:33<25:42, 6.86s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3495 [2025-01-21 14:31:49,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.72 | bwd_microstep: 262.28 | bwd_inner_microstep: 262.07 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3426 [2025-01-21 14:31:49,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.31 | bwd_microstep: 256.51 | bwd_inner_microstep: 256.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3675 [2025-01-21 14:31:50,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.59 | bwd_microstep: 273.65 | bwd_inner_microstep: 273.48 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4681 [2025-01-21 14:31:50,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.00 | bwd_microstep: 344.12 | bwd_inner_microstep: 343.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5470 [2025-01-21 14:31:51,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.41 | bwd_microstep: 402.74 | bwd_inner_microstep: 402.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5191 [2025-01-21 14:31:52,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.24 | bwd_microstep: 379.33 | bwd_inner_microstep: 379.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:31:52,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.33 | bwd_microstep: 282.90 | bwd_inner_microstep: 282.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:31:54,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.67 | optimizer_step: 0.38 [2025-01-21 14:31:54,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.80 | bwd_microstep: 613.08 | bwd_inner_microstep: 605.33 | bwd_allreduce_microstep: 7.53 | step_microstep: 11.25 [2025-01-21 14:31:54,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2523.24 | bwd: 2814.73 | bwd_inner: 2805.97 | bwd_allreduce: 7.98 | step: 12.06 49%|████▊ | 213/437 [23:38<24:08, 6.47s/it] {'loss': 0.4467, 'learning_rate': 2.1854067308177967e-05, 'epoch': 0.49} 49%|████▊ | 213/437 [23:38<24:08, 6.47s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6630 [2025-01-21 14:31:55,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 418.21 | bwd_microstep: 485.70 | bwd_inner_microstep: 485.53 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.20 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3413 [2025-01-21 14:31:55,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.29 | bwd_microstep: 256.24 | bwd_inner_microstep: 256.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6562 [2025-01-21 14:31:56,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.70 | bwd_microstep: 483.63 | bwd_inner_microstep: 483.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2578 [2025-01-21 14:31:56,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.69 | bwd_microstep: 219.29 | bwd_inner_microstep: 219.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7852 [2025-01-21 14:31:58,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.26 | bwd_microstep: 585.58 | bwd_inner_microstep: 585.35 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4358 [2025-01-21 14:31:58,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.73 | bwd_microstep: 320.69 | bwd_inner_microstep: 320.38 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:31:59,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.57 | bwd_microstep: 300.86 | bwd_inner_microstep: 300.55 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2695 [2025-01-21 14:31:59,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.76 | optimizer_step: 0.35 [2025-01-21 14:31:59,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.66 | bwd_microstep: 296.39 | bwd_inner_microstep: 244.44 | bwd_allreduce_microstep: 51.81 | step_microstep: 13.96 [2025-01-21 14:31:59,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2496.96 | bwd: 2948.53 | bwd_inner: 2895.34 | bwd_allreduce: 52.31 | step: 14.78 49%|████▉ | 214/437 [23:44<23:08, 6.23s/it] {'loss': 0.2828, 'learning_rate': 2.1706118515226894e-05, 'epoch': 0.49} 49%|████▉ | 214/437 [23:44<23:08, 6.23s/it]dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5127 [2025-01-21 14:32:00,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.12 | bwd_microstep: 375.75 | bwd_inner_microstep: 375.41 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3165 [2025-01-21 14:32:00,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.92 | bwd_microstep: 243.33 | bwd_inner_microstep: 243.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.18 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2889 [2025-01-21 14:32:01,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.21 | bwd_microstep: 216.21 | bwd_inner_microstep: 216.02 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:32:02,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.27 | bwd_microstep: 605.20 | bwd_inner_microstep: 604.71 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:32:03,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 422.86 | bwd_microstep: 479.21 | bwd_inner_microstep: 478.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:32:04,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.33 | bwd_microstep: 477.82 | bwd_inner_microstep: 477.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2686 [2025-01-21 14:32:04,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.73 | bwd_microstep: 211.49 | bwd_inner_microstep: 211.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6862 [2025-01-21 14:32:05,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:32:05,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.76 | bwd_microstep: 514.62 | bwd_inner_microstep: 507.03 | bwd_allreduce_microstep: 7.36 | step_microstep: 11.03 [2025-01-21 14:32:05,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2743.03 | bwd: 3123.75 | bwd_inner: 3114.78 | bwd_allreduce: 7.98 | step: 11.88 49%|████▉ | 215/437 [23:50<22:53, 6.19s/it] {'loss': 0.3457, 'learning_rate': 2.1558075614348065e-05, 'epoch': 0.49} 49%|████▉ | 215/437 [23:50<22:53, 6.19s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2508 [2025-01-21 14:32:06,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.48 | bwd_microstep: 205.91 | bwd_inner_microstep: 205.68 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3760 [2025-01-21 14:32:06,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.66 | bwd_microstep: 280.23 | bwd_inner_microstep: 279.92 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2338 [2025-01-21 14:32:07,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.86 | bwd_microstep: 203.33 | bwd_inner_microstep: 203.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6286 [2025-01-21 14:32:08,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.82 | bwd_microstep: 466.84 | bwd_inner_microstep: 466.54 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4682 [2025-01-21 14:32:08,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.97 | bwd_microstep: 344.84 | bwd_inner_microstep: 344.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5736 [2025-01-21 14:32:09,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.65 | bwd_microstep: 421.11 | bwd_inner_microstep: 420.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3598 [2025-01-21 14:32:10,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.65 | bwd_microstep: 268.14 | bwd_inner_microstep: 267.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:32:11,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.66 | optimizer_step: 0.33 [2025-01-21 14:32:11,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 507.63 | bwd_microstep: 587.05 | bwd_inner_microstep: 579.49 | bwd_allreduce_microstep: 7.46 | step_microstep: 11.08 [2025-01-21 14:32:11,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2423.59 | bwd: 2777.57 | bwd_inner: 2768.74 | bwd_allreduce: 7.92 | step: 11.84 49%|████▉ | 216/437 [23:55<21:56, 5.96s/it] {'loss': 0.2655, 'learning_rate': 2.1409946771451705e-05, 'epoch': 0.49} 49%|████▉ | 216/437 [23:55<21:56, 5.96s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3221 [2025-01-21 14:32:11,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.08 | bwd_microstep: 242.15 | bwd_inner_microstep: 241.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2361 [2025-01-21 14:32:12,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.80 | bwd_microstep: 214.11 | bwd_inner_microstep: 213.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2320 [2025-01-21 14:32:12,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.16 | bwd_microstep: 213.76 | bwd_inner_microstep: 213.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4948 [2025-01-21 14:32:13,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.79 | bwd_microstep: 361.19 | bwd_inner_microstep: 360.88 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.16 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:32:14,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.40 | bwd_microstep: 606.00 | bwd_inner_microstep: 605.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7558 [2025-01-21 14:32:15,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.19 | bwd_microstep: 563.14 | bwd_inner_microstep: 562.91 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:32:15,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.54 | bwd_microstep: 198.61 | bwd_inner_microstep: 198.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2455 [2025-01-21 14:32:16,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:32:16,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.07 | bwd_microstep: 236.60 | bwd_inner_microstep: 228.81 | bwd_allreduce_microstep: 7.68 | step_microstep: 11.55 [2025-01-21 14:32:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2229.85 | bwd: 2635.68 | bwd_inner: 2626.69 | bwd_allreduce: 8.16 | step: 12.36 50%|████▉ | 217/437 [24:01<20:52, 5.69s/it] {'loss': 0.3977, 'learning_rate': 2.1261740157188498e-05, 'epoch': 0.5} 50%|████▉ | 217/437 [24:01<20:52, 5.69s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4248 [2025-01-21 14:32:16,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.89 | bwd_microstep: 311.01 | bwd_inner_microstep: 310.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:32:18,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 554.10 | bwd_microstep: 605.41 | bwd_inner_microstep: 605.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3126 [2025-01-21 14:32:18,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.31 | bwd_microstep: 242.86 | bwd_inner_microstep: 242.55 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4442 [2025-01-21 14:32:19,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.43 | bwd_microstep: 330.27 | bwd_inner_microstep: 330.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4153 [2025-01-21 14:32:19,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.22 | bwd_microstep: 309.87 | bwd_inner_microstep: 309.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5469 [2025-01-21 14:32:20,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 360.46 | bwd_microstep: 404.59 | bwd_inner_microstep: 404.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3828 [2025-01-21 14:32:21,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 257.43 | bwd_microstep: 282.93 | bwd_inner_microstep: 282.76 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2895 [2025-01-21 14:32:21,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.54 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:32:21,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.76 | bwd_microstep: 241.50 | bwd_inner_microstep: 233.60 | bwd_allreduce_microstep: 7.79 | step_microstep: 11.91 [2025-01-21 14:32:21,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2427.44 | bwd: 2728.57 | bwd_inner: 2719.52 | bwd_allreduce: 8.26 | step: 12.72 50%|████▉ | 218/437 [24:06<20:26, 5.60s/it] {'loss': 0.274, 'learning_rate': 2.111346394649897e-05, 'epoch': 0.5} 50%|████▉ | 218/437 [24:06<20:26, 5.60s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5617 [2025-01-21 14:32:22,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.35 | bwd_microstep: 414.19 | bwd_inner_microstep: 413.89 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8182 [2025-01-21 14:32:23,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 529.58 | bwd_microstep: 610.72 | bwd_inner_microstep: 610.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3370 [2025-01-21 14:32:24,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.71 | bwd_microstep: 254.73 | bwd_inner_microstep: 254.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4684 [2025-01-21 14:32:24,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.71 | bwd_microstep: 345.31 | bwd_inner_microstep: 345.14 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:32:26,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.85 | bwd_microstep: 608.02 | bwd_inner_microstep: 607.84 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5715 [2025-01-21 14:32:26,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.15 | bwd_microstep: 419.28 | bwd_inner_microstep: 419.12 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7099 [2025-01-21 14:32:27,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 451.27 | bwd_microstep: 526.80 | bwd_inner_microstep: 526.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4517 [2025-01-21 14:32:28,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.76 | optimizer_step: 0.34 [2025-01-21 14:32:28,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.24 | bwd_microstep: 344.48 | bwd_inner_microstep: 335.24 | bwd_allreduce_microstep: 9.12 | step_microstep: 11.60 [2025-01-21 14:32:28,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3097.71 | bwd: 3523.65 | bwd_inner: 3513.22 | bwd_allreduce: 9.60 | step: 12.41 50%|█████ | 219/437 [24:13<21:43, 5.98s/it] {'loss': 0.4635, 'learning_rate': 2.0965126318162476e-05, 'epoch': 0.5} 50%|█████ | 219/437 [24:13<21:43, 5.98s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3962 [2025-01-21 14:32:29,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.02 | bwd_microstep: 294.02 | bwd_inner_microstep: 293.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6031 [2025-01-21 14:32:30,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.40 | bwd_microstep: 440.81 | bwd_inner_microstep: 440.49 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4679 [2025-01-21 14:32:30,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.54 | bwd_microstep: 344.41 | bwd_inner_microstep: 344.25 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3871 [2025-01-21 14:32:31,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.88 | bwd_microstep: 288.33 | bwd_inner_microstep: 288.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3599 [2025-01-21 14:32:31,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.91 | bwd_microstep: 269.68 | bwd_inner_microstep: 269.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7820 [2025-01-21 14:32:32,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.30 | bwd_microstep: 584.30 | bwd_inner_microstep: 584.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2654 [2025-01-21 14:32:33,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.89 | bwd_microstep: 205.95 | bwd_inner_microstep: 205.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7662 [2025-01-21 14:32:34,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:32:34,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 497.45 | bwd_microstep: 576.70 | bwd_inner_microstep: 569.15 | bwd_allreduce_microstep: 7.43 | step_microstep: 11.13 [2025-01-21 14:32:34,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2653.24 | bwd: 3004.32 | bwd_inner: 2995.59 | bwd_allreduce: 7.89 | step: 11.94 50%|█████ | 220/437 [24:19<21:30, 5.95s/it] {'loss': 0.3107, 'learning_rate': 2.0816735454346134e-05, 'epoch': 0.5} 50%|█████ | 220/437 [24:19<21:30, 5.95s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5381 [2025-01-21 14:32:35,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 353.66 | bwd_microstep: 397.63 | bwd_inner_microstep: 397.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2341 [2025-01-21 14:32:35,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.86 | bwd_microstep: 210.42 | bwd_inner_microstep: 209.91 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5756 [2025-01-21 14:32:36,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 382.28 | bwd_microstep: 424.10 | bwd_inner_microstep: 423.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3334 [2025-01-21 14:32:37,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.23 | bwd_microstep: 255.13 | bwd_inner_microstep: 254.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2800 [2025-01-21 14:32:37,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.69 | bwd_microstep: 220.38 | bwd_inner_microstep: 220.08 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:32:38,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.47 | bwd_microstep: 604.16 | bwd_inner_microstep: 603.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:32:39,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.29 | bwd_microstep: 220.47 | bwd_inner_microstep: 220.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6162 [2025-01-21 14:32:39,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.14 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:32:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 401.56 | bwd_microstep: 472.18 | bwd_inner_microstep: 458.24 | bwd_allreduce_microstep: 13.82 | step_microstep: 13.78 [2025-01-21 14:32:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2475.87 | bwd: 2804.64 | bwd_inner: 2789.22 | bwd_allreduce: 14.41 | step: 14.75 51%|█████ | 221/437 [24:24<20:56, 5.82s/it] {'loss': 0.4861, 'learning_rate': 2.0668299540153494e-05, 'epoch': 0.51} 51%|█████ | 221/437 [24:24<20:56, 5.82s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3384 [2025-01-21 14:32:40,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.41 | bwd_microstep: 253.64 | bwd_inner_microstep: 253.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6400 [2025-01-21 14:32:41,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 415.86 | bwd_microstep: 475.09 | bwd_inner_microstep: 474.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7884 [2025-01-21 14:32:42,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.83 | bwd_microstep: 591.32 | bwd_inner_microstep: 591.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3097 [2025-01-21 14:32:43,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.37 | bwd_microstep: 241.11 | bwd_inner_microstep: 240.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8116 [2025-01-21 14:32:44,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 529.51 | bwd_microstep: 607.18 | bwd_inner_microstep: 606.99 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4660 [2025-01-21 14:32:44,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.01 | bwd_microstep: 343.54 | bwd_inner_microstep: 343.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3330 [2025-01-21 14:32:45,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.81 | bwd_microstep: 253.90 | bwd_inner_microstep: 253.74 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:32:46,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:32:46,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.48 | bwd_microstep: 470.11 | bwd_inner_microstep: 462.44 | bwd_allreduce_microstep: 7.57 | step_microstep: 11.19 [2025-01-21 14:32:46,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2835.13 | bwd: 3236.01 | bwd_inner: 3227.28 | bwd_allreduce: 8.05 | step: 11.97 51%|█████ | 222/437 [24:30<21:21, 5.96s/it] {'loss': 0.2846, 'learning_rate': 2.051982676317302e-05, 'epoch': 0.51} 51%|█████ | 222/437 [24:30<21:21, 5.96s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3419 [2025-01-21 14:32:46,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.66 | bwd_microstep: 255.33 | bwd_inner_microstep: 255.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5749 [2025-01-21 14:32:47,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.97 | bwd_microstep: 424.42 | bwd_inner_microstep: 423.91 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7580 [2025-01-21 14:32:48,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.03 | bwd_microstep: 565.20 | bwd_inner_microstep: 565.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7045 [2025-01-21 14:32:49,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.62 | bwd_microstep: 523.60 | bwd_inner_microstep: 523.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:32:50,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.20 | bwd_microstep: 603.75 | bwd_inner_microstep: 603.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:32:51,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 218.77 | bwd_microstep: 243.08 | bwd_inner_microstep: 242.78 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:32:52,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.71 | bwd_microstep: 318.32 | bwd_inner_microstep: 318.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5737 [2025-01-21 14:32:52,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.09 | optimizer_gradients: 0.73 | optimizer_step: 0.35 [2025-01-21 14:32:52,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 368.98 | bwd_microstep: 453.95 | bwd_inner_microstep: 423.01 | bwd_allreduce_microstep: 30.84 | step_microstep: 13.94 [2025-01-21 14:32:52,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2972.77 | bwd: 3387.85 | bwd_inner: 3355.46 | bwd_allreduce: 31.42 | step: 14.92 51%|█████ | 223/437 [24:37<21:56, 6.15s/it] {'loss': 0.3598, 'learning_rate': 2.0371325313026502e-05, 'epoch': 0.51} 51%|█████ | 223/437 [24:37<21:56, 6.15s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5669 [2025-01-21 14:32:53,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.22 | bwd_microstep: 416.21 | bwd_inner_microstep: 416.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:32:54,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.06 | bwd_microstep: 604.76 | bwd_inner_microstep: 604.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2331 [2025-01-21 14:32:55,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.49 | bwd_microstep: 201.27 | bwd_inner_microstep: 200.92 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5733 [2025-01-21 14:32:56,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.46 | bwd_microstep: 423.18 | bwd_inner_microstep: 423.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7311 [2025-01-21 14:32:57,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.86 | bwd_microstep: 547.38 | bwd_inner_microstep: 547.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2790 [2025-01-21 14:32:57,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.98 | bwd_microstep: 219.92 | bwd_inner_microstep: 219.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:32:58,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.85 | bwd_microstep: 215.06 | bwd_inner_microstep: 214.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:32:59,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.72 | optimizer_step: 0.42 [2025-01-21 14:32:59,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.78 | bwd_microstep: 692.22 | bwd_inner_microstep: 615.78 | bwd_allreduce_microstep: 76.32 | step_microstep: 13.85 [2025-01-21 14:32:59,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2848.56 | bwd: 3320.14 | bwd_inner: 3242.55 | bwd_allreduce: 76.81 | step: 14.61 51%|█████▏ | 224/437 [24:43<22:05, 6.22s/it] {'loss': 0.2647, 'learning_rate': 2.022280338091731e-05, 'epoch': 0.51} 51%|█████▏ | 224/437 [24:43<22:05, 6.22s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4556 [2025-01-21 14:32:59,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.06 | bwd_microstep: 334.96 | bwd_inner_microstep: 334.76 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3411 [2025-01-21 14:33:00,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.37 | bwd_microstep: 258.43 | bwd_inner_microstep: 258.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3351 [2025-01-21 14:33:00,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.11 | bwd_microstep: 256.06 | bwd_inner_microstep: 255.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6795 [2025-01-21 14:33:01,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.94 | bwd_microstep: 502.43 | bwd_inner_microstep: 502.07 | bwd_allreduce_microstep: 0.15 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3869 [2025-01-21 14:33:02,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.88 | bwd_microstep: 287.72 | bwd_inner_microstep: 287.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:33:03,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 555.06 | bwd_microstep: 608.10 | bwd_inner_microstep: 607.79 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:33:04,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.30 | bwd_microstep: 203.62 | bwd_inner_microstep: 203.13 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2851 [2025-01-21 14:33:04,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:33:04,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 187.80 | bwd_microstep: 240.72 | bwd_inner_microstep: 233.07 | bwd_allreduce_microstep: 7.54 | step_microstep: 11.35 [2025-01-21 14:33:04,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2359.35 | bwd: 2692.20 | bwd_inner: 2682.96 | bwd_allreduce: 8.19 | step: 12.32 51%|█████▏ | 225/437 [24:49<20:59, 5.94s/it] {'loss': 0.4459, 'learning_rate': 2.0074269159178606e-05, 'epoch': 0.51} 51%|█████▏ | 225/437 [24:49<20:59, 5.94s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6207 [2025-01-21 14:33:05,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.61 | bwd_microstep: 459.04 | bwd_inner_microstep: 458.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6657 [2025-01-21 14:33:06,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 429.08 | bwd_microstep: 494.64 | bwd_inner_microstep: 494.41 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.17 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4225 [2025-01-21 14:33:07,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.90 | bwd_microstep: 313.61 | bwd_inner_microstep: 313.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6042 [2025-01-21 14:33:07,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 394.87 | bwd_microstep: 441.16 | bwd_inner_microstep: 440.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3358 [2025-01-21 14:33:08,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.97 | bwd_microstep: 257.17 | bwd_inner_microstep: 257.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:33:09,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.36 | bwd_microstep: 606.76 | bwd_inner_microstep: 606.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:33:10,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.73 | bwd_microstep: 562.56 | bwd_inner_microstep: 562.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3452 [2025-01-21 14:33:11,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:33:11,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 222.28 | bwd_microstep: 273.69 | bwd_inner_microstep: 263.15 | bwd_allreduce_microstep: 10.30 | step_microstep: 11.29 [2025-01-21 14:33:11,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2991.64 | bwd: 3408.75 | bwd_inner: 3397.20 | bwd_allreduce: 10.75 | step: 12.12 52%|█████▏ | 226/437 [24:55<21:36, 6.15s/it] {'loss': 0.5016, 'learning_rate': 1.9925730840821404e-05, 'epoch': 0.52} 52%|█████▏ | 226/437 [24:55<21:36, 6.15s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6190 [2025-01-21 14:33:12,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 401.08 | bwd_microstep: 458.50 | bwd_inner_microstep: 458.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3430 [2025-01-21 14:33:12,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.03 | bwd_microstep: 258.15 | bwd_inner_microstep: 257.96 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:33:13,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.23 | bwd_microstep: 606.07 | bwd_inner_microstep: 605.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5755 [2025-01-21 14:33:14,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.69 | bwd_microstep: 421.53 | bwd_inner_microstep: 421.37 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2515 [2025-01-21 14:33:15,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.64 | bwd_microstep: 203.00 | bwd_inner_microstep: 202.67 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:33:15,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.39 | bwd_microstep: 358.21 | bwd_inner_microstep: 358.02 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6200 [2025-01-21 14:33:16,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 397.26 | bwd_microstep: 459.93 | bwd_inner_microstep: 459.77 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7753 [2025-01-21 14:33:17,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.70 | optimizer_step: 0.34 [2025-01-21 14:33:17,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 498.81 | bwd_microstep: 587.37 | bwd_inner_microstep: 579.76 | bwd_allreduce_microstep: 7.51 | step_microstep: 11.34 [2025-01-21 14:33:17,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2955.97 | bwd: 3352.90 | bwd_inner: 3344.07 | bwd_allreduce: 8.00 | step: 12.14 52%|█████▏ | 227/437 [25:02<21:55, 6.27s/it] {'loss': 0.3645, 'learning_rate': 1.9777196619082693e-05, 'epoch': 0.52} 52%|█████▏ | 227/437 [25:02<21:55, 6.27s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3330 [2025-01-21 14:33:18,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.75 | bwd_microstep: 250.35 | bwd_inner_microstep: 250.18 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6173 [2025-01-21 14:33:19,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 402.04 | bwd_microstep: 458.94 | bwd_inner_microstep: 458.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6601 [2025-01-21 14:33:20,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.27 | bwd_microstep: 485.29 | bwd_inner_microstep: 485.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7108 [2025-01-21 14:33:21,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.64 | bwd_microstep: 526.87 | bwd_inner_microstep: 526.70 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3916 [2025-01-21 14:33:21,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.22 | bwd_microstep: 290.15 | bwd_inner_microstep: 289.98 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7317 [2025-01-21 14:33:22,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.75 | bwd_microstep: 549.40 | bwd_inner_microstep: 549.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:33:23,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.72 | bwd_microstep: 604.88 | bwd_inner_microstep: 604.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.17 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8043 [2025-01-21 14:33:25,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.70 | optimizer_step: 0.34 [2025-01-21 14:33:25,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 518.67 | bwd_microstep: 608.96 | bwd_inner_microstep: 601.13 | bwd_allreduce_microstep: 7.62 | step_microstep: 11.24 [2025-01-21 14:33:25,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3313.89 | bwd: 3774.98 | bwd_inner: 3766.16 | bwd_allreduce: 8.05 | step: 12.06 52%|█████▏ | 228/437 [25:09<22:55, 6.58s/it] {'loss': 0.5755, 'learning_rate': 1.9628674686973508e-05, 'epoch': 0.52} 52%|█████▏ | 228/437 [25:09<22:55, 6.58s/it]dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7014 [2025-01-21 14:33:26,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 449.05 | bwd_microstep: 519.46 | bwd_inner_microstep: 519.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4236 [2025-01-21 14:33:26,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.59 | bwd_microstep: 311.46 | bwd_inner_microstep: 311.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7377 [2025-01-21 14:33:27,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.02 | bwd_microstep: 550.70 | bwd_inner_microstep: 550.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2858 [2025-01-21 14:33:28,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.05 | bwd_microstep: 214.56 | bwd_inner_microstep: 214.38 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6539 [2025-01-21 14:33:29,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.60 | bwd_microstep: 482.45 | bwd_inner_microstep: 482.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4125 [2025-01-21 14:33:29,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.09 | bwd_microstep: 303.58 | bwd_inner_microstep: 303.41 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8068 [2025-01-21 14:33:30,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.55 | bwd_microstep: 608.32 | bwd_inner_microstep: 608.12 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2493 [2025-01-21 14:33:31,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:33:31,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.87 | bwd_microstep: 235.69 | bwd_inner_microstep: 228.04 | bwd_allreduce_microstep: 7.55 | step_microstep: 11.24 [2025-01-21 14:33:31,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2797.66 | bwd: 3226.35 | bwd_inner: 3217.60 | bwd_allreduce: 8.02 | step: 12.03 52%|█████▏ | 229/437 [25:16<22:28, 6.48s/it] {'loss': 0.3132, 'learning_rate': 1.948017323682699e-05, 'epoch': 0.52} 52%|█████▏ | 229/437 [25:16<22:28, 6.48s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7509 [2025-01-21 14:33:32,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 472.47 | bwd_microstep: 561.19 | bwd_inner_microstep: 560.97 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5048 [2025-01-21 14:33:33,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.98 | bwd_microstep: 368.71 | bwd_inner_microstep: 368.53 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5794 [2025-01-21 14:33:33,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 380.46 | bwd_microstep: 425.78 | bwd_inner_microstep: 425.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5744 [2025-01-21 14:33:34,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.10 | bwd_microstep: 421.91 | bwd_inner_microstep: 421.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5716 [2025-01-21 14:33:35,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.64 | bwd_microstep: 420.15 | bwd_inner_microstep: 419.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3374 [2025-01-21 14:33:36,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.40 | bwd_microstep: 254.47 | bwd_inner_microstep: 254.30 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:33:37,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.38 | bwd_microstep: 605.31 | bwd_inner_microstep: 605.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:33:38,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.73 | optimizer_step: 0.34 [2025-01-21 14:33:38,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.52 | bwd_microstep: 623.69 | bwd_inner_microstep: 616.06 | bwd_allreduce_microstep: 7.52 | step_microstep: 11.26 [2025-01-21 14:33:38,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3220.79 | bwd: 3681.32 | bwd_inner: 3672.62 | bwd_allreduce: 8.01 | step: 12.03 53%|█████▎ | 230/437 [25:23<23:02, 6.68s/it] {'loss': 0.5478, 'learning_rate': 1.9331700459846516e-05, 'epoch': 0.53} 53%|█████▎ | 230/437 [25:23<23:02, 6.68s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6130 [2025-01-21 14:33:39,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.46 | bwd_microstep: 445.79 | bwd_inner_microstep: 445.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6893 [2025-01-21 14:33:40,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.88 | bwd_microstep: 511.10 | bwd_inner_microstep: 510.76 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4993 [2025-01-21 14:33:41,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.70 | bwd_microstep: 366.93 | bwd_inner_microstep: 366.61 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4699 [2025-01-21 14:33:41,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.20 | bwd_microstep: 344.87 | bwd_inner_microstep: 344.55 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6525 [2025-01-21 14:33:42,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.35 | bwd_microstep: 479.14 | bwd_inner_microstep: 478.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6528 [2025-01-21 14:33:43,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 431.10 | bwd_microstep: 481.15 | bwd_inner_microstep: 480.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:33:44,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 255.17 | bwd_microstep: 279.26 | bwd_inner_microstep: 278.93 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7238 [2025-01-21 14:33:45,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:33:45,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 464.69 | bwd_microstep: 553.68 | bwd_inner_microstep: 542.51 | bwd_allreduce_microstep: 11.05 | step_microstep: 11.69 [2025-01-21 14:33:45,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3053.38 | bwd: 3462.04 | bwd_inner: 3449.48 | bwd_allreduce: 11.52 | step: 12.49 53%|█████▎ | 231/437 [25:29<22:59, 6.70s/it] {'loss': 0.274, 'learning_rate': 1.918326454565387e-05, 'epoch': 0.53} 53%|█████▎ | 231/437 [25:29<22:59, 6.70s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:33:46,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.78 | bwd_microstep: 614.41 | bwd_inner_microstep: 614.24 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7180 [2025-01-21 14:33:47,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.68 | bwd_microstep: 533.08 | bwd_inner_microstep: 532.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:33:48,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.69 | bwd_microstep: 607.03 | bwd_inner_microstep: 606.84 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4929 [2025-01-21 14:33:49,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.85 | bwd_microstep: 362.30 | bwd_inner_microstep: 362.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2272 [2025-01-21 14:33:49,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.48 | bwd_microstep: 205.04 | bwd_inner_microstep: 204.86 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:33:50,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.28 | bwd_microstep: 263.80 | bwd_inner_microstep: 263.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:33:50,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.82 | bwd_microstep: 209.36 | bwd_inner_microstep: 209.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 8013 [2025-01-21 14:33:51,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.76 | optimizer_step: 0.34 [2025-01-21 14:33:51,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.25 | bwd_microstep: 605.46 | bwd_inner_microstep: 597.72 | bwd_allreduce_microstep: 7.63 | step_microstep: 11.60 [2025-01-21 14:33:51,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2941.66 | bwd: 3400.61 | bwd_inner: 3391.80 | bwd_allreduce: 8.10 | step: 12.38 53%|█████▎ | 232/437 [25:36<22:45, 6.66s/it] {'loss': 0.3535, 'learning_rate': 1.9034873681837534e-05, 'epoch': 0.53} 53%|█████▎ | 232/437 [25:36<22:45, 6.66s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6721 [2025-01-21 14:33:52,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 432.63 | bwd_microstep: 495.86 | bwd_inner_microstep: 495.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4576 [2025-01-21 14:33:53,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.91 | bwd_microstep: 335.86 | bwd_inner_microstep: 335.64 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2368 [2025-01-21 14:33:53,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.84 | bwd_microstep: 207.20 | bwd_inner_microstep: 206.90 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7628 [2025-01-21 14:33:54,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.39 | bwd_microstep: 569.56 | bwd_inner_microstep: 569.26 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4181 [2025-01-21 14:33:55,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.25 | bwd_microstep: 308.07 | bwd_inner_microstep: 307.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7846 [2025-01-21 14:33:56,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.06 | bwd_microstep: 587.46 | bwd_inner_microstep: 587.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8103 [2025-01-21 14:33:57,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 525.69 | bwd_microstep: 604.91 | bwd_inner_microstep: 604.59 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:33:58,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 1.33 | optimizer_step: 0.37 [2025-01-21 14:33:58,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.68 | bwd_microstep: 387.64 | bwd_inner_microstep: 377.28 | bwd_allreduce_microstep: 10.22 | step_microstep: 17.24 [2025-01-21 14:33:58,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3045.29 | bwd: 3496.69 | bwd_inner: 3485.01 | bwd_allreduce: 10.71 | step: 18.02 53%|█████▎ | 233/437 [25:43<22:46, 6.70s/it] {'loss': 0.2747, 'learning_rate': 1.8886536053501042e-05, 'epoch': 0.53} 53%|█████▎ | 233/437 [25:43<22:46, 6.70s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8088 [2025-01-21 14:33:59,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 520.41 | bwd_microstep: 605.95 | bwd_inner_microstep: 605.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6029 [2025-01-21 14:34:00,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.75 | bwd_microstep: 441.01 | bwd_inner_microstep: 440.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5498 [2025-01-21 14:34:01,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.01 | bwd_microstep: 408.02 | bwd_inner_microstep: 407.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6257 [2025-01-21 14:34:02,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.55 | bwd_microstep: 464.06 | bwd_inner_microstep: 463.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5716 [2025-01-21 14:34:03,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 376.39 | bwd_microstep: 419.89 | bwd_inner_microstep: 419.72 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:34:03,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.73 | bwd_microstep: 338.88 | bwd_inner_microstep: 338.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:34:04,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.46 | bwd_microstep: 498.44 | bwd_inner_microstep: 498.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4347 [2025-01-21 14:34:05,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.74 | optimizer_step: 0.36 [2025-01-21 14:34:05,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.14 | bwd_microstep: 328.08 | bwd_inner_microstep: 320.15 | bwd_allreduce_microstep: 7.82 | step_microstep: 11.54 [2025-01-21 14:34:05,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3088.25 | bwd: 3504.46 | bwd_inner: 3495.48 | bwd_allreduce: 8.29 | step: 12.36 54%|█████▎ | 234/437 [25:50<22:47, 6.74s/it] {'loss': 0.2108, 'learning_rate': 1.873825984281151e-05, 'epoch': 0.53} 54%|█████▎ | 234/437 [25:50<22:47, 6.74s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5641 [2025-01-21 14:34:06,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 370.95 | bwd_microstep: 417.38 | bwd_inner_microstep: 417.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:34:07,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.40 | bwd_microstep: 606.62 | bwd_inner_microstep: 606.40 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2835 [2025-01-21 14:34:07,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.74 | bwd_microstep: 214.09 | bwd_inner_microstep: 213.60 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3613 [2025-01-21 14:34:08,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.69 | bwd_microstep: 272.06 | bwd_inner_microstep: 271.82 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7538 [2025-01-21 14:34:09,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.53 | bwd_microstep: 560.49 | bwd_inner_microstep: 560.27 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:34:09,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 158.86 | bwd_microstep: 196.40 | bwd_inner_microstep: 196.11 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:34:10,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.91 | bwd_microstep: 605.32 | bwd_inner_microstep: 605.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7470 [2025-01-21 14:34:12,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.76 | optimizer_step: 0.40 [2025-01-21 14:34:12,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.59 | bwd_microstep: 567.74 | bwd_inner_microstep: 559.84 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.66 [2025-01-21 14:34:12,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3022.51 | bwd: 3440.28 | bwd_inner: 3430.85 | bwd_allreduce: 8.42 | step: 12.71 54%|█████▍ | 235/437 [25:56<22:38, 6.73s/it] {'loss': 0.3552, 'learning_rate': 1.8590053228548305e-05, 'epoch': 0.54} 54%|█████▍ | 235/437 [25:56<22:38, 6.73s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8073 [2025-01-21 14:34:13,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 519.59 | bwd_microstep: 604.61 | bwd_inner_microstep: 604.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2910 [2025-01-21 14:34:13,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.77 | bwd_microstep: 218.12 | bwd_inner_microstep: 217.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4487 [2025-01-21 14:34:14,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.06 | bwd_microstep: 332.48 | bwd_inner_microstep: 332.31 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7346 [2025-01-21 14:34:15,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.39 | bwd_microstep: 549.86 | bwd_inner_microstep: 549.69 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7619 [2025-01-21 14:34:16,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.30 | bwd_microstep: 570.81 | bwd_inner_microstep: 570.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6011 [2025-01-21 14:34:17,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.40 | bwd_microstep: 437.35 | bwd_inner_microstep: 437.10 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5201 [2025-01-21 14:34:18,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.81 | bwd_microstep: 380.19 | bwd_inner_microstep: 379.87 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3592 [2025-01-21 14:34:18,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.70 | optimizer_step: 0.34 [2025-01-21 14:34:18,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.22 | bwd_microstep: 278.24 | bwd_inner_microstep: 270.20 | bwd_allreduce_microstep: 7.81 | step_microstep: 11.32 [2025-01-21 14:34:18,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2956.35 | bwd: 3371.78 | bwd_inner: 3362.57 | bwd_allreduce: 8.27 | step: 12.14 54%|█████▍ | 236/437 [26:03<22:22, 6.68s/it] {'loss': 0.3144, 'learning_rate': 1.844192438565194e-05, 'epoch': 0.54} 54%|█████▍ | 236/437 [26:03<22:22, 6.68s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:34:19,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.65 | bwd_microstep: 603.98 | bwd_inner_microstep: 603.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5327 [2025-01-21 14:34:20,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 346.03 | bwd_microstep: 386.19 | bwd_inner_microstep: 386.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8160 [2025-01-21 14:34:21,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 529.77 | bwd_microstep: 610.08 | bwd_inner_microstep: 609.91 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4428 [2025-01-21 14:34:22,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.97 | bwd_microstep: 324.81 | bwd_inner_microstep: 324.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7046 [2025-01-21 14:34:23,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.01 | bwd_microstep: 523.87 | bwd_inner_microstep: 523.69 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:34:24,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.29 | bwd_microstep: 435.11 | bwd_inner_microstep: 434.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:34:25,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.80 | bwd_microstep: 518.05 | bwd_inner_microstep: 517.86 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3611 [2025-01-21 14:34:25,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:34:25,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.69 | bwd_microstep: 337.61 | bwd_inner_microstep: 271.72 | bwd_allreduce_microstep: 65.77 | step_microstep: 14.11 [2025-01-21 14:34:25,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3248.03 | bwd: 3739.84 | bwd_inner: 3672.86 | bwd_allreduce: 66.27 | step: 14.90 54%|█████▍ | 237/437 [26:10<22:48, 6.84s/it] {'loss': 0.2116, 'learning_rate': 1.829388148477311e-05, 'epoch': 0.54} 54%|█████▍ | 237/437 [26:10<22:48, 6.84s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6744 [2025-01-21 14:34:26,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 421.46 | bwd_microstep: 497.43 | bwd_inner_microstep: 497.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5320 [2025-01-21 14:34:27,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 346.66 | bwd_microstep: 389.63 | bwd_inner_microstep: 389.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.31 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:34:28,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.73 | bwd_microstep: 605.86 | bwd_inner_microstep: 605.69 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7330 [2025-01-21 14:34:29,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.29 | bwd_microstep: 546.98 | bwd_inner_microstep: 546.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7308 [2025-01-21 14:34:30,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.33 | bwd_microstep: 547.02 | bwd_inner_microstep: 546.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:34:31,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 436.47 | bwd_microstep: 498.75 | bwd_inner_microstep: 498.43 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5807 [2025-01-21 14:34:32,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 368.15 | bwd_microstep: 427.48 | bwd_inner_microstep: 427.32 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4451 [2025-01-21 14:34:33,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.81 | optimizer_step: 0.35 [2025-01-21 14:34:33,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.33 | bwd_microstep: 334.89 | bwd_inner_microstep: 326.94 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.85 [2025-01-21 14:34:33,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3363.26 | bwd: 3848.16 | bwd_inner: 3839.12 | bwd_allreduce: 8.26 | step: 12.82 54%|█████▍ | 238/437 [26:18<23:17, 7.02s/it] {'loss': 0.3492, 'learning_rate': 1.814593269182204e-05, 'epoch': 0.54} 54%|█████▍ | 238/437 [26:18<23:17, 7.02s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2700 [2025-01-21 14:34:33,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.38 | bwd_microstep: 207.43 | bwd_inner_microstep: 207.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7913 [2025-01-21 14:34:34,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 516.03 | bwd_microstep: 589.87 | bwd_inner_microstep: 589.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6296 [2025-01-21 14:34:35,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.82 | bwd_microstep: 464.74 | bwd_inner_microstep: 464.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4684 [2025-01-21 14:34:36,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 306.50 | bwd_microstep: 343.02 | bwd_inner_microstep: 342.85 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2536 [2025-01-21 14:34:36,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.89 | bwd_microstep: 220.15 | bwd_inner_microstep: 219.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4920 [2025-01-21 14:34:37,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.57 | bwd_microstep: 365.56 | bwd_inner_microstep: 365.38 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:34:38,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.55 | bwd_microstep: 377.30 | bwd_inner_microstep: 377.09 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5150 [2025-01-21 14:34:39,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.39 | optimizer_gradients: 0.84 | optimizer_step: 0.35 [2025-01-21 14:34:39,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 330.18 | bwd_microstep: 450.85 | bwd_inner_microstep: 379.50 | bwd_allreduce_microstep: 71.13 | step_microstep: 18.50 [2025-01-21 14:34:39,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2583.75 | bwd: 3019.05 | bwd_inner: 2946.64 | bwd_allreduce: 71.60 | step: 19.31 55%|█████▍ | 239/437 [26:23<22:00, 6.67s/it] {'loss': 0.2632, 'learning_rate': 1.7998086167518033e-05, 'epoch': 0.55} 55%|█████▍ | 239/437 [26:23<22:00, 6.67s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5913 [2025-01-21 14:34:39,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.93 | bwd_microstep: 434.79 | bwd_inner_microstep: 434.58 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6091 [2025-01-21 14:34:40,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.23 | bwd_microstep: 446.45 | bwd_inner_microstep: 446.21 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4726 [2025-01-21 14:34:41,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.80 | bwd_microstep: 344.86 | bwd_inner_microstep: 344.69 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2599 [2025-01-21 14:34:41,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.50 | bwd_microstep: 206.18 | bwd_inner_microstep: 206.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:34:43,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.11 | bwd_microstep: 607.11 | bwd_inner_microstep: 606.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6527 [2025-01-21 14:34:44,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.94 | bwd_microstep: 481.49 | bwd_inner_microstep: 481.32 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7585 [2025-01-21 14:34:45,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.90 | bwd_microstep: 565.49 | bwd_inner_microstep: 565.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6245 [2025-01-21 14:34:46,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.82 | optimizer_step: 0.35 [2025-01-21 14:34:46,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.67 | bwd_microstep: 471.98 | bwd_inner_microstep: 464.14 | bwd_allreduce_microstep: 7.72 | step_microstep: 11.84 [2025-01-21 14:34:46,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3135.92 | bwd: 3558.48 | bwd_inner: 3549.53 | bwd_allreduce: 8.22 | step: 12.61 55%|█████▍ | 240/437 [26:30<22:09, 6.75s/it] {'loss': 0.4334, 'learning_rate': 1.785035006693938e-05, 'epoch': 0.55} 55%|█████▍ | 240/437 [26:30<22:09, 6.75s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7944 [2025-01-21 14:34:47,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 517.19 | bwd_microstep: 598.44 | bwd_inner_microstep: 598.21 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5247 [2025-01-21 14:34:47,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.35 | bwd_microstep: 383.95 | bwd_inner_microstep: 383.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4155 [2025-01-21 14:34:48,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.63 | bwd_microstep: 305.82 | bwd_inner_microstep: 305.63 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2802 [2025-01-21 14:34:49,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.37 | bwd_microstep: 209.98 | bwd_inner_microstep: 209.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4645 [2025-01-21 14:34:49,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.34 | bwd_microstep: 344.83 | bwd_inner_microstep: 344.52 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:34:50,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 510.30 | bwd_microstep: 579.81 | bwd_inner_microstep: 579.64 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6701 [2025-01-21 14:34:51,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 420.00 | bwd_microstep: 495.07 | bwd_inner_microstep: 494.86 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3902 [2025-01-21 14:34:52,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.88 | optimizer_step: 0.35 [2025-01-21 14:34:52,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.16 | bwd_microstep: 301.53 | bwd_inner_microstep: 290.87 | bwd_allreduce_microstep: 10.41 | step_microstep: 14.24 [2025-01-21 14:34:52,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2828.16 | bwd: 3219.56 | bwd_inner: 3207.70 | bwd_allreduce: 10.94 | step: 15.04 55%|█████▌ | 241/437 [26:37<21:35, 6.61s/it] {'loss': 0.2417, 'learning_rate': 1.770273253907346e-05, 'epoch': 0.55} 55%|█████▌ | 241/437 [26:37<21:35, 6.61s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:34:53,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.94 | bwd_microstep: 613.29 | bwd_inner_microstep: 612.98 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7127 [2025-01-21 14:34:54,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.46 | bwd_microstep: 527.51 | bwd_inner_microstep: 527.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5509 [2025-01-21 14:34:55,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.31 | bwd_microstep: 408.45 | bwd_inner_microstep: 408.29 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2590 [2025-01-21 14:34:55,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.87 | bwd_microstep: 205.52 | bwd_inner_microstep: 205.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6558 [2025-01-21 14:34:56,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.88 | bwd_microstep: 485.33 | bwd_inner_microstep: 485.03 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6530 [2025-01-21 14:34:57,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.51 | bwd_microstep: 486.26 | bwd_inner_microstep: 485.95 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:34:58,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 374.97 | bwd_microstep: 418.95 | bwd_inner_microstep: 418.73 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3926 [2025-01-21 14:34:59,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.79 | optimizer_step: 0.35 [2025-01-21 14:34:59,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.27 | bwd_microstep: 300.91 | bwd_inner_microstep: 293.11 | bwd_allreduce_microstep: 7.69 | step_microstep: 11.77 [2025-01-21 14:34:59,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3023.06 | bwd: 3446.37 | bwd_inner: 3437.19 | bwd_allreduce: 8.24 | step: 12.56 55%|█████▌ | 242/437 [26:43<21:34, 6.64s/it] {'loss': 0.2955, 'learning_rate': 1.7555241726367317e-05, 'epoch': 0.55} 55%|█████▌ | 242/437 [26:43<21:34, 6.64s/it]dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7006 [2025-01-21 14:35:00,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.02 | bwd_microstep: 518.27 | bwd_inner_microstep: 518.10 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2854 [2025-01-21 14:35:00,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.19 | bwd_microstep: 216.15 | bwd_inner_microstep: 215.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:35:01,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.38 | bwd_microstep: 604.66 | bwd_inner_microstep: 604.35 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5475 [2025-01-21 14:35:02,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.66 | bwd_microstep: 404.29 | bwd_inner_microstep: 404.09 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:35:03,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.41 | bwd_microstep: 604.26 | bwd_inner_microstep: 604.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:35:04,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.95 | bwd_microstep: 604.67 | bwd_inner_microstep: 604.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:35:05,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 405.79 | bwd_microstep: 460.94 | bwd_inner_microstep: 460.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5572 [2025-01-21 14:35:06,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:35:06,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.53 | bwd_microstep: 422.21 | bwd_inner_microstep: 414.34 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.77 [2025-01-21 14:35:06,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3382.80 | bwd: 3835.58 | bwd_inner: 3826.56 | bwd_allreduce: 8.25 | step: 12.53 56%|█████▌ | 243/437 [26:51<22:14, 6.88s/it] {'loss': 0.538, 'learning_rate': 1.7407885764278488e-05, 'epoch': 0.56} 56%|█████▌ | 243/437 [26:51<22:14, 6.88s/it]dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5120 [2025-01-21 14:35:07,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 330.26 | bwd_microstep: 373.55 | bwd_inner_microstep: 373.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4497 [2025-01-21 14:35:07,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.42 | bwd_microstep: 336.99 | bwd_inner_microstep: 336.71 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2352 [2025-01-21 14:35:08,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.13 | bwd_microstep: 211.24 | bwd_inner_microstep: 211.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4175 [2025-01-21 14:35:08,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.02 | bwd_microstep: 308.10 | bwd_inner_microstep: 307.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3623 [2025-01-21 14:35:09,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.23 | bwd_microstep: 271.14 | bwd_inner_microstep: 270.97 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4141 [2025-01-21 14:35:10,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.15 | bwd_microstep: 307.14 | bwd_inner_microstep: 306.84 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2539 [2025-01-21 14:35:10,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.24 | bwd_microstep: 204.00 | bwd_inner_microstep: 203.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:35:11,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.70 | optimizer_step: 0.33 [2025-01-21 14:35:11,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.36 | bwd_microstep: 509.46 | bwd_inner_microstep: 501.60 | bwd_allreduce_microstep: 7.68 | step_microstep: 11.40 [2025-01-21 14:35:11,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2201.62 | bwd: 2521.74 | bwd_inner: 2512.81 | bwd_allreduce: 8.15 | step: 12.22 56%|█████▌ | 244/437 [26:56<20:16, 6.30s/it] {'loss': 0.2689, 'learning_rate': 1.7260672780826296e-05, 'epoch': 0.56} 56%|█████▌ | 244/437 [26:56<20:16, 6.30s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8119 [2025-01-21 14:35:12,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 521.22 | bwd_microstep: 608.05 | bwd_inner_microstep: 607.73 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6079 [2025-01-21 14:35:13,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.85 | bwd_microstep: 443.83 | bwd_inner_microstep: 443.48 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6299 [2025-01-21 14:35:14,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.67 | bwd_microstep: 468.06 | bwd_inner_microstep: 467.57 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5245 [2025-01-21 14:35:15,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.90 | bwd_microstep: 387.26 | bwd_inner_microstep: 387.09 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6799 [2025-01-21 14:35:16,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.42 | bwd_microstep: 504.43 | bwd_inner_microstep: 504.24 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6517 [2025-01-21 14:35:17,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.14 | bwd_microstep: 481.13 | bwd_inner_microstep: 480.79 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:35:17,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.25 | bwd_microstep: 262.83 | bwd_inner_microstep: 262.61 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6921 [2025-01-21 14:35:18,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.86 | optimizer_step: 0.37 [2025-01-21 14:35:18,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 431.97 | bwd_microstep: 522.17 | bwd_inner_microstep: 514.10 | bwd_allreduce_microstep: 7.96 | step_microstep: 12.05 [2025-01-21 14:35:18,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3207.27 | bwd: 3677.94 | bwd_inner: 3668.15 | bwd_allreduce: 8.62 | step: 13.04 56%|█████▌ | 245/437 [27:03<20:57, 6.55s/it] {'loss': 0.2849, 'learning_rate': 1.7113610896143473e-05, 'epoch': 0.56} 56%|█████▌ | 245/437 [27:03<20:57, 6.55s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7922 [2025-01-21 14:35:19,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 515.85 | bwd_microstep: 595.29 | bwd_inner_microstep: 594.97 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4694 [2025-01-21 14:35:20,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.46 | bwd_microstep: 344.80 | bwd_inner_microstep: 344.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7857 [2025-01-21 14:35:21,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.27 | bwd_microstep: 585.49 | bwd_inner_microstep: 585.32 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6792 [2025-01-21 14:35:22,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 442.70 | bwd_microstep: 505.24 | bwd_inner_microstep: 504.97 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.17 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7580 [2025-01-21 14:35:23,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.78 | bwd_microstep: 564.59 | bwd_inner_microstep: 564.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2262 [2025-01-21 14:35:23,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.45 | bwd_microstep: 203.72 | bwd_inner_microstep: 203.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7602 [2025-01-21 14:35:25,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 486.39 | bwd_microstep: 566.14 | bwd_inner_microstep: 565.81 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:35:26,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.70 | optimizer_step: 0.37 [2025-01-21 14:35:26,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.26 | bwd_microstep: 614.20 | bwd_inner_microstep: 606.41 | bwd_allreduce_microstep: 7.68 | step_microstep: 11.37 [2025-01-21 14:35:26,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3464.00 | bwd: 3979.60 | bwd_inner: 3970.47 | bwd_allreduce: 8.20 | step: 12.20 56%|█████▋ | 246/437 [27:10<21:55, 6.89s/it] {'loss': 0.5197, 'learning_rate': 1.6966708222028284e-05, 'epoch': 0.56} 56%|█████▋ | 246/437 [27:10<21:55, 6.89s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7945 [2025-01-21 14:35:27,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 517.93 | bwd_microstep: 591.66 | bwd_inner_microstep: 591.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7657 [2025-01-21 14:35:28,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.49 | bwd_microstep: 572.07 | bwd_inner_microstep: 571.85 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3902 [2025-01-21 14:35:29,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.03 | bwd_microstep: 286.39 | bwd_inner_microstep: 286.06 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4911 [2025-01-21 14:35:29,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.04 | bwd_microstep: 359.68 | bwd_inner_microstep: 359.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:35:30,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.71 | bwd_microstep: 608.09 | bwd_inner_microstep: 607.60 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4815 [2025-01-21 14:35:31,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.38 | bwd_microstep: 353.63 | bwd_inner_microstep: 353.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3657 [2025-01-21 14:35:32,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 233.42 | bwd_microstep: 272.68 | bwd_inner_microstep: 272.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4135 [2025-01-21 14:35:32,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:35:32,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.96 | bwd_microstep: 315.98 | bwd_inner_microstep: 307.97 | bwd_allreduce_microstep: 7.90 | step_microstep: 12.26 [2025-01-21 14:35:32,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2959.80 | bwd: 3360.34 | bwd_inner: 3350.87 | bwd_allreduce: 8.50 | step: 13.23 57%|█████▋ | 247/437 [27:17<21:29, 6.79s/it] {'loss': 0.4282, 'learning_rate': 1.681997286149709e-05, 'epoch': 0.56} 57%|█████▋ | 247/437 [27:17<21:29, 6.79s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4571 [2025-01-21 14:35:33,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.86 | bwd_microstep: 342.20 | bwd_inner_microstep: 342.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3676 [2025-01-21 14:35:34,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.15 | bwd_microstep: 274.25 | bwd_inner_microstep: 273.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6815 [2025-01-21 14:35:35,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.83 | bwd_microstep: 506.09 | bwd_inner_microstep: 505.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4403 [2025-01-21 14:35:35,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.44 | bwd_microstep: 323.36 | bwd_inner_microstep: 323.19 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2799 [2025-01-21 14:35:36,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.50 | bwd_microstep: 215.27 | bwd_inner_microstep: 215.05 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:35:37,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 440.88 | bwd_microstep: 498.24 | bwd_inner_microstep: 497.91 | bwd_allreduce_microstep: 0.15 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4712 [2025-01-21 14:35:37,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 301.37 | bwd_microstep: 345.79 | bwd_inner_microstep: 345.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:35:38,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.90 | optimizer_step: 0.36 [2025-01-21 14:35:38,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.58 | bwd_microstep: 616.85 | bwd_inner_microstep: 608.55 | bwd_allreduce_microstep: 8.06 | step_microstep: 11.96 [2025-01-21 14:35:38,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2749.44 | bwd: 3122.19 | bwd_inner: 3112.68 | bwd_allreduce: 8.59 | step: 12.77 57%|█████▋ | 248/437 [27:23<20:44, 6.58s/it] {'loss': 0.4306, 'learning_rate': 1.6673412908337402e-05, 'epoch': 0.57} 57%|█████▋ | 248/437 [27:23<20:44, 6.58s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7960 [2025-01-21 14:35:40,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 519.70 | bwd_microstep: 593.17 | bwd_inner_microstep: 593.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3157 [2025-01-21 14:35:40,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.00 | bwd_microstep: 242.72 | bwd_inner_microstep: 242.40 | bwd_allreduce_microstep: 0.20 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2600 [2025-01-21 14:35:40,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.14 | bwd_microstep: 212.83 | bwd_inner_microstep: 212.50 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.14 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5223 [2025-01-21 14:35:41,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.04 | bwd_microstep: 382.93 | bwd_inner_microstep: 382.77 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5487 [2025-01-21 14:35:42,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.78 | bwd_microstep: 404.74 | bwd_inner_microstep: 404.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5194 [2025-01-21 14:35:43,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.32 | bwd_microstep: 378.80 | bwd_inner_microstep: 378.50 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5179 [2025-01-21 14:35:44,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.99 | bwd_microstep: 379.50 | bwd_inner_microstep: 379.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2602 [2025-01-21 14:35:44,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.90 | optimizer_step: 0.36 [2025-01-21 14:35:44,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.56 | bwd_microstep: 249.63 | bwd_inner_microstep: 241.36 | bwd_allreduce_microstep: 8.10 | step_microstep: 12.69 [2025-01-21 14:35:44,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2475.35 | bwd: 2844.45 | bwd_inner: 2834.92 | bwd_allreduce: 8.68 | step: 13.49 57%|█████▋ | 249/437 [27:29<19:39, 6.27s/it] {'loss': 0.2655, 'learning_rate': 1.6527036446661396e-05, 'epoch': 0.57} 57%|█████▋ | 249/437 [27:29<19:39, 6.27s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:35:45,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.83 | bwd_microstep: 604.82 | bwd_inner_microstep: 604.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7421 [2025-01-21 14:35:46,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.99 | bwd_microstep: 552.37 | bwd_inner_microstep: 552.06 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6335 [2025-01-21 14:35:47,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 412.21 | bwd_microstep: 467.47 | bwd_inner_microstep: 467.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4988 [2025-01-21 14:35:48,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.51 | bwd_microstep: 362.47 | bwd_inner_microstep: 362.22 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3652 [2025-01-21 14:35:48,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.62 | bwd_microstep: 272.24 | bwd_inner_microstep: 272.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3095 [2025-01-21 14:35:49,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.90 | bwd_microstep: 277.49 | bwd_inner_microstep: 277.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2549 [2025-01-21 14:35:49,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.78 | bwd_microstep: 205.83 | bwd_inner_microstep: 205.52 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7504 [2025-01-21 14:35:50,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.66 | optimizer_step: 0.33 [2025-01-21 14:35:50,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.03 | bwd_microstep: 568.49 | bwd_inner_microstep: 560.82 | bwd_allreduce_microstep: 7.56 | step_microstep: 10.88 [2025-01-21 14:35:50,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2873.70 | bwd: 3311.33 | bwd_inner: 3302.43 | bwd_allreduce: 8.03 | step: 11.72 57%|█████▋ | 250/437 [27:35<19:41, 6.32s/it] {'loss': 0.2738, 'learning_rate': 1.638085155046004e-05, 'epoch': 0.57} 57%|█████▋ | 250/437 [27:35<19:41, 6.32s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5938 [2025-01-21 14:35:51,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.23 | bwd_microstep: 435.51 | bwd_inner_microstep: 435.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4248 [2025-01-21 14:35:52,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.07 | bwd_microstep: 310.58 | bwd_inner_microstep: 310.39 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8167 [2025-01-21 14:35:53,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 530.41 | bwd_microstep: 611.16 | bwd_inner_microstep: 610.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5223 [2025-01-21 14:35:54,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.04 | bwd_microstep: 382.20 | bwd_inner_microstep: 382.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2555 [2025-01-21 14:35:54,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.57 | bwd_microstep: 205.57 | bwd_inner_microstep: 205.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:35:55,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.94 | bwd_microstep: 605.42 | bwd_inner_microstep: 605.21 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3595 [2025-01-21 14:35:56,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.78 | bwd_microstep: 268.70 | bwd_inner_microstep: 268.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:35:56,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:35:56,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.45 | bwd_microstep: 258.79 | bwd_inner_microstep: 231.29 | bwd_allreduce_microstep: 27.27 | step_microstep: 14.02 [2025-01-21 14:35:56,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2663.35 | bwd: 3078.05 | bwd_inner: 3049.46 | bwd_allreduce: 27.76 | step: 14.81 57%|█████▋ | 251/437 [27:41<19:15, 6.21s/it] {'loss': 0.4238, 'learning_rate': 1.623486628315773e-05, 'epoch': 0.57} 57%|█████▋ | 251/437 [27:41<19:15, 6.21s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4579 [2025-01-21 14:35:57,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 298.55 | bwd_microstep: 337.21 | bwd_inner_microstep: 337.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:35:58,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 555.78 | bwd_microstep: 605.61 | bwd_inner_microstep: 605.29 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7871 [2025-01-21 14:35:59,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.97 | bwd_microstep: 589.15 | bwd_inner_microstep: 588.98 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7331 [2025-01-21 14:36:00,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.19 | bwd_microstep: 550.26 | bwd_inner_microstep: 550.09 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2563 [2025-01-21 14:36:01,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.99 | bwd_microstep: 203.25 | bwd_inner_microstep: 203.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3350 [2025-01-21 14:36:01,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.55 | bwd_microstep: 257.82 | bwd_inner_microstep: 257.50 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:36:02,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.98 | bwd_microstep: 203.90 | bwd_inner_microstep: 203.71 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:36:02,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:36:02,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.77 | bwd_microstep: 300.45 | bwd_inner_microstep: 263.71 | bwd_allreduce_microstep: 36.62 | step_microstep: 14.85 [2025-01-21 14:36:02,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2671.62 | bwd: 3047.78 | bwd_inner: 3009.84 | bwd_allreduce: 37.09 | step: 15.63 58%|█████▊ | 252/437 [27:47<18:54, 6.13s/it] {'loss': 0.2554, 'learning_rate': 1.608908869716751e-05, 'epoch': 0.58} 58%|█████▊ | 252/437 [27:47<18:54, 6.13s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:36:04,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.44 | bwd_microstep: 607.19 | bwd_inner_microstep: 607.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7920 [2025-01-21 14:36:05,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 516.33 | bwd_microstep: 588.83 | bwd_inner_microstep: 588.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2310 [2025-01-21 14:36:05,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.81 | bwd_microstep: 202.25 | bwd_inner_microstep: 202.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:36:06,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.14 | bwd_microstep: 607.75 | bwd_inner_microstep: 607.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7061 [2025-01-21 14:36:07,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.52 | bwd_microstep: 524.40 | bwd_inner_microstep: 524.21 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:36:08,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 437.82 | bwd_microstep: 498.52 | bwd_inner_microstep: 498.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2766 [2025-01-21 14:36:09,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.79 | bwd_microstep: 212.24 | bwd_inner_microstep: 211.93 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:36:10,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.69 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:36:10,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.14 | bwd_microstep: 646.39 | bwd_inner_microstep: 617.74 | bwd_allreduce_microstep: 28.53 | step_microstep: 24.98 [2025-01-21 14:36:10,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3387.82 | bwd: 3887.70 | bwd_inner: 3857.89 | bwd_allreduce: 29.02 | step: 25.83 58%|█████▊ | 253/437 [27:55<20:06, 6.56s/it] {'loss': 0.295, 'learning_rate': 1.5943526833446917e-05, 'epoch': 0.58} 58%|█████▊ | 253/437 [27:55<20:06, 6.56s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7786 [2025-01-21 14:36:11,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 498.34 | bwd_microstep: 580.65 | bwd_inner_microstep: 580.47 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6975 [2025-01-21 14:36:12,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.57 | bwd_microstep: 512.52 | bwd_inner_microstep: 512.33 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2361 [2025-01-21 14:36:12,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 162.05 | bwd_microstep: 198.55 | bwd_inner_microstep: 198.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5774 [2025-01-21 14:36:13,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 381.03 | bwd_microstep: 427.81 | bwd_inner_microstep: 427.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7061 [2025-01-21 14:36:14,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.91 | bwd_microstep: 524.66 | bwd_inner_microstep: 524.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6503 [2025-01-21 14:36:15,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.85 | bwd_microstep: 482.60 | bwd_inner_microstep: 482.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 7411 [2025-01-21 14:36:16,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.85 | bwd_microstep: 553.06 | bwd_inner_microstep: 552.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7813 [2025-01-21 14:36:17,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.70 | optimizer_step: 0.33 [2025-01-21 14:36:17,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.55 | bwd_microstep: 590.91 | bwd_inner_microstep: 583.46 | bwd_allreduce_microstep: 7.35 | step_microstep: 11.15 [2025-01-21 14:36:17,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3327.00 | bwd: 3870.90 | bwd_inner: 3862.37 | bwd_allreduce: 7.84 | step: 11.91 58%|█████▊ | 254/437 [28:02<20:48, 6.82s/it] {'loss': 0.3214, 'learning_rate': 1.579818872105444e-05, 'epoch': 0.58} 58%|█████▊ | 254/437 [28:02<20:48, 6.82s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:36:18,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.91 | bwd_microstep: 604.31 | bwd_inner_microstep: 604.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6317 [2025-01-21 14:36:19,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.35 | bwd_microstep: 467.81 | bwd_inner_microstep: 467.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6557 [2025-01-21 14:36:20,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.24 | bwd_microstep: 485.10 | bwd_inner_microstep: 484.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4134 [2025-01-21 14:36:21,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.47 | bwd_microstep: 305.84 | bwd_inner_microstep: 305.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4125 [2025-01-21 14:36:22,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.68 | bwd_microstep: 302.54 | bwd_inner_microstep: 302.30 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7010 [2025-01-21 14:36:23,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.99 | bwd_microstep: 517.41 | bwd_inner_microstep: 517.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2924 [2025-01-21 14:36:23,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.95 | bwd_microstep: 220.48 | bwd_inner_microstep: 220.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7108 [2025-01-21 14:36:24,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:36:24,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.97 | bwd_microstep: 535.84 | bwd_inner_microstep: 527.91 | bwd_allreduce_microstep: 7.70 | step_microstep: 11.56 [2025-01-21 14:36:24,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3040.38 | bwd: 3439.45 | bwd_inner: 3430.52 | bwd_allreduce: 8.17 | step: 12.39 58%|█████▊ | 255/437 [28:09<20:36, 6.79s/it] {'loss': 0.2114, 'learning_rate': 1.565308237670666e-05, 'epoch': 0.58} 58%|█████▊ | 255/437 [28:09<20:36, 6.79s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3243 [2025-01-21 14:36:25,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.12 | bwd_microstep: 246.88 | bwd_inner_microstep: 246.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4219 [2025-01-21 14:36:25,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.86 | bwd_microstep: 311.68 | bwd_inner_microstep: 311.47 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:36:26,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.06 | bwd_microstep: 607.17 | bwd_inner_microstep: 606.98 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5725 [2025-01-21 14:36:27,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.84 | bwd_microstep: 421.37 | bwd_inner_microstep: 421.05 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5726 [2025-01-21 14:36:28,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.86 | bwd_microstep: 421.58 | bwd_inner_microstep: 421.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4662 [2025-01-21 14:36:29,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.61 | bwd_microstep: 344.11 | bwd_inner_microstep: 343.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2790 [2025-01-21 14:36:29,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.49 | bwd_microstep: 214.86 | bwd_inner_microstep: 214.69 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:36:30,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.87 | optimizer_step: 0.36 [2025-01-21 14:36:30,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.12 | bwd_microstep: 235.69 | bwd_inner_microstep: 227.77 | bwd_allreduce_microstep: 7.79 | step_microstep: 12.14 [2025-01-21 14:36:30,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2469.81 | bwd: 2803.46 | bwd_inner: 2794.39 | bwd_allreduce: 8.28 | step: 12.94 59%|█████▊ | 256/437 [28:14<19:19, 6.40s/it] {'loss': 0.3037, 'learning_rate': 1.550821580433604e-05, 'epoch': 0.59} 59%|█████▊ | 256/437 [28:14<19:19, 6.40s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:36:31,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.87 | bwd_microstep: 614.02 | bwd_inner_microstep: 613.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7497 [2025-01-21 14:36:32,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 482.88 | bwd_microstep: 559.73 | bwd_inner_microstep: 559.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:36:33,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.82 | bwd_microstep: 610.83 | bwd_inner_microstep: 610.62 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5755 [2025-01-21 14:36:34,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 376.95 | bwd_microstep: 423.01 | bwd_inner_microstep: 422.79 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3087 [2025-01-21 14:36:34,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.39 | bwd_microstep: 242.33 | bwd_inner_microstep: 242.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5735 [2025-01-21 14:36:35,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.30 | bwd_microstep: 422.60 | bwd_inner_microstep: 422.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7829 [2025-01-21 14:36:36,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.84 | bwd_microstep: 584.01 | bwd_inner_microstep: 583.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:36:37,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.69 | optimizer_step: 0.38 [2025-01-21 14:36:37,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.74 | bwd_microstep: 286.84 | bwd_inner_microstep: 279.12 | bwd_allreduce_microstep: 7.61 | step_microstep: 11.33 [2025-01-21 14:36:37,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3202.61 | bwd: 3743.51 | bwd_inner: 3734.74 | bwd_allreduce: 8.10 | step: 12.15 59%|█████▉ | 257/437 [28:21<19:54, 6.64s/it] {'loss': 0.3683, 'learning_rate': 1.5363596994649433e-05, 'epoch': 0.59} 59%|█████▉ | 257/437 [28:21<19:54, 6.64s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3712 [2025-01-21 14:36:37,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.55 | bwd_microstep: 281.08 | bwd_inner_microstep: 280.79 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3457 [2025-01-21 14:36:38,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.66 | bwd_microstep: 262.78 | bwd_inner_microstep: 262.60 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6593 [2025-01-21 14:36:39,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.11 | bwd_microstep: 488.17 | bwd_inner_microstep: 487.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7877 [2025-01-21 14:36:40,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.88 | bwd_microstep: 590.36 | bwd_inner_microstep: 590.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:36:41,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.68 | bwd_microstep: 606.19 | bwd_inner_microstep: 605.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3614 [2025-01-21 14:36:42,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.18 | bwd_microstep: 269.49 | bwd_inner_microstep: 269.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6480 [2025-01-21 14:36:42,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.85 | bwd_microstep: 477.48 | bwd_inner_microstep: 477.32 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:36:43,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.72 | optimizer_step: 0.35 [2025-01-21 14:36:43,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.38 | bwd_microstep: 468.25 | bwd_inner_microstep: 460.56 | bwd_allreduce_microstep: 7.58 | step_microstep: 11.47 [2025-01-21 14:36:43,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3032.14 | bwd: 3443.92 | bwd_inner: 3435.07 | bwd_allreduce: 8.05 | step: 12.30 59%|█████▉ | 258/437 [28:28<19:51, 6.66s/it] {'loss': 0.4236, 'learning_rate': 1.5219233924687351e-05, 'epoch': 0.59} 59%|█████▉ | 258/437 [28:28<19:51, 6.66s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3518 [2025-01-21 14:36:44,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.67 | bwd_microstep: 262.12 | bwd_inner_microstep: 261.78 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5267 [2025-01-21 14:36:45,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 344.13 | bwd_microstep: 383.63 | bwd_inner_microstep: 383.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6568 [2025-01-21 14:36:46,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.95 | bwd_microstep: 484.19 | bwd_inner_microstep: 483.93 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4139 [2025-01-21 14:36:46,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.28 | bwd_microstep: 307.97 | bwd_inner_microstep: 307.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7839 [2025-01-21 14:36:47,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.33 | bwd_microstep: 584.42 | bwd_inner_microstep: 584.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:36:48,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 422.26 | bwd_microstep: 479.29 | bwd_inner_microstep: 479.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:36:49,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.46 | bwd_microstep: 199.35 | bwd_inner_microstep: 199.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3741 [2025-01-21 14:36:49,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.16 | optimizer_gradients: 0.95 | optimizer_step: 0.36 [2025-01-21 14:36:49,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.61 | bwd_microstep: 292.80 | bwd_inner_microstep: 282.37 | bwd_allreduce_microstep: 10.31 | step_microstep: 17.72 [2025-01-21 14:36:49,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2627.52 | bwd: 2993.88 | bwd_inner: 2982.30 | bwd_allreduce: 10.77 | step: 18.47 59%|█████▉ | 259/437 [28:34<19:02, 6.42s/it] {'loss': 0.2787, 'learning_rate': 1.5075134557383931e-05, 'epoch': 0.59} 59%|█████▉ | 259/437 [28:34<19:02, 6.42s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3494 [2025-01-21 14:36:50,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.46 | bwd_microstep: 261.92 | bwd_inner_microstep: 261.74 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5308 [2025-01-21 14:36:51,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 346.68 | bwd_microstep: 387.06 | bwd_inner_microstep: 386.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8164 [2025-01-21 14:36:52,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.30 | bwd_microstep: 609.63 | bwd_inner_microstep: 609.46 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4708 [2025-01-21 14:36:52,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.05 | bwd_microstep: 345.11 | bwd_inner_microstep: 344.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3350 [2025-01-21 14:36:53,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.17 | bwd_microstep: 251.82 | bwd_inner_microstep: 251.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2808 [2025-01-21 14:36:53,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.28 | bwd_microstep: 208.22 | bwd_inner_microstep: 208.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 5885 [2025-01-21 14:36:54,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 347.10 | bwd_microstep: 434.76 | bwd_inner_microstep: 434.52 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:36:55,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:36:55,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 554.08 | bwd_microstep: 616.94 | bwd_inner_microstep: 609.06 | bwd_allreduce_microstep: 7.66 | step_microstep: 11.40 [2025-01-21 14:36:55,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2726.97 | bwd: 3115.59 | bwd_inner: 3106.72 | bwd_allreduce: 8.12 | step: 12.17 59%|█████▉ | 260/437 [28:40<18:37, 6.31s/it] {'loss': 0.588, 'learning_rate': 1.4931306841127691e-05, 'epoch': 0.59} 59%|█████▉ | 260/437 [28:40<18:37, 6.31s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3522 [2025-01-21 14:36:56,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.85 | bwd_microstep: 264.19 | bwd_inner_microstep: 263.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5813 [2025-01-21 14:36:57,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 382.95 | bwd_microstep: 429.17 | bwd_inner_microstep: 428.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7914 [2025-01-21 14:36:58,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.79 | bwd_microstep: 590.75 | bwd_inner_microstep: 590.54 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4451 [2025-01-21 14:36:58,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.02 | bwd_microstep: 323.21 | bwd_inner_microstep: 323.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2815 [2025-01-21 14:36:59,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.76 | bwd_microstep: 211.59 | bwd_inner_microstep: 211.26 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:36:59,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 208.94 | bwd_microstep: 221.50 | bwd_inner_microstep: 221.27 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:37:00,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.91 | bwd_microstep: 517.44 | bwd_inner_microstep: 517.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6997 [2025-01-21 14:37:01,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:37:01,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 448.37 | bwd_microstep: 554.85 | bwd_inner_microstep: 519.04 | bwd_allreduce_microstep: 35.70 | step_microstep: 14.08 [2025-01-21 14:37:01,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2718.41 | bwd: 3112.85 | bwd_inner: 3075.78 | bwd_allreduce: 36.21 | step: 14.97 60%|█████▉ | 261/437 [28:46<18:18, 6.24s/it] {'loss': 0.2892, 'learning_rate': 1.4787758709323155e-05, 'epoch': 0.6} 60%|█████▉ | 261/437 [28:46<18:18, 6.24s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6170 [2025-01-21 14:37:02,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.26 | bwd_microstep: 458.00 | bwd_inner_microstep: 457.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:37:03,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.78 | bwd_microstep: 605.15 | bwd_inner_microstep: 604.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:37:05,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.84 | bwd_microstep: 604.10 | bwd_inner_microstep: 603.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5194 [2025-01-21 14:37:05,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.76 | bwd_microstep: 379.40 | bwd_inner_microstep: 379.24 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3069 [2025-01-21 14:37:06,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.43 | bwd_microstep: 226.80 | bwd_inner_microstep: 226.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:37:07,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.98 | bwd_microstep: 605.74 | bwd_inner_microstep: 605.53 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:37:08,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.07 | bwd_microstep: 416.03 | bwd_inner_microstep: 415.80 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:37:08,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.82 | optimizer_step: 0.36 [2025-01-21 14:37:08,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.91 | bwd_microstep: 245.23 | bwd_inner_microstep: 224.02 | bwd_allreduce_microstep: 21.10 | step_microstep: 14.36 [2025-01-21 14:37:08,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3149.87 | bwd: 3540.58 | bwd_inner: 3518.23 | bwd_allreduce: 21.60 | step: 15.11 60%|█████▉ | 262/437 [28:53<18:47, 6.44s/it] {'loss': 0.5691, 'learning_rate': 1.4644498079953215e-05, 'epoch': 0.6} 60%|█████▉ | 262/437 [28:53<18:47, 6.44s/it]warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:10,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.95 | bwd_microstep: 605.07 | bwd_inner_microstep: 604.87 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7095 [2025-01-21 14:37:11,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.80 | bwd_microstep: 524.80 | bwd_inner_microstep: 524.63 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3623 [2025-01-21 14:37:11,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.97 | bwd_microstep: 266.01 | bwd_inner_microstep: 265.84 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3062 [2025-01-21 14:37:12,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.81 | bwd_microstep: 229.32 | bwd_inner_microstep: 229.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5431 [2025-01-21 14:37:12,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.07 | bwd_microstep: 399.14 | bwd_inner_microstep: 398.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:37:13,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 487.79 | bwd_microstep: 560.64 | bwd_inner_microstep: 560.33 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3368 [2025-01-21 14:37:14,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.69 | bwd_microstep: 253.79 | bwd_inner_microstep: 253.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4815 [2025-01-21 14:37:15,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.87 | optimizer_step: 0.35 [2025-01-21 14:37:15,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 312.34 | bwd_microstep: 362.42 | bwd_inner_microstep: 353.80 | bwd_allreduce_microstep: 8.42 | step_microstep: 13.29 [2025-01-21 14:37:15,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2825.27 | bwd: 3201.31 | bwd_inner: 3191.56 | bwd_allreduce: 8.94 | step: 14.07 60%|██████ | 263/437 [28:59<18:32, 6.39s/it] {'loss': 0.2227, 'learning_rate': 1.45015328551424e-05, 'epoch': 0.6} 60%|██████ | 263/437 [28:59<18:32, 6.39s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5941 [2025-01-21 14:37:15,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 383.68 | bwd_microstep: 434.50 | bwd_inner_microstep: 434.33 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6388 [2025-01-21 14:37:16,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 412.82 | bwd_microstep: 471.66 | bwd_inner_microstep: 471.48 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3694 [2025-01-21 14:37:17,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 243.19 | bwd_microstep: 274.17 | bwd_inner_microstep: 274.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5018 [2025-01-21 14:37:18,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.11 | bwd_microstep: 367.22 | bwd_inner_microstep: 366.73 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2604 [2025-01-21 14:37:18,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.85 | bwd_microstep: 208.67 | bwd_inner_microstep: 208.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4974 [2025-01-21 14:37:19,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.01 | bwd_microstep: 363.62 | bwd_inner_microstep: 363.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4685 [2025-01-21 14:37:19,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.48 | bwd_microstep: 345.19 | bwd_inner_microstep: 345.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:37:21,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.76 | optimizer_step: 0.35 [2025-01-21 14:37:21,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.26 | bwd_microstep: 614.98 | bwd_inner_microstep: 607.12 | bwd_allreduce_microstep: 7.75 | step_microstep: 11.50 [2025-01-21 14:37:21,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2725.24 | bwd: 3080.19 | bwd_inner: 3070.98 | bwd_allreduce: 8.34 | step: 12.46 60%|██████ | 264/437 [29:05<18:07, 6.29s/it] {'loss': 0.5576, 'learning_rate': 1.4358870920720982e-05, 'epoch': 0.6} 60%|██████ | 264/437 [29:05<18:07, 6.29s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4447 [2025-01-21 14:37:21,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.57 | bwd_microstep: 323.44 | bwd_inner_microstep: 323.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7353 [2025-01-21 14:37:22,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.08 | bwd_microstep: 547.57 | bwd_inner_microstep: 547.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6021 [2025-01-21 14:37:23,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.43 | bwd_microstep: 440.39 | bwd_inner_microstep: 440.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3069 [2025-01-21 14:37:24,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.33 | bwd_microstep: 222.21 | bwd_inner_microstep: 222.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5175 [2025-01-21 14:37:24,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.12 | bwd_microstep: 380.31 | bwd_inner_microstep: 380.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:26,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.18 | bwd_microstep: 605.80 | bwd_inner_microstep: 605.63 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:27,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.89 | bwd_microstep: 605.21 | bwd_inner_microstep: 605.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:37:27,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.77 | optimizer_step: 0.34 [2025-01-21 14:37:27,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.67 | bwd_microstep: 261.54 | bwd_inner_microstep: 218.44 | bwd_allreduce_microstep: 42.99 | step_microstep: 13.97 [2025-01-21 14:37:27,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2965.11 | bwd: 3386.60 | bwd_inner: 3342.45 | bwd_allreduce: 43.46 | step: 14.73 61%|██████ | 265/437 [29:12<18:16, 6.37s/it] {'loss': 0.6571, 'learning_rate': 1.4216520145790027e-05, 'epoch': 0.61} 61%|██████ | 265/437 [29:12<18:16, 6.37s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6088 [2025-01-21 14:37:28,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 388.32 | bwd_microstep: 444.96 | bwd_inner_microstep: 444.79 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4026 [2025-01-21 14:37:29,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.84 | bwd_microstep: 296.68 | bwd_inner_microstep: 296.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3665 [2025-01-21 14:37:29,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.91 | bwd_microstep: 271.89 | bwd_inner_microstep: 271.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4675 [2025-01-21 14:37:30,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.74 | bwd_microstep: 344.39 | bwd_inner_microstep: 344.21 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:37:31,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.37 | bwd_microstep: 607.25 | bwd_inner_microstep: 606.90 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6480 [2025-01-21 14:37:32,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.49 | bwd_microstep: 477.36 | bwd_inner_microstep: 477.09 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:37:33,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 374.74 | bwd_microstep: 417.88 | bwd_inner_microstep: 417.70 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4758 [2025-01-21 14:37:34,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.09 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:37:34,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.83 | bwd_microstep: 366.51 | bwd_inner_microstep: 351.29 | bwd_allreduce_microstep: 14.99 | step_microstep: 14.09 [2025-01-21 14:37:34,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2865.06 | bwd: 3227.04 | bwd_inner: 3210.61 | bwd_allreduce: 15.45 | step: 14.93 61%|██████ | 266/437 [29:18<18:07, 6.36s/it] {'loss': 0.4547, 'learning_rate': 1.4074488382287324e-05, 'epoch': 0.61} 61%|██████ | 266/437 [29:18<18:07, 6.36s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:37:35,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 534.68 | bwd_microstep: 613.43 | bwd_inner_microstep: 613.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4249 [2025-01-21 14:37:35,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.63 | bwd_microstep: 312.99 | bwd_inner_microstep: 312.77 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3132 [2025-01-21 14:37:36,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.10 | bwd_microstep: 244.48 | bwd_inner_microstep: 244.31 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2241 [2025-01-21 14:37:36,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.23 | bwd_microstep: 201.43 | bwd_inner_microstep: 201.11 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:37:37,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.08 | bwd_microstep: 603.85 | bwd_inner_microstep: 603.63 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7667 [2025-01-21 14:37:38,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 487.86 | bwd_microstep: 570.73 | bwd_inner_microstep: 570.56 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5490 [2025-01-21 14:37:39,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 355.51 | bwd_microstep: 406.12 | bwd_inner_microstep: 405.96 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5034 [2025-01-21 14:37:40,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:37:40,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.52 | bwd_microstep: 376.19 | bwd_inner_microstep: 368.59 | bwd_allreduce_microstep: 7.41 | step_microstep: 10.93 [2025-01-21 14:37:40,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2886.45 | bwd: 3329.36 | bwd_inner: 3320.65 | bwd_allreduce: 7.92 | step: 11.72 61%|██████ | 267/437 [29:25<18:05, 6.39s/it] {'loss': 0.3705, 'learning_rate': 1.3932783464554286e-05, 'epoch': 0.61} 61%|██████ | 267/437 [29:25<18:05, 6.39s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7491 [2025-01-21 14:37:41,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.25 | bwd_microstep: 558.80 | bwd_inner_microstep: 558.59 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.14 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5810 [2025-01-21 14:37:42,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 381.38 | bwd_microstep: 425.40 | bwd_inner_microstep: 425.21 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:43,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.58 | bwd_microstep: 606.06 | bwd_inner_microstep: 605.84 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3092 [2025-01-21 14:37:44,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.74 | bwd_microstep: 241.19 | bwd_inner_microstep: 240.71 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:45,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.90 | bwd_microstep: 609.97 | bwd_inner_microstep: 609.64 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:37:46,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.45 | bwd_microstep: 477.86 | bwd_inner_microstep: 477.53 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:37:47,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.64 | bwd_microstep: 563.18 | bwd_inner_microstep: 563.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:37:47,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.65 | optimizer_step: 0.36 [2025-01-21 14:37:47,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 288.16 | bwd_microstep: 327.19 | bwd_inner_microstep: 319.25 | bwd_allreduce_microstep: 7.72 | step_microstep: 11.13 [2025-01-21 14:37:47,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3363.94 | bwd: 3809.83 | bwd_inner: 3800.33 | bwd_allreduce: 8.32 | step: 12.14 61%|██████▏ | 268/437 [29:32<18:51, 6.69s/it] {'loss': 0.4904, 'learning_rate': 1.379141320890381e-05, 'epoch': 0.61} 61%|██████▏ | 268/437 [29:32<18:51, 6.69s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3857 [2025-01-21 14:37:48,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.03 | bwd_microstep: 286.17 | bwd_inner_microstep: 286.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5824 [2025-01-21 14:37:49,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 382.05 | bwd_microstep: 427.93 | bwd_inner_microstep: 427.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2637 [2025-01-21 14:37:49,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.53 | bwd_microstep: 219.17 | bwd_inner_microstep: 218.99 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5009 [2025-01-21 14:37:50,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.66 | bwd_microstep: 369.67 | bwd_inner_microstep: 369.48 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:51,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.95 | bwd_microstep: 610.41 | bwd_inner_microstep: 610.11 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6811 [2025-01-21 14:37:52,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.97 | bwd_microstep: 504.01 | bwd_inner_microstep: 503.81 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6243 [2025-01-21 14:37:53,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.65 | bwd_microstep: 461.11 | bwd_inner_microstep: 460.95 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:37:54,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:37:54,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 405.49 | bwd_microstep: 471.12 | bwd_inner_microstep: 463.07 | bwd_allreduce_microstep: 7.93 | step_microstep: 11.94 [2025-01-21 14:37:54,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2936.17 | bwd: 3349.73 | bwd_inner: 3340.54 | bwd_allreduce: 8.43 | step: 12.78 62%|██████▏ | 269/437 [29:39<18:35, 6.64s/it] {'loss': 0.4299, 'learning_rate': 1.3650385413189151e-05, 'epoch': 0.61} 62%|██████▏ | 269/437 [29:39<18:35, 6.64s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3716 [2025-01-21 14:37:54,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.80 | bwd_microstep: 276.89 | bwd_inner_microstep: 276.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:37:56,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.35 | bwd_microstep: 607.92 | bwd_inner_microstep: 607.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6575 [2025-01-21 14:37:57,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.96 | bwd_microstep: 484.33 | bwd_inner_microstep: 484.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7891 [2025-01-21 14:37:58,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.45 | bwd_microstep: 587.52 | bwd_inner_microstep: 587.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3355 [2025-01-21 14:37:58,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.43 | bwd_microstep: 256.25 | bwd_inner_microstep: 256.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:37:59,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.59 | bwd_microstep: 608.08 | bwd_inner_microstep: 607.87 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:38:00,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.02 | bwd_microstep: 262.89 | bwd_inner_microstep: 262.57 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3834 [2025-01-21 14:38:01,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:38:01,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 247.62 | bwd_microstep: 292.49 | bwd_inner_microstep: 284.74 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.38 [2025-01-21 14:38:01,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2986.05 | bwd: 3376.52 | bwd_inner: 3367.64 | bwd_allreduce: 8.05 | step: 12.18 62%|██████▏ | 270/437 [29:45<18:26, 6.63s/it] {'loss': 0.4775, 'learning_rate': 1.3509707856373779e-05, 'epoch': 0.62} 62%|██████▏ | 270/437 [29:45<18:26, 6.63s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7494 [2025-01-21 14:38:02,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 483.30 | bwd_microstep: 557.34 | bwd_inner_microstep: 557.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3622 [2025-01-21 14:38:02,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.05 | bwd_microstep: 270.47 | bwd_inner_microstep: 270.30 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:38:03,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.51 | bwd_microstep: 604.99 | bwd_inner_microstep: 604.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7590 [2025-01-21 14:38:04,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.38 | bwd_microstep: 564.16 | bwd_inner_microstep: 563.84 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3869 [2025-01-21 14:38:05,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 258.10 | bwd_microstep: 285.11 | bwd_inner_microstep: 284.90 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:38:05,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.54 | bwd_microstep: 262.81 | bwd_inner_microstep: 262.33 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:38:07,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 534.10 | bwd_microstep: 613.94 | bwd_inner_microstep: 613.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4209 [2025-01-21 14:38:07,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.67 | optimizer_step: 0.36 [2025-01-21 14:38:07,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.33 | bwd_microstep: 318.64 | bwd_inner_microstep: 310.83 | bwd_allreduce_microstep: 7.70 | step_microstep: 11.18 [2025-01-21 14:38:07,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3061.14 | bwd: 3477.64 | bwd_inner: 3468.36 | bwd_allreduce: 8.30 | step: 12.17 62%|██████▏ | 271/437 [29:52<18:27, 6.67s/it] {'loss': 0.2769, 'learning_rate': 1.3369388298102312e-05, 'epoch': 0.62} 62%|██████▏ | 271/437 [29:52<18:27, 6.67s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4547 [2025-01-21 14:38:08,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.90 | bwd_microstep: 334.31 | bwd_inner_microstep: 334.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2920 [2025-01-21 14:38:08,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.12 | bwd_microstep: 221.33 | bwd_inner_microstep: 221.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2328 [2025-01-21 14:38:09,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.28 | bwd_microstep: 201.88 | bwd_inner_microstep: 201.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3086 [2025-01-21 14:38:09,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.68 | bwd_microstep: 242.45 | bwd_inner_microstep: 242.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:38:10,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.76 | bwd_microstep: 204.46 | bwd_inner_microstep: 204.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:38:10,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 318.86 | bwd_microstep: 360.02 | bwd_inner_microstep: 359.70 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:38:11,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.00 | bwd_microstep: 266.19 | bwd_inner_microstep: 266.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4272 [2025-01-21 14:38:12,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 2.86 | optimizer_gradients: 0.79 | optimizer_step: 0.34 [2025-01-21 14:38:12,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 279.88 | bwd_microstep: 337.96 | bwd_inner_microstep: 315.22 | bwd_allreduce_microstep: 22.63 | step_microstep: 17.58 [2025-01-21 14:38:12,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1893.31 | bwd: 2168.73 | bwd_inner: 2144.87 | bwd_allreduce: 23.09 | step: 18.37 62%|██████▏ | 272/437 [29:56<16:23, 5.96s/it] {'loss': 0.2478, 'learning_rate': 1.3229434478272492e-05, 'epoch': 0.62} 62%|██████▏ | 272/437 [29:56<16:23, 5.96s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6764 [2025-01-21 14:38:13,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 432.59 | bwd_microstep: 498.67 | bwd_inner_microstep: 498.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6991 [2025-01-21 14:38:14,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 449.85 | bwd_microstep: 518.31 | bwd_inner_microstep: 518.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2903 [2025-01-21 14:38:14,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.47 | bwd_microstep: 216.81 | bwd_inner_microstep: 216.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3919 [2025-01-21 14:38:15,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.64 | bwd_microstep: 291.63 | bwd_inner_microstep: 291.46 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2556 [2025-01-21 14:38:15,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.67 | bwd_microstep: 210.14 | bwd_inner_microstep: 209.97 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5206 [2025-01-21 14:38:16,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.69 | bwd_microstep: 381.73 | bwd_inner_microstep: 381.42 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4400 [2025-01-21 14:38:16,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.24 | bwd_microstep: 323.21 | bwd_inner_microstep: 323.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:38:18,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:38:18,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.14 | bwd_microstep: 596.78 | bwd_inner_microstep: 579.54 | bwd_allreduce_microstep: 17.13 | step_microstep: 13.28 [2025-01-21 14:38:18,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2660.13 | bwd: 3037.40 | bwd_inner: 3019.05 | bwd_allreduce: 17.59 | step: 14.06 62%|██████▏ | 273/437 [30:02<16:15, 5.95s/it] {'loss': 0.3056, 'learning_rate': 1.3089854116608279e-05, 'epoch': 0.62} 62%|██████▏ | 273/437 [30:02<16:15, 5.95s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3950 [2025-01-21 14:38:18,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.03 | bwd_microstep: 292.50 | bwd_inner_microstep: 292.34 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7637 [2025-01-21 14:38:19,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.53 | bwd_microstep: 567.42 | bwd_inner_microstep: 567.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2304 [2025-01-21 14:38:20,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.74 | bwd_microstep: 201.51 | bwd_inner_microstep: 201.35 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5490 [2025-01-21 14:38:20,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 360.60 | bwd_microstep: 406.06 | bwd_inner_microstep: 405.88 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6532 [2025-01-21 14:38:21,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.37 | bwd_microstep: 482.78 | bwd_inner_microstep: 482.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:38:22,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.65 | bwd_microstep: 607.91 | bwd_inner_microstep: 607.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:38:23,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.19 | bwd_microstep: 261.39 | bwd_inner_microstep: 261.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2878 [2025-01-21 14:38:23,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.43 | optimizer_gradients: 0.81 | optimizer_step: 0.35 [2025-01-21 14:38:23,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.80 | bwd_microstep: 273.69 | bwd_inner_microstep: 247.05 | bwd_allreduce_microstep: 26.39 | step_microstep: 18.64 [2025-01-21 14:38:23,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2630.77 | bwd: 3093.38 | bwd_inner: 3065.79 | bwd_allreduce: 26.81 | step: 19.43 63%|██████▎ | 274/437 [30:08<16:10, 5.95s/it] {'loss': 0.4498, 'learning_rate': 1.295065491223401e-05, 'epoch': 0.63} 63%|██████▎ | 274/437 [30:08<16:10, 5.95s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4672 [2025-01-21 14:38:24,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 302.33 | bwd_microstep: 344.22 | bwd_inner_microstep: 343.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7656 [2025-01-21 14:38:25,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 495.41 | bwd_microstep: 567.84 | bwd_inner_microstep: 567.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4988 [2025-01-21 14:38:26,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.03 | bwd_microstep: 362.90 | bwd_inner_microstep: 362.55 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3389 [2025-01-21 14:38:26,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.79 | bwd_microstep: 255.82 | bwd_inner_microstep: 255.60 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4674 [2025-01-21 14:38:27,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.26 | bwd_microstep: 345.63 | bwd_inner_microstep: 345.42 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5714 [2025-01-21 14:38:28,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 383.97 | bwd_microstep: 420.66 | bwd_inner_microstep: 420.31 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.17 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7902 [2025-01-21 14:38:29,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 507.74 | bwd_microstep: 588.04 | bwd_inner_microstep: 587.83 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:38:30,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:38:30,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.04 | bwd_microstep: 613.00 | bwd_inner_microstep: 605.59 | bwd_allreduce_microstep: 7.20 | step_microstep: 10.73 [2025-01-21 14:38:30,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3101.33 | bwd: 3498.28 | bwd_inner: 3489.47 | bwd_allreduce: 7.76 | step: 11.69 63%|██████▎ | 275/437 [30:15<16:49, 6.23s/it] {'loss': 0.5963, 'learning_rate': 1.2811844543249748e-05, 'epoch': 0.63} 63%|██████▎ | 275/437 [30:15<16:49, 6.23s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2919 [2025-01-21 14:38:31,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.48 | bwd_microstep: 217.58 | bwd_inner_microstep: 217.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6045 [2025-01-21 14:38:32,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.92 | bwd_microstep: 441.63 | bwd_inner_microstep: 441.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2566 [2025-01-21 14:38:32,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.88 | bwd_microstep: 207.70 | bwd_inner_microstep: 207.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4939 [2025-01-21 14:38:33,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.86 | bwd_microstep: 360.95 | bwd_inner_microstep: 360.60 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.21 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4135 [2025-01-21 14:38:33,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.84 | bwd_microstep: 301.78 | bwd_inner_microstep: 301.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:38:35,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.36 | bwd_microstep: 605.61 | bwd_inner_microstep: 605.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:38:35,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.15 | bwd_microstep: 304.62 | bwd_inner_microstep: 304.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3851 [2025-01-21 14:38:36,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.09 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:38:36,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.66 | bwd_microstep: 301.10 | bwd_inner_microstep: 288.90 | bwd_allreduce_microstep: 11.98 | step_microstep: 13.99 [2025-01-21 14:38:36,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2424.00 | bwd: 2741.13 | bwd_inner: 2727.78 | bwd_allreduce: 12.49 | step: 14.89 63%|██████▎ | 276/437 [30:20<16:02, 5.98s/it] {'loss': 0.4305, 'learning_rate': 1.2673430666307738e-05, 'epoch': 0.63} 63%|██████▎ | 276/437 [30:20<16:02, 5.98s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3785 [2025-01-21 14:38:36,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.78 | bwd_microstep: 281.04 | bwd_inner_microstep: 280.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3737 [2025-01-21 14:38:37,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.87 | bwd_microstep: 280.61 | bwd_inner_microstep: 280.39 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:38:38,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.73 | bwd_microstep: 606.59 | bwd_inner_microstep: 606.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5980 [2025-01-21 14:38:39,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.68 | bwd_microstep: 435.02 | bwd_inner_microstep: 434.69 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:38:39,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.29 | bwd_microstep: 199.87 | bwd_inner_microstep: 199.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:38:40,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.73 | bwd_microstep: 246.17 | bwd_inner_microstep: 246.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4846 [2025-01-21 14:38:40,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.84 | bwd_microstep: 353.63 | bwd_inner_microstep: 353.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6089 [2025-01-21 14:38:41,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:38:41,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.75 | bwd_microstep: 489.90 | bwd_inner_microstep: 445.75 | bwd_allreduce_microstep: 44.01 | step_microstep: 13.77 [2025-01-21 14:38:41,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2525.50 | bwd: 2892.96 | bwd_inner: 2847.65 | bwd_allreduce: 44.53 | step: 14.57 63%|██████▎ | 277/437 [30:26<15:40, 5.88s/it] {'loss': 0.3375, 'learning_rate': 1.2535420916190106e-05, 'epoch': 0.63} 63%|██████▎ | 277/437 [30:26<15:40, 5.88s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5951 [2025-01-21 14:38:42,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 383.91 | bwd_microstep: 433.59 | bwd_inner_microstep: 433.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6637 [2025-01-21 14:38:43,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 429.81 | bwd_microstep: 486.77 | bwd_inner_microstep: 486.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4471 [2025-01-21 14:38:44,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.61 | bwd_microstep: 325.77 | bwd_inner_microstep: 325.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7879 [2025-01-21 14:38:45,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 515.68 | bwd_microstep: 589.93 | bwd_inner_microstep: 589.64 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5201 [2025-01-21 14:38:46,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 344.08 | bwd_microstep: 378.71 | bwd_inner_microstep: 378.51 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:38:46,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.52 | bwd_microstep: 242.85 | bwd_inner_microstep: 242.53 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5857 [2025-01-21 14:38:47,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 370.66 | bwd_microstep: 433.00 | bwd_inner_microstep: 432.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7918 [2025-01-21 14:38:48,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.82 | optimizer_step: 0.35 [2025-01-21 14:38:48,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 515.94 | bwd_microstep: 603.17 | bwd_inner_microstep: 592.90 | bwd_allreduce_microstep: 10.11 | step_microstep: 14.25 [2025-01-21 14:38:48,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3072.04 | bwd: 3493.91 | bwd_inner: 3482.42 | bwd_allreduce: 10.59 | step: 15.09 64%|██████▎ | 278/437 [30:33<16:18, 6.16s/it] {'loss': 0.3299, 'learning_rate': 1.2397822905387707e-05, 'epoch': 0.64} 64%|██████▎ | 278/437 [30:33<16:18, 6.16s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4231 [2025-01-21 14:38:49,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.60 | bwd_microstep: 311.91 | bwd_inner_microstep: 311.75 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5233 [2025-01-21 14:38:50,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.98 | bwd_microstep: 383.18 | bwd_inner_microstep: 382.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3605 [2025-01-21 14:38:50,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.47 | bwd_microstep: 269.44 | bwd_inner_microstep: 269.23 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:38:51,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.66 | bwd_microstep: 605.97 | bwd_inner_microstep: 605.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:38:52,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.41 | bwd_microstep: 605.70 | bwd_inner_microstep: 605.53 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3033 [2025-01-21 14:38:53,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.83 | bwd_microstep: 227.03 | bwd_inner_microstep: 226.71 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:38:54,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.06 | bwd_microstep: 318.94 | bwd_inner_microstep: 318.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7961 [2025-01-21 14:38:55,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.66 | optimizer_step: 0.37 [2025-01-21 14:38:55,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 505.76 | bwd_microstep: 601.46 | bwd_inner_microstep: 594.15 | bwd_allreduce_microstep: 7.21 | step_microstep: 10.90 [2025-01-21 14:38:55,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2964.63 | bwd: 3323.76 | bwd_inner: 3315.23 | bwd_allreduce: 7.69 | step: 11.70 64%|██████▍ | 279/437 [30:39<16:29, 6.26s/it] {'loss': 0.537, 'learning_rate': 1.2260644223680228e-05, 'epoch': 0.64} 64%|██████▍ | 279/437 [30:39<16:29, 6.26s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:38:56,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.17 | bwd_microstep: 605.33 | bwd_inner_microstep: 605.12 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4436 [2025-01-21 14:38:57,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.64 | bwd_microstep: 323.80 | bwd_inner_microstep: 323.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3910 [2025-01-21 14:38:57,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.42 | bwd_microstep: 290.55 | bwd_inner_microstep: 290.23 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7315 [2025-01-21 14:38:58,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.26 | bwd_microstep: 546.48 | bwd_inner_microstep: 546.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5186 [2025-01-21 14:38:59,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.36 | bwd_microstep: 379.27 | bwd_inner_microstep: 378.97 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:00,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.69 | bwd_microstep: 605.44 | bwd_inner_microstep: 605.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:39:01,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 388.45 | bwd_microstep: 434.66 | bwd_inner_microstep: 434.47 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:39:01,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.06 | optimizer_gradients: 0.84 | optimizer_step: 0.36 [2025-01-21 14:39:01,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.91 | bwd_microstep: 240.76 | bwd_inner_microstep: 230.79 | bwd_allreduce_microstep: 9.85 | step_microstep: 14.45 [2025-01-21 14:39:01,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3073.76 | bwd: 3426.42 | bwd_inner: 3415.21 | bwd_allreduce: 10.34 | step: 15.24 64%|██████▍ | 280/437 [30:46<16:45, 6.40s/it] {'loss': 0.2732, 'learning_rate': 1.212389243771756e-05, 'epoch': 0.64} 64%|██████▍ | 280/437 [30:46<16:45, 6.40s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5581 [2025-01-21 14:39:02,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.14 | bwd_microstep: 412.67 | bwd_inner_microstep: 412.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4986 [2025-01-21 14:39:03,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.57 | bwd_microstep: 363.71 | bwd_inner_microstep: 363.47 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7076 [2025-01-21 14:39:04,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.07 | bwd_microstep: 524.84 | bwd_inner_microstep: 524.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7591 [2025-01-21 14:39:05,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.88 | bwd_microstep: 566.82 | bwd_inner_microstep: 566.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:06,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.98 | bwd_microstep: 604.28 | bwd_inner_microstep: 604.10 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:39:07,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.87 | bwd_microstep: 606.06 | bwd_inner_microstep: 605.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:39:08,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.51 | bwd_microstep: 520.40 | bwd_inner_microstep: 520.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4680 [2025-01-21 14:39:09,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.80 | optimizer_step: 0.36 [2025-01-21 14:39:09,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.60 | bwd_microstep: 355.09 | bwd_inner_microstep: 347.17 | bwd_allreduce_microstep: 7.81 | step_microstep: 11.86 [2025-01-21 14:39:09,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3460.43 | bwd: 3953.99 | bwd_inner: 3944.96 | bwd_allreduce: 8.31 | step: 12.73 64%|██████▍ | 281/437 [30:54<17:37, 6.78s/it] {'loss': 0.325, 'learning_rate': 1.1987575090602408e-05, 'epoch': 0.64} 64%|██████▍ | 281/437 [30:54<17:37, 6.78s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7340 [2025-01-21 14:39:10,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 465.69 | bwd_microstep: 548.94 | bwd_inner_microstep: 548.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.01 | bwd_microstep: 604.16 | bwd_inner_microstep: 603.98 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:13,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.10 | bwd_microstep: 607.12 | bwd_inner_microstep: 606.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3359 [2025-01-21 14:39:13,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.59 | bwd_microstep: 258.03 | bwd_inner_microstep: 257.72 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:14,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.40 | bwd_microstep: 605.32 | bwd_inner_microstep: 605.03 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2543 [2025-01-21 14:39:15,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.34 | bwd_microstep: 205.20 | bwd_inner_microstep: 205.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:39:15,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 172.97 | bwd_microstep: 208.07 | bwd_inner_microstep: 207.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:16,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:39:16,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.14 | bwd_microstep: 617.71 | bwd_inner_microstep: 609.79 | bwd_allreduce_microstep: 7.74 | step_microstep: 11.55 [2025-01-21 14:39:16,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3239.07 | bwd: 3654.67 | bwd_inner: 3645.62 | bwd_allreduce: 8.22 | step: 12.34 65%|██████▍ | 282/437 [31:01<17:47, 6.89s/it] {'loss': 1.0755, 'learning_rate': 1.185169970147424e-05, 'epoch': 0.64} 65%|██████▍ | 282/437 [31:01<17:47, 6.89s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5642 [2025-01-21 14:39:17,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 368.77 | bwd_microstep: 416.73 | bwd_inner_microstep: 416.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6157 [2025-01-21 14:39:18,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.29 | bwd_microstep: 458.52 | bwd_inner_microstep: 458.19 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3357 [2025-01-21 14:39:18,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.26 | bwd_microstep: 257.12 | bwd_inner_microstep: 256.91 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:20,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.20 | bwd_microstep: 604.61 | bwd_inner_microstep: 604.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3346 [2025-01-21 14:39:20,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.88 | bwd_microstep: 256.00 | bwd_inner_microstep: 255.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5450 [2025-01-21 14:39:21,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.85 | bwd_microstep: 401.70 | bwd_inner_microstep: 401.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:22,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.03 | bwd_microstep: 610.80 | bwd_inner_microstep: 610.63 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:39:23,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:39:23,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.48 | bwd_microstep: 505.05 | bwd_inner_microstep: 497.70 | bwd_allreduce_microstep: 7.24 | step_microstep: 10.79 [2025-01-21 14:39:23,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3105.61 | bwd: 3510.66 | bwd_inner: 3502.18 | bwd_allreduce: 7.72 | step: 11.57 65%|██████▍ | 283/437 [31:08<17:38, 6.87s/it] {'loss': 0.6557, 'learning_rate': 1.1716273765094517e-05, 'epoch': 0.65} 65%|██████▍ | 283/437 [31:08<17:38, 6.87s/it]dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4833 [2025-01-21 14:39:24,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.11 | bwd_microstep: 352.91 | bwd_inner_microstep: 352.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3194 [2025-01-21 14:39:24,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 217.15 | bwd_microstep: 243.89 | bwd_inner_microstep: 243.55 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5554 [2025-01-21 14:39:25,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.86 | bwd_microstep: 411.94 | bwd_inner_microstep: 411.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3118 [2025-01-21 14:39:26,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.75 | bwd_microstep: 243.61 | bwd_inner_microstep: 243.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2826 [2025-01-21 14:39:26,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.25 | bwd_microstep: 222.73 | bwd_inner_microstep: 222.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:27,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.75 | bwd_microstep: 605.26 | bwd_inner_microstep: 605.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:39:28,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.32 | bwd_microstep: 356.86 | bwd_inner_microstep: 356.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7860 [2025-01-21 14:39:29,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.77 | optimizer_step: 0.35 [2025-01-21 14:39:29,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.88 | bwd_microstep: 602.40 | bwd_inner_microstep: 587.15 | bwd_allreduce_microstep: 15.12 | step_microstep: 14.10 [2025-01-21 14:39:29,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2691.91 | bwd: 3039.72 | bwd_inner: 3023.34 | bwd_allreduce: 15.60 | step: 14.93 65%|██████▍ | 284/437 [31:14<16:49, 6.60s/it] {'loss': 0.4481, 'learning_rate': 1.1581304751433305e-05, 'epoch': 0.65} 65%|██████▍ | 284/437 [31:14<16:49, 6.60s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2388 [2025-01-21 14:39:29,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.46 | bwd_microstep: 203.29 | bwd_inner_microstep: 203.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4465 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:39:30,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.43 | bwd_microstep: 326.49 | bwd_inner_microstep: 326.31 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6565 [2025-01-21 14:39:31,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 424.73 | bwd_microstep: 484.33 | bwd_inner_microstep: 484.12 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4696 [2025-01-21 14:39:32,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.70 | bwd_microstep: 342.78 | bwd_inner_microstep: 342.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:33,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.60 | bwd_microstep: 606.07 | bwd_inner_microstep: 605.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3046 [2025-01-21 14:39:33,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.12 | bwd_microstep: 225.88 | bwd_inner_microstep: 225.58 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:39:34,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 306.06 | bwd_microstep: 340.00 | bwd_inner_microstep: 339.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:39:35,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.06 | optimizer_gradients: 0.79 | optimizer_step: 0.35 [2025-01-21 14:39:35,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.93 | bwd_microstep: 899.84 | bwd_inner_microstep: 400.54 | bwd_allreduce_microstep: 499.19 | step_microstep: 14.10 [2025-01-21 14:39:35,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2613.84 | bwd: 3428.81 | bwd_inner: 2928.37 | bwd_allreduce: 499.66 | step: 14.87 65%|██████▌ | 285/437 [31:20<16:28, 6.50s/it] {'loss': 0.3972, 'learning_rate': 1.1446800105257232e-05, 'epoch': 0.65} 65%|██████▌ | 285/437 [31:20<16:28, 6.50s/it]warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:36,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.60 | bwd_microstep: 604.39 | bwd_inner_microstep: 604.22 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3183 [2025-01-21 14:39:37,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.63 | bwd_microstep: 242.49 | bwd_inner_microstep: 242.30 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2584 [2025-01-21 14:39:37,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.42 | bwd_microstep: 207.73 | bwd_inner_microstep: 207.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3646 [2025-01-21 14:39:38,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.64 | bwd_microstep: 273.31 | bwd_inner_microstep: 273.14 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6518 [2025-01-21 14:39:39,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.95 | bwd_microstep: 480.57 | bwd_inner_microstep: 480.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4389 [2025-01-21 14:39:39,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.88 | bwd_microstep: 321.56 | bwd_inner_microstep: 321.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:41,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.48 | bwd_microstep: 606.22 | bwd_inner_microstep: 606.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:39:42,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:39:42,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 288.14 | bwd_microstep: 553.25 | bwd_inner_microstep: 320.09 | bwd_allreduce_microstep: 233.05 | step_microstep: 13.25 [2025-01-21 14:39:42,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2721.60 | bwd: 3289.64 | bwd_inner: 3055.43 | bwd_allreduce: 233.53 | step: 14.02 65%|██████▌ | 286/437 [31:26<16:09, 6.42s/it] {'loss': 0.3317, 'learning_rate': 1.1312767245718836e-05, 'epoch': 0.65} 65%|██████▌ | 286/437 [31:26<16:09, 6.42s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5804 [2025-01-21 14:39:42,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.55 | bwd_microstep: 425.71 | bwd_inner_microstep: 425.48 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6060 [2025-01-21 14:39:43,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.23 | bwd_microstep: 442.66 | bwd_inner_microstep: 442.48 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7106 [2025-01-21 14:39:44,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.65 | bwd_microstep: 526.75 | bwd_inner_microstep: 526.58 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4693 [2025-01-21 14:39:45,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.42 | bwd_microstep: 343.62 | bwd_inner_microstep: 343.42 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7607 [2025-01-21 14:39:46,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.90 | bwd_microstep: 567.05 | bwd_inner_microstep: 566.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5188 [2025-01-21 14:39:47,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.44 | bwd_microstep: 379.87 | bwd_inner_microstep: 379.65 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4733 [2025-01-21 14:39:47,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.59 | bwd_microstep: 345.72 | bwd_inner_microstep: 345.42 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5234 [2025-01-21 14:39:48,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:39:48,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.03 | bwd_microstep: 391.50 | bwd_inner_microstep: 383.93 | bwd_allreduce_microstep: 7.35 | step_microstep: 11.31 [2025-01-21 14:39:48,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3025.63 | bwd: 3423.01 | bwd_inner: 3414.33 | bwd_allreduce: 7.83 | step: 12.13 66%|██████▌ | 287/437 [31:33<16:14, 6.50s/it] {'loss': 0.2758, 'learning_rate': 1.1179213565947366e-05, 'epoch': 0.66} 66%|██████▌ | 287/437 [31:33<16:14, 6.50s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3226 [2025-01-21 14:39:49,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.52 | bwd_microstep: 242.74 | bwd_inner_microstep: 242.40 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7409 [2025-01-21 14:39:50,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 480.24 | bwd_microstep: 551.89 | bwd_inner_microstep: 551.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6854 [2025-01-21 14:39:51,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.96 | bwd_microstep: 506.82 | bwd_inner_microstep: 506.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7626 [2025-01-21 14:39:52,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.31 | bwd_microstep: 568.03 | bwd_inner_microstep: 567.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2831 [2025-01-21 14:39:52,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 191.61 | bwd_microstep: 215.67 | bwd_inner_microstep: 215.34 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7585 [2025-01-21 14:39:53,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.43 | bwd_microstep: 564.92 | bwd_inner_microstep: 564.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:39:54,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 523.72 | bwd_microstep: 603.94 | bwd_inner_microstep: 603.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:39:55,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.66 | optimizer_step: 0.33 [2025-01-21 14:39:55,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.17 | bwd_microstep: 229.63 | bwd_inner_microstep: 222.21 | bwd_allreduce_microstep: 7.32 | step_microstep: 11.08 [2025-01-21 14:39:55,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3023.80 | bwd: 3483.76 | bwd_inner: 3475.13 | bwd_allreduce: 7.77 | step: 11.83 66%|██████▌ | 288/437 [31:40<16:18, 6.57s/it] {'loss': 0.2696, 'learning_rate': 1.1046146432640923e-05, 'epoch': 0.66} 66%|██████▌ | 288/437 [31:40<16:18, 6.57s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2709 [2025-01-21 14:39:55,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.13 | bwd_microstep: 209.00 | bwd_inner_microstep: 208.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3936 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:39:56,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.58 | bwd_microstep: 292.66 | bwd_inner_microstep: 292.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2320 [2025-01-21 14:39:56,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.42 | bwd_microstep: 199.30 | bwd_inner_microstep: 199.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2555 [2025-01-21 14:39:57,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.71 | bwd_microstep: 206.43 | bwd_inner_microstep: 206.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3859 [2025-01-21 14:39:57,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.93 | bwd_microstep: 287.90 | bwd_inner_microstep: 287.68 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4910 [2025-01-21 14:39:58,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.50 | bwd_microstep: 360.59 | bwd_inner_microstep: 360.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:39:59,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.33 | bwd_microstep: 498.47 | bwd_inner_microstep: 498.26 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:40:00,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.06 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:40:00,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.53 | bwd_microstep: 896.95 | bwd_inner_microstep: 607.55 | bwd_allreduce_microstep: 289.18 | step_microstep: 13.24 [2025-01-21 14:40:00,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2337.95 | bwd: 2951.43 | bwd_inner: 2661.02 | bwd_allreduce: 289.62 | step: 14.02 66%|██████▌ | 289/437 [31:45<15:25, 6.25s/it] {'loss': 0.3465, 'learning_rate': 1.0913573185660167e-05, 'epoch': 0.66} 66%|██████▌ | 289/437 [31:45<15:25, 6.25s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3021 [2025-01-21 14:40:01,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.56 | bwd_microstep: 224.48 | bwd_inner_microstep: 224.31 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2878 [2025-01-21 14:40:01,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.30 | bwd_microstep: 211.57 | bwd_inner_microstep: 211.25 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2861 [2025-01-21 14:40:02,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.45 | bwd_microstep: 218.98 | bwd_inner_microstep: 218.72 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4135 [2025-01-21 14:40:02,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.64 | bwd_microstep: 301.67 | bwd_inner_microstep: 301.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8079 [2025-01-21 14:40:04,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.74 | bwd_microstep: 604.45 | bwd_inner_microstep: 604.14 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:40:05,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.81 | bwd_microstep: 607.59 | bwd_inner_microstep: 607.38 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6860 [2025-01-21 14:40:06,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 436.36 | bwd_microstep: 507.34 | bwd_inner_microstep: 507.15 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5331 [2025-01-21 14:40:06,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.76 | optimizer_step: 0.36 [2025-01-21 14:40:06,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 345.41 | bwd_microstep: 398.17 | bwd_inner_microstep: 390.17 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.65 [2025-01-21 14:40:06,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2702.13 | bwd: 3074.37 | bwd_inner: 3065.18 | bwd_allreduce: 8.25 | step: 12.46 66%|██████▋ | 290/437 [31:51<15:07, 6.18s/it] {'loss': 0.3807, 'learning_rate': 1.078150113762344e-05, 'epoch': 0.66} 66%|██████▋ | 290/437 [31:51<15:07, 6.18s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3988 [2025-01-21 14:40:07,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.48 | bwd_microstep: 295.60 | bwd_inner_microstep: 295.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3389 [2025-01-21 14:40:08,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.34 | bwd_microstep: 251.22 | bwd_inner_microstep: 251.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4675 [2025-01-21 14:40:08,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 306.23 | bwd_microstep: 342.45 | bwd_inner_microstep: 342.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4146 [2025-01-21 14:40:09,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.42 | bwd_microstep: 308.09 | bwd_inner_microstep: 307.88 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2809 [2025-01-21 14:40:09,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 190.94 | bwd_microstep: 215.23 | bwd_inner_microstep: 215.07 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:40:10,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.62 | bwd_microstep: 539.07 | bwd_inner_microstep: 538.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:40:11,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.00 | bwd_microstep: 608.38 | bwd_inner_microstep: 607.90 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6293 [2025-01-21 14:40:13,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.97 | optimizer_gradients: 0.82 | optimizer_step: 0.39 [2025-01-21 14:40:13,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.67 | bwd_microstep: 1070.40 | bwd_inner_microstep: 466.56 | bwd_allreduce_microstep: 603.71 | step_microstep: 19.57 [2025-01-21 14:40:13,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2686.54 | bwd: 3630.61 | bwd_inner: 3025.43 | bwd_allreduce: 604.31 | step: 20.57 67%|██████▋ | 291/437 [31:58<15:18, 6.29s/it] {'loss': 0.3017, 'learning_rate': 1.0649937573503419e-05, 'epoch': 0.67} 67%|██████▋ | 291/437 [31:58<15:18, 6.29s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6191 [2025-01-21 14:40:14,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.22 | bwd_microstep: 457.71 | bwd_inner_microstep: 457.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5808 [2025-01-21 14:40:15,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 380.39 | bwd_microstep: 426.38 | bwd_inner_microstep: 426.04 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2863 [2025-01-21 14:40:15,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.65 | bwd_microstep: 218.84 | bwd_inner_microstep: 218.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5991 [2025-01-21 14:40:16,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.67 | bwd_microstep: 435.35 | bwd_inner_microstep: 435.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3587 [2025-01-21 14:40:17,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.67 | bwd_microstep: 270.09 | bwd_inner_microstep: 269.79 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:40:18,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 507.60 | bwd_microstep: 579.80 | bwd_inner_microstep: 579.59 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4177 [2025-01-21 14:40:18,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 267.47 | bwd_microstep: 307.57 | bwd_inner_microstep: 307.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5035 [2025-01-21 14:40:19,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.34 [2025-01-21 14:40:19,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.87 | bwd_microstep: 376.27 | bwd_inner_microstep: 368.39 | bwd_allreduce_microstep: 7.66 | step_microstep: 11.19 [2025-01-21 14:40:19,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2707.35 | bwd: 3072.15 | bwd_inner: 3063.06 | bwd_allreduce: 8.11 | step: 12.06 67%|██████▋ | 292/437 [32:04<14:59, 6.21s/it] {'loss': 0.3269, 'learning_rate': 1.051888975022525e-05, 'epoch': 0.67} 67%|██████▋ | 292/437 [32:04<14:59, 6.21s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6649 [2025-01-21 14:40:20,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.33 | bwd_microstep: 488.29 | bwd_inner_microstep: 488.10 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2922 [2025-01-21 14:40:20,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.19 | bwd_microstep: 218.14 | bwd_inner_microstep: 217.64 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4437 [2025-01-21 14:40:21,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.68 | bwd_microstep: 324.76 | bwd_inner_microstep: 324.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8103 [2025-01-21 14:40:22,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.54 | bwd_microstep: 605.94 | bwd_inner_microstep: 605.69 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3828 [2025-01-21 14:40:23,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.78 | bwd_microstep: 282.79 | bwd_inner_microstep: 282.48 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:40:23,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.83 | bwd_microstep: 205.92 | bwd_inner_microstep: 205.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:40:24,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 360.21 | bwd_microstep: 400.77 | bwd_inner_microstep: 400.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:40:25,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:40:25,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.35 | bwd_microstep: 1069.26 | bwd_inner_microstep: 236.29 | bwd_allreduce_microstep: 832.84 | step_microstep: 13.70 [2025-01-21 14:40:25,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2431.74 | bwd: 3596.04 | bwd_inner: 2761.64 | bwd_allreduce: 833.42 | step: 14.67 67%|██████▋ | 293/437 [32:10<14:55, 6.22s/it] {'loss': 0.1849, 'learning_rate': 1.0388364896266326e-05, 'epoch': 0.67} 67%|██████▋ | 293/437 [32:10<14:55, 6.22s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4362 [2025-01-21 14:40:26,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.05 | bwd_microstep: 320.59 | bwd_inner_microstep: 320.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:40:27,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.89 | bwd_microstep: 606.04 | bwd_inner_microstep: 605.82 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6036 [2025-01-21 14:40:28,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.43 | bwd_microstep: 440.52 | bwd_inner_microstep: 440.35 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7330 [2025-01-21 14:40:29,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.66 | bwd_microstep: 546.32 | bwd_inner_microstep: 546.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6792 [2025-01-21 14:40:30,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 440.03 | bwd_microstep: 503.65 | bwd_inner_microstep: 503.48 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5186 [2025-01-21 14:40:31,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.46 | bwd_microstep: 378.25 | bwd_inner_microstep: 378.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:40:32,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 549.85 | bwd_microstep: 606.56 | bwd_inner_microstep: 606.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:40:33,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.67 | optimizer_step: 0.34 [2025-01-21 14:40:33,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 472.13 | bwd_microstep: 545.76 | bwd_inner_microstep: 538.18 | bwd_allreduce_microstep: 7.36 | step_microstep: 10.55 [2025-01-21 14:40:33,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3490.35 | bwd: 3947.81 | bwd_inner: 3939.23 | bwd_allreduce: 7.79 | step: 11.30 67%|██████▋ | 294/437 [32:18<15:51, 6.66s/it] {'loss': 0.4315, 'learning_rate': 1.0258370211257511e-05, 'epoch': 0.67} 67%|██████▋ | 294/437 [32:18<15:51, 6.66s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5650 [2025-01-21 14:40:34,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.71 | bwd_microstep: 414.31 | bwd_inner_microstep: 414.07 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8163 [2025-01-21 14:40:35,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.33 | bwd_microstep: 609.13 | bwd_inner_microstep: 608.80 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5499 [2025-01-21 14:40:36,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.96 | bwd_microstep: 404.91 | bwd_inner_microstep: 404.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2564 [2025-01-21 14:40:36,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.05 | bwd_microstep: 211.39 | bwd_inner_microstep: 211.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3879 [2025-01-21 14:40:37,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.63 | bwd_microstep: 288.53 | bwd_inner_microstep: 288.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:40:38,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.95 | bwd_microstep: 604.80 | bwd_inner_microstep: 604.58 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.16 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:40:38,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.89 | bwd_microstep: 262.12 | bwd_inner_microstep: 261.81 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:40:39,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.68 | optimizer_step: 0.33 [2025-01-21 14:40:39,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 388.66 | bwd_microstep: 441.59 | bwd_inner_microstep: 434.21 | bwd_allreduce_microstep: 7.27 | step_microstep: 11.10 [2025-01-21 14:40:39,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2860.03 | bwd: 3236.90 | bwd_inner: 3228.27 | bwd_allreduce: 7.75 | step: 11.93 68%|██████▊ | 295/437 [32:24<15:30, 6.55s/it] {'loss': 0.2808, 'learning_rate': 1.0128912865586038e-05, 'epoch': 0.67} 68%|██████▊ | 295/437 [32:24<15:30, 6.55s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5574 [2025-01-21 14:40:40,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.04 | bwd_microstep: 411.07 | bwd_inner_microstep: 410.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5574 [2025-01-21 14:40:41,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.15 | bwd_microstep: 412.01 | bwd_inner_microstep: 411.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2883 [2025-01-21 14:40:41,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.95 | bwd_microstep: 218.69 | bwd_inner_microstep: 218.20 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6322 [2025-01-21 14:40:42,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 411.41 | bwd_microstep: 468.25 | bwd_inner_microstep: 468.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6567 [2025-01-21 14:40:43,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.59 | bwd_microstep: 486.42 | bwd_inner_microstep: 486.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6548 [2025-01-21 14:40:44,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.92 | bwd_microstep: 485.38 | bwd_inner_microstep: 485.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5733 [2025-01-21 14:40:45,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.10 | bwd_microstep: 421.39 | bwd_inner_microstep: 421.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3095 [2025-01-21 14:40:45,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:40:45,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.00 | bwd_microstep: 246.43 | bwd_inner_microstep: 238.80 | bwd_allreduce_microstep: 7.52 | step_microstep: 11.08 [2025-01-21 14:40:45,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2767.00 | bwd: 3149.81 | bwd_inner: 3140.83 | bwd_allreduce: 8.11 | step: 12.09 68%|██████▊ | 296/437 [32:30<15:06, 6.43s/it] {'loss': 0.292, 'learning_rate': 1.0000000000000006e-05, 'epoch': 0.68} 68%|██████▊ | 296/437 [32:30<15:06, 6.43s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3380 [2025-01-21 14:40:46,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.03 | bwd_microstep: 252.84 | bwd_inner_microstep: 252.68 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8043 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:40:47,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 520.26 | bwd_microstep: 597.50 | bwd_inner_microstep: 597.17 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8149 [2025-01-21 14:40:48,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.93 | bwd_microstep: 607.12 | bwd_inner_microstep: 606.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3375 [2025-01-21 14:40:49,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.81 | bwd_microstep: 253.64 | bwd_inner_microstep: 253.48 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8100 [2025-01-21 14:40:50,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.03 | bwd_microstep: 606.59 | bwd_inner_microstep: 606.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:40:51,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 340.50 | bwd_microstep: 376.13 | bwd_inner_microstep: 375.95 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:40:52,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.48 | bwd_microstep: 519.44 | bwd_inner_microstep: 519.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6075 [2025-01-21 14:40:53,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.81 | optimizer_step: 0.34 [2025-01-21 14:40:53,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 394.00 | bwd_microstep: 909.20 | bwd_inner_microstep: 442.46 | bwd_allreduce_microstep: 466.63 | step_microstep: 13.74 [2025-01-21 14:40:53,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3208.91 | bwd: 4122.59 | bwd_inner: 3654.74 | bwd_allreduce: 467.09 | step: 14.50 68%|██████▊ | 297/437 [32:38<15:47, 6.77s/it] {'loss': 0.2757, 'learning_rate': 9.871638725214481e-06, 'epoch': 0.68} 68%|██████▊ | 297/437 [32:38<15:47, 6.77s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2847 [2025-01-21 14:40:53,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.58 | bwd_microstep: 213.77 | bwd_inner_microstep: 213.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3232 [2025-01-21 14:40:54,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.81 | bwd_microstep: 245.80 | bwd_inner_microstep: 245.48 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3155 [2025-01-21 14:40:54,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.71 | bwd_microstep: 242.49 | bwd_inner_microstep: 242.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3369 [2025-01-21 14:40:55,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.70 | bwd_microstep: 254.79 | bwd_inner_microstep: 254.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:40:56,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.45 | bwd_microstep: 519.19 | bwd_inner_microstep: 519.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:40:57,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 422.19 | bwd_microstep: 477.89 | bwd_inner_microstep: 477.68 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:40:58,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 374.81 | bwd_microstep: 417.06 | bwd_inner_microstep: 416.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4313 [2025-01-21 14:40:58,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.97 | optimizer_step: 0.36 [2025-01-21 14:40:58,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.24 | bwd_microstep: 503.96 | bwd_inner_microstep: 318.91 | bwd_allreduce_microstep: 184.90 | step_microstep: 15.38 [2025-01-21 14:40:58,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2370.31 | bwd: 2875.09 | bwd_inner: 2688.82 | bwd_allreduce: 185.39 | step: 16.19 68%|██████▊ | 298/437 [32:43<14:47, 6.38s/it] {'loss': 0.2608, 'learning_rate': 9.743836121519297e-06, 'epoch': 0.68} 68%|██████▊ | 298/437 [32:43<14:47, 6.38s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2699 [2025-01-21 14:40:59,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.29 | bwd_microstep: 205.85 | bwd_inner_microstep: 205.36 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4522 [2025-01-21 14:41:00,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.57 | bwd_microstep: 332.97 | bwd_inner_microstep: 332.47 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5278 [2025-01-21 14:41:00,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 344.75 | bwd_microstep: 384.13 | bwd_inner_microstep: 383.97 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3656 [2025-01-21 14:41:01,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.67 | bwd_microstep: 272.67 | bwd_inner_microstep: 272.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4718 [2025-01-21 14:41:02,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.90 | bwd_microstep: 345.16 | bwd_inner_microstep: 344.84 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5429 [2025-01-21 14:41:02,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 361.29 | bwd_microstep: 400.12 | bwd_inner_microstep: 399.79 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:41:03,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.60 | bwd_microstep: 538.96 | bwd_inner_microstep: 538.80 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:41:04,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:41:04,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.40 | bwd_microstep: 838.35 | bwd_inner_microstep: 262.15 | bwd_allreduce_microstep: 576.05 | step_microstep: 13.72 [2025-01-21 14:41:04,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2447.29 | bwd: 3318.42 | bwd_inner: 2740.39 | bwd_allreduce: 576.78 | step: 14.85 68%|██████▊ | 299/437 [32:49<14:25, 6.27s/it] {'loss': 0.2605, 'learning_rate': 9.616599238388501e-06, 'epoch': 0.68} 68%|██████▊ | 299/437 [32:49<14:25, 6.27s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5635 [2025-01-21 14:41:05,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 366.92 | bwd_microstep: 416.98 | bwd_inner_microstep: 416.66 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.13 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8184 [2025-01-21 14:41:06,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 529.96 | bwd_microstep: 612.90 | bwd_inner_microstep: 612.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4449 [2025-01-21 14:41:07,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.54 | bwd_microstep: 325.40 | bwd_inner_microstep: 325.10 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7858 [2025-01-21 14:41:08,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.13 | bwd_microstep: 586.03 | bwd_inner_microstep: 585.53 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:41:09,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.81 | bwd_microstep: 203.05 | bwd_inner_microstep: 202.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:41:10,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.24 | bwd_microstep: 580.35 | bwd_inner_microstep: 580.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:41:11,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.98 | bwd_microstep: 615.40 | bwd_inner_microstep: 615.24 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5604 [2025-01-21 14:41:12,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:41:12,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.09 | bwd_microstep: 419.66 | bwd_inner_microstep: 411.95 | bwd_allreduce_microstep: 7.60 | step_microstep: 11.40 [2025-01-21 14:41:12,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3297.52 | bwd: 3759.94 | bwd_inner: 3750.70 | bwd_allreduce: 8.25 | step: 12.40 69%|██████▊ | 300/437 [32:56<15:01, 6.58s/it] {'loss': 0.2572, 'learning_rate': 9.48993509409151e-06, 'epoch': 0.69} 69%|██████▊ | 300/437 [32:56<15:01, 6.58s/it]dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6973 [2025-01-21 14:41:13,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 447.85 | bwd_microstep: 512.16 | bwd_inner_microstep: 512.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6138 [2025-01-21 14:41:14,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.53 | bwd_microstep: 445.60 | bwd_inner_microstep: 445.36 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7138 [2025-01-21 14:41:15,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 458.18 | bwd_microstep: 528.42 | bwd_inner_microstep: 528.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7599 [2025-01-21 14:41:16,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.21 | bwd_microstep: 564.72 | bwd_inner_microstep: 564.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5418 [2025-01-21 14:41:17,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 361.09 | bwd_microstep: 398.47 | bwd_inner_microstep: 398.31 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:41:18,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.65 | bwd_microstep: 604.18 | bwd_inner_microstep: 604.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6365 [2025-01-21 14:41:19,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 405.64 | bwd_microstep: 469.19 | bwd_inner_microstep: 469.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:41:20,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.76 | optimizer_step: 0.36 [2025-01-21 14:41:20,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.63 | bwd_microstep: 623.43 | bwd_inner_microstep: 615.50 | bwd_allreduce_microstep: 7.80 | step_microstep: 11.72 [2025-01-21 14:41:20,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3619.62 | bwd: 4146.31 | bwd_inner: 4137.27 | bwd_allreduce: 8.28 | step: 12.51 69%|██████▉ | 301/437 [33:04<15:52, 7.01s/it] {'loss': 0.2759, 'learning_rate': 9.363850675306013e-06, 'epoch': 0.69} 69%|██████▉ | 301/437 [33:04<15:52, 7.01s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6667 [2025-01-21 14:41:21,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 429.76 | bwd_microstep: 492.12 | bwd_inner_microstep: 491.64 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.29 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6069 [2025-01-21 14:41:22,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.18 | bwd_microstep: 440.94 | bwd_inner_microstep: 440.77 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7352 [2025-01-21 14:41:23,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.79 | bwd_microstep: 549.46 | bwd_inner_microstep: 549.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:41:24,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.65 | bwd_microstep: 606.20 | bwd_inner_microstep: 605.98 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2281 [2025-01-21 14:41:24,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.28 | bwd_microstep: 206.74 | bwd_inner_microstep: 206.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7838 [2025-01-21 14:41:25,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.02 | bwd_microstep: 585.98 | bwd_inner_microstep: 585.44 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5979 [2025-01-21 14:41:26,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.14 | bwd_microstep: 434.83 | bwd_inner_microstep: 434.62 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:41:27,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.73 | optimizer_step: 0.38 [2025-01-21 14:41:27,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.31 | bwd_microstep: 270.30 | bwd_inner_microstep: 262.63 | bwd_allreduce_microstep: 7.57 | step_microstep: 11.53 [2025-01-21 14:41:27,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3162.96 | bwd: 3586.79 | bwd_inner: 3577.38 | bwd_allreduce: 8.34 | step: 12.71 69%|██████▉ | 302/437 [33:11<15:45, 7.00s/it] {'loss': 0.2789, 'learning_rate': 9.238352936732549e-06, 'epoch': 0.69} 69%|██████▉ | 302/437 [33:11<15:45, 7.00s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8065 [2025-01-21 14:41:28,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 520.01 | bwd_microstep: 603.37 | bwd_inner_microstep: 603.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3668 [2025-01-21 14:41:28,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.75 | bwd_microstep: 274.03 | bwd_inner_microstep: 273.75 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3863 [2025-01-21 14:41:29,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.81 | bwd_microstep: 287.74 | bwd_inner_microstep: 287.56 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:41:30,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.55 | bwd_microstep: 518.79 | bwd_inner_microstep: 518.57 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4364 [2025-01-21 14:41:31,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.21 | bwd_microstep: 322.24 | bwd_inner_microstep: 322.00 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.21 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3638 [2025-01-21 14:41:31,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.48 | bwd_microstep: 272.73 | bwd_inner_microstep: 272.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5038 [2025-01-21 14:41:32,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.43 | bwd_microstep: 368.19 | bwd_inner_microstep: 367.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3878 [2025-01-21 14:41:33,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.77 | optimizer_step: 0.36 [2025-01-21 14:41:33,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.18 | bwd_microstep: 618.50 | bwd_inner_microstep: 289.31 | bwd_allreduce_microstep: 328.98 | step_microstep: 13.42 [2025-01-21 14:41:33,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2573.23 | bwd: 3265.73 | bwd_inner: 2935.32 | bwd_allreduce: 329.52 | step: 14.37 69%|██████▉ | 303/437 [33:18<15:01, 6.73s/it] {'loss': 0.2701, 'learning_rate': 9.113448800710929e-06, 'epoch': 0.69} 69%|██████▉ | 303/437 [33:18<15:01, 6.73s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5408 [2025-01-21 14:41:34,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 353.21 | bwd_microstep: 397.30 | bwd_inner_microstep: 397.09 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5055 [2025-01-21 14:41:34,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.77 | bwd_microstep: 369.40 | bwd_inner_microstep: 369.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7919 [2025-01-21 14:41:35,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 516.13 | bwd_microstep: 589.62 | bwd_inner_microstep: 589.37 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:41:37,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.03 | bwd_microstep: 605.84 | bwd_inner_microstep: 605.62 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3092 [2025-01-21 14:41:37,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.46 | bwd_microstep: 237.00 | bwd_inner_microstep: 236.68 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:41:38,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 359.47 | bwd_microstep: 401.79 | bwd_inner_microstep: 401.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:41:39,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.49 | bwd_microstep: 460.37 | bwd_inner_microstep: 460.18 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:41:39,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.83 | optimizer_step: 0.35 [2025-01-21 14:41:39,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.48 | bwd_microstep: 228.21 | bwd_inner_microstep: 220.25 | bwd_allreduce_microstep: 7.71 | step_microstep: 11.81 [2025-01-21 14:41:39,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2900.88 | bwd: 3289.66 | bwd_inner: 3280.51 | bwd_allreduce: 8.18 | step: 12.67 70%|██████▉ | 304/437 [33:24<14:42, 6.64s/it] {'loss': 0.2623, 'learning_rate': 8.989145156838387e-06, 'epoch': 0.69} 70%|██████▉ | 304/437 [33:24<14:42, 6.64s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4068 [2025-01-21 14:41:40,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.58 | bwd_microstep: 303.99 | bwd_inner_microstep: 303.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7637 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:41:41,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.30 | bwd_microstep: 567.06 | bwd_inner_microstep: 566.89 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6838 [2025-01-21 14:41:42,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.41 | bwd_microstep: 504.59 | bwd_inner_microstep: 504.41 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4675 [2025-01-21 14:41:43,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 306.70 | bwd_microstep: 344.01 | bwd_inner_microstep: 343.85 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:41:44,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.17 | bwd_microstep: 497.53 | bwd_inner_microstep: 497.26 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:41:45,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.44 | bwd_microstep: 518.67 | bwd_inner_microstep: 518.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6078 [2025-01-21 14:41:45,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 385.92 | bwd_microstep: 442.30 | bwd_inner_microstep: 442.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:41:47,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.68 | optimizer_step: 0.34 [2025-01-21 14:41:47,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.50 | bwd_microstep: 622.38 | bwd_inner_microstep: 614.59 | bwd_allreduce_microstep: 7.69 | step_microstep: 11.38 [2025-01-21 14:41:47,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3318.84 | bwd: 3800.65 | bwd_inner: 3791.73 | bwd_allreduce: 8.17 | step: 12.18 70%|██████▉ | 305/437 [33:31<15:04, 6.85s/it] {'loss': 0.2679, 'learning_rate': 8.865448861589572e-06, 'epoch': 0.7} 70%|██████▉ | 305/437 [33:31<15:04, 6.85s/it]dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4906 [2025-01-21 14:41:47,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 315.19 | bwd_microstep: 358.70 | bwd_inner_microstep: 358.41 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7669 [2025-01-21 14:41:48,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.71 | bwd_microstep: 571.96 | bwd_inner_microstep: 571.63 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.19 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2371 [2025-01-21 14:41:49,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.31 | bwd_microstep: 207.06 | bwd_inner_microstep: 206.85 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3404 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:41:49,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.27 | bwd_microstep: 256.27 | bwd_inner_microstep: 256.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:41:51,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 554.73 | bwd_microstep: 607.65 | bwd_inner_microstep: 607.44 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.19 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:41:51,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.63 | bwd_microstep: 399.90 | bwd_inner_microstep: 399.61 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3606 [2025-01-21 14:41:52,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.16 | bwd_microstep: 269.57 | bwd_inner_microstep: 269.36 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3782 [2025-01-21 14:41:52,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.74 | optimizer_step: 0.35 [2025-01-21 14:41:52,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.55 | bwd_microstep: 291.54 | bwd_inner_microstep: 283.72 | bwd_allreduce_microstep: 7.65 | step_microstep: 11.54 [2025-01-21 14:41:52,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2614.35 | bwd: 2962.83 | bwd_inner: 2953.62 | bwd_allreduce: 8.25 | step: 12.57 70%|███████ | 306/437 [33:37<14:17, 6.55s/it] {'loss': 0.4513, 'learning_rate': 8.74236673793833e-06, 'epoch': 0.7} 70%|███████ | 306/437 [33:37<14:17, 6.55s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6664 [2025-01-21 14:41:53,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 431.96 | bwd_microstep: 494.43 | bwd_inner_microstep: 494.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:41:55,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.21 | bwd_microstep: 607.16 | bwd_inner_microstep: 606.96 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2592 [2025-01-21 14:41:55,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.35 | bwd_microstep: 209.08 | bwd_inner_microstep: 208.91 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:41:56,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.93 | bwd_microstep: 606.91 | bwd_inner_microstep: 606.41 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.30 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5731 [2025-01-21 14:41:57,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.02 | bwd_microstep: 419.91 | bwd_inner_microstep: 419.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2803 [2025-01-21 14:41:57,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.87 | bwd_microstep: 209.56 | bwd_inner_microstep: 209.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:41:58,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.20 | bwd_microstep: 478.74 | bwd_inner_microstep: 478.40 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:42:00,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:42:00,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.58 | bwd_microstep: 610.40 | bwd_inner_microstep: 603.08 | bwd_allreduce_microstep: 7.22 | step_microstep: 10.89 [2025-01-21 14:42:00,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3219.95 | bwd: 3636.37 | bwd_inner: 3627.53 | bwd_allreduce: 7.81 | step: 11.87 70%|███████ | 307/437 [33:44<14:32, 6.71s/it] {'loss': 0.2597, 'learning_rate': 8.619905574981378e-06, 'epoch': 0.7} 70%|███████ | 307/437 [33:44<14:32, 6.71s/it]warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:42:01,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.71 | bwd_microstep: 605.23 | bwd_inner_microstep: 604.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5684 [2025-01-21 14:42:02,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.58 | bwd_microstep: 416.86 | bwd_inner_microstep: 416.69 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7922 [2025-01-21 14:42:03,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 515.48 | bwd_microstep: 589.13 | bwd_inner_microstep: 588.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7072 [2025-01-21 14:42:04,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.75 | bwd_microstep: 521.74 | bwd_inner_microstep: 521.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5194 [2025-01-21 14:42:04,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.40 | bwd_microstep: 378.89 | bwd_inner_microstep: 378.71 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3595 [2025-01-21 14:42:05,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.69 | bwd_microstep: 271.50 | bwd_inner_microstep: 271.01 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2781 [2025-01-21 14:42:05,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.34 | bwd_microstep: 208.89 | bwd_inner_microstep: 208.40 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:42:07,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.77 | optimizer_step: 0.34 [2025-01-21 14:42:07,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.00 | bwd_microstep: 1152.31 | bwd_inner_microstep: 606.09 | bwd_allreduce_microstep: 545.99 | step_microstep: 13.13 [2025-01-21 14:42:07,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3170.82 | bwd: 4144.75 | bwd_inner: 3596.93 | bwd_allreduce: 546.66 | step: 14.25 70%|███████ | 308/437 [33:52<14:58, 6.96s/it] {'loss': 0.2465, 'learning_rate': 8.498072127563793e-06, 'epoch': 0.7} 70%|███████ | 308/437 [33:52<14:58, 6.96s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4583 [2025-01-21 14:42:08,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.69 | bwd_microstep: 335.95 | bwd_inner_microstep: 335.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3133 [2025-01-21 14:42:08,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.45 | bwd_microstep: 243.02 | bwd_inner_microstep: 242.75 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4426 [2025-01-21 14:42:09,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.79 | bwd_microstep: 323.81 | bwd_inner_microstep: 323.48 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3592 [2025-01-21 14:42:09,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 243.87 | bwd_microstep: 268.98 | bwd_inner_microstep: 268.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:42:10,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.55 | bwd_microstep: 208.06 | bwd_inner_microstep: 207.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5416 [2025-01-21 14:42:11,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.08 | bwd_microstep: 399.95 | bwd_inner_microstep: 399.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3945 [2025-01-21 14:42:11,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 250.39 | bwd_microstep: 293.33 | bwd_inner_microstep: 293.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2322 [2025-01-21 14:42:12,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.81 | optimizer_step: 0.35 [2025-01-21 14:42:12,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.02 | bwd_microstep: 708.10 | bwd_inner_microstep: 231.73 | bwd_allreduce_microstep: 476.26 | step_microstep: 14.03 [2025-01-21 14:42:12,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1993.66 | bwd: 2781.33 | bwd_inner: 2303.81 | bwd_allreduce: 476.74 | step: 14.87 71%|███████ | 309/437 [33:57<13:35, 6.37s/it] {'loss': 0.2899, 'learning_rate': 8.37687311590647e-06, 'epoch': 0.71} 71%|███████ | 309/437 [33:57<13:35, 6.37s/it]dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4902 [2025-01-21 14:42:13,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.74 | bwd_microstep: 358.94 | bwd_inner_microstep: 358.64 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3677 [2025-01-21 14:42:13,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.26 | bwd_microstep: 274.01 | bwd_inner_microstep: 273.68 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3370 [2025-01-21 14:42:14,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.26 | bwd_microstep: 255.20 | bwd_inner_microstep: 255.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3614 [2025-01-21 14:42:14,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.58 | bwd_microstep: 266.28 | bwd_inner_microstep: 265.75 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5684 [2025-01-21 14:42:15,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.90 | bwd_microstep: 417.22 | bwd_inner_microstep: 417.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5153 [2025-01-21 14:42:16,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 340.45 | bwd_microstep: 377.42 | bwd_inner_microstep: 377.25 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:42:17,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 422.54 | bwd_microstep: 479.35 | bwd_inner_microstep: 479.04 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3467 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:42:19,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:42:19,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.48 | bwd_microstep: 1376.99 | bwd_inner_microstep: 262.54 | bwd_allreduce_microstep: 1114.25 | step_microstep: 13.54 [2025-01-21 14:42:19,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2390.03 | bwd: 3805.62 | bwd_inner: 2689.59 | bwd_allreduce: 1114.85 | step: 14.36 71%|███████ | 310/437 [34:03<13:31, 6.39s/it] {'loss': 0.321, 'learning_rate': 8.256315225235392e-06, 'epoch': 0.71} 71%|███████ | 310/437 [34:03<13:31, 6.39s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2361 [2025-01-21 14:42:19,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.05 | bwd_microstep: 199.70 | bwd_inner_microstep: 199.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:42:20,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.62 | bwd_microstep: 607.40 | bwd_inner_microstep: 607.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8133 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:42:21,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.34 | bwd_microstep: 607.61 | bwd_inner_microstep: 607.30 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2819 [2025-01-21 14:42:22,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.09 | bwd_microstep: 212.61 | bwd_inner_microstep: 212.31 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:42:23,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 436.74 | bwd_microstep: 498.99 | bwd_inner_microstep: 498.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2601 [2025-01-21 14:42:23,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 173.63 | bwd_microstep: 210.72 | bwd_inner_microstep: 210.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7056 [2025-01-21 14:42:24,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 451.37 | bwd_microstep: 524.75 | bwd_inner_microstep: 524.54 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5950 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:42:26,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.11 | optimizer_gradients: 0.77 | optimizer_step: 0.34 [2025-01-21 14:42:26,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 385.50 | bwd_microstep: 1151.69 | bwd_inner_microstep: 439.10 | bwd_allreduce_microstep: 712.44 | step_microstep: 13.80 [2025-01-21 14:42:26,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2870.18 | bwd: 4013.61 | bwd_inner: 3299.81 | bwd_allreduce: 712.92 | step: 14.58 71%|███████ | 311/437 [34:10<13:52, 6.60s/it] {'loss': 0.5139, 'learning_rate': 8.136405105412897e-06, 'epoch': 0.71} 71%|███████ | 311/437 [34:10<13:52, 6.60s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6537 [2025-01-21 14:42:27,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 413.12 | bwd_microstep: 481.48 | bwd_inner_microstep: 481.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5505 [2025-01-21 14:42:27,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.66 | bwd_microstep: 407.12 | bwd_inner_microstep: 406.79 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7608 [2025-01-21 14:42:28,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.31 | bwd_microstep: 566.55 | bwd_inner_microstep: 566.06 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.31 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2549 [2025-01-21 14:42:29,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.16 | bwd_microstep: 202.90 | bwd_inner_microstep: 202.66 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7850 [2025-01-21 14:42:30,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.70 | bwd_microstep: 584.83 | bwd_inner_microstep: 584.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3861 [2025-01-21 14:42:31,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 259.97 | bwd_microstep: 288.41 | bwd_inner_microstep: 288.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:42:32,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.27 | bwd_microstep: 606.96 | bwd_inner_microstep: 606.77 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2690 [2025-01-21 14:42:34,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.81 | optimizer_step: 0.35 [2025-01-21 14:42:34,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.68 | bwd_microstep: 1765.71 | bwd_inner_microstep: 226.59 | bwd_allreduce_microstep: 1539.00 | step_microstep: 13.37 [2025-01-21 14:42:34,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2936.70 | bwd: 4904.13 | bwd_inner: 3363.51 | bwd_allreduce: 1539.63 | step: 14.37 71%|███████▏ | 312/437 [34:18<14:40, 7.04s/it] {'loss': 0.4021, 'learning_rate': 8.017149370570884e-06, 'epoch': 0.71} 71%|███████▏ | 312/437 [34:18<14:40, 7.04s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:42:35,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 530.80 | bwd_microstep: 613.58 | bwd_inner_microstep: 613.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7350 [2025-01-21 14:42:36,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.61 | bwd_microstep: 547.03 | bwd_inner_microstep: 546.81 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6244 [2025-01-21 14:42:37,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.13 | bwd_microstep: 461.90 | bwd_inner_microstep: 461.42 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:42:37,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.71 | bwd_microstep: 261.71 | bwd_inner_microstep: 261.40 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:42:38,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 318.94 | bwd_microstep: 354.97 | bwd_inner_microstep: 354.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:42:39,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 525.20 | bwd_microstep: 613.54 | bwd_inner_microstep: 613.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:42:40,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 549.76 | bwd_microstep: 604.21 | bwd_inner_microstep: 604.00 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6277 [2025-01-21 14:42:41,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.63 | optimizer_step: 0.39 [2025-01-21 14:42:41,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.95 | bwd_microstep: 472.63 | bwd_inner_microstep: 465.07 | bwd_allreduce_microstep: 7.45 | step_microstep: 11.14 [2025-01-21 14:42:41,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3452.96 | bwd: 3929.75 | bwd_inner: 3920.70 | bwd_allreduce: 8.07 | step: 12.10 72%|███████▏ | 313/437 [34:26<14:54, 7.22s/it] {'loss': 0.4034, 'learning_rate': 7.89855459874598e-06, 'epoch': 0.72} 72%|███████▏ | 313/437 [34:26<14:54, 7.22s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3489 [2025-01-21 14:42:42,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.88 | bwd_microstep: 260.29 | bwd_inner_microstep: 260.06 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2877 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:42:42,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.46 | bwd_microstep: 220.70 | bwd_inner_microstep: 220.37 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2562 [2025-01-21 14:42:43,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.40 | bwd_microstep: 203.36 | bwd_inner_microstep: 203.17 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2284 [2025-01-21 14:42:43,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.66 | bwd_microstep: 201.72 | bwd_inner_microstep: 201.53 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2768 [2025-01-21 14:42:43,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.84 | bwd_microstep: 213.25 | bwd_inner_microstep: 212.95 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:42:44,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.23 | bwd_microstep: 338.23 | bwd_inner_microstep: 338.03 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4255 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:42:45,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.57 | bwd_microstep: 313.75 | bwd_inner_microstep: 313.54 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4481 [2025-01-21 14:42:47,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.83 | optimizer_step: 0.35 [2025-01-21 14:42:47,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.63 | bwd_microstep: 1806.29 | bwd_inner_microstep: 331.92 | bwd_allreduce_microstep: 1474.24 | step_microstep: 14.08 [2025-01-21 14:42:47,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 1829.50 | bwd: 3557.72 | bwd_inner: 2082.05 | bwd_allreduce: 1474.76 | step: 14.91 72%|███████▏ | 314/437 [34:32<13:48, 6.73s/it] {'loss': 0.3049, 'learning_rate': 7.780627331516697e-06, 'epoch': 0.72} 72%|███████▏ | 314/437 [34:32<13:48, 6.73s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3218 [2025-01-21 14:42:47,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.85 | bwd_microstep: 242.45 | bwd_inner_microstep: 242.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6890 [2025-01-21 14:42:48,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.54 | bwd_microstep: 507.96 | bwd_inner_microstep: 507.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2318 [2025-01-21 14:42:49,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.16 | bwd_microstep: 207.41 | bwd_inner_microstep: 207.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3897 [2025-01-21 14:42:49,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.46 | bwd_microstep: 287.64 | bwd_inner_microstep: 287.48 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:42:50,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.35 | bwd_microstep: 282.39 | bwd_inner_microstep: 282.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:42:51,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 337.46 | bwd_microstep: 375.64 | bwd_inner_microstep: 375.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6544 [2025-01-21 14:42:52,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.32 | bwd_microstep: 484.40 | bwd_inner_microstep: 484.10 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7765 [2025-01-21 14:42:53,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.63 | optimizer_step: 0.33 [2025-01-21 14:42:53,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.48 | bwd_microstep: 583.64 | bwd_inner_microstep: 576.51 | bwd_allreduce_microstep: 7.02 | step_microstep: 11.11 [2025-01-21 14:42:53,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2584.43 | bwd: 2971.65 | bwd_inner: 2963.42 | bwd_allreduce: 7.48 | step: 11.90 72%|███████▏ | 315/437 [34:37<13:06, 6.45s/it] {'loss': 0.2607, 'learning_rate': 7.6633740736426e-06, 'epoch': 0.72} 72%|███████▏ | 315/437 [34:37<13:06, 6.45s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5921 [2025-01-21 14:42:54,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 382.94 | bwd_microstep: 432.05 | bwd_inner_microstep: 431.88 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4531 [2025-01-21 14:42:54,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.15 | bwd_microstep: 333.11 | bwd_inner_microstep: 332.81 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:42:55,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 538.45 | bwd_microstep: 606.22 | bwd_inner_microstep: 606.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2331 [2025-01-21 14:42:56,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.09 | bwd_microstep: 204.20 | bwd_inner_microstep: 204.03 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5744 [2025-01-21 14:42:57,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.91 | bwd_microstep: 420.56 | bwd_inner_microstep: 420.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7849 [2025-01-21 14:42:58,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.26 | bwd_microstep: 584.60 | bwd_inner_microstep: 584.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2274 [2025-01-21 14:42:58,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.92 | bwd_microstep: 204.05 | bwd_inner_microstep: 203.83 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:42:59,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:42:59,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.71 | bwd_microstep: 613.03 | bwd_inner_microstep: 605.23 | bwd_allreduce_microstep: 7.68 | step_microstep: 11.19 [2025-01-21 14:42:59,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2989.27 | bwd: 3397.94 | bwd_inner: 3388.98 | bwd_allreduce: 8.15 | step: 11.99 72%|███████▏ | 316/437 [34:44<13:06, 6.50s/it] {'loss': 0.4599, 'learning_rate': 7.546801292705539e-06, 'epoch': 0.72} 72%|███████▏ | 316/437 [34:44<13:06, 6.50s/it]dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5136 [2025-01-21 14:43:00,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 328.10 | bwd_microstep: 374.17 | bwd_inner_microstep: 374.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2676 [2025-01-21 14:43:00,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.34 | bwd_microstep: 210.94 | bwd_inner_microstep: 210.44 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.43 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4455 [2025-01-21 14:43:01,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.96 | bwd_microstep: 323.52 | bwd_inner_microstep: 323.36 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7867 [2025-01-21 14:43:02,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.44 | bwd_microstep: 585.69 | bwd_inner_microstep: 585.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8121 [2025-01-21 14:43:03,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 525.35 | bwd_microstep: 605.86 | bwd_inner_microstep: 605.57 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7006 [2025-01-21 14:43:04,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.28 | bwd_microstep: 519.14 | bwd_inner_microstep: 518.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4350 [2025-01-21 14:43:05,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 282.76 | bwd_microstep: 316.75 | bwd_inner_microstep: 316.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7911 [2025-01-21 14:43:06,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.72 | optimizer_step: 0.38 [2025-01-21 14:43:06,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.88 | bwd_microstep: 596.05 | bwd_inner_microstep: 588.16 | bwd_allreduce_microstep: 7.78 | step_microstep: 13.24 [2025-01-21 14:43:06,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3089.94 | bwd: 3532.29 | bwd_inner: 3523.04 | bwd_allreduce: 8.36 | step: 14.38 73%|███████▎ | 317/437 [34:51<13:12, 6.61s/it] {'loss': 0.3117, 'learning_rate': 7.430915418752867e-06, 'epoch': 0.72} 73%|███████▎ | 317/437 [34:51<13:12, 6.61s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5506 [2025-01-21 14:43:07,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 356.05 | bwd_microstep: 405.80 | bwd_inner_microstep: 405.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4789 [2025-01-21 14:43:08,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 312.05 | bwd_microstep: 350.82 | bwd_inner_microstep: 350.52 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3433 [2025-01-21 14:43:08,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.85 | bwd_microstep: 256.80 | bwd_inner_microstep: 256.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8163 [2025-01-21 14:43:09,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.90 | bwd_microstep: 609.03 | bwd_inner_microstep: 608.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3616 [2025-01-21 14:43:10,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.75 | bwd_microstep: 269.80 | bwd_inner_microstep: 269.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2819 [2025-01-21 14:43:10,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.87 | bwd_microstep: 209.77 | bwd_inner_microstep: 209.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4143 [2025-01-21 14:43:11,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.68 | bwd_microstep: 303.77 | bwd_inner_microstep: 303.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:43:12,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.82 | optimizer_step: 0.36 [2025-01-21 14:43:12,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.98 | bwd_microstep: 614.89 | bwd_inner_microstep: 605.83 | bwd_allreduce_microstep: 8.83 | step_microstep: 11.81 [2025-01-21 14:43:12,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2675.96 | bwd: 3020.81 | bwd_inner: 3010.65 | bwd_allreduce: 9.25 | step: 12.60 73%|███████▎ | 318/437 [34:57<12:41, 6.40s/it] {'loss': 0.4483, 'learning_rate': 7.3157228439427765e-06, 'epoch': 0.73} 73%|███████▎ | 318/437 [34:57<12:41, 6.40s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8180 [2025-01-21 14:43:13,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 521.90 | bwd_microstep: 609.21 | bwd_inner_microstep: 609.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:43:14,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.71 | bwd_microstep: 605.85 | bwd_inner_microstep: 605.53 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5551 [2025-01-21 14:43:15,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.36 | bwd_microstep: 410.28 | bwd_inner_microstep: 410.12 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5205 [2025-01-21 14:43:16,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.03 | bwd_microstep: 380.12 | bwd_inner_microstep: 379.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4930 [2025-01-21 14:43:17,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.62 | bwd_microstep: 360.89 | bwd_inner_microstep: 360.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5973 [2025-01-21 14:43:18,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.87 | bwd_microstep: 434.17 | bwd_inner_microstep: 434.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:43:18,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.57 | bwd_microstep: 219.81 | bwd_inner_microstep: 219.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:43:19,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.70 | optimizer_step: 0.33 [2025-01-21 14:43:19,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.05 | bwd_microstep: 383.46 | bwd_inner_microstep: 375.87 | bwd_allreduce_microstep: 7.37 | step_microstep: 11.09 [2025-01-21 14:43:19,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3035.97 | bwd: 3403.92 | bwd_inner: 3395.29 | bwd_allreduce: 7.81 | step: 11.89 73%|███████▎ | 319/437 [35:03<12:44, 6.48s/it] {'loss': 0.2225, 'learning_rate': 7.201229922191726e-06, 'epoch': 0.73} 73%|███████▎ | 319/437 [35:03<12:44, 6.48s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7918 [2025-01-21 14:43:20,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.17 | bwd_microstep: 588.21 | bwd_inner_microstep: 587.90 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4494 [2025-01-21 14:43:21,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 295.50 | bwd_microstep: 330.72 | bwd_inner_microstep: 330.55 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:43:22,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.81 | bwd_microstep: 604.23 | bwd_inner_microstep: 604.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6292 [2025-01-21 14:43:23,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.23 | bwd_microstep: 466.59 | bwd_inner_microstep: 466.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3120 [2025-01-21 14:43:23,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.16 | bwd_microstep: 242.70 | bwd_inner_microstep: 242.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4393 [2025-01-21 14:43:24,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.29 | bwd_microstep: 323.54 | bwd_inner_microstep: 323.37 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:43:25,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 523.88 | bwd_microstep: 605.49 | bwd_inner_microstep: 605.27 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7845 [2025-01-21 14:43:26,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.74 | optimizer_step: 0.33 [2025-01-21 14:43:26,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 500.66 | bwd_microstep: 595.37 | bwd_inner_microstep: 583.93 | bwd_allreduce_microstep: 11.32 | step_microstep: 11.24 [2025-01-21 14:43:26,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3289.54 | bwd: 3756.99 | bwd_inner: 3744.37 | bwd_allreduce: 11.80 | step: 12.02 73%|███████▎ | 320/437 [35:11<13:06, 6.72s/it] {'loss': 0.467, 'learning_rate': 7.087442968823952e-06, 'epoch': 0.73} 73%|███████▎ | 320/437 [35:11<13:06, 6.72s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2686 [2025-01-21 14:43:26,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.14 | bwd_microstep: 204.80 | bwd_inner_microstep: 204.64 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2619 [2025-01-21 14:43:27,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.09 | bwd_microstep: 211.30 | bwd_inner_microstep: 211.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6292 [2025-01-21 14:43:28,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.38 | bwd_microstep: 468.10 | bwd_inner_microstep: 467.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3088 [2025-01-21 14:43:28,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.04 | bwd_microstep: 242.42 | bwd_inner_microstep: 241.97 | bwd_allreduce_microstep: 0.22 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5718 [2025-01-21 14:43:29,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.66 | bwd_microstep: 422.05 | bwd_inner_microstep: 421.84 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:43:30,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.07 | bwd_microstep: 279.93 | bwd_inner_microstep: 279.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:43:31,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.28 | bwd_microstep: 604.75 | bwd_inner_microstep: 604.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6693 [2025-01-21 14:43:33,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.77 | optimizer_step: 0.34 [2025-01-21 14:43:33,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 431.50 | bwd_microstep: 1207.15 | bwd_inner_microstep: 495.88 | bwd_allreduce_microstep: 711.16 | step_microstep: 13.67 [2025-01-21 14:43:33,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2587.00 | bwd: 3640.63 | bwd_inner: 2928.08 | bwd_allreduce: 711.75 | step: 14.44 73%|███████▎ | 321/437 [35:17<12:50, 6.64s/it] {'loss': 0.4134, 'learning_rate': 6.974368260223123e-06, 'epoch': 0.73} 73%|███████▎ | 321/437 [35:17<12:50, 6.64s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6041 [2025-01-21 14:43:33,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.84 | bwd_microstep: 442.00 | bwd_inner_microstep: 441.67 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.12 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5600 [2025-01-21 14:43:34,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.49 | bwd_microstep: 411.81 | bwd_inner_microstep: 411.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6087 [2025-01-21 14:43:35,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.74 | bwd_microstep: 443.97 | bwd_inner_microstep: 443.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7386 [2025-01-21 14:43:36,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.39 | bwd_microstep: 550.58 | bwd_inner_microstep: 550.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7870 [2025-01-21 14:43:37,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.00 | bwd_microstep: 586.99 | bwd_inner_microstep: 586.80 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8102 [2025-01-21 14:43:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.01 | bwd_microstep: 608.73 | bwd_inner_microstep: 608.40 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.14 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:43:39,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.44 | bwd_microstep: 221.64 | bwd_inner_microstep: 221.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3939 [2025-01-21 14:43:39,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.73 | optimizer_step: 0.34 [2025-01-21 14:43:39,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.10 | bwd_microstep: 301.42 | bwd_inner_microstep: 293.71 | bwd_allreduce_microstep: 7.62 | step_microstep: 11.41 [2025-01-21 14:43:39,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3104.82 | bwd: 3567.28 | bwd_inner: 3558.32 | bwd_allreduce: 8.10 | step: 12.24 74%|███████▎ | 322/437 [35:24<12:52, 6.72s/it] {'loss': 0.3155, 'learning_rate': 6.862012033486145e-06, 'epoch': 0.74} 74%|███████▎ | 322/437 [35:24<12:52, 6.72s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2809 [2025-01-21 14:43:40,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.12 | bwd_microstep: 204.47 | bwd_inner_microstep: 204.31 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6897 [2025-01-21 14:43:41,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.02 | bwd_microstep: 510.53 | bwd_inner_microstep: 510.31 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5565 [2025-01-21 14:43:42,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.79 | bwd_microstep: 409.08 | bwd_inner_microstep: 408.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4480 [2025-01-21 14:43:42,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.96 | bwd_microstep: 328.76 | bwd_inner_microstep: 328.55 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4169 [2025-01-21 14:43:43,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.64 | bwd_microstep: 307.87 | bwd_inner_microstep: 307.66 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:43:44,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 387.49 | bwd_microstep: 433.85 | bwd_inner_microstep: 433.58 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6212 [2025-01-21 14:43:45,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 399.24 | bwd_microstep: 460.53 | bwd_inner_microstep: 460.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2615 [2025-01-21 14:43:46,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.87 | optimizer_step: 0.36 [2025-01-21 14:43:46,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.89 | bwd_microstep: 1317.10 | bwd_inner_microstep: 227.55 | bwd_allreduce_microstep: 1089.42 | step_microstep: 14.08 [2025-01-21 14:43:46,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2520.00 | bwd: 3972.33 | bwd_inner: 2881.60 | bwd_allreduce: 1089.95 | step: 14.91 74%|███████▍ | 323/437 [35:31<12:46, 6.72s/it] {'loss': 0.4041, 'learning_rate': 6.7503804860791115e-06, 'epoch': 0.74} 74%|███████▍ | 323/437 [35:31<12:46, 6.72s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7775 [2025-01-21 14:43:47,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 497.60 | bwd_microstep: 576.59 | bwd_inner_microstep: 576.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6175 [2025-01-21 14:43:48,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.67 | bwd_microstep: 456.59 | bwd_inner_microstep: 456.30 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5796 [2025-01-21 14:43:49,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.22 | bwd_microstep: 424.56 | bwd_inner_microstep: 424.39 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7344 [2025-01-21 14:43:50,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.05 | bwd_microstep: 548.18 | bwd_inner_microstep: 547.99 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.17 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8091 [2025-01-21 14:43:51,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 525.41 | bwd_microstep: 606.20 | bwd_inner_microstep: 605.98 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:43:52,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.52 | bwd_microstep: 199.16 | bwd_inner_microstep: 199.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8160 [2025-01-21 14:43:53,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.08 | bwd_microstep: 610.57 | bwd_inner_microstep: 610.39 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4413 [2025-01-21 14:43:53,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.69 | optimizer_step: 0.37 [2025-01-21 14:43:53,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.23 | bwd_microstep: 333.84 | bwd_inner_microstep: 323.66 | bwd_allreduce_microstep: 9.97 | step_microstep: 11.55 [2025-01-21 14:43:53,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3258.63 | bwd: 3755.82 | bwd_inner: 3744.53 | bwd_allreduce: 10.42 | step: 12.40 74%|███████▍ | 324/437 [35:38<12:57, 6.88s/it] {'loss': 0.2644, 'learning_rate': 6.6394797754955055e-06, 'epoch': 0.74} 74%|███████▍ | 324/437 [35:38<12:57, 6.88s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7978 [2025-01-21 14:43:55,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 515.99 | bwd_microstep: 594.17 | bwd_inner_microstep: 594.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2925 [2025-01-21 14:43:55,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.45 | bwd_microstep: 219.29 | bwd_inner_microstep: 219.08 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3645 [2025-01-21 14:43:56,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.38 | bwd_microstep: 273.85 | bwd_inner_microstep: 273.62 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3643 [2025-01-21 14:43:56,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 246.67 | bwd_microstep: 272.75 | bwd_inner_microstep: 272.39 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.13 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5996 [2025-01-21 14:43:57,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.31 | bwd_microstep: 436.64 | bwd_inner_microstep: 436.42 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7043 [2025-01-21 14:43:58,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.31 | bwd_microstep: 525.57 | bwd_inner_microstep: 525.39 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7031 [2025-01-21 14:43:59,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.13 | bwd_microstep: 519.94 | bwd_inner_microstep: 519.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6389 [2025-01-21 14:44:00,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.68 | optimizer_step: 0.33 [2025-01-21 14:44:00,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 411.60 | bwd_microstep: 478.37 | bwd_inner_microstep: 470.70 | bwd_allreduce_microstep: 7.57 | step_microstep: 11.25 [2025-01-21 14:44:00,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2919.67 | bwd: 3320.73 | bwd_inner: 3311.75 | bwd_allreduce: 8.11 | step: 12.16 74%|███████▍ | 325/437 [35:45<12:37, 6.76s/it] {'loss': 0.269, 'learning_rate': 6.529316018916478e-06, 'epoch': 0.74} 74%|███████▍ | 325/437 [35:45<12:37, 6.76s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3758 [2025-01-21 14:44:00,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.56 | bwd_microstep: 278.19 | bwd_inner_microstep: 278.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4980 [2025-01-21 14:44:01,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.07 | bwd_microstep: 362.07 | bwd_inner_microstep: 361.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7865 [2025-01-21 14:44:02,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.72 | bwd_microstep: 585.89 | bwd_inner_microstep: 585.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7302 [2025-01-21 14:44:03,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 475.75 | bwd_microstep: 546.15 | bwd_inner_microstep: 545.98 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4383 [2025-01-21 14:44:04,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 291.07 | bwd_microstep: 321.46 | bwd_inner_microstep: 321.12 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:44:04,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.82 | bwd_microstep: 206.60 | bwd_inner_microstep: 206.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:44:05,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.15 | bwd_microstep: 415.19 | bwd_inner_microstep: 415.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 8038 [2025-01-21 14:44:07,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.73 | optimizer_step: 0.34 [2025-01-21 14:44:07,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 507.00 | bwd_microstep: 1504.26 | bwd_inner_microstep: 598.99 | bwd_allreduce_microstep: 905.16 | step_microstep: 13.54 [2025-01-21 14:44:07,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2906.97 | bwd: 4219.94 | bwd_inner: 3313.53 | bwd_allreduce: 905.62 | step: 14.30 75%|███████▍ | 326/437 [35:52<12:50, 6.94s/it] {'loss': 0.2354, 'learning_rate': 6.419895292873504e-06, 'epoch': 0.75} 75%|███████▍ | 326/437 [35:52<12:50, 6.94s/it]dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4806 [2025-01-21 14:44:08,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 312.20 | bwd_microstep: 351.82 | bwd_inner_microstep: 351.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.16 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:44:09,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 549.88 | bwd_microstep: 605.53 | bwd_inner_microstep: 605.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3122 [2025-01-21 14:44:10,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.60 | bwd_microstep: 243.82 | bwd_inner_microstep: 243.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7085 [2025-01-21 14:44:11,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.93 | bwd_microstep: 525.62 | bwd_inner_microstep: 525.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6806 [2025-01-21 14:44:12,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.11 | bwd_microstep: 504.26 | bwd_inner_microstep: 504.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3881 [2025-01-21 14:44:12,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.93 | bwd_microstep: 288.47 | bwd_inner_microstep: 288.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:44:13,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.57 | bwd_microstep: 261.33 | bwd_inner_microstep: 261.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5347 [2025-01-21 14:44:13,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:44:13,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 333.86 | bwd_microstep: 396.46 | bwd_inner_microstep: 388.59 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.28 [2025-01-21 14:44:13,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2809.91 | bwd: 3177.45 | bwd_inner: 3168.48 | bwd_allreduce: 8.23 | step: 12.12 75%|███████▍ | 327/437 [35:58<12:19, 6.72s/it] {'loss': 0.2503, 'learning_rate': 6.311223632913173e-06, 'epoch': 0.75} 75%|███████▍ | 327/437 [35:58<12:19, 6.72s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7483 [2025-01-21 14:44:15,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.61 | bwd_microstep: 556.40 | bwd_inner_microstep: 556.24 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3477 [2025-01-21 14:44:15,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.30 | bwd_microstep: 260.69 | bwd_inner_microstep: 260.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8145 [2025-01-21 14:44:16,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.34 | bwd_microstep: 609.68 | bwd_inner_microstep: 609.46 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7335 [2025-01-21 14:44:17,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.21 | bwd_microstep: 547.62 | bwd_inner_microstep: 547.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6249 [2025-01-21 14:44:18,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.06 | bwd_microstep: 463.27 | bwd_inner_microstep: 463.06 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:44:19,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.97 | bwd_microstep: 210.88 | bwd_inner_microstep: 210.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5946 [2025-01-21 14:44:19,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.31 | bwd_microstep: 435.59 | bwd_inner_microstep: 435.41 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6307 [2025-01-21 14:44:20,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.77 | optimizer_step: 0.35 [2025-01-21 14:44:20,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.63 | bwd_microstep: 475.79 | bwd_inner_microstep: 467.79 | bwd_allreduce_microstep: 7.81 | step_microstep: 11.65 [2025-01-21 14:44:20,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3086.27 | bwd: 3560.04 | bwd_inner: 3551.04 | bwd_allreduce: 8.30 | step: 12.46 75%|███████▌ | 328/437 [36:05<12:17, 6.77s/it] {'loss': 0.2419, 'learning_rate': 6.203307033264272e-06, 'epoch': 0.75} 75%|███████▌ | 328/437 [36:05<12:17, 6.77s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4643 [2025-01-21 14:44:21,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 300.24 | bwd_microstep: 342.42 | bwd_inner_microstep: 342.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3422 [2025-01-21 14:44:22,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.04 | bwd_microstep: 257.41 | bwd_inner_microstep: 257.05 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:44:23,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.50 | bwd_microstep: 604.12 | bwd_inner_microstep: 603.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2301 [2025-01-21 14:44:23,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.00 | bwd_microstep: 202.63 | bwd_inner_microstep: 202.46 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7301 [2025-01-21 14:44:24,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.78 | bwd_microstep: 546.49 | bwd_inner_microstep: 546.18 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.17 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7104 [2025-01-21 14:44:25,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.32 | bwd_microstep: 524.06 | bwd_inner_microstep: 523.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7714 [2025-01-21 14:44:26,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.70 | bwd_microstep: 576.32 | bwd_inner_microstep: 576.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4125 [2025-01-21 14:44:27,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.75 | optimizer_step: 0.34 [2025-01-21 14:44:27,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.22 | bwd_microstep: 738.51 | bwd_inner_microstep: 305.96 | bwd_allreduce_microstep: 432.41 | step_microstep: 13.44 [2025-01-21 14:44:27,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2936.62 | bwd: 3792.08 | bwd_inner: 3358.36 | bwd_allreduce: 432.86 | step: 14.28 75%|███████▌ | 329/437 [36:12<12:17, 6.82s/it] {'loss': 0.3827, 'learning_rate': 6.096151446507155e-06, 'epoch': 0.75} 75%|███████▌ | 329/437 [36:12<12:17, 6.82s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7303 [2025-01-21 14:44:28,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.88 | bwd_microstep: 545.02 | bwd_inner_microstep: 544.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6337 [2025-01-21 14:44:29,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.76 | bwd_microstep: 468.24 | bwd_inner_microstep: 468.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5735 [2025-01-21 14:44:30,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.21 | bwd_microstep: 420.71 | bwd_inner_microstep: 420.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5718 [2025-01-21 14:44:31,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.53 | bwd_microstep: 419.66 | bwd_inner_microstep: 419.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4646 [2025-01-21 14:44:32,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.35 | bwd_microstep: 341.60 | bwd_inner_microstep: 341.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:44:33,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.34 | bwd_microstep: 560.36 | bwd_inner_microstep: 560.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:44:33,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.65 | bwd_microstep: 203.36 | bwd_inner_microstep: 203.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5835 [2025-01-21 14:44:34,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:44:34,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 370.05 | bwd_microstep: 435.04 | bwd_inner_microstep: 427.52 | bwd_allreduce_microstep: 7.41 | step_microstep: 10.99 [2025-01-21 14:44:34,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2981.62 | bwd: 3394.12 | bwd_inner: 3385.53 | bwd_allreduce: 7.87 | step: 11.74 76%|███████▌ | 330/437 [36:19<12:03, 6.76s/it] {'loss': 0.2289, 'learning_rate': 5.989762783245423e-06, 'epoch': 0.75} 76%|███████▌ | 330/437 [36:19<12:03, 6.76s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3077 [2025-01-21 14:44:34,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.55 | bwd_microstep: 239.71 | bwd_inner_microstep: 239.55 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6093 [2025-01-21 14:44:35,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 396.03 | bwd_microstep: 442.91 | bwd_inner_microstep: 442.71 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3916 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:44:36,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.50 | bwd_microstep: 290.26 | bwd_inner_microstep: 290.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8140 [2025-01-21 14:44:37,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.70 | bwd_microstep: 608.31 | bwd_inner_microstep: 608.12 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2546 [2025-01-21 14:44:37,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.51 | bwd_microstep: 203.25 | bwd_inner_microstep: 203.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6233 [2025-01-21 14:44:38,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.17 | bwd_microstep: 461.08 | bwd_inner_microstep: 460.91 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6501 [2025-01-21 14:44:39,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.75 | bwd_microstep: 479.87 | bwd_inner_microstep: 479.70 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5584 [2025-01-21 14:44:40,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.66 | optimizer_step: 0.33 [2025-01-21 14:44:40,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.91 | bwd_microstep: 417.49 | bwd_inner_microstep: 409.80 | bwd_allreduce_microstep: 7.58 | step_microstep: 11.10 [2025-01-21 14:44:40,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2761.96 | bwd: 3142.99 | bwd_inner: 3134.26 | bwd_allreduce: 8.03 | step: 11.89 76%|███████▌ | 331/437 [36:25<11:36, 6.57s/it] {'loss': 0.3037, 'learning_rate': 5.884146911779865e-06, 'epoch': 0.76} 76%|███████▌ | 331/437 [36:25<11:36, 6.57s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3856 [2025-01-21 14:44:41,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.87 | bwd_microstep: 287.27 | bwd_inner_microstep: 287.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4083 [2025-01-21 14:44:41,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.35 | bwd_microstep: 303.52 | bwd_inner_microstep: 303.18 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2437 [2025-01-21 14:44:42,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.25 | bwd_microstep: 199.83 | bwd_inner_microstep: 199.56 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3993 [2025-01-21 14:44:42,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.26 | bwd_microstep: 297.45 | bwd_inner_microstep: 297.14 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6013 [2025-01-21 14:44:43,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.03 | bwd_microstep: 437.78 | bwd_inner_microstep: 437.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3880 [2025-01-21 14:44:44,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.46 | bwd_microstep: 288.21 | bwd_inner_microstep: 288.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6523 [2025-01-21 14:44:45,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.50 | bwd_microstep: 481.11 | bwd_inner_microstep: 480.90 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4923 [2025-01-21 14:44:46,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:44:46,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.94 | bwd_microstep: 1285.14 | bwd_inner_microstep: 360.71 | bwd_allreduce_microstep: 924.32 | step_microstep: 13.25 [2025-01-21 14:44:46,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2347.48 | bwd: 3580.43 | bwd_inner: 2654.66 | bwd_allreduce: 924.86 | step: 14.05 76%|███████▌ | 332/437 [36:31<11:16, 6.44s/it] {'loss': 0.3167, 'learning_rate': 5.779309657784786e-06, 'epoch': 0.76} 76%|███████▌ | 332/437 [36:31<11:16, 6.44s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7690 [2025-01-21 14:44:47,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.58 | bwd_microstep: 572.92 | bwd_inner_microstep: 572.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3682 [2025-01-21 14:44:48,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.79 | bwd_microstep: 279.46 | bwd_inner_microstep: 279.15 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2870 [2025-01-21 14:44:48,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.13 | bwd_microstep: 244.48 | bwd_inner_microstep: 244.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4993 [2025-01-21 14:44:49,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.40 | bwd_microstep: 372.33 | bwd_inner_microstep: 372.14 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:44:50,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.27 | bwd_microstep: 606.30 | bwd_inner_microstep: 606.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:44:51,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 288.47 | bwd_microstep: 319.17 | bwd_inner_microstep: 318.97 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:44:51,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 272.78 | bwd_microstep: 304.48 | bwd_inner_microstep: 304.31 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3674 [2025-01-21 14:44:54,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:44:54,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 240.84 | bwd_microstep: 2103.77 | bwd_inner_microstep: 274.72 | bwd_allreduce_microstep: 1828.93 | step_microstep: 13.58 [2025-01-21 14:44:54,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2600.10 | bwd: 4803.05 | bwd_inner: 2972.83 | bwd_allreduce: 1829.44 | step: 14.39 76%|███████▌ | 333/437 [36:38<11:47, 6.80s/it] {'loss': 0.5754, 'learning_rate': 5.6752568039866754e-06, 'epoch': 0.76} 76%|███████▌ | 333/437 [36:38<11:47, 6.80s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3828 [2025-01-21 14:44:54,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.71 | bwd_microstep: 278.88 | bwd_inner_microstep: 278.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2646 [2025-01-21 14:44:55,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 181.38 | bwd_microstep: 211.79 | bwd_inner_microstep: 211.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4488 [2025-01-21 14:44:55,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.92 | bwd_microstep: 329.66 | bwd_inner_microstep: 329.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6293 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:44:56,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.83 | bwd_microstep: 465.51 | bwd_inner_microstep: 465.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3616 [2025-01-21 14:44:57,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.34 | bwd_microstep: 265.94 | bwd_inner_microstep: 265.46 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:44:58,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 523.20 | bwd_microstep: 605.30 | bwd_inner_microstep: 605.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:44:59,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 288.94 | bwd_microstep: 318.23 | bwd_inner_microstep: 318.07 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6316 [2025-01-21 14:45:00,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.82 | optimizer_step: 0.42 [2025-01-21 14:45:00,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.78 | bwd_microstep: 916.42 | bwd_inner_microstep: 467.03 | bwd_allreduce_microstep: 449.29 | step_microstep: 13.99 [2025-01-21 14:45:00,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2584.93 | bwd: 3391.92 | bwd_inner: 2941.16 | bwd_allreduce: 449.88 | step: 14.97 76%|███████▋ | 334/437 [36:45<11:21, 6.62s/it] {'loss': 0.2943, 'learning_rate': 5.5719940898452205e-06, 'epoch': 0.76} 76%|███████▋ | 334/437 [36:45<11:21, 6.62s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5935 [2025-01-21 14:45:01,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 383.40 | bwd_microstep: 431.67 | bwd_inner_microstep: 431.43 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3963 [2025-01-21 14:45:01,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.89 | bwd_microstep: 294.84 | bwd_inner_microstep: 294.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2574 [2025-01-21 14:45:02,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.93 | bwd_microstep: 213.58 | bwd_inner_microstep: 213.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7597 [2025-01-21 14:45:03,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.74 | bwd_microstep: 565.59 | bwd_inner_microstep: 565.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:45:04,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.27 | bwd_microstep: 606.53 | bwd_inner_microstep: 606.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:45:05,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 509.04 | bwd_microstep: 579.94 | bwd_inner_microstep: 579.68 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.22 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4775 [2025-01-21 14:45:06,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.72 | bwd_microstep: 349.01 | bwd_inner_microstep: 348.69 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.14 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7078 [2025-01-21 14:45:07,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:45:07,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.97 | bwd_microstep: 533.00 | bwd_inner_microstep: 524.50 | bwd_allreduce_microstep: 8.37 | step_microstep: 11.14 [2025-01-21 14:45:07,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3145.77 | bwd: 3574.30 | bwd_inner: 3564.60 | bwd_allreduce: 8.86 | step: 12.08 77%|███████▋ | 335/437 [36:52<11:25, 6.72s/it] {'loss': 0.376, 'learning_rate': 5.46952721123674e-06, 'epoch': 0.77} 77%|███████▋ | 335/437 [36:52<11:25, 6.72s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6837 [2025-01-21 14:45:08,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 435.77 | bwd_microstep: 509.70 | bwd_inner_microstep: 509.38 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:45:09,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.17 | bwd_microstep: 604.32 | bwd_inner_microstep: 604.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6838 [2025-01-21 14:45:10,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.48 | bwd_microstep: 507.24 | bwd_inner_microstep: 507.07 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2867 [2025-01-21 14:45:11,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.35 | bwd_microstep: 218.31 | bwd_inner_microstep: 218.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4174 [2025-01-21 14:45:11,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.66 | bwd_microstep: 307.51 | bwd_inner_microstep: 307.34 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:45:12,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 549.52 | bwd_microstep: 604.57 | bwd_inner_microstep: 604.25 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:45:13,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 387.78 | bwd_microstep: 435.91 | bwd_inner_microstep: 435.69 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 4107 [2025-01-21 14:45:14,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.40 | optimizer_gradients: 0.84 | optimizer_step: 0.35 [2025-01-21 14:45:14,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 255.56 | bwd_microstep: 388.05 | bwd_inner_microstep: 304.68 | bwd_allreduce_microstep: 83.26 | step_microstep: 18.48 [2025-01-21 14:45:14,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3089.12 | bwd: 3575.73 | bwd_inner: 3491.13 | bwd_allreduce: 83.73 | step: 19.25 77%|███████▋ | 336/437 [36:59<11:24, 6.78s/it] {'loss': 0.4093, 'learning_rate': 5.367861820139995e-06, 'epoch': 0.77} 77%|███████▋ | 336/437 [36:59<11:24, 6.78s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6649 [2025-01-21 14:45:15,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.79 | bwd_microstep: 486.16 | bwd_inner_microstep: 486.00 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6350 [2025-01-21 14:45:16,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.21 | bwd_microstep: 468.21 | bwd_inner_microstep: 468.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5254 [2025-01-21 14:45:16,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.34 | bwd_microstep: 384.26 | bwd_inner_microstep: 384.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7886 [2025-01-21 14:45:18,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.70 | bwd_microstep: 593.05 | bwd_inner_microstep: 592.56 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7844 [2025-01-21 14:45:19,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.65 | bwd_microstep: 584.23 | bwd_inner_microstep: 584.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:45:19,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 254.83 | bwd_microstep: 279.05 | bwd_inner_microstep: 278.84 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 3054 [2025-01-21 14:45:20,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.34 | bwd_microstep: 228.23 | bwd_inner_microstep: 227.92 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:45:21,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:45:21,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.24 | bwd_microstep: 612.73 | bwd_inner_microstep: 605.11 | bwd_allreduce_microstep: 7.51 | step_microstep: 11.17 [2025-01-21 14:45:21,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3198.95 | bwd: 3636.08 | bwd_inner: 3627.04 | bwd_allreduce: 8.11 | step: 12.10 77%|███████▋ | 337/437 [37:06<11:26, 6.86s/it] {'loss': 0.4033, 'learning_rate': 5.267003524324423e-06, 'epoch': 0.77} 77%|███████▋ | 337/437 [37:06<11:26, 6.86s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3144 [2025-01-21 14:45:21,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.46 | bwd_microstep: 240.75 | bwd_inner_microstep: 240.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2586 [2025-01-21 14:45:22,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.96 | bwd_microstep: 207.09 | bwd_inner_microstep: 206.86 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2320 [2025-01-21 14:45:22,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.34 | bwd_microstep: 231.49 | bwd_inner_microstep: 231.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:45:23,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.73 | bwd_microstep: 607.99 | bwd_inner_microstep: 607.66 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:45:24,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 358.55 | bwd_microstep: 399.32 | bwd_inner_microstep: 399.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:45:25,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.33 | bwd_microstep: 198.29 | bwd_inner_microstep: 198.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7294 [2025-01-21 14:45:26,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.22 | bwd_microstep: 540.86 | bwd_inner_microstep: 540.69 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:45:27,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.64 | optimizer_step: 0.32 [2025-01-21 14:45:27,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.51 | bwd_microstep: 611.35 | bwd_inner_microstep: 603.95 | bwd_allreduce_microstep: 7.27 | step_microstep: 10.97 [2025-01-21 14:45:27,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2661.92 | bwd: 3037.26 | bwd_inner: 3028.65 | bwd_allreduce: 7.81 | step: 11.78 77%|███████▋ | 338/437 [37:12<10:51, 6.59s/it] {'loss': 0.3671, 'learning_rate': 5.166957887040849e-06, 'epoch': 0.77} 77%|███████▋ | 338/437 [37:12<10:51, 6.59s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4283 [2025-01-21 14:45:27,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.68 | bwd_microstep: 312.71 | bwd_inner_microstep: 312.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3128 [2025-01-21 14:45:28,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.36 | bwd_microstep: 243.20 | bwd_inner_microstep: 243.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5472 [2025-01-21 14:45:29,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.62 | bwd_microstep: 403.75 | bwd_inner_microstep: 403.45 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7574 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:45:30,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.15 | bwd_microstep: 563.96 | bwd_inner_microstep: 563.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:45:30,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.71 | bwd_microstep: 318.58 | bwd_inner_microstep: 318.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:45:31,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.66 | bwd_microstep: 262.54 | bwd_inner_microstep: 262.24 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:45:32,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 254.53 | bwd_microstep: 280.09 | bwd_inner_microstep: 279.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:45:33,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:45:33,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.68 | bwd_microstep: 802.31 | bwd_inner_microstep: 614.67 | bwd_allreduce_microstep: 187.54 | step_microstep: 13.48 [2025-01-21 14:45:33,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2652.21 | bwd: 3187.27 | bwd_inner: 2998.40 | bwd_allreduce: 188.00 | step: 14.28 78%|███████▊ | 339/437 [37:18<10:30, 6.43s/it] {'loss': 0.2423, 'learning_rate': 5.067730426714583e-06, 'epoch': 0.77} 78%|███████▊ | 339/437 [37:18<10:30, 6.43s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3285 [2025-01-21 14:45:33,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.37 | bwd_microstep: 244.38 | bwd_inner_microstep: 244.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7397 [2025-01-21 14:45:34,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.75 | bwd_microstep: 554.74 | bwd_inner_microstep: 554.52 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5229 [2025-01-21 14:45:35,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.50 | bwd_microstep: 382.20 | bwd_inner_microstep: 382.00 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5742 [2025-01-21 14:45:36,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.57 | bwd_microstep: 424.30 | bwd_inner_microstep: 424.10 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7325 [2025-01-21 14:45:37,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.72 | bwd_microstep: 545.87 | bwd_inner_microstep: 545.69 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4653 [2025-01-21 14:45:38,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.23 | bwd_microstep: 342.28 | bwd_inner_microstep: 342.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:45:39,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.82 | bwd_microstep: 538.99 | bwd_inner_microstep: 538.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2966 [2025-01-21 14:45:39,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:45:39,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.29 | bwd_microstep: 236.22 | bwd_inner_microstep: 228.54 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.27 [2025-01-21 14:45:39,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2864.11 | bwd: 3269.12 | bwd_inner: 3260.32 | bwd_allreduce: 8.08 | step: 12.07 78%|███████▊ | 340/437 [37:24<10:21, 6.41s/it] {'loss': 0.2651, 'learning_rate': 4.969326616641052e-06, 'epoch': 0.78} 78%|███████▊ | 340/437 [37:24<10:21, 6.41s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3204 [2025-01-21 14:45:40,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.54 | bwd_microstep: 242.28 | bwd_inner_microstep: 242.11 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4217 [2025-01-21 14:45:40,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 276.72 | bwd_microstep: 311.31 | bwd_inner_microstep: 311.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6087 [2025-01-21 14:45:41,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.45 | bwd_microstep: 448.14 | bwd_inner_microstep: 447.97 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4156 [2025-01-21 14:45:42,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.50 | bwd_microstep: 306.14 | bwd_inner_microstep: 305.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:45:43,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.75 | bwd_microstep: 606.48 | bwd_inner_microstep: 606.20 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:45:43,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.99 | bwd_microstep: 201.50 | bwd_inner_microstep: 201.19 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:45:44,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 405.93 | bwd_microstep: 459.21 | bwd_inner_microstep: 458.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4954 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:45:46,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.75 | optimizer_step: 0.34 [2025-01-21 14:45:46,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 316.44 | bwd_microstep: 1494.51 | bwd_inner_microstep: 362.38 | bwd_allreduce_microstep: 1132.02 | step_microstep: 13.20 [2025-01-21 14:45:46,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2582.17 | bwd: 4069.72 | bwd_inner: 2936.32 | bwd_allreduce: 1132.51 | step: 14.01 78%|███████▊ | 341/437 [37:31<10:28, 6.55s/it] {'loss': 0.5764, 'learning_rate': 4.871751884683895e-06, 'epoch': 0.78} 78%|███████▊ | 341/437 [37:31<10:28, 6.55s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4406 [2025-01-21 14:45:47,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.15 | bwd_microstep: 321.88 | bwd_inner_microstep: 321.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3772 [2025-01-21 14:45:47,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.06 | bwd_microstep: 279.20 | bwd_inner_microstep: 279.03 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4847 [2025-01-21 14:45:48,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.70 | bwd_microstep: 354.49 | bwd_inner_microstep: 354.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7133 [2025-01-21 14:45:49,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.40 | bwd_microstep: 527.20 | bwd_inner_microstep: 527.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6285 [2025-01-21 14:45:50,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.96 | bwd_microstep: 466.20 | bwd_inner_microstep: 466.03 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7595 [2025-01-21 14:45:51,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.59 | bwd_microstep: 565.59 | bwd_inner_microstep: 565.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4911 [2025-01-21 14:45:52,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.41 | bwd_microstep: 359.82 | bwd_inner_microstep: 359.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6134 [2025-01-21 14:45:53,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:45:53,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 396.33 | bwd_microstep: 454.12 | bwd_inner_microstep: 446.44 | bwd_allreduce_microstep: 7.46 | step_microstep: 11.19 [2025-01-21 14:45:53,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2917.44 | bwd: 3328.62 | bwd_inner: 3319.88 | bwd_allreduce: 7.94 | step: 11.99 78%|███████▊ | 342/437 [37:37<10:19, 6.52s/it] {'loss': 0.3681, 'learning_rate': 4.775011612975562e-06, 'epoch': 0.78} 78%|███████▊ | 342/437 [37:37<10:19, 6.52s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6293 [2025-01-21 14:45:54,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.92 | bwd_microstep: 464.28 | bwd_inner_microstep: 464.09 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2707 [2025-01-21 14:45:54,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.56 | bwd_microstep: 210.51 | bwd_inner_microstep: 210.28 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7127 [2025-01-21 14:45:55,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.08 | bwd_microstep: 528.60 | bwd_inner_microstep: 528.41 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4446 [2025-01-21 14:45:56,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.78 | bwd_microstep: 324.34 | bwd_inner_microstep: 324.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5234 [2025-01-21 14:45:56,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.59 | bwd_microstep: 382.96 | bwd_inner_microstep: 382.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7049 [2025-01-21 14:45:57,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.09 | bwd_microstep: 524.85 | bwd_inner_microstep: 524.57 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:45:58,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 255.12 | bwd_microstep: 281.67 | bwd_inner_microstep: 281.44 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:45:59,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.81 | optimizer_step: 0.34 [2025-01-21 14:45:59,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 389.33 | bwd_microstep: 911.86 | bwd_inner_microstep: 434.59 | bwd_allreduce_microstep: 477.10 | step_microstep: 13.82 [2025-01-21 14:45:59,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2775.32 | bwd: 3629.19 | bwd_inner: 3150.79 | bwd_allreduce: 477.61 | step: 14.64 78%|███████▊ | 343/437 [37:44<10:16, 6.56s/it] {'loss': 0.3841, 'learning_rate': 4.679111137620442e-06, 'epoch': 0.78} 78%|███████▊ | 343/437 [37:44<10:16, 6.56s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:46:00,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.46 | bwd_microstep: 605.05 | bwd_inner_microstep: 604.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2460 [2025-01-21 14:46:01,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.96 | bwd_microstep: 202.83 | bwd_inner_microstep: 202.66 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2881 [2025-01-21 14:46:01,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.80 | bwd_microstep: 219.64 | bwd_inner_microstep: 219.33 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3653 [2025-01-21 14:46:02,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.47 | bwd_microstep: 273.59 | bwd_inner_microstep: 273.39 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4713 [2025-01-21 14:46:02,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.48 | bwd_microstep: 344.68 | bwd_inner_microstep: 344.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4667 [2025-01-21 14:46:03,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.43 | bwd_microstep: 343.17 | bwd_inner_microstep: 343.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:46:04,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.57 | bwd_microstep: 615.26 | bwd_inner_microstep: 615.09 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8118 [2025-01-21 14:46:07,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.72 | optimizer_step: 0.35 [2025-01-21 14:46:07,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 520.74 | bwd_microstep: 1608.87 | bwd_inner_microstep: 606.01 | bwd_allreduce_microstep: 1002.76 | step_microstep: 12.92 [2025-01-21 14:46:07,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2819.76 | bwd: 4213.21 | bwd_inner: 3209.21 | bwd_allreduce: 1003.22 | step: 13.69 79%|███████▊ | 344/437 [37:51<10:29, 6.77s/it] {'loss': 0.2937, 'learning_rate': 4.5840557484005355e-06, 'epoch': 0.79} 79%|███████▊ | 344/437 [37:51<10:29, 6.77s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7663 [2025-01-21 14:46:08,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 485.41 | bwd_microstep: 569.16 | bwd_inner_microstep: 568.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4999 [2025-01-21 14:46:08,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.47 | bwd_microstep: 367.72 | bwd_inner_microstep: 367.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7070 [2025-01-21 14:46:09,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.67 | bwd_microstep: 523.32 | bwd_inner_microstep: 523.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6780 [2025-01-21 14:46:10,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.37 | bwd_microstep: 498.63 | bwd_inner_microstep: 498.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5949 [2025-01-21 14:46:11,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.11 | bwd_microstep: 434.44 | bwd_inner_microstep: 434.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:46:12,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.07 | bwd_microstep: 606.86 | bwd_inner_microstep: 606.66 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:46:13,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 222.89 | bwd_microstep: 247.72 | bwd_inner_microstep: 247.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5049 [2025-01-21 14:46:14,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.70 | optimizer_step: 0.37 [2025-01-21 14:46:14,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.70 | bwd_microstep: 376.72 | bwd_inner_microstep: 369.11 | bwd_allreduce_microstep: 7.51 | step_microstep: 11.24 [2025-01-21 14:46:14,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3191.54 | bwd: 3624.69 | bwd_inner: 3616.01 | bwd_allreduce: 7.99 | step: 12.06 79%|███████▉ | 345/437 [37:58<10:30, 6.85s/it] {'loss': 0.2957, 'learning_rate': 4.4898506884836565e-06, 'epoch': 0.79} 79%|███████▉ | 345/437 [37:58<10:30, 6.85s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3941 [2025-01-21 14:46:14,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.95 | bwd_microstep: 291.90 | bwd_inner_microstep: 291.58 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3134 [2025-01-21 14:46:15,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.87 | bwd_microstep: 242.88 | bwd_inner_microstep: 242.70 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4969 [2025-01-21 14:46:15,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.12 | bwd_microstep: 362.59 | bwd_inner_microstep: 362.37 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3081 [2025-01-21 14:46:16,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 213.87 | bwd_microstep: 241.14 | bwd_inner_microstep: 240.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2769 [2025-01-21 14:46:16,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 188.60 | bwd_microstep: 218.58 | bwd_inner_microstep: 218.41 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:46:17,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.41 | bwd_microstep: 221.99 | bwd_inner_microstep: 221.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:46:17,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.10 | bwd_microstep: 203.09 | bwd_inner_microstep: 202.85 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7189 [2025-01-21 14:46:19,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.82 | optimizer_step: 0.38 [2025-01-21 14:46:19,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.71 | bwd_microstep: 1613.39 | bwd_inner_microstep: 534.36 | bwd_allreduce_microstep: 1078.92 | step_microstep: 13.34 [2025-01-21 14:46:19,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2007.43 | bwd: 3395.71 | bwd_inner: 2315.48 | bwd_allreduce: 1079.43 | step: 14.13 79%|███████▉ | 346/437 [38:04<09:49, 6.48s/it] {'loss': 0.2435, 'learning_rate': 4.3965011541342606e-06, 'epoch': 0.79} 79%|███████▉ | 346/437 [38:04<09:49, 6.48s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5493 [2025-01-21 14:46:20,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 354.93 | bwd_microstep: 404.69 | bwd_inner_microstep: 404.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3246 [2025-01-21 14:46:20,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.88 | bwd_microstep: 244.11 | bwd_inner_microstep: 243.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7661 [2025-01-21 14:46:22,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.86 | bwd_microstep: 568.84 | bwd_inner_microstep: 568.68 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2556 [2025-01-21 14:46:22,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.66 | bwd_microstep: 205.74 | bwd_inner_microstep: 205.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6540 [2025-01-21 14:46:23,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.84 | bwd_microstep: 483.56 | bwd_inner_microstep: 483.35 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2806 [2025-01-21 14:46:23,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 189.83 | bwd_microstep: 208.99 | bwd_inner_microstep: 208.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4358 [2025-01-21 14:46:24,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 290.42 | bwd_microstep: 319.53 | bwd_inner_microstep: 319.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:46:26,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.83 | optimizer_step: 0.36 [2025-01-21 14:46:26,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.38 | bwd_microstep: 1418.82 | bwd_inner_microstep: 606.28 | bwd_allreduce_microstep: 812.43 | step_microstep: 13.82 [2025-01-21 14:46:26,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2689.64 | bwd: 3854.40 | bwd_inner: 3040.77 | bwd_allreduce: 812.91 | step: 14.62 79%|███████▉ | 347/437 [38:11<09:51, 6.57s/it] {'loss': 0.3995, 'learning_rate': 4.304012294426781e-06, 'epoch': 0.79} 79%|███████▉ | 347/437 [38:11<09:51, 6.57s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4400 [2025-01-21 14:46:27,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.39 | bwd_microstep: 320.37 | bwd_inner_microstep: 320.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:46:28,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.33 | bwd_microstep: 614.55 | bwd_inner_microstep: 614.22 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4548 [2025-01-21 14:46:28,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.74 | bwd_microstep: 334.22 | bwd_inner_microstep: 334.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4484 [2025-01-21 14:46:29,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 296.44 | bwd_microstep: 332.68 | bwd_inner_microstep: 332.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6025 [2025-01-21 14:46:30,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.51 | bwd_microstep: 439.90 | bwd_inner_microstep: 439.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8143 [2025-01-21 14:46:31,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.17 | bwd_microstep: 609.88 | bwd_inner_microstep: 609.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:46:32,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 255.20 | bwd_microstep: 280.01 | bwd_inner_microstep: 279.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:46:33,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:46:33,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.95 | bwd_microstep: 569.23 | bwd_inner_microstep: 561.34 | bwd_allreduce_microstep: 7.72 | step_microstep: 11.26 [2025-01-21 14:46:33,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3072.55 | bwd: 3500.97 | bwd_inner: 3492.01 | bwd_allreduce: 8.18 | step: 12.07 80%|███████▉ | 348/437 [38:17<09:50, 6.64s/it] {'loss': 0.2983, 'learning_rate': 4.212389210961629e-06, 'epoch': 0.8} 80%|███████▉ | 348/437 [38:17<09:50, 6.64s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3875 [2025-01-21 14:46:33,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 248.05 | bwd_microstep: 286.16 | bwd_inner_microstep: 286.00 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5044 [2025-01-21 14:46:34,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.12 | bwd_microstep: 368.05 | bwd_inner_microstep: 367.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5524 [2025-01-21 14:46:35,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.12 | bwd_microstep: 408.77 | bwd_inner_microstep: 408.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6537 [2025-01-21 14:46:36,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.78 | bwd_microstep: 480.68 | bwd_inner_microstep: 480.36 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5990 [2025-01-21 14:46:37,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.00 | bwd_microstep: 435.35 | bwd_inner_microstep: 435.17 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:46:37,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 374.21 | bwd_microstep: 415.65 | bwd_inner_microstep: 415.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:46:39,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.10 | bwd_microstep: 607.77 | bwd_inner_microstep: 607.48 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.15 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:46:40,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.68 | optimizer_step: 0.36 [2025-01-21 14:46:40,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 534.30 | bwd_microstep: 828.47 | bwd_inner_microstep: 613.51 | bwd_allreduce_microstep: 214.79 | step_microstep: 13.45 [2025-01-21 14:46:40,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3209.52 | bwd: 3831.03 | bwd_inner: 3614.94 | bwd_allreduce: 215.26 | step: 14.26 80%|███████▉ | 349/437 [38:25<10:00, 6.83s/it] {'loss': 0.4387, 'learning_rate': 4.121636957583805e-06, 'epoch': 0.8} 80%|███████▉ | 349/437 [38:25<10:00, 6.83s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5442 [2025-01-21 14:46:41,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 354.01 | bwd_microstep: 398.87 | bwd_inner_microstep: 398.71 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2636 [2025-01-21 14:46:41,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.38 | bwd_microstep: 203.90 | bwd_inner_microstep: 203.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6327 [2025-01-21 14:46:42,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.73 | bwd_microstep: 470.22 | bwd_inner_microstep: 470.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6529 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:46:43,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.57 | bwd_microstep: 482.68 | bwd_inner_microstep: 482.39 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2271 [2025-01-21 14:46:43,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 163.50 | bwd_microstep: 202.54 | bwd_inner_microstep: 202.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:46:44,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.46 | bwd_microstep: 302.14 | bwd_inner_microstep: 301.97 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3566 [2025-01-21 14:46:45,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.31 | bwd_microstep: 266.16 | bwd_inner_microstep: 265.95 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:46:46,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.78 | optimizer_step: 0.34 [2025-01-21 14:46:46,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 538.04 | bwd_microstep: 1087.74 | bwd_inner_microstep: 615.44 | bwd_allreduce_microstep: 472.12 | step_microstep: 13.84 [2025-01-21 14:46:46,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2574.83 | bwd: 3414.39 | bwd_inner: 2940.98 | bwd_allreduce: 472.60 | step: 14.68 80%|████████ | 350/437 [38:31<09:38, 6.65s/it] {'loss': 0.3402, 'learning_rate': 4.031760540104115e-06, 'epoch': 0.8} 80%|████████ | 350/437 [38:31<09:38, 6.65s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4072 [2025-01-21 14:46:47,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.07 | bwd_microstep: 302.98 | bwd_inner_microstep: 302.48 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3168 [2025-01-21 14:46:47,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.62 | bwd_microstep: 243.65 | bwd_inner_microstep: 243.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:46:49,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.18 | bwd_microstep: 607.28 | bwd_inner_microstep: 607.11 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7081 [2025-01-21 14:46:50,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.14 | bwd_microstep: 525.73 | bwd_inner_microstep: 525.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5990 [2025-01-21 14:46:50,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 389.26 | bwd_microstep: 435.80 | bwd_inner_microstep: 435.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3857 [2025-01-21 14:46:51,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.42 | bwd_microstep: 286.74 | bwd_inner_microstep: 286.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3586 [2025-01-21 14:46:51,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 238.51 | bwd_microstep: 268.49 | bwd_inner_microstep: 268.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6824 [2025-01-21 14:46:52,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:46:52,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 442.44 | bwd_microstep: 514.53 | bwd_inner_microstep: 506.66 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.84 [2025-01-21 14:46:52,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2818.47 | bwd: 3185.36 | bwd_inner: 3176.13 | bwd_allreduce: 8.36 | step: 12.82 80%|████████ | 351/437 [38:37<09:20, 6.52s/it] {'loss': 0.3425, 'learning_rate': 3.942764916023067e-06, 'epoch': 0.8} 80%|████████ | 351/437 [38:37<09:20, 6.52s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6514 [2025-01-21 14:46:53,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 416.19 | bwd_microstep: 478.92 | bwd_inner_microstep: 478.75 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4825 [2025-01-21 14:46:54,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 312.92 | bwd_microstep: 353.01 | bwd_inner_microstep: 352.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5043 [2025-01-21 14:46:55,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.23 | bwd_microstep: 367.61 | bwd_inner_microstep: 367.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4461 [2025-01-21 14:46:55,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 294.26 | bwd_microstep: 326.79 | bwd_inner_microstep: 326.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8161 [2025-01-21 14:46:57,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 528.96 | bwd_microstep: 608.87 | bwd_inner_microstep: 608.57 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7357 [2025-01-21 14:46:58,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.27 | bwd_microstep: 547.64 | bwd_inner_microstep: 547.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4093 [2025-01-21 14:46:58,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.59 | bwd_microstep: 303.74 | bwd_inner_microstep: 303.57 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:47:00,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.80 | optimizer_step: 0.34 [2025-01-21 14:47:00,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.38 | bwd_microstep: 1153.55 | bwd_inner_microstep: 227.71 | bwd_allreduce_microstep: 925.72 | step_microstep: 13.42 [2025-01-21 14:47:00,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2814.65 | bwd: 4140.25 | bwd_inner: 3213.30 | bwd_allreduce: 926.18 | step: 14.21 81%|████████ | 352/437 [38:44<09:31, 6.72s/it] {'loss': 0.34, 'learning_rate': 3.854654994257412e-06, 'epoch': 0.8} 81%|████████ | 352/437 [38:44<09:31, 6.72s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6207 [2025-01-21 14:47:01,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 399.75 | bwd_microstep: 461.33 | bwd_inner_microstep: 461.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3163 [2025-01-21 14:47:01,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.66 | bwd_microstep: 243.80 | bwd_inner_microstep: 243.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3148 [2025-01-21 14:47:02,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.26 | bwd_microstep: 242.53 | bwd_inner_microstep: 242.36 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6039 [2025-01-21 14:47:02,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.05 | bwd_microstep: 443.18 | bwd_inner_microstep: 443.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6264 [2025-01-21 14:47:03,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.14 | bwd_microstep: 462.10 | bwd_inner_microstep: 461.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3863 [2025-01-21 14:47:04,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.03 | bwd_microstep: 287.45 | bwd_inner_microstep: 287.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2923 [2025-01-21 14:47:04,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 186.62 | bwd_microstep: 222.03 | bwd_inner_microstep: 221.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7471 [2025-01-21 14:47:05,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.69 | optimizer_step: 0.34 [2025-01-21 14:47:05,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.30 | bwd_microstep: 565.00 | bwd_inner_microstep: 557.17 | bwd_allreduce_microstep: 7.60 | step_microstep: 11.03 [2025-01-21 14:47:05,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2561.64 | bwd: 2927.54 | bwd_inner: 2918.69 | bwd_allreduce: 8.03 | step: 11.82 81%|████████ | 353/437 [38:50<08:59, 6.42s/it] {'loss': 0.3359, 'learning_rate': 3.7674356348693764e-06, 'epoch': 0.81} 81%|████████ | 353/437 [38:50<08:59, 6.42s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6424 [2025-01-21 14:47:06,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 413.44 | bwd_microstep: 475.62 | bwd_inner_microstep: 475.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6845 [2025-01-21 14:47:07,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.56 | bwd_microstep: 506.16 | bwd_inner_microstep: 505.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4153 [2025-01-21 14:47:08,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.24 | bwd_microstep: 306.47 | bwd_inner_microstep: 306.26 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3606 [2025-01-21 14:47:08,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.29 | bwd_microstep: 270.89 | bwd_inner_microstep: 270.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:47:10,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.75 | bwd_microstep: 610.60 | bwd_inner_microstep: 610.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:47:10,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.97 | bwd_microstep: 204.25 | bwd_inner_microstep: 204.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6812 [2025-01-21 14:47:11,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 435.85 | bwd_microstep: 505.52 | bwd_inner_microstep: 505.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5069 [2025-01-21 14:47:12,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.78 | optimizer_step: 0.34 [2025-01-21 14:47:12,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.63 | bwd_microstep: 773.18 | bwd_inner_microstep: 371.93 | bwd_allreduce_microstep: 401.13 | step_microstep: 13.68 [2025-01-21 14:47:12,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2873.57 | bwd: 3652.81 | bwd_inner: 3250.49 | bwd_allreduce: 401.61 | step: 14.50 81%|████████ | 354/437 [38:57<09:01, 6.52s/it] {'loss': 0.393, 'learning_rate': 3.681111648798592e-06, 'epoch': 0.81} 81%|████████ | 354/437 [38:57<09:01, 6.52s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4547 [2025-01-21 14:47:13,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 286.16 | bwd_microstep: 333.75 | bwd_inner_microstep: 333.58 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:47:14,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.87 | bwd_microstep: 613.48 | bwd_inner_microstep: 613.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6009 [2025-01-21 14:47:15,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.92 | bwd_microstep: 436.70 | bwd_inner_microstep: 436.40 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7585 [2025-01-21 14:47:16,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.20 | bwd_microstep: 566.57 | bwd_inner_microstep: 566.26 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3861 [2025-01-21 14:47:16,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 258.81 | bwd_microstep: 285.54 | bwd_inner_microstep: 285.23 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:47:18,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.38 | bwd_microstep: 605.82 | bwd_inner_microstep: 605.66 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:47:19,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 388.55 | bwd_microstep: 435.36 | bwd_inner_microstep: 435.18 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4806 [2025-01-21 14:47:19,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:47:19,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.79 | bwd_microstep: 359.28 | bwd_inner_microstep: 351.49 | bwd_allreduce_microstep: 7.68 | step_microstep: 11.22 [2025-01-21 14:47:19,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3214.53 | bwd: 3636.61 | bwd_inner: 3627.58 | bwd_allreduce: 8.14 | step: 12.03 81%|████████ | 355/437 [39:04<09:08, 6.69s/it] {'loss': 0.4085, 'learning_rate': 3.5956877975967163e-06, 'epoch': 0.81} 81%|████████ | 355/437 [39:04<09:08, 6.69s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4316 [2025-01-21 14:47:20,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 279.08 | bwd_microstep: 315.65 | bwd_inner_microstep: 315.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3924 [2025-01-21 14:47:20,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.88 | bwd_microstep: 294.36 | bwd_inner_microstep: 294.17 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2555 [2025-01-21 14:47:21,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.90 | bwd_microstep: 209.80 | bwd_inner_microstep: 209.64 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7297 [2025-01-21 14:47:22,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.38 | bwd_microstep: 547.11 | bwd_inner_microstep: 546.91 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7826 [2025-01-21 14:47:23,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.97 | bwd_microstep: 584.93 | bwd_inner_microstep: 584.75 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:47:24,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.20 | bwd_microstep: 498.69 | bwd_inner_microstep: 498.53 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 5063 [2025-01-21 14:47:25,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.09 | bwd_microstep: 371.21 | bwd_inner_microstep: 371.00 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5940 [2025-01-21 14:47:26,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.75 | optimizer_step: 0.36 [2025-01-21 14:47:26,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 383.84 | bwd_microstep: 442.58 | bwd_inner_microstep: 434.62 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.63 [2025-01-21 14:47:26,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2854.16 | bwd: 3264.49 | bwd_inner: 3255.46 | bwd_allreduce: 8.28 | step: 12.56 81%|████████▏ | 356/437 [39:10<08:53, 6.59s/it] {'loss': 0.2618, 'learning_rate': 3.5111687931647984e-06, 'epoch': 0.81} 81%|████████▏ | 356/437 [39:10<08:53, 6.59s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6263 [2025-01-21 14:47:26,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 403.94 | bwd_microstep: 461.84 | bwd_inner_microstep: 461.52 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6178 [2025-01-21 14:47:27,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.76 | bwd_microstep: 459.61 | bwd_inner_microstep: 459.39 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6546 [2025-01-21 14:47:28,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.93 | bwd_microstep: 482.67 | bwd_inner_microstep: 482.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3625 [2025-01-21 14:47:29,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.15 | bwd_microstep: 269.88 | bwd_inner_microstep: 269.56 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5986 [2025-01-21 14:47:30,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.66 | bwd_microstep: 435.79 | bwd_inner_microstep: 435.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7824 [2025-01-21 14:47:31,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.58 | bwd_microstep: 583.95 | bwd_inner_microstep: 583.80 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:47:32,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 388.48 | bwd_microstep: 432.42 | bwd_inner_microstep: 432.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 5421 [2025-01-21 14:47:33,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.39 | optimizer_gradients: 0.86 | optimizer_step: 0.39 [2025-01-21 14:47:33,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 337.40 | bwd_microstep: 617.71 | bwd_inner_microstep: 401.35 | bwd_allreduce_microstep: 216.14 | step_microstep: 18.78 [2025-01-21 14:47:33,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3100.73 | bwd: 3744.00 | bwd_inner: 3526.47 | bwd_allreduce: 216.63 | step: 19.57 82%|████████▏ | 357/437 [39:17<08:58, 6.73s/it] {'loss': 0.2674, 'learning_rate': 3.427559297493359e-06, 'epoch': 0.82} 82%|████████▏ | 357/437 [39:17<08:58, 6.73s/it]dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:47:34,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.82 | bwd_microstep: 606.44 | bwd_inner_microstep: 606.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.17 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3370 [2025-01-21 14:47:34,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.92 | bwd_microstep: 254.28 | bwd_inner_microstep: 254.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6528 [2025-01-21 14:47:35,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 430.29 | bwd_microstep: 482.21 | bwd_inner_microstep: 482.02 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:47:36,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.54 | bwd_microstep: 607.84 | bwd_inner_microstep: 607.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2540 [2025-01-21 14:47:37,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.57 | bwd_microstep: 211.00 | bwd_inner_microstep: 210.76 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:47:38,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.70 | bwd_microstep: 579.84 | bwd_inner_microstep: 579.53 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6763 [2025-01-21 14:47:39,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 433.08 | bwd_microstep: 499.20 | bwd_inner_microstep: 498.89 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2606 [2025-01-21 14:47:39,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:47:39,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.12 | bwd_microstep: 237.06 | bwd_inner_microstep: 229.32 | bwd_allreduce_microstep: 7.62 | step_microstep: 11.67 [2025-01-21 14:47:39,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3045.85 | bwd: 3478.00 | bwd_inner: 3468.98 | bwd_allreduce: 8.15 | step: 12.51 82%|████████▏ | 358/437 [39:24<08:52, 6.74s/it] {'loss': 0.251, 'learning_rate': 3.3448639224052703e-06, 'epoch': 0.82} 82%|████████▏ | 358/437 [39:24<08:52, 6.74s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8163 [2025-01-21 14:47:41,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 519.93 | bwd_microstep: 609.02 | bwd_inner_microstep: 608.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7434 [2025-01-21 14:47:42,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 483.30 | bwd_microstep: 555.09 | bwd_inner_microstep: 554.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4471 [2025-01-21 14:47:42,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.77 | bwd_microstep: 326.68 | bwd_inner_microstep: 326.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6825 [2025-01-21 14:47:43,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.82 | bwd_microstep: 504.49 | bwd_inner_microstep: 504.28 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6501 [2025-01-21 14:47:44,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.43 | bwd_microstep: 477.94 | bwd_inner_microstep: 477.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4370 [2025-01-21 14:47:45,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 289.77 | bwd_microstep: 321.08 | bwd_inner_microstep: 320.77 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:47:46,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 305.67 | bwd_microstep: 339.33 | bwd_inner_microstep: 339.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7182 [2025-01-21 14:47:47,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.96 | optimizer_gradients: 0.75 | optimizer_step: 0.34 [2025-01-21 14:47:47,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.04 | bwd_microstep: 545.69 | bwd_inner_microstep: 535.19 | bwd_allreduce_microstep: 10.36 | step_microstep: 13.37 [2025-01-21 14:47:47,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3220.55 | bwd: 3679.47 | bwd_inner: 3667.80 | bwd_allreduce: 10.84 | step: 14.16 82%|████████▏ | 359/437 [39:31<08:55, 6.86s/it] {'loss': 0.2529, 'learning_rate': 3.2630872293013403e-06, 'epoch': 0.82} 82%|████████▏ | 359/437 [39:31<08:55, 6.86s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3510 [2025-01-21 14:47:47,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.42 | bwd_microstep: 262.37 | bwd_inner_microstep: 262.21 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3943 [2025-01-21 14:47:48,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.10 | bwd_microstep: 295.53 | bwd_inner_microstep: 295.30 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2856 [2025-01-21 14:47:48,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 201.09 | bwd_microstep: 215.28 | bwd_inner_microstep: 214.93 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.13 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6235 [2025-01-21 14:47:49,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.64 | bwd_microstep: 463.76 | bwd_inner_microstep: 463.57 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:47:50,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.94 | bwd_microstep: 417.17 | bwd_inner_microstep: 416.98 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:47:50,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.67 | bwd_microstep: 200.72 | bwd_inner_microstep: 200.40 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:47:51,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.22 | bwd_microstep: 461.71 | bwd_inner_microstep: 461.42 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.19 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7360 [2025-01-21 14:47:52,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:47:52,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.21 | bwd_microstep: 557.68 | bwd_inner_microstep: 550.07 | bwd_allreduce_microstep: 7.51 | step_microstep: 11.34 [2025-01-21 14:47:52,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2538.11 | bwd: 2874.37 | bwd_inner: 2865.36 | bwd_allreduce: 8.05 | step: 12.26 82%|████████▏ | 360/437 [39:37<08:20, 6.50s/it] {'loss': 0.1951, 'learning_rate': 3.182233728908741e-06, 'epoch': 0.82} 82%|████████▏ | 360/437 [39:37<08:20, 6.50s/it]warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:47:53,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.69 | bwd_microstep: 603.77 | bwd_inner_microstep: 603.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7773 [2025-01-21 14:47:55,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 499.00 | bwd_microstep: 578.88 | bwd_inner_microstep: 578.70 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2305 [2025-01-21 14:47:55,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.80 | bwd_microstep: 208.35 | bwd_inner_microstep: 207.91 | bwd_allreduce_microstep: 0.17 | step_microstep: 0.26 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:47:56,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.74 | bwd_microstep: 606.52 | bwd_inner_microstep: 606.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5983 [2025-01-21 14:47:57,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.89 | bwd_microstep: 435.27 | bwd_inner_microstep: 435.06 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:47:57,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.17 | bwd_microstep: 197.00 | bwd_inner_microstep: 196.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:47:59,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 554.79 | bwd_microstep: 607.21 | bwd_inner_microstep: 606.89 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 8064 [2025-01-21 14:48:00,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.72 | optimizer_step: 0.35 [2025-01-21 14:48:00,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.96 | bwd_microstep: 612.14 | bwd_inner_microstep: 604.35 | bwd_allreduce_microstep: 7.66 | step_microstep: 11.26 [2025-01-21 14:48:00,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3401.91 | bwd: 3849.31 | bwd_inner: 3840.09 | bwd_allreduce: 8.25 | step: 12.21 83%|████████▎ | 361/437 [39:44<08:36, 6.80s/it] {'loss': 0.3467, 'learning_rate': 3.102307881032165e-06, 'epoch': 0.83} 83%|████████▎ | 361/437 [39:44<08:36, 6.80s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4587 [2025-01-21 14:48:00,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.55 | bwd_microstep: 336.22 | bwd_inner_microstep: 336.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2732 [2025-01-21 14:48:01,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.78 | bwd_microstep: 206.58 | bwd_inner_microstep: 206.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:48:02,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.30 | bwd_microstep: 606.21 | bwd_inner_microstep: 606.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5221 [2025-01-21 14:48:03,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.40 | bwd_microstep: 381.43 | bwd_inner_microstep: 381.25 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5450 [2025-01-21 14:48:04,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 361.86 | bwd_microstep: 401.38 | bwd_inner_microstep: 401.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:48:04,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.30 | bwd_microstep: 459.67 | bwd_inner_microstep: 459.46 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7086 [2025-01-21 14:48:05,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 450.28 | bwd_microstep: 525.11 | bwd_inner_microstep: 524.89 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4816 [2025-01-21 14:48:06,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.26 | optimizer_gradients: 0.74 | optimizer_step: 0.33 [2025-01-21 14:48:06,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 312.39 | bwd_microstep: 360.35 | bwd_inner_microstep: 352.75 | bwd_allreduce_microstep: 7.50 | step_microstep: 11.37 [2025-01-21 14:48:06,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2906.69 | bwd: 3277.07 | bwd_inner: 3268.34 | bwd_allreduce: 8.00 | step: 12.15 83%|████████▎ | 362/437 [39:51<08:21, 6.68s/it] {'loss': 0.3827, 'learning_rate': 3.023314094307859e-06, 'epoch': 0.83} 83%|████████▎ | 362/437 [39:51<08:21, 6.68s/it]dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7990 [2025-01-21 14:48:07,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 518.17 | bwd_microstep: 595.97 | bwd_inner_microstep: 595.64 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4756 [2025-01-21 14:48:08,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 311.40 | bwd_microstep: 349.85 | bwd_inner_microstep: 349.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3372 [2025-01-21 14:48:08,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.84 | bwd_microstep: 255.79 | bwd_inner_microstep: 255.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4960 [2025-01-21 14:48:09,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.43 | bwd_microstep: 361.94 | bwd_inner_microstep: 361.75 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3092 [2025-01-21 14:48:10,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.27 | bwd_microstep: 241.80 | bwd_inner_microstep: 241.47 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:48:11,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 454.65 | bwd_microstep: 520.49 | bwd_inner_microstep: 520.31 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:48:11,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.98 | bwd_microstep: 266.91 | bwd_inner_microstep: 266.70 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:48:12,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.63 | optimizer_step: 0.33 [2025-01-21 14:48:12,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.47 | bwd_microstep: 612.21 | bwd_inner_microstep: 604.59 | bwd_allreduce_microstep: 7.50 | step_microstep: 10.98 [2025-01-21 14:48:12,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2782.04 | bwd: 3205.09 | bwd_inner: 3196.18 | bwd_allreduce: 7.99 | step: 11.83 83%|████████▎ | 363/437 [39:57<08:04, 6.54s/it] {'loss': 0.2283, 'learning_rate': 2.9452567259604215e-06, 'epoch': 0.83} 83%|████████▎ | 363/437 [39:57<08:04, 6.54s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4443 [2025-01-21 14:48:13,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 281.89 | bwd_microstep: 324.38 | bwd_inner_microstep: 324.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5349 [2025-01-21 14:48:14,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 345.11 | bwd_microstep: 387.17 | bwd_inner_microstep: 387.00 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:48:15,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.72 | bwd_microstep: 605.24 | bwd_inner_microstep: 605.08 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2669 [2025-01-21 14:48:15,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.55 | bwd_microstep: 214.61 | bwd_inner_microstep: 214.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5999 [2025-01-21 14:48:16,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.47 | bwd_microstep: 436.25 | bwd_inner_microstep: 436.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2289 [2025-01-21 14:48:17,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.88 | bwd_microstep: 207.97 | bwd_inner_microstep: 207.65 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:48:18,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 474.47 | bwd_microstep: 540.18 | bwd_inner_microstep: 539.86 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:48:19,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.77 | optimizer_step: 0.34 [2025-01-21 14:48:19,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.76 | bwd_microstep: 861.93 | bwd_inner_microstep: 616.25 | bwd_allreduce_microstep: 245.53 | step_microstep: 13.69 [2025-01-21 14:48:19,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2917.71 | bwd: 3577.84 | bwd_inner: 3330.99 | bwd_allreduce: 245.99 | step: 14.44 83%|████████▎ | 364/437 [40:04<08:01, 6.60s/it] {'loss': 0.3542, 'learning_rate': 2.868140081562487e-06, 'epoch': 0.83} 83%|████████▎ | 364/437 [40:04<08:01, 6.60s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7554 [2025-01-21 14:48:20,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 483.82 | bwd_microstep: 565.42 | bwd_inner_microstep: 565.26 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4562 [2025-01-21 14:48:21,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.72 | bwd_microstep: 336.14 | bwd_inner_microstep: 335.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3137 [2025-01-21 14:48:21,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.35 | bwd_microstep: 243.52 | bwd_inner_microstep: 243.30 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2829 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:48:22,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 200.44 | bwd_microstep: 215.59 | bwd_inner_microstep: 215.38 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6529 [2025-01-21 14:48:23,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.73 | bwd_microstep: 481.78 | bwd_inner_microstep: 481.57 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:48:23,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.11 | bwd_microstep: 261.14 | bwd_inner_microstep: 260.98 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4018 [2025-01-21 14:48:24,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.00 | bwd_microstep: 305.01 | bwd_inner_microstep: 304.68 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5755 [2025-01-21 14:48:26,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.78 | optimizer_step: 0.34 [2025-01-21 14:48:26,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.49 | bwd_microstep: 1413.62 | bwd_inner_microstep: 424.74 | bwd_allreduce_microstep: 988.76 | step_microstep: 13.66 [2025-01-21 14:48:26,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2501.49 | bwd: 3822.35 | bwd_inner: 2832.18 | bwd_allreduce: 989.28 | step: 14.51 84%|████████▎ | 365/437 [40:10<07:54, 6.59s/it] {'loss': 0.2871, 'learning_rate': 2.791968414797217e-06, 'epoch': 0.83} 84%|████████▎ | 365/437 [40:10<07:54, 6.59s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6685 [2025-01-21 14:48:27,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 429.83 | bwd_microstep: 493.89 | bwd_inner_microstep: 493.59 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5305 [2025-01-21 14:48:27,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 344.34 | bwd_microstep: 386.06 | bwd_inner_microstep: 385.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5025 [2025-01-21 14:48:28,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.22 | bwd_microstep: 368.54 | bwd_inner_microstep: 368.34 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:48:29,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.09 | bwd_microstep: 608.58 | bwd_inner_microstep: 608.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6283 [2025-01-21 14:48:30,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.96 | bwd_microstep: 466.61 | bwd_inner_microstep: 466.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7538 [2025-01-21 14:48:31,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.12 | bwd_microstep: 564.09 | bwd_inner_microstep: 563.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:48:32,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 358.32 | bwd_microstep: 399.89 | bwd_inner_microstep: 399.58 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 6298 [2025-01-21 14:48:33,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.68 | optimizer_step: 0.33 [2025-01-21 14:48:33,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.19 | bwd_microstep: 473.90 | bwd_inner_microstep: 466.15 | bwd_allreduce_microstep: 7.52 | step_microstep: 10.95 [2025-01-21 14:48:33,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3285.91 | bwd: 3761.67 | bwd_inner: 3752.75 | bwd_allreduce: 7.94 | step: 11.74 84%|████████▍ | 366/437 [40:18<08:02, 6.79s/it] {'loss': 0.323, 'learning_rate': 2.7167459272236718e-06, 'epoch': 0.84} 84%|████████▍ | 366/437 [40:18<08:02, 6.79s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2542 [2025-01-21 14:48:33,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.36 | bwd_microstep: 206.41 | bwd_inner_microstep: 206.23 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5045 [2025-01-21 14:48:34,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.77 | bwd_microstep: 369.47 | bwd_inner_microstep: 369.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:48:35,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.15 | bwd_microstep: 605.92 | bwd_inner_microstep: 605.61 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4929 [2025-01-21 14:48:36,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.51 | bwd_microstep: 359.93 | bwd_inner_microstep: 359.77 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:48:37,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.45 | bwd_microstep: 606.74 | bwd_inner_microstep: 606.53 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:48:38,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.34 | bwd_microstep: 477.30 | bwd_inner_microstep: 477.08 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:48:39,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.15 | bwd_microstep: 607.20 | bwd_inner_microstep: 606.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5061 [2025-01-21 14:48:40,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.70 | optimizer_step: 0.38 [2025-01-21 14:48:40,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.56 | bwd_microstep: 872.48 | bwd_inner_microstep: 372.24 | bwd_allreduce_microstep: 500.13 | step_microstep: 13.25 [2025-01-21 14:48:40,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3215.12 | bwd: 4105.60 | bwd_inner: 3604.06 | bwd_allreduce: 500.65 | step: 14.07 84%|████████▍ | 367/437 [40:25<08:11, 7.02s/it] {'loss': 0.9199, 'learning_rate': 2.6424767680450657e-06, 'epoch': 0.84} 84%|████████▍ | 367/437 [40:25<08:11, 7.02s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3053 [2025-01-21 14:48:41,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.99 | bwd_microstep: 225.92 | bwd_inner_microstep: 225.76 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6038 [2025-01-21 14:48:42,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 394.05 | bwd_microstep: 441.14 | bwd_inner_microstep: 440.95 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5489 [2025-01-21 14:48:43,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.39 | bwd_microstep: 406.54 | bwd_inner_microstep: 406.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3595 [2025-01-21 14:48:43,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.53 | bwd_microstep: 269.66 | bwd_inner_microstep: 269.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:48:44,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.41 | bwd_microstep: 498.92 | bwd_inner_microstep: 498.75 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:48:45,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 405.16 | bwd_microstep: 461.10 | bwd_inner_microstep: 460.93 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5801 [2025-01-21 14:48:46,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.87 | bwd_microstep: 426.59 | bwd_inner_microstep: 426.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:48:47,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.99 | optimizer_step: 0.34 [2025-01-21 14:48:47,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.48 | bwd_microstep: 614.76 | bwd_inner_microstep: 606.95 | bwd_allreduce_microstep: 7.61 | step_microstep: 13.63 [2025-01-21 14:48:47,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2952.71 | bwd: 3344.76 | bwd_inner: 3335.97 | bwd_allreduce: 8.08 | step: 14.43 84%|████████▍ | 368/437 [40:32<07:54, 6.87s/it] {'loss': 0.3337, 'learning_rate': 2.5691650338799012e-06, 'epoch': 0.84} 84%|████████▍ | 368/437 [40:32<07:54, 6.87s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2450 [2025-01-21 14:48:47,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.89 | bwd_microstep: 202.45 | bwd_inner_microstep: 202.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4753 [2025-01-21 14:48:48,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 311.22 | bwd_microstep: 348.49 | bwd_inner_microstep: 348.32 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6836 [2025-01-21 14:48:49,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 442.89 | bwd_microstep: 505.09 | bwd_inner_microstep: 504.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7347 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:48:50,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.68 | bwd_microstep: 549.59 | bwd_inner_microstep: 549.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:48:51,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.52 | bwd_microstep: 606.18 | bwd_inner_microstep: 605.96 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:48:52,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.71 | bwd_microstep: 478.58 | bwd_inner_microstep: 478.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:48:53,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 507.62 | bwd_microstep: 580.64 | bwd_inner_microstep: 580.36 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3713 [2025-01-21 14:48:54,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.69 | optimizer_step: 0.34 [2025-01-21 14:48:54,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.50 | bwd_microstep: 287.50 | bwd_inner_microstep: 276.98 | bwd_allreduce_microstep: 10.30 | step_microstep: 11.50 [2025-01-21 14:48:54,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3114.88 | bwd: 3558.65 | bwd_inner: 3547.00 | bwd_allreduce: 10.79 | step: 12.34 84%|████████▍ | 369/437 [40:39<07:47, 6.88s/it] {'loss': 0.4007, 'learning_rate': 2.496814768535989e-06, 'epoch': 0.84} 84%|████████▍ | 369/437 [40:39<07:47, 6.88s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6723 [2025-01-21 14:48:55,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 432.01 | bwd_microstep: 496.07 | bwd_inner_microstep: 495.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4455 [2025-01-21 14:48:55,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.38 | bwd_microstep: 325.04 | bwd_inner_microstep: 324.85 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3402 [2025-01-21 14:48:56,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.27 | bwd_microstep: 257.56 | bwd_inner_microstep: 257.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7361 [2025-01-21 14:48:57,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.58 | bwd_microstep: 551.68 | bwd_inner_microstep: 551.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5184 [2025-01-21 14:48:58,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.09 | bwd_microstep: 380.36 | bwd_inner_microstep: 380.11 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:48:59,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.25 | bwd_microstep: 377.92 | bwd_inner_microstep: 377.75 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 4049 [2025-01-21 14:48:59,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 252.09 | bwd_microstep: 301.43 | bwd_inner_microstep: 301.24 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:49:00,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:49:00,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.92 | bwd_microstep: 615.95 | bwd_inner_microstep: 608.10 | bwd_allreduce_microstep: 7.71 | step_microstep: 11.45 [2025-01-21 14:49:00,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2914.43 | bwd: 3306.14 | bwd_inner: 3297.22 | bwd_allreduce: 8.18 | step: 12.27 85%|████████▍ | 370/437 [40:45<07:32, 6.75s/it] {'loss': 0.3352, 'learning_rate': 2.4254299627874045e-06, 'epoch': 0.85} 85%|████████▍ | 370/437 [40:45<07:32, 6.75s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6499 [2025-01-21 14:49:01,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 415.33 | bwd_microstep: 478.25 | bwd_inner_microstep: 478.09 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6567 [2025-01-21 14:49:02,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.80 | bwd_microstep: 484.85 | bwd_inner_microstep: 484.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:49:03,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.46 | bwd_microstep: 604.86 | bwd_inner_microstep: 604.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6038 [2025-01-21 14:49:04,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.55 | bwd_microstep: 439.49 | bwd_inner_microstep: 439.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7596 [2025-01-21 14:49:05,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.00 | bwd_microstep: 564.46 | bwd_inner_microstep: 564.20 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7043 [2025-01-21 14:49:06,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.59 | bwd_microstep: 523.66 | bwd_inner_microstep: 523.39 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3223 [2025-01-21 14:49:07,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.55 | bwd_microstep: 242.99 | bwd_inner_microstep: 242.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7948 [2025-01-21 14:49:08,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.66 | optimizer_step: 0.33 [2025-01-21 14:49:08,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.41 | bwd_microstep: 601.49 | bwd_inner_microstep: 594.03 | bwd_allreduce_microstep: 7.36 | step_microstep: 10.89 [2025-01-21 14:49:08,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3435.54 | bwd: 3940.17 | bwd_inner: 3931.49 | bwd_allreduce: 7.87 | step: 11.70 85%|████████▍ | 371/437 [40:53<07:42, 7.01s/it] {'loss': 0.3592, 'learning_rate': 2.3550145541543666e-06, 'epoch': 0.85} 85%|████████▍ | 371/437 [40:53<07:42, 7.01s/it]dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 5174 [2025-01-21 14:49:09,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.55 | bwd_microstep: 378.46 | bwd_inner_microstep: 378.29 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3717 [2025-01-21 14:49:09,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.25 | bwd_microstep: 277.09 | bwd_inner_microstep: 276.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4466 [2025-01-21 14:49:10,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.95 | bwd_microstep: 324.30 | bwd_inner_microstep: 324.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3370 [2025-01-21 14:49:10,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.50 | bwd_microstep: 250.22 | bwd_inner_microstep: 250.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2836 [2025-01-21 14:49:11,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.40 | bwd_microstep: 217.30 | bwd_inner_microstep: 216.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:49:11,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.66 | bwd_microstep: 358.78 | bwd_inner_microstep: 358.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5295 [2025-01-21 14:49:12,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 347.58 | bwd_microstep: 386.47 | bwd_inner_microstep: 385.97 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6825 [2025-01-21 14:49:14,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.85 | optimizer_step: 0.36 [2025-01-21 14:49:14,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 441.43 | bwd_microstep: 794.34 | bwd_inner_microstep: 507.24 | bwd_allreduce_microstep: 286.87 | step_microstep: 14.04 [2025-01-21 14:49:14,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2388.15 | bwd: 2987.14 | bwd_inner: 2698.71 | bwd_allreduce: 287.45 | step: 15.04 85%|████████▌ | 372/437 [40:58<07:07, 6.58s/it] {'loss': 0.2621, 'learning_rate': 2.2855724266860314e-06, 'epoch': 0.85} 85%|████████▌ | 372/437 [40:58<07:07, 6.58s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3006 [2025-01-21 14:49:14,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.96 | bwd_microstep: 224.04 | bwd_inner_microstep: 223.88 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2625 [2025-01-21 14:49:14,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.20 | bwd_microstep: 205.64 | bwd_inner_microstep: 205.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3400 [2025-01-21 14:49:15,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.27 | bwd_microstep: 254.39 | bwd_inner_microstep: 254.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5770 [2025-01-21 14:49:16,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 380.34 | bwd_microstep: 424.00 | bwd_inner_microstep: 423.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4938 [2025-01-21 14:49:16,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.43 | bwd_microstep: 360.76 | bwd_inner_microstep: 360.59 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7824 [2025-01-21 14:49:18,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 509.90 | bwd_microstep: 585.02 | bwd_inner_microstep: 584.85 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:49:18,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.89 | bwd_microstep: 261.54 | bwd_inner_microstep: 261.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6514 [2025-01-21 14:49:19,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.78 | optimizer_step: 0.35 [2025-01-21 14:49:19,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.83 | bwd_microstep: 944.55 | bwd_inner_microstep: 480.13 | bwd_allreduce_microstep: 464.29 | step_microstep: 13.25 [2025-01-21 14:49:19,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2473.67 | bwd: 3260.07 | bwd_inner: 2794.59 | bwd_allreduce: 464.75 | step: 14.07 85%|████████▌ | 373/437 [41:04<06:49, 6.40s/it] {'loss': 0.2718, 'learning_rate': 2.217107410746271e-06, 'epoch': 0.85} 85%|████████▌ | 373/437 [41:04<06:49, 6.40s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4019 [2025-01-21 14:49:20,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.88 | bwd_microstep: 296.44 | bwd_inner_microstep: 296.21 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3088 [2025-01-21 14:49:21,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.53 | bwd_microstep: 241.74 | bwd_inner_microstep: 241.53 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3084 [2025-01-21 14:49:21,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.40 | bwd_microstep: 244.80 | bwd_inner_microstep: 244.60 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7308 [2025-01-21 14:49:22,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.20 | bwd_microstep: 545.43 | bwd_inner_microstep: 545.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:49:22,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.42 | bwd_microstep: 200.35 | bwd_inner_microstep: 199.86 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:49:23,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.31 | bwd_microstep: 476.64 | bwd_inner_microstep: 476.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:49:24,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 288.01 | bwd_microstep: 318.64 | bwd_inner_microstep: 318.32 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4491 [2025-01-21 14:49:25,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.89 | optimizer_gradients: 0.68 | optimizer_step: 0.34 [2025-01-21 14:49:25,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 285.82 | bwd_microstep: 339.96 | bwd_inner_microstep: 332.36 | bwd_allreduce_microstep: 7.46 | step_microstep: 10.95 [2025-01-21 14:49:25,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2341.40 | bwd: 2664.18 | bwd_inner: 2655.15 | bwd_allreduce: 8.07 | step: 11.91 86%|████████▌ | 374/437 [41:09<06:20, 6.04s/it] {'loss': 0.1917, 'learning_rate': 2.149623282802378e-06, 'epoch': 0.85} 86%|████████▌ | 374/437 [41:09<06:20, 6.04s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5925 [2025-01-21 14:49:26,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.08 | bwd_microstep: 431.85 | bwd_inner_microstep: 431.52 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5584 [2025-01-21 14:49:26,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.35 | bwd_microstep: 410.72 | bwd_inner_microstep: 410.55 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7384 [2025-01-21 14:49:27,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.14 | bwd_microstep: 551.70 | bwd_inner_microstep: 551.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2342 [2025-01-21 14:49:28,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.76 | bwd_microstep: 200.07 | bwd_inner_microstep: 199.75 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2303 [2025-01-21 14:49:28,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.89 | bwd_microstep: 211.69 | bwd_inner_microstep: 211.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2300 [2025-01-21 14:49:29,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.75 | bwd_microstep: 210.09 | bwd_inner_microstep: 209.77 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:49:29,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.59 | bwd_microstep: 223.48 | bwd_inner_microstep: 222.99 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3500 [2025-01-21 14:49:31,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.82 | optimizer_step: 0.34 [2025-01-21 14:49:31,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.77 | bwd_microstep: 1381.70 | bwd_inner_microstep: 265.18 | bwd_allreduce_microstep: 1116.30 | step_microstep: 13.91 [2025-01-21 14:49:31,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.18 | bwd: 3621.46 | bwd_inner: 2503.40 | bwd_allreduce: 1116.83 | step: 14.89 86%|████████▌ | 375/437 [41:15<06:13, 6.03s/it] {'loss': 0.2845, 'learning_rate': 2.0831237652167656e-06, 'epoch': 0.86} 86%|████████▌ | 375/437 [41:15<06:13, 6.03s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2474 [2025-01-21 14:49:31,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.27 | bwd_microstep: 203.58 | bwd_inner_microstep: 203.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7947 [2025-01-21 14:49:32,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 517.88 | bwd_microstep: 593.18 | bwd_inner_microstep: 592.91 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3671 [2025-01-21 14:49:33,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 246.19 | bwd_microstep: 273.79 | bwd_inner_microstep: 273.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6570 [2025-01-21 14:49:34,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.27 | bwd_microstep: 483.62 | bwd_inner_microstep: 483.46 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7848 [2025-01-21 14:49:35,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 511.24 | bwd_microstep: 587.07 | bwd_inner_microstep: 586.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2806 [2025-01-21 14:49:35,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.71 | bwd_microstep: 205.25 | bwd_inner_microstep: 205.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4886 [2025-01-21 14:49:36,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.20 | bwd_microstep: 357.89 | bwd_inner_microstep: 357.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7207 [2025-01-21 14:49:37,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.70 | optimizer_step: 0.34 [2025-01-21 14:49:37,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.80 | bwd_microstep: 541.52 | bwd_inner_microstep: 533.63 | bwd_allreduce_microstep: 7.77 | step_microstep: 11.18 [2025-01-21 14:49:37,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2839.41 | bwd: 3246.03 | bwd_inner: 3237.05 | bwd_allreduce: 8.23 | step: 11.98 86%|████████▌ | 376/437 [41:22<06:13, 6.12s/it] {'loss': 0.2991, 'learning_rate': 2.0176125260416544e-06, 'epoch': 0.86} 86%|████████▌ | 376/437 [41:22<06:13, 6.12s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7738 [2025-01-21 14:49:38,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 497.50 | bwd_microstep: 573.63 | bwd_inner_microstep: 573.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6599 [2025-01-21 14:49:39,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.24 | bwd_microstep: 486.20 | bwd_inner_microstep: 485.98 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2345 [2025-01-21 14:49:39,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.43 | bwd_microstep: 199.27 | bwd_inner_microstep: 199.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:49:41,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.37 | bwd_microstep: 604.81 | bwd_inner_microstep: 604.50 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4116 [2025-01-21 14:49:41,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.30 | bwd_microstep: 304.30 | bwd_inner_microstep: 304.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:49:42,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.93 | bwd_microstep: 561.00 | bwd_inner_microstep: 560.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:49:43,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.61 | bwd_microstep: 560.43 | bwd_inner_microstep: 560.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:49:45,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.66 | optimizer_step: 0.32 [2025-01-21 14:49:45,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.06 | bwd_microstep: 620.11 | bwd_inner_microstep: 612.78 | bwd_allreduce_microstep: 7.24 | step_microstep: 10.61 [2025-01-21 14:49:45,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3428.27 | bwd: 3909.89 | bwd_inner: 3901.40 | bwd_allreduce: 7.71 | step: 11.39 86%|████████▋ | 377/437 [41:29<06:33, 6.55s/it] {'loss': 0.3188, 'learning_rate': 1.9530931788167274e-06, 'epoch': 0.86} 86%|████████▋ | 377/437 [41:29<06:33, 6.55s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 8192 [2025-01-21 14:49:46,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 515.52 | bwd_microstep: 614.57 | bwd_inner_microstep: 614.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3160 [2025-01-21 14:49:46,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.70 | bwd_microstep: 242.03 | bwd_inner_microstep: 241.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3687 [2025-01-21 14:49:47,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.55 | bwd_microstep: 275.62 | bwd_inner_microstep: 275.34 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4979 [2025-01-21 14:49:47,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.39 | bwd_microstep: 364.75 | bwd_inner_microstep: 364.53 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3105 [2025-01-21 14:49:48,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.76 | bwd_microstep: 242.07 | bwd_inner_microstep: 241.90 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7576 [2025-01-21 14:49:49,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.31 | bwd_microstep: 565.87 | bwd_inner_microstep: 565.70 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5705 [2025-01-21 14:49:50,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.80 | bwd_microstep: 420.31 | bwd_inner_microstep: 420.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4676 [2025-01-21 14:49:51,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.84 | optimizer_step: 0.35 [2025-01-21 14:49:51,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 300.81 | bwd_microstep: 355.38 | bwd_inner_microstep: 347.38 | bwd_allreduce_microstep: 7.89 | step_microstep: 11.94 [2025-01-21 14:49:51,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2683.69 | bwd: 3080.73 | bwd_inner: 3071.62 | bwd_allreduce: 8.36 | step: 12.74 86%|████████▋ | 378/437 [41:35<06:16, 6.38s/it] {'loss': 0.2927, 'learning_rate': 1.889569282369823e-06, 'epoch': 0.86} 86%|████████▋ | 378/437 [41:35<06:16, 6.38s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2344 [2025-01-21 14:49:51,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.80 | bwd_microstep: 206.41 | bwd_inner_microstep: 206.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6309 [2025-01-21 14:49:52,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 411.16 | bwd_microstep: 467.83 | bwd_inner_microstep: 467.65 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8133 [2025-01-21 14:49:53,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.97 | bwd_microstep: 607.65 | bwd_inner_microstep: 607.47 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:49:54,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.26 | bwd_microstep: 607.55 | bwd_inner_microstep: 607.06 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:49:55,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.33 | bwd_microstep: 302.12 | bwd_inner_microstep: 301.64 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.29 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:49:56,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.36 | bwd_microstep: 498.36 | bwd_inner_microstep: 498.18 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:49:57,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 437.59 | bwd_microstep: 496.76 | bwd_inner_microstep: 496.59 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8114 [2025-01-21 14:49:58,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.65 | optimizer_step: 0.33 [2025-01-21 14:49:58,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.85 | bwd_microstep: 614.06 | bwd_inner_microstep: 606.56 | bwd_allreduce_microstep: 7.29 | step_microstep: 10.91 [2025-01-21 14:49:58,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3332.15 | bwd: 3800.96 | bwd_inner: 3791.85 | bwd_allreduce: 7.99 | step: 12.05 87%|████████▋ | 379/437 [41:43<06:27, 6.68s/it] {'loss': 0.2112, 'learning_rate': 1.8270443406206273e-06, 'epoch': 0.87} 87%|████████▋ | 379/437 [41:43<06:27, 6.68s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7585 [2025-01-21 14:49:59,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 482.85 | bwd_microstep: 567.11 | bwd_inner_microstep: 566.90 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.15 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3426 [2025-01-21 14:50:00,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.49 | bwd_microstep: 257.76 | bwd_inner_microstep: 257.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3412 [2025-01-21 14:50:00,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.95 | bwd_microstep: 257.25 | bwd_inner_microstep: 257.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4705 [2025-01-21 14:50:01,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.76 | bwd_microstep: 349.35 | bwd_inner_microstep: 349.19 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5990 [2025-01-21 14:50:02,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.26 | bwd_microstep: 435.95 | bwd_inner_microstep: 435.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:50:02,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.83 | bwd_microstep: 302.61 | bwd_inner_microstep: 302.42 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:50:03,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.62 | bwd_microstep: 537.42 | bwd_inner_microstep: 537.10 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2800 [2025-01-21 14:50:04,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.78 | optimizer_step: 0.38 [2025-01-21 14:50:04,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.62 | bwd_microstep: 968.49 | bwd_inner_microstep: 232.46 | bwd_allreduce_microstep: 735.92 | step_microstep: 13.90 [2025-01-21 14:50:04,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2559.23 | bwd: 3676.06 | bwd_inner: 2938.78 | bwd_allreduce: 736.43 | step: 14.72 87%|████████▋ | 380/437 [41:49<06:16, 6.61s/it] {'loss': 0.2621, 'learning_rate': 1.7655218023874131e-06, 'epoch': 0.87} 87%|████████▋ | 380/437 [41:49<06:16, 6.61s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6027 [2025-01-21 14:50:05,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 385.81 | bwd_microstep: 439.52 | bwd_inner_microstep: 439.35 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.15 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7600 [2025-01-21 14:50:06,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.88 | bwd_microstep: 567.07 | bwd_inner_microstep: 566.76 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5460 [2025-01-21 14:50:07,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.24 | bwd_microstep: 404.50 | bwd_inner_microstep: 404.20 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7570 [2025-01-21 14:50:08,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 491.92 | bwd_microstep: 565.60 | bwd_inner_microstep: 565.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6212 [2025-01-21 14:50:09,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.19 | bwd_microstep: 460.06 | bwd_inner_microstep: 459.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 5217 [2025-01-21 14:50:10,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.59 | bwd_microstep: 382.00 | bwd_inner_microstep: 381.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4869 [2025-01-21 14:50:11,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 314.72 | bwd_microstep: 356.73 | bwd_inner_microstep: 356.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2303 [2025-01-21 14:50:11,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.77 | optimizer_step: 0.35 [2025-01-21 14:50:11,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 161.06 | bwd_microstep: 265.26 | bwd_inner_microstep: 223.18 | bwd_allreduce_microstep: 41.97 | step_microstep: 14.13 [2025-01-21 14:50:11,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2937.25 | bwd: 3440.88 | bwd_inner: 3397.58 | bwd_allreduce: 42.45 | step: 15.00 87%|████████▋ | 381/437 [41:56<06:10, 6.61s/it] {'loss': 0.2712, 'learning_rate': 1.7050050611967872e-06, 'epoch': 0.87} 87%|████████▋ | 381/437 [41:56<06:10, 6.61s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4388 [2025-01-21 14:50:12,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 280.29 | bwd_microstep: 319.90 | bwd_inner_microstep: 319.64 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3999 [2025-01-21 14:50:12,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.73 | bwd_microstep: 296.84 | bwd_inner_microstep: 296.65 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:50:13,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.65 | bwd_microstep: 605.90 | bwd_inner_microstep: 605.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6288 [2025-01-21 14:50:14,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.74 | bwd_microstep: 464.92 | bwd_inner_microstep: 464.76 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:50:15,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.00 | bwd_microstep: 604.65 | bwd_inner_microstep: 604.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:50:16,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.13 | bwd_microstep: 355.60 | bwd_inner_microstep: 355.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3036 [2025-01-21 14:50:17,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 197.30 | bwd_microstep: 226.78 | bwd_inner_microstep: 226.52 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6922 [2025-01-21 14:50:18,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.69 | optimizer_step: 0.35 [2025-01-21 14:50:18,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.85 | bwd_microstep: 520.52 | bwd_inner_microstep: 512.76 | bwd_allreduce_microstep: 7.56 | step_microstep: 11.24 [2025-01-21 14:50:18,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3009.55 | bwd: 3395.26 | bwd_inner: 3386.37 | bwd_allreduce: 8.05 | step: 12.03 87%|████████▋ | 382/437 [42:02<06:03, 6.62s/it] {'loss': 0.527, 'learning_rate': 1.6454974550965185e-06, 'epoch': 0.87} 87%|████████▋ | 382/437 [42:02<06:03, 6.62s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2825 [2025-01-21 14:50:18,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.91 | bwd_microstep: 212.42 | bwd_inner_microstep: 212.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6404 [2025-01-21 14:50:19,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 413.79 | bwd_microstep: 475.63 | bwd_inner_microstep: 475.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6092 [2025-01-21 14:50:20,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.14 | bwd_microstep: 443.74 | bwd_inner_microstep: 443.58 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4470 [2025-01-21 14:50:20,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.54 | bwd_microstep: 325.78 | bwd_inner_microstep: 325.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2844 [2025-01-21 14:50:21,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.42 | bwd_microstep: 217.00 | bwd_inner_microstep: 216.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5971 [2025-01-21 14:50:22,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.97 | bwd_microstep: 434.95 | bwd_inner_microstep: 434.78 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:50:22,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.12 | bwd_microstep: 343.11 | bwd_inner_microstep: 342.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7613 [2025-01-21 14:50:24,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.65 | optimizer_step: 0.33 [2025-01-21 14:50:24,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 485.68 | bwd_microstep: 573.31 | bwd_inner_microstep: 565.72 | bwd_allreduce_microstep: 7.39 | step_microstep: 10.96 [2025-01-21 14:50:24,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2664.41 | bwd: 3026.06 | bwd_inner: 3017.50 | bwd_allreduce: 7.85 | step: 11.80 88%|████████▊ | 383/437 [42:08<05:45, 6.41s/it] {'loss': 0.3912, 'learning_rate': 1.5870022664714225e-06, 'epoch': 0.88} 88%|████████▊ | 383/437 [42:08<05:45, 6.41s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3476 [2025-01-21 14:50:24,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.26 | bwd_microstep: 259.74 | bwd_inner_microstep: 259.58 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5327 [2025-01-21 14:50:25,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 346.32 | bwd_microstep: 388.62 | bwd_inner_microstep: 388.36 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6529 [2025-01-21 14:50:26,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.86 | bwd_microstep: 481.39 | bwd_inner_microstep: 481.24 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5735 [2025-01-21 14:50:27,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.23 | bwd_microstep: 421.56 | bwd_inner_microstep: 421.32 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7843 [2025-01-21 14:50:28,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.76 | bwd_microstep: 586.61 | bwd_inner_microstep: 586.43 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3332 [2025-01-21 14:50:28,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.29 | bwd_microstep: 253.59 | bwd_inner_microstep: 253.43 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:50:29,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.95 | bwd_microstep: 606.91 | bwd_inner_microstep: 606.74 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6009 [2025-01-21 14:50:30,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.43 | optimizer_gradients: 0.73 | optimizer_step: 0.35 [2025-01-21 14:50:30,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 382.57 | bwd_microstep: 576.37 | bwd_inner_microstep: 436.82 | bwd_allreduce_microstep: 139.37 | step_microstep: 18.18 [2025-01-21 14:50:30,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3038.08 | bwd: 3574.92 | bwd_inner: 3434.40 | bwd_allreduce: 139.82 | step: 18.98 88%|████████▊ | 384/437 [42:15<05:46, 6.54s/it] {'loss': 0.4273, 'learning_rate': 1.529522721862291e-06, 'epoch': 0.88} 88%|████████▊ | 384/437 [42:15<05:46, 6.54s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4654 [2025-01-21 14:50:31,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 301.02 | bwd_microstep: 342.35 | bwd_inner_microstep: 342.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3164 [2025-01-21 14:50:32,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.67 | bwd_microstep: 244.32 | bwd_inner_microstep: 244.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2850 [2025-01-21 14:50:32,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.35 | bwd_microstep: 215.86 | bwd_inner_microstep: 215.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2560 [2025-01-21 14:50:32,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.18 | bwd_microstep: 208.30 | bwd_inner_microstep: 208.00 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5979 [2025-01-21 14:50:33,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.66 | bwd_microstep: 438.43 | bwd_inner_microstep: 438.26 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:50:34,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.25 | bwd_microstep: 220.69 | bwd_inner_microstep: 220.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4095 [2025-01-21 14:50:34,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.81 | bwd_microstep: 304.68 | bwd_inner_microstep: 304.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7857 [2025-01-21 14:50:37,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.76 | optimizer_step: 0.34 [2025-01-21 14:50:37,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.33 | bwd_microstep: 1837.75 | bwd_inner_microstep: 588.36 | bwd_allreduce_microstep: 1249.26 | step_microstep: 13.68 [2025-01-21 14:50:37,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2263.10 | bwd: 3812.51 | bwd_inner: 2562.02 | bwd_allreduce: 1249.73 | step: 14.44 88%|████████▊ | 385/437 [42:21<05:36, 6.47s/it] {'loss': 0.2741, 'learning_rate': 1.473061991787923e-06, 'epoch': 0.88} 88%|████████▊ | 385/437 [42:21<05:36, 6.47s/it]dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 7175 [2025-01-21 14:50:38,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.51 | bwd_microstep: 531.52 | bwd_inner_microstep: 531.34 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5636 [2025-01-21 14:50:39,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 367.80 | bwd_microstep: 417.98 | bwd_inner_microstep: 417.80 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5013 [2025-01-21 14:50:39,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.31 | bwd_microstep: 368.12 | bwd_inner_microstep: 367.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6316 [2025-01-21 14:50:40,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.87 | bwd_microstep: 467.76 | bwd_inner_microstep: 467.60 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6018 [2025-01-21 14:50:41,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.82 | bwd_microstep: 439.33 | bwd_inner_microstep: 439.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4936 [2025-01-21 14:50:42,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.55 | bwd_microstep: 361.08 | bwd_inner_microstep: 360.89 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:50:43,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.35 | bwd_microstep: 605.87 | bwd_inner_microstep: 605.71 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5203 [2025-01-21 14:50:44,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.10 | optimizer_gradients: 0.83 | optimizer_step: 0.35 [2025-01-21 14:50:44,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 330.67 | bwd_microstep: 686.75 | bwd_inner_microstep: 381.26 | bwd_allreduce_microstep: 305.38 | step_microstep: 14.16 [2025-01-21 14:50:44,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3150.72 | bwd: 3878.54 | bwd_inner: 3571.93 | bwd_allreduce: 305.85 | step: 14.95 88%|████████▊ | 386/437 [42:29<05:41, 6.70s/it] {'loss': 0.4739, 'learning_rate': 1.4176231905702476e-06, 'epoch': 0.88} 88%|████████▊ | 386/437 [42:29<05:41, 6.70s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3892 [2025-01-21 14:50:45,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 249.03 | bwd_microstep: 287.63 | bwd_inner_microstep: 287.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3968 [2025-01-21 14:50:45,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.18 | bwd_microstep: 294.61 | bwd_inner_microstep: 294.42 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4429 [2025-01-21 14:50:46,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 292.26 | bwd_microstep: 324.29 | bwd_inner_microstep: 324.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5481 [2025-01-21 14:50:47,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 363.43 | bwd_microstep: 405.52 | bwd_inner_microstep: 405.35 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:50:48,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.95 | bwd_microstep: 604.88 | bwd_inner_microstep: 604.72 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8108 [2025-01-21 14:50:49,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 525.89 | bwd_microstep: 606.94 | bwd_inner_microstep: 606.63 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:50:50,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.73 | bwd_microstep: 609.18 | bwd_inner_microstep: 609.01 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4037 [2025-01-21 14:50:51,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.72 | optimizer_step: 0.34 [2025-01-21 14:50:51,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 264.01 | bwd_microstep: 1087.82 | bwd_inner_microstep: 302.85 | bwd_allreduce_microstep: 784.87 | step_microstep: 13.15 [2025-01-21 14:50:51,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3052.33 | bwd: 4220.99 | bwd_inner: 3434.88 | bwd_allreduce: 785.34 | step: 13.93 89%|████████▊ | 387/437 [42:36<05:47, 6.94s/it] {'loss': 0.7205, 'learning_rate': 1.363209376162542e-06, 'epoch': 0.88} 89%|████████▊ | 387/437 [42:36<05:47, 6.94s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:50:53,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 531.65 | bwd_microstep: 612.93 | bwd_inner_microstep: 612.77 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6416 [2025-01-21 14:50:54,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 412.53 | bwd_microstep: 476.92 | bwd_inner_microstep: 476.74 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8140 [2025-01-21 14:50:55,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.12 | bwd_microstep: 608.20 | bwd_inner_microstep: 608.04 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7336 [2025-01-21 14:50:56,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.16 | bwd_microstep: 549.57 | bwd_inner_microstep: 549.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4930 [2025-01-21 14:50:56,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.10 | bwd_microstep: 361.29 | bwd_inner_microstep: 361.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3066 [2025-01-21 14:50:57,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 206.84 | bwd_microstep: 227.81 | bwd_inner_microstep: 227.60 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:50:58,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 472.63 | bwd_microstep: 539.81 | bwd_inner_microstep: 539.58 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5678 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:51:00,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.81 | optimizer_step: 0.35 [2025-01-21 14:51:00,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 365.81 | bwd_microstep: 1220.05 | bwd_inner_microstep: 416.75 | bwd_allreduce_microstep: 803.19 | step_microstep: 14.00 [2025-01-21 14:51:00,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3315.65 | bwd: 4596.71 | bwd_inner: 3792.31 | bwd_allreduce: 803.68 | step: 14.82 89%|████████▉ | 388/437 [42:44<05:58, 7.31s/it] {'loss': 0.3439, 'learning_rate': 1.309823549980751e-06, 'epoch': 0.89} 89%|████████▉ | 388/437 [42:44<05:58, 7.31s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3003 [2025-01-21 14:51:00,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.41 | bwd_microstep: 223.03 | bwd_inner_microstep: 222.68 | bwd_allreduce_microstep: 0.14 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3136 [2025-01-21 14:51:01,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 216.24 | bwd_microstep: 244.14 | bwd_inner_microstep: 243.94 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4970 [2025-01-21 14:51:01,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.06 | bwd_microstep: 363.66 | bwd_inner_microstep: 363.36 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3068 [2025-01-21 14:51:02,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 207.49 | bwd_microstep: 269.07 | bwd_inner_microstep: 268.83 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:51:03,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 543.05 | bwd_microstep: 605.00 | bwd_inner_microstep: 604.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7034 [2025-01-21 14:51:04,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.27 | bwd_microstep: 520.40 | bwd_inner_microstep: 520.12 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:51:05,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 288.45 | bwd_microstep: 319.03 | bwd_inner_microstep: 318.87 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6445 [2025-01-21 14:51:06,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.06 | optimizer_gradients: 0.76 | optimizer_step: 0.35 [2025-01-21 14:51:06,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 415.40 | bwd_microstep: 764.41 | bwd_inner_microstep: 476.64 | bwd_allreduce_microstep: 287.66 | step_microstep: 13.86 [2025-01-21 14:51:06,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2644.20 | bwd: 3308.88 | bwd_inner: 3019.76 | bwd_allreduce: 288.15 | step: 14.68 89%|████████▉ | 389/437 [42:51<05:34, 6.97s/it] {'loss': 0.3784, 'learning_rate': 1.2574686567379324e-06, 'epoch': 0.89} 89%|████████▉ | 389/437 [42:51<05:34, 6.97s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6085 [2025-01-21 14:51:07,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.51 | bwd_microstep: 442.02 | bwd_inner_microstep: 441.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4969 [2025-01-21 14:51:07,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 319.68 | bwd_microstep: 360.92 | bwd_inner_microstep: 360.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.26 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:51:09,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.18 | bwd_microstep: 608.12 | bwd_inner_microstep: 607.93 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6521 [2025-01-21 14:51:09,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.20 | bwd_microstep: 479.30 | bwd_inner_microstep: 479.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6254 [2025-01-21 14:51:10,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.73 | bwd_microstep: 461.83 | bwd_inner_microstep: 461.66 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:51:11,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.58 | bwd_microstep: 338.95 | bwd_inner_microstep: 338.63 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:51:12,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 389.17 | bwd_microstep: 432.98 | bwd_inner_microstep: 432.81 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7801 [2025-01-21 14:51:14,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.81 | optimizer_step: 0.35 [2025-01-21 14:51:14,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.91 | bwd_microstep: 1370.40 | bwd_inner_microstep: 579.43 | bwd_allreduce_microstep: 790.75 | step_microstep: 13.76 [2025-01-21 14:51:14,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3299.79 | bwd: 4494.64 | bwd_inner: 3702.59 | bwd_allreduce: 791.18 | step: 14.69 89%|████████▉ | 390/437 [42:59<05:42, 7.29s/it] {'loss': 0.3971, 'learning_rate': 1.2061475842818337e-06, 'epoch': 0.89} 89%|████████▉ | 390/437 [42:59<05:42, 7.29s/it]dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7496 [2025-01-21 14:51:15,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 481.12 | bwd_microstep: 559.02 | bwd_inner_microstep: 558.79 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5032 [2025-01-21 14:51:16,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 324.80 | bwd_microstep: 367.47 | bwd_inner_microstep: 367.30 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6873 [2025-01-21 14:51:17,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.97 | bwd_microstep: 507.52 | bwd_inner_microstep: 507.36 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6540 [2025-01-21 14:51:18,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.53 | bwd_microstep: 481.09 | bwd_inner_microstep: 480.94 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2532 [2025-01-21 14:51:18,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 174.35 | bwd_microstep: 203.69 | bwd_inner_microstep: 203.53 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:51:18,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 185.55 | bwd_microstep: 212.47 | bwd_inner_microstep: 212.29 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:51:19,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.96 | bwd_microstep: 286.61 | bwd_inner_microstep: 286.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4230 [2025-01-21 14:51:20,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.78 | optimizer_step: 0.39 [2025-01-21 14:51:20,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.38 | bwd_microstep: 792.61 | bwd_inner_microstep: 315.49 | bwd_allreduce_microstep: 477.01 | step_microstep: 13.91 [2025-01-21 14:51:20,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2543.49 | bwd: 3410.61 | bwd_inner: 2932.42 | bwd_allreduce: 477.52 | step: 14.75 89%|████████▉ | 391/437 [43:05<05:20, 6.96s/it] {'loss': 0.2784, 'learning_rate': 1.1558631634356e-06, 'epoch': 0.89} 89%|████████▉ | 391/437 [43:05<05:20, 6.96s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3763 [2025-01-21 14:51:21,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.22 | bwd_microstep: 284.97 | bwd_inner_microstep: 284.79 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5802 [2025-01-21 14:51:21,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 381.67 | bwd_microstep: 427.13 | bwd_inner_microstep: 426.90 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8146 [2025-01-21 14:51:23,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 527.51 | bwd_microstep: 606.80 | bwd_inner_microstep: 606.64 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:51:24,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.05 | bwd_microstep: 606.31 | bwd_inner_microstep: 606.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3598 [2025-01-21 14:51:24,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.85 | bwd_microstep: 273.96 | bwd_inner_microstep: 273.80 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:51:25,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.75 | bwd_microstep: 605.96 | bwd_inner_microstep: 605.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:51:26,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.54 | bwd_microstep: 376.29 | bwd_inner_microstep: 376.07 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4255 [2025-01-21 14:51:27,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:51:27,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 269.85 | bwd_microstep: 514.68 | bwd_inner_microstep: 313.30 | bwd_allreduce_microstep: 201.27 | step_microstep: 13.32 [2025-01-21 14:51:27,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3066.29 | bwd: 3696.21 | bwd_inner: 3493.75 | bwd_allreduce: 201.76 | step: 14.11 90%|████████▉ | 392/437 [43:12<05:13, 6.97s/it] {'loss': 0.3908, 'learning_rate': 1.1066181678416266e-06, 'epoch': 0.9} 90%|████████▉ | 392/437 [43:12<05:13, 6.97s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4270 [2025-01-21 14:51:28,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 268.16 | bwd_microstep: 313.49 | bwd_inner_microstep: 312.99 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.30 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5846 [2025-01-21 14:51:28,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 382.33 | bwd_microstep: 429.04 | bwd_inner_microstep: 428.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4756 [2025-01-21 14:51:29,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.88 | bwd_microstep: 346.98 | bwd_inner_microstep: 346.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2274 [2025-01-21 14:51:30,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.94 | bwd_microstep: 207.13 | bwd_inner_microstep: 206.97 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:51:30,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 423.40 | bwd_microstep: 476.25 | bwd_inner_microstep: 476.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8066 [2025-01-21 14:51:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 524.83 | bwd_microstep: 603.91 | bwd_inner_microstep: 603.69 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3460 [2025-01-21 14:51:32,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.24 | bwd_microstep: 260.95 | bwd_inner_microstep: 260.45 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6784 [2025-01-21 14:51:33,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:51:33,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.38 | bwd_microstep: 509.12 | bwd_inner_microstep: 501.73 | bwd_allreduce_microstep: 7.26 | step_microstep: 10.95 [2025-01-21 14:51:33,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2748.01 | bwd: 3147.08 | bwd_inner: 3138.02 | bwd_allreduce: 7.99 | step: 12.13 90%|████████▉ | 393/437 [43:18<04:55, 6.71s/it] {'loss': 0.2788, 'learning_rate': 1.058415313808565e-06, 'epoch': 0.9} 90%|████████▉ | 393/437 [43:18<04:55, 6.71s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4567 [2025-01-21 14:51:34,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 297.26 | bwd_microstep: 334.58 | bwd_inner_microstep: 334.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.14 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7903 [2025-01-21 14:51:35,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.87 | bwd_microstep: 587.50 | bwd_inner_microstep: 587.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6541 [2025-01-21 14:51:36,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.52 | bwd_microstep: 482.94 | bwd_inner_microstep: 482.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6799 [2025-01-21 14:51:37,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 441.87 | bwd_microstep: 502.95 | bwd_inner_microstep: 502.79 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7319 [2025-01-21 14:51:38,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 476.81 | bwd_microstep: 547.33 | bwd_inner_microstep: 547.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4104 [2025-01-21 14:51:38,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.31 | bwd_microstep: 303.45 | bwd_inner_microstep: 303.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:51:40,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 522.04 | bwd_microstep: 609.32 | bwd_inner_microstep: 609.01 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7802 [2025-01-21 14:51:41,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.75 | optimizer_step: 0.35 [2025-01-21 14:51:41,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 507.41 | bwd_microstep: 585.94 | bwd_inner_microstep: 578.30 | bwd_allreduce_microstep: 7.54 | step_microstep: 11.37 [2025-01-21 14:51:41,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3457.91 | bwd: 3954.13 | bwd_inner: 3945.31 | bwd_allreduce: 8.00 | step: 12.22 90%|█████████ | 394/437 [43:25<05:00, 6.99s/it] {'loss': 0.2195, 'learning_rate': 1.0112572601615022e-06, 'epoch': 0.9} 90%|█████████ | 394/437 [43:25<05:00, 6.99s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5389 [2025-01-21 14:51:42,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 353.34 | bwd_microstep: 396.47 | bwd_inner_microstep: 396.31 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6951 [2025-01-21 14:51:43,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 447.49 | bwd_microstep: 510.84 | bwd_inner_microstep: 510.68 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2330 [2025-01-21 14:51:43,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.91 | bwd_microstep: 205.43 | bwd_inner_microstep: 205.27 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4945 [2025-01-21 14:51:44,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.01 | bwd_microstep: 360.62 | bwd_inner_microstep: 360.29 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4153 [2025-01-21 14:51:44,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.60 | bwd_microstep: 302.60 | bwd_inner_microstep: 302.44 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6773 [2025-01-21 14:51:45,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 439.84 | bwd_microstep: 500.02 | bwd_inner_microstep: 499.85 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:51:46,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.34 | bwd_microstep: 561.39 | bwd_inner_microstep: 561.06 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 [2025-01-21 14:51:48,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.74 | optimizer_step: 0.34 [2025-01-21 14:51:48,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 304.93 | bwd_microstep: 1191.96 | bwd_inner_microstep: 338.59 | bwd_allreduce_microstep: 853.25 | step_microstep: 13.70 [2025-01-21 14:51:48,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2792.31 | bwd: 4029.45 | bwd_inner: 3174.86 | bwd_allreduce: 853.73 | step: 14.48 90%|█████████ | 395/437 [43:33<04:54, 7.01s/it] {'loss': 0.2877, 'learning_rate': 9.65146608095293e-07, 'epoch': 0.9} 90%|█████████ | 395/437 [43:33<04:54, 7.01s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6415 [2025-01-21 14:51:49,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 412.71 | bwd_microstep: 474.69 | bwd_inner_microstep: 474.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2894 [2025-01-21 14:51:49,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 193.57 | bwd_microstep: 217.73 | bwd_inner_microstep: 217.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3378 [2025-01-21 14:51:50,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.39 | bwd_microstep: 255.00 | bwd_inner_microstep: 254.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4114 [2025-01-21 14:51:50,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 274.04 | bwd_microstep: 306.76 | bwd_inner_microstep: 306.60 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:51:51,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.51 | bwd_microstep: 200.04 | bwd_inner_microstep: 199.76 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:51:52,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.79 | bwd_microstep: 560.89 | bwd_inner_microstep: 560.72 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4093 [2025-01-21 14:51:52,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.23 | bwd_microstep: 301.53 | bwd_inner_microstep: 301.37 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5998 [2025-01-21 14:51:53,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.08 | optimizer_gradients: 0.81 | optimizer_step: 0.34 [2025-01-21 14:51:53,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 392.80 | bwd_microstep: 537.92 | bwd_inner_microstep: 437.98 | bwd_allreduce_microstep: 99.83 | step_microstep: 14.22 [2025-01-21 14:51:53,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2417.87 | bwd: 2854.70 | bwd_inner: 2753.64 | bwd_allreduce: 100.31 | step: 15.03 91%|█████████ | 396/437 [43:38<04:28, 6.55s/it] {'loss': 0.1981, 'learning_rate': 9.200859010310847e-07, 'epoch': 0.91} 91%|█████████ | 396/437 [43:38<04:28, 6.55s/it]dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5055 [2025-01-21 14:51:54,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.00 | bwd_microstep: 367.44 | bwd_inner_microstep: 367.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7663 [2025-01-21 14:51:55,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.76 | bwd_microstep: 569.41 | bwd_inner_microstep: 569.24 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7908 [2025-01-21 14:51:56,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.03 | bwd_microstep: 588.69 | bwd_inner_microstep: 588.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6294 [2025-01-21 14:51:57,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.45 | bwd_microstep: 466.67 | bwd_inner_microstep: 466.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4427 [2025-01-21 14:51:58,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 291.60 | bwd_microstep: 324.26 | bwd_inner_microstep: 324.09 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:51:58,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.92 | bwd_microstep: 198.98 | bwd_inner_microstep: 198.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:51:59,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 256.04 | bwd_microstep: 280.91 | bwd_inner_microstep: 280.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4613 [2025-01-21 14:51:59,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.69 | optimizer_step: 0.34 [2025-01-21 14:51:59,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 299.66 | bwd_microstep: 348.62 | bwd_inner_microstep: 340.69 | bwd_allreduce_microstep: 7.70 | step_microstep: 11.11 [2025-01-21 14:51:59,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2762.32 | bwd: 3145.10 | bwd_inner: 3136.20 | bwd_allreduce: 8.16 | step: 11.89 91%|█████████ | 397/437 [43:44<04:17, 6.43s/it] {'loss': 0.2398, 'learning_rate': 8.760776244760283e-07, 'epoch': 0.91} 91%|█████████ | 397/437 [43:44<04:17, 6.43s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6702 [2025-01-21 14:52:00,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 430.62 | bwd_microstep: 494.09 | bwd_inner_microstep: 493.93 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3425 [2025-01-21 14:52:01,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 227.81 | bwd_microstep: 256.66 | bwd_inner_microstep: 256.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2623 [2025-01-21 14:52:01,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.74 | bwd_microstep: 209.49 | bwd_inner_microstep: 209.26 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7619 [2025-01-21 14:52:02,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 495.79 | bwd_microstep: 570.13 | bwd_inner_microstep: 569.90 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5205 [2025-01-21 14:52:03,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.10 | bwd_microstep: 381.56 | bwd_inner_microstep: 381.26 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3585 [2025-01-21 14:52:04,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.98 | bwd_microstep: 269.17 | bwd_inner_microstep: 268.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:52:05,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 376.43 | bwd_microstep: 417.46 | bwd_inner_microstep: 417.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:52:05,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:52:05,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 337.47 | bwd_microstep: 384.41 | bwd_inner_microstep: 376.76 | bwd_allreduce_microstep: 7.42 | step_microstep: 10.74 [2025-01-21 14:52:05,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2627.78 | bwd: 2983.09 | bwd_inner: 2974.41 | bwd_allreduce: 7.86 | step: 11.54 91%|█████████ | 398/437 [43:50<04:03, 6.25s/it] {'loss': 0.217, 'learning_rate': 8.33124205886171e-07, 'epoch': 0.91} 91%|█████████ | 398/437 [43:50<04:03, 6.25s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5726 [2025-01-21 14:52:06,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 366.54 | bwd_microstep: 420.99 | bwd_inner_microstep: 420.82 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4884 [2025-01-21 14:52:07,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 313.94 | bwd_microstep: 356.86 | bwd_inner_microstep: 356.68 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6127 [2025-01-21 14:52:08,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 394.49 | bwd_microstep: 446.25 | bwd_inner_microstep: 446.07 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5253 [2025-01-21 14:52:08,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 343.44 | bwd_microstep: 386.19 | bwd_inner_microstep: 386.02 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4968 [2025-01-21 14:52:09,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.67 | bwd_microstep: 361.66 | bwd_inner_microstep: 361.50 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4658 [2025-01-21 14:52:10,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.96 | bwd_microstep: 342.33 | bwd_inner_microstep: 342.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:52:11,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 534.29 | bwd_microstep: 613.72 | bwd_inner_microstep: 613.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4044 [2025-01-21 14:52:12,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.85 | optimizer_step: 0.35 [2025-01-21 14:52:12,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.76 | bwd_microstep: 336.10 | bwd_inner_microstep: 303.23 | bwd_allreduce_microstep: 32.72 | step_microstep: 14.33 [2025-01-21 14:52:12,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2848.91 | bwd: 3264.22 | bwd_inner: 3230.37 | bwd_allreduce: 33.18 | step: 15.11 91%|█████████▏| 399/437 [43:56<03:58, 6.28s/it] {'loss': 0.3294, 'learning_rate': 7.912280145325702e-07, 'epoch': 0.91} 91%|█████████▏| 399/437 [43:56<03:58, 6.28s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3753 [2025-01-21 14:52:12,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.93 | bwd_microstep: 279.31 | bwd_inner_microstep: 279.05 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4999 [2025-01-21 14:52:13,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 330.30 | bwd_microstep: 366.11 | bwd_inner_microstep: 365.89 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3122 [2025-01-21 14:52:13,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.52 | bwd_microstep: 242.92 | bwd_inner_microstep: 242.60 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4946 [2025-01-21 14:52:14,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.24 | bwd_microstep: 363.00 | bwd_inner_microstep: 362.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7043 [2025-01-21 14:52:15,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.18 | bwd_microstep: 522.85 | bwd_inner_microstep: 522.60 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4921 [2025-01-21 14:52:16,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.29 | bwd_microstep: 359.35 | bwd_inner_microstep: 359.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:52:17,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 359.69 | bwd_microstep: 399.31 | bwd_inner_microstep: 399.09 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.10 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8192 [2025-01-21 14:52:18,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.78 | optimizer_step: 0.36 [2025-01-21 14:52:18,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 525.91 | bwd_microstep: 624.30 | bwd_inner_microstep: 616.31 | bwd_allreduce_microstep: 7.87 | step_microstep: 12.39 [2025-01-21 14:52:18,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2772.90 | bwd: 3157.30 | bwd_inner: 3147.98 | bwd_allreduce: 8.43 | step: 13.22 92%|█████████▏| 400/437 [44:02<03:51, 6.24s/it] {'loss': 0.3187, 'learning_rate': 7.503913613705971e-07, 'epoch': 0.91} 92%|█████████▏| 400/437 [44:02<03:51, 6.24s/it][INFO|trainer.py:2936] 2025-01-21 14:52:19,478 >> Saving model checkpoint to work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400 [INFO|configuration_utils.py:473] 2025-01-21 14:52:19,480 >> Configuration saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/config.json [INFO|configuration_utils.py:594] 2025-01-21 14:52:19,481 >> Configuration saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/generation_config.json [INFO|modeling_utils.py:2493] 2025-01-21 14:52:22,202 >> Model weights saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/model.safetensors [INFO|tokenization_utils_base.py:2433] 2025-01-21 14:52:22,204 >> tokenizer config file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2025-01-21 14:52:22,205 >> Special tokens file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2025-01-21 14:52:22,205 >> added tokens file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/added_tokens.json [2025-01-21 14:52:22,512] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint global_step400 is about to be saved! [2025-01-21 14:52:22,539] [INFO] [logging.py:128:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt [2025-01-21 14:52:22,539] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt... [2025-01-21 14:52:24,358] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/mp_rank_00_model_states.pt. [2025-01-21 14:52:24,360] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-01-21 14:52:24,412] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-01-21 14:52:24,413] [INFO] [engine.py:3536:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tmp-checkpoint-400/global_step400/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-01-21 14:52:24,413] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step400 is ready now! [INFO|trainer.py:3028] 2025-01-21 14:52:24,429 >> Deleting older checkpoint [work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/checkpoint-200] due to args.save_total_limit dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7442 [2025-01-21 14:52:25,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 478.32 | bwd_microstep: 553.08 | bwd_inner_microstep: 552.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5019 [2025-01-21 14:52:26,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.87 | bwd_microstep: 366.27 | bwd_inner_microstep: 366.09 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2594 [2025-01-21 14:52:26,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 175.50 | bwd_microstep: 208.06 | bwd_inner_microstep: 207.70 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7050 [2025-01-21 14:52:27,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.26 | bwd_microstep: 520.98 | bwd_inner_microstep: 520.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7039 [2025-01-21 14:52:28,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.59 | bwd_microstep: 520.00 | bwd_inner_microstep: 519.68 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6742 [2025-01-21 14:52:29,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 436.56 | bwd_microstep: 495.46 | bwd_inner_microstep: 495.25 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:52:30,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 422.34 | bwd_microstep: 476.34 | bwd_inner_microstep: 476.15 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3149 [2025-01-21 14:52:31,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.73 | optimizer_step: 0.36 [2025-01-21 14:52:31,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 205.64 | bwd_microstep: 248.79 | bwd_inner_microstep: 241.05 | bwd_allreduce_microstep: 7.63 | step_microstep: 11.41 [2025-01-21 14:52:31,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2950.92 | bwd: 3389.12 | bwd_inner: 3380.11 | bwd_allreduce: 8.10 | step: 12.25 92%|█████████▏| 401/437 [44:16<04:57, 8.27s/it] {'loss': 0.2297, 'learning_rate': 7.106164989124708e-07, 'epoch': 0.92} 92%|█████████▏| 401/437 [44:16<04:57, 8.27s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7189 [2025-01-21 14:52:32,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 459.64 | bwd_microstep: 533.59 | bwd_inner_microstep: 533.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4475 [2025-01-21 14:52:32,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.11 | bwd_microstep: 325.59 | bwd_inner_microstep: 325.38 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4956 [2025-01-21 14:52:33,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 321.25 | bwd_microstep: 361.68 | bwd_inner_microstep: 361.50 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6270 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:52:34,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.59 | bwd_microstep: 464.33 | bwd_inner_microstep: 464.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5993 [2025-01-21 14:52:35,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.81 | bwd_microstep: 433.25 | bwd_inner_microstep: 433.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7573 [2025-01-21 14:52:36,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.97 | bwd_microstep: 566.26 | bwd_inner_microstep: 566.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:52:37,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.87 | bwd_microstep: 604.31 | bwd_inner_microstep: 604.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:52:38,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.99 | optimizer_gradients: 0.86 | optimizer_step: 0.41 [2025-01-21 14:52:38,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.46 | bwd_microstep: 415.60 | bwd_inner_microstep: 375.41 | bwd_allreduce_microstep: 39.98 | step_microstep: 14.89 [2025-01-21 14:52:38,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3240.54 | bwd: 3704.73 | bwd_inner: 3663.51 | bwd_allreduce: 40.42 | step: 15.73 92%|█████████▏| 402/437 [44:23<04:38, 7.95s/it] {'loss': 0.4698, 'learning_rate': 6.719056211030128e-07, 'epoch': 0.92} 92%|█████████▏| 402/437 [44:23<04:38, 7.95s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4299 [2025-01-21 14:52:39,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 278.12 | bwd_microstep: 315.81 | bwd_inner_microstep: 315.50 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7415 [2025-01-21 14:52:40,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.51 | bwd_microstep: 553.14 | bwd_inner_microstep: 552.95 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3344 [2025-01-21 14:52:40,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 226.03 | bwd_microstep: 257.48 | bwd_inner_microstep: 257.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3863 [2025-01-21 14:52:41,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.77 | bwd_microstep: 287.75 | bwd_inner_microstep: 287.47 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6500 [2025-01-21 14:52:42,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.04 | bwd_microstep: 478.16 | bwd_inner_microstep: 478.00 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6224 [2025-01-21 14:52:43,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.00 | bwd_microstep: 459.61 | bwd_inner_microstep: 459.30 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:52:44,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 470.39 | bwd_microstep: 537.93 | bwd_inner_microstep: 537.78 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:52:44,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.73 | optimizer_step: 0.35 [2025-01-21 14:52:44,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.70 | bwd_microstep: 225.42 | bwd_inner_microstep: 217.75 | bwd_allreduce_microstep: 7.57 | step_microstep: 11.42 [2025-01-21 14:52:44,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2716.39 | bwd: 3115.43 | bwd_inner: 3106.51 | bwd_allreduce: 8.03 | step: 12.28 92%|█████████▏| 403/437 [44:29<04:10, 7.38s/it] {'loss': 0.2209, 'learning_rate': 6.342608631986346e-07, 'epoch': 0.92} 92%|█████████▏| 403/437 [44:29<04:10, 7.38s/it]dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6854 [2025-01-21 14:52:45,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 435.47 | bwd_microstep: 505.66 | bwd_inner_microstep: 505.32 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2614 [2025-01-21 14:52:45,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.55 | bwd_microstep: 210.19 | bwd_inner_microstep: 210.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7889 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:52:47,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 512.75 | bwd_microstep: 587.32 | bwd_inner_microstep: 587.16 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7848 [2025-01-21 14:52:48,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 509.84 | bwd_microstep: 587.34 | bwd_inner_microstep: 586.85 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5189 [2025-01-21 14:52:48,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 341.81 | bwd_microstep: 377.97 | bwd_inner_microstep: 377.64 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.10 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4648 [2025-01-21 14:52:49,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.64 | bwd_microstep: 343.68 | bwd_inner_microstep: 343.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4673 [2025-01-21 14:52:50,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 291.17 | bwd_microstep: 343.18 | bwd_inner_microstep: 343.02 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6320 [2025-01-21 14:52:51,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.76 | optimizer_step: 0.35 [2025-01-21 14:52:51,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 403.67 | bwd_microstep: 794.18 | bwd_inner_microstep: 470.29 | bwd_allreduce_microstep: 323.77 | step_microstep: 13.12 [2025-01-21 14:52:51,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2978.74 | bwd: 3749.69 | bwd_inner: 3424.24 | bwd_allreduce: 324.37 | step: 14.07 92%|█████████▏| 404/437 [44:36<03:59, 7.25s/it] {'loss': 0.3496, 'learning_rate': 5.976843016495482e-07, 'epoch': 0.92} 92%|█████████▏| 404/437 [44:36<03:59, 7.25s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6220 [2025-01-21 14:52:52,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 400.10 | bwd_microstep: 458.29 | bwd_inner_microstep: 458.09 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4040 [2025-01-21 14:52:52,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.94 | bwd_microstep: 301.22 | bwd_inner_microstep: 301.05 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2861 [2025-01-21 14:52:53,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.46 | bwd_microstep: 217.81 | bwd_inner_microstep: 217.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2271 [2025-01-21 14:52:53,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.36 | bwd_microstep: 199.89 | bwd_inner_microstep: 199.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3595 [2025-01-21 14:52:54,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.73 | bwd_microstep: 263.91 | bwd_inner_microstep: 263.75 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:52:54,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 183.76 | bwd_microstep: 203.56 | bwd_inner_microstep: 203.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:52:55,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.64 | bwd_microstep: 537.83 | bwd_inner_microstep: 537.66 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 8192 [2025-01-21 14:52:58,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.74 | optimizer_step: 0.33 [2025-01-21 14:52:58,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.76 | bwd_microstep: 1908.62 | bwd_inner_microstep: 615.00 | bwd_allreduce_microstep: 1293.50 | step_microstep: 13.10 [2025-01-21 14:52:58,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2433.61 | bwd: 4091.26 | bwd_inner: 2796.58 | bwd_allreduce: 1293.98 | step: 13.89 93%|█████████▎| 405/437 [44:42<03:47, 7.10s/it] {'loss': 0.3261, 'learning_rate': 5.621779539852435e-07, 'epoch': 0.93} 93%|█████████▎| 405/437 [44:42<03:47, 7.10s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3948 [2025-01-21 14:52:58,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.96 | bwd_microstep: 294.28 | bwd_inner_microstep: 294.12 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2867 [2025-01-21 14:52:59,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.90 | bwd_microstep: 217.64 | bwd_inner_microstep: 217.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7108 [2025-01-21 14:53:00,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.46 | bwd_microstep: 526.02 | bwd_inner_microstep: 525.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3105 [2025-01-21 14:53:00,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 211.46 | bwd_microstep: 237.02 | bwd_inner_microstep: 236.85 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5453 [2025-01-21 14:53:01,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 361.66 | bwd_microstep: 405.45 | bwd_inner_microstep: 405.24 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:53:02,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.91 | bwd_microstep: 375.52 | bwd_inner_microstep: 375.25 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:53:02,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 221.96 | bwd_microstep: 247.99 | bwd_inner_microstep: 247.76 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4038 [2025-01-21 14:53:04,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.79 | optimizer_step: 0.34 [2025-01-21 14:53:04,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.48 | bwd_microstep: 1375.62 | bwd_inner_microstep: 303.92 | bwd_allreduce_microstep: 1071.56 | step_microstep: 17.62 [2025-01-21 14:53:04,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2306.61 | bwd: 3679.66 | bwd_inner: 2606.88 | bwd_allreduce: 1072.04 | step: 18.42 93%|█████████▎| 406/437 [44:49<03:31, 6.83s/it] {'loss': 0.2379, 'learning_rate': 5.277437787031892e-07, 'epoch': 0.93} 93%|█████████▎| 406/437 [44:49<03:31, 6.83s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3043 [2025-01-21 14:53:04,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 195.09 | bwd_microstep: 259.23 | bwd_inner_microstep: 259.06 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:53:06,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 532.98 | bwd_microstep: 616.10 | bwd_inner_microstep: 615.88 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:53:07,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.12 | bwd_microstep: 604.50 | bwd_inner_microstep: 604.34 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5785 [2025-01-21 14:53:08,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.23 | bwd_microstep: 425.17 | bwd_inner_microstep: 424.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7058 [2025-01-21 14:53:09,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 455.49 | bwd_microstep: 523.19 | bwd_inner_microstep: 522.92 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5983 [2025-01-21 14:53:09,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 389.02 | bwd_microstep: 433.11 | bwd_inner_microstep: 432.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:53:10,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 203.37 | bwd_microstep: 220.04 | bwd_inner_microstep: 219.87 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:53:10,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.80 | optimizer_step: 0.34 [2025-01-21 14:53:10,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.16 | bwd_microstep: 226.93 | bwd_inner_microstep: 219.15 | bwd_allreduce_microstep: 7.67 | step_microstep: 11.59 [2025-01-21 14:53:10,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2877.29 | bwd: 3308.41 | bwd_inner: 3299.49 | bwd_allreduce: 8.15 | step: 12.42 93%|█████████▎| 407/437 [44:55<03:21, 6.71s/it] {'loss': 0.347, 'learning_rate': 4.943836751608211e-07, 'epoch': 0.93} 93%|█████████▎| 407/437 [44:55<03:21, 6.71s/it]dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4098 [2025-01-21 14:53:11,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 265.22 | bwd_microstep: 306.84 | bwd_inner_microstep: 306.64 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7408 [2025-01-21 14:53:12,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 479.83 | bwd_microstep: 549.20 | bwd_inner_microstep: 548.99 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:53:13,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.67 | bwd_microstep: 605.39 | bwd_inner_microstep: 605.18 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4675 [2025-01-21 14:53:14,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 307.36 | bwd_microstep: 342.36 | bwd_inner_microstep: 342.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5991 [2025-01-21 14:53:15,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 389.19 | bwd_microstep: 434.78 | bwd_inner_microstep: 434.58 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7567 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:53:16,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.70 | bwd_microstep: 562.36 | bwd_inner_microstep: 562.20 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:53:16,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.26 | bwd_microstep: 194.57 | bwd_inner_microstep: 194.40 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5686 [2025-01-21 14:53:17,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.97 | optimizer_step: 0.42 [2025-01-21 14:53:17,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 358.76 | bwd_microstep: 881.58 | bwd_inner_microstep: 415.33 | bwd_allreduce_microstep: 466.00 | step_microstep: 14.80 [2025-01-21 14:53:17,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2989.83 | bwd: 3877.23 | bwd_inner: 3409.90 | bwd_allreduce: 466.47 | step: 15.63 93%|█████████▎| 408/437 [45:02<03:17, 6.83s/it] {'loss': 0.3664, 'learning_rate': 4.6209948347075483e-07, 'epoch': 0.93} 93%|█████████▎| 408/437 [45:02<03:17, 6.83s/it]dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6091 [2025-01-21 14:53:18,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 391.99 | bwd_microstep: 442.37 | bwd_inner_microstep: 442.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7649 [2025-01-21 14:53:19,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.52 | bwd_microstep: 568.43 | bwd_inner_microstep: 568.22 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6562 [2025-01-21 14:53:20,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 426.27 | bwd_microstep: 483.02 | bwd_inner_microstep: 482.71 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6004 [2025-01-21 14:53:21,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 390.05 | bwd_microstep: 435.67 | bwd_inner_microstep: 435.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2554 [2025-01-21 14:53:22,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.73 | bwd_microstep: 210.51 | bwd_inner_microstep: 210.34 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5708 [2025-01-21 14:53:22,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 376.93 | bwd_microstep: 420.06 | bwd_inner_microstep: 419.87 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:53:23,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 338.79 | bwd_microstep: 376.95 | bwd_inner_microstep: 376.73 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.18 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:53:24,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.74 | optimizer_step: 0.35 [2025-01-21 14:53:24,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 339.10 | bwd_microstep: 385.22 | bwd_inner_microstep: 377.35 | bwd_allreduce_microstep: 7.76 | step_microstep: 11.54 [2025-01-21 14:53:24,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2933.20 | bwd: 3322.35 | bwd_inner: 3313.24 | bwd_allreduce: 8.29 | step: 12.43 94%|█████████▎| 409/437 [45:09<03:08, 6.72s/it] {'loss': 0.2325, 'learning_rate': 4.308929843993115e-07, 'epoch': 0.93} 94%|█████████▎| 409/437 [45:09<03:08, 6.72s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2662 [2025-01-21 14:53:24,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.61 | bwd_microstep: 210.04 | bwd_inner_microstep: 209.55 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6866 [2025-01-21 14:53:25,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.82 | bwd_microstep: 508.11 | bwd_inner_microstep: 507.79 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6798 [2025-01-21 14:53:26,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 441.66 | bwd_microstep: 503.73 | bwd_inner_microstep: 503.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4920 [2025-01-21 14:53:27,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.87 | bwd_microstep: 359.90 | bwd_inner_microstep: 359.72 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3827 [2025-01-21 14:53:28,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 254.16 | bwd_microstep: 279.80 | bwd_inner_microstep: 279.64 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5681 [2025-01-21 14:53:28,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 376.90 | bwd_microstep: 418.50 | bwd_inner_microstep: 418.33 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6833 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:53:29,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 442.40 | bwd_microstep: 506.09 | bwd_inner_microstep: 505.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5743 [2025-01-21 14:53:30,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.95 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:53:30,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 378.05 | bwd_microstep: 427.91 | bwd_inner_microstep: 420.61 | bwd_allreduce_microstep: 7.19 | step_microstep: 10.98 [2025-01-21 14:53:30,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2840.29 | bwd: 3214.25 | bwd_inner: 3205.50 | bwd_allreduce: 7.79 | step: 11.94 94%|█████████▍| 410/437 [45:15<02:57, 6.59s/it] {'loss': 0.3251, 'learning_rate': 4.0076589926826503e-07, 'epoch': 0.94} 94%|█████████▍| 410/437 [45:15<02:57, 6.59s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:53:31,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 530.86 | bwd_microstep: 614.95 | bwd_inner_microstep: 614.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3262 [2025-01-21 14:53:32,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.77 | bwd_microstep: 245.09 | bwd_inner_microstep: 244.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2359 [2025-01-21 14:53:32,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.90 | bwd_microstep: 201.77 | bwd_inner_microstep: 201.51 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6281 [2025-01-21 14:53:33,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.64 | bwd_microstep: 465.53 | bwd_inner_microstep: 465.36 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7596 [2025-01-21 14:53:34,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.08 | bwd_microstep: 565.33 | bwd_inner_microstep: 565.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.13 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:53:35,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 539.20 | bwd_microstep: 606.34 | bwd_inner_microstep: 606.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:53:36,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 358.55 | bwd_microstep: 398.39 | bwd_inner_microstep: 398.22 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:53:37,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.71 | optimizer_step: 0.34 [2025-01-21 14:53:37,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 452.47 | bwd_microstep: 525.24 | bwd_inner_microstep: 517.75 | bwd_allreduce_microstep: 7.39 | step_microstep: 11.17 [2025-01-21 14:53:37,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3158.33 | bwd: 3622.76 | bwd_inner: 3614.11 | bwd_allreduce: 7.87 | step: 11.97 94%|█████████▍| 411/437 [45:22<02:54, 6.72s/it] {'loss': 0.3222, 'learning_rate': 3.7171988985991835e-07, 'epoch': 0.94} 94%|█████████▍| 411/437 [45:22<02:54, 6.72s/it]dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 3126 [2025-01-21 14:53:38,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.05 | bwd_microstep: 241.97 | bwd_inner_microstep: 241.77 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5082 [2025-01-21 14:53:38,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 327.07 | bwd_microstep: 369.63 | bwd_inner_microstep: 369.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7993 [2025-01-21 14:53:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 518.22 | bwd_microstep: 598.29 | bwd_inner_microstep: 598.12 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7867 [2025-01-21 14:53:41,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.94 | bwd_microstep: 587.61 | bwd_inner_microstep: 587.44 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7331 [2025-01-21 14:53:42,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.54 | bwd_microstep: 547.18 | bwd_inner_microstep: 547.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6760 [2025-01-21 14:53:43,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 438.06 | bwd_microstep: 498.31 | bwd_inner_microstep: 498.14 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:53:44,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 451.80 | bwd_microstep: 517.42 | bwd_inner_microstep: 517.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3623 [2025-01-21 14:53:44,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.88 | optimizer_step: 0.36 [2025-01-21 14:53:44,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 232.76 | bwd_microstep: 282.26 | bwd_inner_microstep: 271.92 | bwd_allreduce_microstep: 10.21 | step_microstep: 14.89 [2025-01-21 14:53:44,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3164.28 | bwd: 3642.80 | bwd_inner: 3631.42 | bwd_allreduce: 10.68 | step: 15.72 94%|█████████▍| 412/437 [45:29<02:50, 6.81s/it] {'loss': 0.3088, 'learning_rate': 3.4375655832542763e-07, 'epoch': 0.94} 94%|█████████▍| 412/437 [45:29<02:50, 6.81s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3785 [2025-01-21 14:53:45,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 246.63 | bwd_microstep: 279.32 | bwd_inner_microstep: 279.09 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2640 [2025-01-21 14:53:45,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.91 | bwd_microstep: 213.00 | bwd_inner_microstep: 212.77 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6321 [2025-01-21 14:53:46,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.82 | bwd_microstep: 467.24 | bwd_inner_microstep: 467.08 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4921 [2025-01-21 14:53:47,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 318.93 | bwd_microstep: 357.81 | bwd_inner_microstep: 357.51 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.10 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:53:48,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 523.52 | bwd_microstep: 604.98 | bwd_inner_microstep: 604.77 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:53:49,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 375.30 | bwd_microstep: 416.72 | bwd_inner_microstep: 416.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5147 [2025-01-21 14:53:50,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 329.84 | bwd_microstep: 377.91 | bwd_inner_microstep: 377.68 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4458 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:53:51,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.98 | optimizer_gradients: 0.76 | optimizer_step: 0.35 [2025-01-21 14:53:51,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.67 | bwd_microstep: 1269.85 | bwd_inner_microstep: 326.76 | bwd_allreduce_microstep: 942.85 | step_microstep: 13.36 [2025-01-21 14:53:51,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2678.47 | bwd: 3986.99 | bwd_inner: 3042.67 | bwd_allreduce: 943.40 | step: 14.20 95%|█████████▍| 413/437 [45:36<02:44, 6.84s/it] {'loss': 0.3003, 'learning_rate': 3.1687744709644197e-07, 'epoch': 0.94} 95%|█████████▍| 413/437 [45:36<02:44, 6.84s/it]dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4297 [2025-01-21 14:53:52,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 277.55 | bwd_microstep: 314.81 | bwd_inner_microstep: 314.61 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7676 [2025-01-21 14:53:53,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.55 | bwd_microstep: 570.78 | bwd_inner_microstep: 570.61 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6549 [2025-01-21 14:53:54,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.96 | bwd_microstep: 484.23 | bwd_inner_microstep: 484.06 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8141 [2025-01-21 14:53:55,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.88 | bwd_microstep: 609.42 | bwd_inner_microstep: 609.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2288 [2025-01-21 14:53:55,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.84 | bwd_microstep: 200.64 | bwd_inner_microstep: 200.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:53:57,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.36 | bwd_microstep: 607.68 | bwd_inner_microstep: 607.47 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:53:58,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.69 | bwd_microstep: 561.00 | bwd_inner_microstep: 560.68 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6500 [2025-01-21 14:53:59,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.73 | optimizer_step: 0.34 [2025-01-21 14:53:59,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 406.74 | bwd_microstep: 487.26 | bwd_inner_microstep: 479.61 | bwd_allreduce_microstep: 7.48 | step_microstep: 11.25 [2025-01-21 14:53:59,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3326.41 | bwd: 3835.95 | bwd_inner: 3827.20 | bwd_allreduce: 7.95 | step: 12.06 95%|█████████▍| 414/437 [45:43<02:41, 7.01s/it] {'loss': 0.359, 'learning_rate': 2.9108403880000247e-07, 'epoch': 0.95} 95%|█████████▍| 414/437 [45:43<02:41, 7.01s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7177 [2025-01-21 14:54:00,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.65 | bwd_microstep: 531.61 | bwd_inner_microstep: 531.45 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6609 [2025-01-21 14:54:01,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 428.37 | bwd_microstep: 488.60 | bwd_inner_microstep: 488.35 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.14 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2601 [2025-01-21 14:54:01,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 192.48 | bwd_microstep: 264.39 | bwd_inner_microstep: 264.16 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7616 [2025-01-21 14:54:02,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.12 | bwd_microstep: 568.30 | bwd_inner_microstep: 568.14 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3889 [2025-01-21 14:54:03,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 260.25 | bwd_microstep: 287.57 | bwd_inner_microstep: 287.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5718 [2025-01-21 14:54:04,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 379.38 | bwd_microstep: 420.30 | bwd_inner_microstep: 420.13 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7275 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:54:05,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 473.76 | bwd_microstep: 538.72 | bwd_inner_microstep: 538.46 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:54:06,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.15 | optimizer_gradients: 0.62 | optimizer_step: 0.33 [2025-01-21 14:54:06,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.70 | bwd_microstep: 611.64 | bwd_inner_microstep: 604.36 | bwd_allreduce_microstep: 7.16 | step_microstep: 10.70 [2025-01-21 14:54:06,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3237.54 | bwd: 3711.26 | bwd_inner: 3702.82 | bwd_allreduce: 7.64 | step: 11.50 95%|█████████▍| 415/437 [45:50<02:35, 7.06s/it] {'loss': 0.4534, 'learning_rate': 2.663777561767855e-07, 'epoch': 0.95} 95%|█████████▍| 415/437 [45:50<02:35, 7.06s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5306 [2025-01-21 14:54:07,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 344.43 | bwd_microstep: 384.76 | bwd_inner_microstep: 384.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6851 [2025-01-21 14:54:08,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 446.05 | bwd_microstep: 508.60 | bwd_inner_microstep: 508.37 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6839 [2025-01-21 14:54:08,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 447.02 | bwd_microstep: 507.17 | bwd_inner_microstep: 507.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3652 [2025-01-21 14:54:09,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 242.34 | bwd_microstep: 272.41 | bwd_inner_microstep: 272.18 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6286 [2025-01-21 14:54:10,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.35 | bwd_microstep: 466.30 | bwd_inner_microstep: 466.14 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4694 [2025-01-21 14:54:11,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 308.72 | bwd_microstep: 343.32 | bwd_inner_microstep: 343.17 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5725 [2025-01-21 14:54:11,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.21 | bwd_microstep: 421.27 | bwd_inner_microstep: 421.11 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7007 [2025-01-21 14:54:12,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.69 | optimizer_step: 0.33 [2025-01-21 14:54:12,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 451.33 | bwd_microstep: 525.77 | bwd_inner_microstep: 518.41 | bwd_allreduce_microstep: 7.21 | step_microstep: 10.80 [2025-01-21 14:54:12,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3027.27 | bwd: 3429.74 | bwd_inner: 3421.29 | bwd_allreduce: 7.68 | step: 11.61 95%|█████████▌| 416/437 [45:57<02:25, 6.95s/it] {'loss': 0.3747, 'learning_rate': 2.4275996200261e-07, 'epoch': 0.95} 95%|█████████▌| 416/437 [45:57<02:25, 6.95s/it]dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3291 [2025-01-21 14:54:13,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.79 | bwd_microstep: 244.50 | bwd_inner_microstep: 244.33 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6312 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:54:14,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 411.02 | bwd_microstep: 468.76 | bwd_inner_microstep: 468.50 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:54:15,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 549.96 | bwd_microstep: 605.38 | bwd_inner_microstep: 605.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7039 [2025-01-21 14:54:16,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 453.46 | bwd_microstep: 520.72 | bwd_inner_microstep: 520.44 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:54:16,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 202.98 | bwd_microstep: 219.54 | bwd_inner_microstep: 219.37 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:54:17,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.38 | bwd_microstep: 302.30 | bwd_inner_microstep: 302.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4887 [2025-01-21 14:54:18,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 318.13 | bwd_microstep: 355.37 | bwd_inner_microstep: 355.06 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6517 [2025-01-21 14:54:19,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.80 | optimizer_step: 0.36 [2025-01-21 14:54:19,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.15 | bwd_microstep: 827.47 | bwd_inner_microstep: 481.63 | bwd_allreduce_microstep: 345.71 | step_microstep: 14.10 [2025-01-21 14:54:19,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2829.70 | bwd: 3544.15 | bwd_inner: 3197.16 | bwd_allreduce: 346.19 | step: 14.87 95%|█████████▌| 417/437 [46:04<02:16, 6.84s/it] {'loss': 0.2838, 'learning_rate': 2.2023195901327731e-07, 'epoch': 0.95} 95%|█████████▌| 417/437 [46:04<02:16, 6.84s/it]dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5446 [2025-01-21 14:54:20,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 354.24 | bwd_microstep: 400.87 | bwd_inner_microstep: 400.70 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7696 [2025-01-21 14:54:21,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.16 | bwd_microstep: 573.90 | bwd_inner_microstep: 573.65 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5004 [2025-01-21 14:54:22,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 325.07 | bwd_microstep: 367.33 | bwd_inner_microstep: 367.16 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.14 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8130 [2025-01-21 14:54:23,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 526.64 | bwd_microstep: 609.01 | bwd_inner_microstep: 608.80 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2555 [2025-01-21 14:54:23,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.01 | bwd_microstep: 205.90 | bwd_inner_microstep: 205.58 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4921 [2025-01-21 14:54:24,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 320.51 | bwd_microstep: 359.60 | bwd_inner_microstep: 359.43 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:54:24,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 236.13 | bwd_microstep: 261.45 | bwd_inner_microstep: 261.13 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7498 [2025-01-21 14:54:26,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.83 | optimizer_step: 0.38 [2025-01-21 14:54:26,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 483.60 | bwd_microstep: 570.06 | bwd_inner_microstep: 561.96 | bwd_allreduce_microstep: 7.97 | step_microstep: 11.99 [2025-01-21 14:54:26,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2918.19 | bwd: 3348.24 | bwd_inner: 3338.93 | bwd_allreduce: 8.43 | step: 12.80 96%|█████████▌| 418/437 [46:10<02:08, 6.74s/it] {'loss': 0.2783, 'learning_rate': 1.9879498983270685e-07, 'epoch': 0.96} 96%|█████████▌| 418/437 [46:10<02:08, 6.74s/it]dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6924 [2025-01-21 14:54:27,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 447.56 | bwd_microstep: 509.17 | bwd_inner_microstep: 509.00 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4716 [2025-01-21 14:54:27,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 309.48 | bwd_microstep: 344.04 | bwd_inner_microstep: 343.86 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4432 [2025-01-21 14:54:28,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 291.93 | bwd_microstep: 324.71 | bwd_inner_microstep: 324.49 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5476 [2025-01-21 14:54:29,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.15 | bwd_microstep: 404.60 | bwd_inner_microstep: 404.41 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6778 [2025-01-21 14:54:30,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 440.42 | bwd_microstep: 500.70 | bwd_inner_microstep: 500.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4622 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:54:30,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 306.88 | bwd_microstep: 338.51 | bwd_inner_microstep: 338.35 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:54:31,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.49 | bwd_microstep: 202.49 | bwd_inner_microstep: 202.32 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:54:33,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.42 | optimizer_gradients: 0.74 | optimizer_step: 0.35 [2025-01-21 14:54:33,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.22 | bwd_microstep: 1671.01 | bwd_inner_microstep: 611.04 | bwd_allreduce_microstep: 1059.82 | step_microstep: 17.85 [2025-01-21 14:54:33,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2870.96 | bwd: 4295.35 | bwd_inner: 3234.31 | bwd_allreduce: 1060.29 | step: 18.65 96%|█████████▌| 419/437 [46:18<02:04, 6.94s/it] {'loss': 0.4112, 'learning_rate': 1.7845023690439944e-07, 'epoch': 0.96} 96%|█████████▌| 419/437 [46:18<02:04, 6.94s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7296 [2025-01-21 14:54:34,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 468.51 | bwd_microstep: 540.06 | bwd_inner_microstep: 539.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7209 [2025-01-21 14:54:35,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 461.38 | bwd_microstep: 535.90 | bwd_inner_microstep: 535.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2318 [2025-01-21 14:54:35,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.68 | bwd_microstep: 203.45 | bwd_inner_microstep: 203.28 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2271 [2025-01-21 14:54:36,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 160.65 | bwd_microstep: 198.69 | bwd_inner_microstep: 198.52 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:54:36,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.48 | bwd_microstep: 226.43 | bwd_inner_microstep: 225.94 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.21 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 8047 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:54:37,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 508.85 | bwd_microstep: 598.65 | bwd_inner_microstep: 598.49 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4668 [2025-01-21 14:54:38,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 301.34 | bwd_microstep: 345.01 | bwd_inner_microstep: 344.68 | bwd_allreduce_microstep: 0.11 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3760 [2025-01-21 14:54:39,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.05 | optimizer_gradients: 0.68 | optimizer_step: 0.36 [2025-01-21 14:54:39,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 244.50 | bwd_microstep: 515.76 | bwd_inner_microstep: 280.70 | bwd_allreduce_microstep: 234.95 | step_microstep: 12.95 [2025-01-21 14:54:39,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2515.22 | bwd: 3164.14 | bwd_inner: 2927.60 | bwd_allreduce: 235.56 | step: 13.81 96%|█████████▌| 420/437 [46:24<01:52, 6.63s/it] {'loss': 0.3825, 'learning_rate': 1.591988224262053e-07, 'epoch': 0.96} 96%|█████████▌| 420/437 [46:24<01:52, 6.63s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2341 [2025-01-21 14:54:39,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 164.40 | bwd_microstep: 196.46 | bwd_inner_microstep: 196.30 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2338 [2025-01-21 14:54:40,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 167.07 | bwd_microstep: 208.06 | bwd_inner_microstep: 207.89 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2583 [2025-01-21 14:54:40,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.48 | bwd_microstep: 221.11 | bwd_inner_microstep: 220.79 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7610 [2025-01-21 14:54:41,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 494.33 | bwd_microstep: 566.96 | bwd_inner_microstep: 566.76 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.12 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:54:42,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 555.37 | bwd_microstep: 605.87 | bwd_inner_microstep: 605.66 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6247 [2025-01-21 14:54:43,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.92 | bwd_microstep: 462.64 | bwd_inner_microstep: 462.47 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:54:44,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 170.65 | bwd_microstep: 202.42 | bwd_inner_microstep: 202.25 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5417 [2025-01-21 14:54:45,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.90 | optimizer_gradients: 0.81 | optimizer_step: 0.34 [2025-01-21 14:54:45,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 359.96 | bwd_microstep: 512.80 | bwd_inner_microstep: 400.89 | bwd_allreduce_microstep: 111.78 | step_microstep: 13.83 [2025-01-21 14:54:45,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2501.01 | bwd: 2976.45 | bwd_inner: 2863.33 | bwd_allreduce: 112.29 | step: 14.66 96%|█████████▋| 421/437 [46:29<01:41, 6.35s/it] {'loss': 0.496, 'learning_rate': 1.4104180828844237e-07, 'epoch': 0.96} 96%|█████████▋| 421/437 [46:29<01:41, 6.35s/it]dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2403 [2025-01-21 14:54:45,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 165.97 | bwd_microstep: 200.65 | bwd_inner_microstep: 200.16 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4745 [2025-01-21 14:54:46,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 311.77 | bwd_microstep: 350.92 | bwd_inner_microstep: 350.74 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.25 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7591 [2025-01-21 14:54:47,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.60 | bwd_microstep: 567.72 | bwd_inner_microstep: 567.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7836 [2025-01-21 14:54:48,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.56 | bwd_microstep: 589.61 | bwd_inner_microstep: 589.43 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7305 [2025-01-21 14:54:49,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 477.22 | bwd_microstep: 547.56 | bwd_inner_microstep: 547.26 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:54:50,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.80 | bwd_microstep: 611.44 | bwd_inner_microstep: 610.95 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:54:51,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 540.52 | bwd_microstep: 604.97 | bwd_inner_microstep: 604.82 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6714 [2025-01-21 14:54:52,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:54:52,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 430.24 | bwd_microstep: 502.59 | bwd_inner_microstep: 495.04 | bwd_allreduce_microstep: 7.33 | step_microstep: 10.74 [2025-01-21 14:54:52,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3483.54 | bwd: 3975.69 | bwd_inner: 3966.45 | bwd_allreduce: 8.04 | step: 12.07 97%|█████████▋| 422/437 [46:37<01:41, 6.75s/it] {'loss': 0.4617, 'learning_rate': 1.239801960153053e-07, 'epoch': 0.96} 97%|█████████▋| 422/437 [46:37<01:41, 6.75s/it]dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 5080 [2025-01-21 14:54:53,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 326.88 | bwd_microstep: 371.22 | bwd_inner_microstep: 371.05 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:54:54,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.28 | bwd_microstep: 606.22 | bwd_inner_microstep: 606.04 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6279 [2025-01-21 14:54:55,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.60 | bwd_microstep: 465.53 | bwd_inner_microstep: 465.36 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6275 [2025-01-21 14:54:56,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.31 | bwd_microstep: 465.68 | bwd_inner_microstep: 465.51 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5459 [2025-01-21 14:54:57,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.32 | bwd_microstep: 404.77 | bwd_inner_microstep: 404.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:54:58,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 490.01 | bwd_microstep: 560.73 | bwd_inner_microstep: 560.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 5947 [2025-01-21 14:54:59,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 389.91 | bwd_microstep: 434.77 | bwd_inner_microstep: 434.51 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2453 [2025-01-21 14:54:59,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.76 | optimizer_step: 0.34 [2025-01-21 14:54:59,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 168.09 | bwd_microstep: 237.87 | bwd_inner_microstep: 227.38 | bwd_allreduce_microstep: 10.27 | step_microstep: 11.66 [2025-01-21 14:54:59,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3097.22 | bwd: 3546.91 | bwd_inner: 3535.36 | bwd_allreduce: 10.73 | step: 12.45 97%|█████████▋| 423/437 [46:44<01:35, 6.79s/it] {'loss': 0.2298, 'learning_rate': 1.0801492670962976e-07, 'epoch': 0.97} 97%|█████████▋| 423/437 [46:44<01:35, 6.79s/it]dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4706 [2025-01-21 14:55:00,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 301.88 | bwd_microstep: 345.05 | bwd_inner_microstep: 344.56 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7648 [2025-01-21 14:55:01,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.29 | bwd_microstep: 569.82 | bwd_inner_microstep: 569.32 | bwd_allreduce_microstep: 0.20 | step_microstep: 0.28 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5238 [2025-01-21 14:55:02,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 342.94 | bwd_microstep: 383.25 | bwd_inner_microstep: 382.92 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5471 [2025-01-21 14:55:02,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 362.53 | bwd_microstep: 404.70 | bwd_inner_microstep: 404.48 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4134 [2025-01-21 14:55:03,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 273.76 | bwd_microstep: 305.24 | bwd_inner_microstep: 305.07 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2767 [2025-01-21 14:55:03,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 184.90 | bwd_microstep: 205.96 | bwd_inner_microstep: 205.73 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7537 [2025-01-21 14:55:05,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 488.15 | bwd_microstep: 561.07 | bwd_inner_microstep: 560.91 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7928 [2025-01-21 14:55:06,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.67 | optimizer_step: 0.33 [2025-01-21 14:55:06,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 514.21 | bwd_microstep: 599.71 | bwd_inner_microstep: 591.93 | bwd_allreduce_microstep: 7.66 | step_microstep: 11.12 [2025-01-21 14:55:06,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2961.50 | bwd: 3375.01 | bwd_inner: 3365.44 | bwd_allreduce: 8.39 | step: 12.26 97%|█████████▋| 424/437 [46:50<01:27, 6.72s/it] {'loss': 0.2499, 'learning_rate': 9.314688100098502e-08, 'epoch': 0.97} 97%|█████████▋| 424/437 [46:50<01:27, 6.72s/it]dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3767 [2025-01-21 14:55:06,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 245.06 | bwd_microstep: 284.13 | bwd_inner_microstep: 283.86 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.12 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3192 [2025-01-21 14:55:07,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 214.83 | bwd_microstep: 242.67 | bwd_inner_microstep: 242.34 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6852 [2025-01-21 14:55:08,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.30 | bwd_microstep: 508.20 | bwd_inner_microstep: 508.00 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2351 [2025-01-21 14:55:08,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 166.20 | bwd_microstep: 202.12 | bwd_inner_microstep: 201.62 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.29 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6306 [2025-01-21 14:55:09,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 409.93 | bwd_microstep: 466.01 | bwd_inner_microstep: 465.84 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:55:10,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.73 | bwd_microstep: 605.96 | bwd_inner_microstep: 605.80 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:55:11,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.59 | bwd_microstep: 604.96 | bwd_inner_microstep: 604.81 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5930 [2025-01-21 14:55:12,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.93 | optimizer_gradients: 0.75 | optimizer_step: 0.34 [2025-01-21 14:55:12,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 366.13 | bwd_microstep: 441.88 | bwd_inner_microstep: 434.04 | bwd_allreduce_microstep: 7.65 | step_microstep: 11.38 [2025-01-21 14:55:12,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2942.62 | bwd: 3356.11 | bwd_inner: 3346.81 | bwd_allreduce: 8.24 | step: 12.36 97%|█████████▋| 425/437 [46:57<01:19, 6.66s/it] {'loss': 0.5606, 'learning_rate': 7.9376878997095e-08, 'epoch': 0.97} 97%|█████████▋| 425/437 [46:57<01:19, 6.66s/it]dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5557 [2025-01-21 14:55:13,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.26 | bwd_microstep: 408.74 | bwd_inner_microstep: 408.45 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2598 [2025-01-21 14:55:13,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 176.93 | bwd_microstep: 206.64 | bwd_inner_microstep: 206.46 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6814 [2025-01-21 14:55:14,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 445.61 | bwd_microstep: 507.67 | bwd_inner_microstep: 507.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3888 [2025-01-21 14:55:15,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 263.66 | bwd_microstep: 291.86 | bwd_inner_microstep: 291.68 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:55:16,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 270.51 | bwd_microstep: 301.49 | bwd_inner_microstep: 301.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5152 [2025-01-21 14:55:16,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 340.01 | bwd_microstep: 376.56 | bwd_inner_microstep: 376.40 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3774 [2025-01-21 14:55:17,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.42 | bwd_microstep: 281.15 | bwd_inner_microstep: 280.97 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.17 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3551 [2025-01-21 14:55:18,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.04 | optimizer_gradients: 0.85 | optimizer_step: 0.35 [2025-01-21 14:55:18,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 231.02 | bwd_microstep: 967.56 | bwd_inner_microstep: 266.87 | bwd_allreduce_microstep: 700.57 | step_microstep: 14.04 [2025-01-21 14:55:18,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2322.25 | bwd: 3341.79 | bwd_inner: 2639.98 | bwd_allreduce: 701.04 | step: 14.82 97%|█████████▋| 426/437 [47:03<01:10, 6.43s/it] {'loss': 0.2629, 'learning_rate': 6.670568023859902e-08, 'epoch': 0.97} 97%|█████████▋| 426/437 [47:03<01:10, 6.43s/it]dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3516 [2025-01-21 14:55:19,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 230.39 | bwd_microstep: 262.80 | bwd_inner_microstep: 262.55 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2621 [2025-01-21 14:55:19,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.50 | bwd_microstep: 210.28 | bwd_inner_microstep: 209.94 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4982 [2025-01-21 14:55:20,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 322.38 | bwd_microstep: 365.03 | bwd_inner_microstep: 364.86 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:55:21,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 552.75 | bwd_microstep: 607.36 | bwd_inner_microstep: 607.15 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.13 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5189 [2025-01-21 14:55:22,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 340.20 | bwd_microstep: 379.63 | bwd_inner_microstep: 379.38 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 4547 [2025-01-21 14:55:22,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 268.07 | bwd_microstep: 335.24 | bwd_inner_microstep: 334.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2525 [2025-01-21 14:55:23,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 169.00 | bwd_microstep: 207.92 | bwd_inner_microstep: 207.74 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 10, images per sample: 10.0, dynamic token length: 2950 [2025-01-21 14:55:25,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.07 | optimizer_gradients: 0.80 | optimizer_step: 0.35 [2025-01-21 14:55:25,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 194.92 | bwd_microstep: 1727.44 | bwd_inner_microstep: 228.22 | bwd_allreduce_microstep: 1499.11 | step_microstep: 13.51 [2025-01-21 14:55:25,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2256.04 | bwd: 4095.83 | bwd_inner: 2595.33 | bwd_allreduce: 1499.62 | step: 14.33 98%|█████████▊| 427/437 [47:09<01:04, 6.47s/it] {'loss': 0.4196, 'learning_rate': 5.5133983657167376e-08, 'epoch': 0.98} 98%|█████████▊| 427/437 [47:09<01:04, 6.47s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:55:26,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 530.67 | bwd_microstep: 614.72 | bwd_inner_microstep: 614.54 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6134 [2025-01-21 14:55:27,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 395.56 | bwd_microstep: 444.82 | bwd_inner_microstep: 444.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7165 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:55:28,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 457.90 | bwd_microstep: 527.92 | bwd_inner_microstep: 527.76 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 5566 [2025-01-21 14:55:29,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 364.68 | bwd_microstep: 410.83 | bwd_inner_microstep: 410.67 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7906 [2025-01-21 14:55:30,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.91 | bwd_microstep: 589.44 | bwd_inner_microstep: 589.27 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3356 [2025-01-21 14:55:30,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 224.98 | bwd_microstep: 256.23 | bwd_inner_microstep: 256.07 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7582 [2025-01-21 14:55:31,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 492.30 | bwd_microstep: 564.88 | bwd_inner_microstep: 564.56 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.10 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2502 [2025-01-21 14:55:32,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.94 | optimizer_gradients: 0.69 | optimizer_step: 0.35 [2025-01-21 14:55:32,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 171.29 | bwd_microstep: 229.79 | bwd_inner_microstep: 221.89 | bwd_allreduce_microstep: 7.78 | step_microstep: 11.47 [2025-01-21 14:55:32,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3151.13 | bwd: 3638.75 | bwd_inner: 3629.70 | bwd_allreduce: 8.25 | step: 12.24 98%|█████████▊| 428/437 [47:16<00:59, 6.64s/it] {'loss': 0.2699, 'learning_rate': 4.4662427536936727e-08, 'epoch': 0.98} 98%|█████████▊| 428/437 [47:16<00:59, 6.64s/it]dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6486 [2025-01-21 14:55:33,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 416.42 | bwd_microstep: 478.46 | bwd_inner_microstep: 478.30 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 8015 [2025-01-21 14:55:34,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 518.84 | bwd_microstep: 598.51 | bwd_inner_microstep: 598.33 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:55:35,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 553.74 | bwd_microstep: 606.64 | bwd_inner_microstep: 606.43 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:55:36,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 542.52 | bwd_microstep: 605.89 | bwd_inner_microstep: 605.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4389 [2025-01-21 14:55:37,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 293.22 | bwd_microstep: 322.18 | bwd_inner_microstep: 321.88 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7272 [2025-01-21 14:55:38,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 471.71 | bwd_microstep: 538.00 | bwd_inner_microstep: 537.80 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3706 [2025-01-21 14:55:38,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 234.44 | bwd_microstep: 273.37 | bwd_inner_microstep: 273.21 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3478 [2025-01-21 14:55:39,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.68 | optimizer_step: 0.37 [2025-01-21 14:55:39,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 229.03 | bwd_microstep: 270.24 | bwd_inner_microstep: 262.45 | bwd_allreduce_microstep: 7.69 | step_microstep: 11.08 [2025-01-21 14:55:39,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3259.76 | bwd: 3693.41 | bwd_inner: 3684.50 | bwd_allreduce: 8.19 | step: 11.89 98%|█████████▊| 429/437 [47:24<00:54, 6.80s/it] {'loss': 0.3988, 'learning_rate': 3.529158947930933e-08, 'epoch': 0.98} 98%|█████████▊| 429/437 [47:24<00:54, 6.80s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7290 [2025-01-21 14:55:40,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 463.42 | bwd_microstep: 537.81 | bwd_inner_microstep: 537.65 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6314 [2025-01-21 14:55:41,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 410.47 | bwd_microstep: 466.67 | bwd_inner_microstep: 466.49 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6803 [2025-01-21 14:55:42,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.69 | bwd_microstep: 504.79 | bwd_inner_microstep: 504.62 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.16 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6529 [2025-01-21 14:55:43,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 424.08 | bwd_microstep: 481.12 | bwd_inner_microstep: 480.95 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3329 [2025-01-21 14:55:43,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 225.64 | bwd_microstep: 255.45 | bwd_inner_microstep: 255.15 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8067 [2025-01-21 14:55:44,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 523.31 | bwd_microstep: 603.73 | bwd_inner_microstep: 603.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3297 [2025-01-21 14:55:45,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 219.07 | bwd_microstep: 243.15 | bwd_inner_microstep: 242.87 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 4050 [2025-01-21 14:55:46,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.03 | optimizer_gradients: 0.86 | optimizer_step: 0.35 [2025-01-21 14:55:46,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 266.23 | bwd_microstep: 566.62 | bwd_inner_microstep: 303.60 | bwd_allreduce_microstep: 262.90 | step_microstep: 14.49 [2025-01-21 14:55:46,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2976.73 | bwd: 3659.46 | bwd_inner: 3395.23 | bwd_allreduce: 263.41 | step: 15.31 98%|█████████▊| 430/437 [47:30<00:47, 6.82s/it] {'loss': 0.2276, 'learning_rate': 2.70219863710941e-08, 'epoch': 0.98} 98%|█████████▊| 430/437 [47:30<00:47, 6.82s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2718 [2025-01-21 14:55:46,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 180.45 | bwd_microstep: 211.25 | bwd_inner_microstep: 210.91 | bwd_allreduce_microstep: 0.13 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2609 [2025-01-21 14:55:47,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 178.66 | bwd_microstep: 217.35 | bwd_inner_microstep: 217.17 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3917 [2025-01-21 14:55:47,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 262.04 | bwd_microstep: 290.91 | bwd_inner_microstep: 290.68 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7074 [2025-01-21 14:55:48,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 456.20 | bwd_microstep: 525.22 | bwd_inner_microstep: 525.00 | bwd_allreduce_microstep: 0.09 | step_microstep: 0.13 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3852 [2025-01-21 14:55:49,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.01 | bwd_microstep: 287.25 | bwd_inner_microstep: 287.01 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4092 [2025-01-21 14:55:49,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 271.54 | bwd_microstep: 301.79 | bwd_inner_microstep: 301.62 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 8, images per sample: 8.0, dynamic token length: 2238 [2025-01-21 14:55:50,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 159.92 | bwd_microstep: 204.49 | bwd_inner_microstep: 204.26 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 18, images per sample: 18.0, dynamic token length: 4889 [2025-01-21 14:55:51,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.84 | optimizer_step: 0.35 [2025-01-21 14:55:51,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 323.05 | bwd_microstep: 1381.26 | bwd_inner_microstep: 359.98 | bwd_allreduce_microstep: 1021.06 | step_microstep: 13.83 [2025-01-21 14:55:51,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2092.71 | bwd: 3419.64 | bwd_inner: 2397.20 | bwd_allreduce: 1021.52 | step: 14.61 99%|█████████▊| 431/437 [47:36<00:38, 6.49s/it] {'loss': 0.243, 'learning_rate': 1.9854074355987186e-08, 'epoch': 0.99} 99%|█████████▊| 431/437 [47:36<00:38, 6.49s/it]dynamic ViT batch size: 26, images per sample: 26.0, dynamic token length: 7187 [2025-01-21 14:55:53,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 460.29 | bwd_microstep: 532.71 | bwd_inner_microstep: 532.54 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 15, images per sample: 15.0, dynamic token length: 4211 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:55:53,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 275.12 | bwd_microstep: 308.62 | bwd_inner_microstep: 308.38 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 14, images per sample: 14.0, dynamic token length: 3918 [2025-01-21 14:55:54,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 261.84 | bwd_microstep: 290.50 | bwd_inner_microstep: 290.02 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7581 [2025-01-21 14:55:55,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 493.61 | bwd_microstep: 565.59 | bwd_inner_microstep: 565.42 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 31, images per sample: 31.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) [2025-01-21 14:55:56,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 541.68 | bwd_microstep: 609.48 | bwd_inner_microstep: 609.22 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 16, images per sample: 16.0, dynamic token length: 4357 [2025-01-21 14:55:57,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 287.72 | bwd_microstep: 317.71 | bwd_inner_microstep: 317.39 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 20, images per sample: 20.0, dynamic token length: 6261 [2025-01-21 14:55:57,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.76 | bwd_microstep: 464.37 | bwd_inner_microstep: 464.20 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.10 dynamic ViT batch size: 27, images per sample: 27.0, dynamic token length: 7711 [2025-01-21 14:55:59,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.00 | optimizer_gradients: 0.84 | optimizer_step: 0.35 [2025-01-21 14:55:59,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 489.23 | bwd_microstep: 1058.36 | bwd_inner_microstep: 574.19 | bwd_allreduce_microstep: 483.94 | step_microstep: 13.74 [2025-01-21 14:55:59,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3194.10 | bwd: 4147.50 | bwd_inner: 3661.95 | bwd_allreduce: 484.52 | step: 14.67 99%|█████████▉| 432/437 [47:44<00:34, 6.82s/it] {'loss': 0.3782, 'learning_rate': 1.37882488094232e-08, 'epoch': 0.99} 99%|█████████▉| 432/437 [47:44<00:34, 6.82s/it]dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6899 [2025-01-21 14:56:00,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 443.18 | bwd_microstep: 508.46 | bwd_inner_microstep: 508.31 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 17, images per sample: 17.0, dynamic token length: 4761 [2025-01-21 14:56:01,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 310.61 | bwd_microstep: 349.16 | bwd_inner_microstep: 348.99 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6069 [2025-01-21 14:56:02,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 393.88 | bwd_microstep: 443.95 | bwd_inner_microstep: 443.73 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3137 [2025-01-21 14:56:02,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 215.13 | bwd_microstep: 244.30 | bwd_inner_microstep: 244.11 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6539 [2025-01-21 14:56:03,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 427.13 | bwd_microstep: 481.11 | bwd_inner_microstep: 480.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.15 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5715 [2025-01-21 14:56:04,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.55 | bwd_microstep: 418.94 | bwd_inner_microstep: 418.76 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3562 [2025-01-21 14:56:04,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 235.88 | bwd_microstep: 261.88 | bwd_inner_microstep: 261.40 | bwd_allreduce_microstep: 0.18 | step_microstep: 0.28 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5274 [2025-01-21 14:56:06,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.97 | optimizer_gradients: 0.76 | optimizer_step: 0.35 [2025-01-21 14:56:06,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 344.17 | bwd_microstep: 1060.33 | bwd_inner_microstep: 386.53 | bwd_allreduce_microstep: 673.69 | step_microstep: 13.67 [2025-01-21 14:56:06,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2747.35 | bwd: 3768.30 | bwd_inner: 3093.12 | bwd_allreduce: 674.28 | step: 14.67 99%|█████████▉| 433/437 [47:51<00:27, 6.80s/it] {'loss': 0.2823, 'learning_rate': 8.82484431675712e-09, 'epoch': 0.99} 99%|█████████▉| 433/437 [47:51<00:27, 6.80s/it]dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2728 [2025-01-21 14:56:06,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 179.43 | bwd_microstep: 264.78 | bwd_inner_microstep: 264.56 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5873 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:07,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 384.06 | bwd_microstep: 434.50 | bwd_inner_microstep: 434.13 | bwd_allreduce_microstep: 0.12 | step_microstep: 0.14 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6622 [2025-01-21 14:56:08,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 433.06 | bwd_microstep: 486.37 | bwd_inner_microstep: 486.16 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.15 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:09,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.44 | bwd_microstep: 607.78 | bwd_inner_microstep: 607.51 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6306 [2025-01-21 14:56:10,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 407.40 | bwd_microstep: 468.59 | bwd_inner_microstep: 468.34 | bwd_allreduce_microstep: 0.08 | step_microstep: 0.14 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3632 [2025-01-21 14:56:11,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 241.36 | bwd_microstep: 271.69 | bwd_inner_microstep: 271.43 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3615 [2025-01-21 14:56:11,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 239.83 | bwd_microstep: 269.26 | bwd_inner_microstep: 269.10 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5682 [2025-01-21 14:56:12,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.91 | optimizer_gradients: 0.74 | optimizer_step: 0.35 [2025-01-21 14:56:12,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 374.99 | bwd_microstep: 425.01 | bwd_inner_microstep: 417.26 | bwd_allreduce_microstep: 7.65 | step_microstep: 11.57 [2025-01-21 14:56:12,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2810.38 | bwd: 3228.11 | bwd_inner: 3219.04 | bwd_allreduce: 8.18 | step: 12.47 99%|█████████▉| 434/437 [47:57<00:19, 6.64s/it] {'loss': 0.3616, 'learning_rate': 4.9641346548190415e-09, 'epoch': 0.99} 99%|█████████▉| 434/437 [47:57<00:19, 6.64s/it]dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 6013 [2025-01-21 14:56:13,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 385.83 | bwd_microstep: 441.31 | bwd_inner_microstep: 441.13 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5340 [2025-01-21 14:56:14,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 346.88 | bwd_microstep: 388.46 | bwd_inner_microstep: 388.26 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 9, images per sample: 9.0, dynamic token length: 2615 [2025-01-21 14:56:14,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 177.21 | bwd_microstep: 205.91 | bwd_inner_microstep: 205.75 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 warning: The size of tensor a (7884) must match the size of tensor b (7936) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([7936, 896]) dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8178 [2025-01-21 14:56:15,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 529.12 | bwd_microstep: 611.20 | bwd_inner_microstep: 610.92 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.11 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7884 [2025-01-21 14:56:16,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 513.19 | bwd_microstep: 591.28 | bwd_inner_microstep: 590.79 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.28 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3329 [2025-01-21 14:56:17,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 223.52 | bwd_microstep: 251.71 | bwd_inner_microstep: 251.22 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.27 dynamic ViT batch size: 11, images per sample: 11.0, dynamic token length: 3032 [2025-01-21 14:56:17,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 204.76 | bwd_microstep: 222.13 | bwd_inner_microstep: 221.96 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 19, images per sample: 19.0, dynamic token length: 5417 [2025-01-21 14:56:19,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.02 | optimizer_gradients: 0.75 | optimizer_step: 0.34 [2025-01-21 14:56:19,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 352.44 | bwd_microstep: 760.14 | bwd_inner_microstep: 400.08 | bwd_allreduce_microstep: 359.95 | step_microstep: 13.75 [2025-01-21 14:56:19,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2732.78 | bwd: 3472.35 | bwd_inner: 3110.56 | bwd_allreduce: 360.67 | step: 14.87 100%|█████████▉| 435/437 [48:03<00:13, 6.58s/it] {'loss': 0.359, 'learning_rate': 2.206332776797382e-09, 'epoch': 0.99} 100%|█████████▉| 435/437 [48:03<00:13, 6.58s/it]dynamic ViT batch size: 30, images per sample: 30.0, dynamic token length: 8192 [2025-01-21 14:56:20,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 533.17 | bwd_microstep: 613.37 | bwd_inner_microstep: 613.06 | bwd_allreduce_microstep: 0.10 | step_microstep: 0.11 dynamic ViT batch size: 22, images per sample: 22.0, dynamic token length: 6094 [2025-01-21 14:56:21,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 394.27 | bwd_microstep: 445.85 | bwd_inner_microstep: 445.67 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 29, images per sample: 29.0, dynamic token length: 7913 [2025-01-21 14:56:22,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 516.83 | bwd_microstep: 588.34 | bwd_inner_microstep: 588.18 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.13 dynamic ViT batch size: 23, images per sample: 23.0, dynamic token length: 6280 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:23,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 408.58 | bwd_microstep: 465.44 | bwd_inner_microstep: 465.28 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:24,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 550.93 | bwd_microstep: 607.00 | bwd_inner_microstep: 606.83 | bwd_allreduce_microstep: 0.07 | step_microstep: 0.12 dynamic ViT batch size: 25, images per sample: 25.0, dynamic token length: 6796 [2025-01-21 14:56:25,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 444.87 | bwd_microstep: 503.59 | bwd_inner_microstep: 503.35 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:56:26,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 422.59 | bwd_microstep: 477.83 | bwd_inner_microstep: 477.60 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:27,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 1.01 | optimizer_gradients: 0.65 | optimizer_step: 0.33 [2025-01-21 14:56:27,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.63 | bwd_microstep: 613.53 | bwd_inner_microstep: 605.93 | bwd_allreduce_microstep: 7.50 | step_microstep: 10.97 [2025-01-21 14:56:27,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3822.74 | bwd: 4315.08 | bwd_inner: 4306.37 | bwd_allreduce: 7.96 | step: 11.80 100%|█████████▉| 436/437 [48:12<00:07, 7.12s/it] {'loss': 0.5555, 'learning_rate': 5.51590800510482e-10, 'epoch': 1.0} 100%|█████████▉| 436/437 [48:12<00:07, 7.12s/it]dynamic ViT batch size: 28, images per sample: 28.0, dynamic token length: 7704 [2025-01-21 14:56:28,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 496.41 | bwd_microstep: 573.71 | bwd_inner_microstep: 573.55 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.12 dynamic ViT batch size: 12, images per sample: 12.0, dynamic token length: 3414 [2025-01-21 14:56:29,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 228.12 | bwd_microstep: 262.41 | bwd_inner_microstep: 262.23 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6556 [2025-01-21 14:56:29,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 425.97 | bwd_microstep: 482.79 | bwd_inner_microstep: 482.63 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 21, images per sample: 21.0, dynamic token length: 5745 [2025-01-21 14:56:30,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 377.57 | bwd_microstep: 420.76 | bwd_inner_microstep: 420.28 | bwd_allreduce_microstep: 0.19 | step_microstep: 0.27 dynamic ViT batch size: 13, images per sample: 13.0, dynamic token length: 3574 [2025-01-21 14:56:31,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 237.50 | bwd_microstep: 263.45 | bwd_inner_microstep: 263.29 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 24, images per sample: 24.0, dynamic token length: 6477 [2025-01-21 14:56:32,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 421.37 | bwd_microstep: 476.41 | bwd_inner_microstep: 476.25 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.10 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:33,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.20 | bwd_microstep: 604.11 | bwd_inner_microstep: 603.95 | bwd_allreduce_microstep: 0.06 | step_microstep: 0.11 dynamic ViT batch size: 32, images per sample: 32.0, dynamic token length: 8192 warning: The size of tensor a (7884) must match the size of tensor b (8192) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([7884, 896]), vit_embeds.shape=torch.Size([8192, 896]) [2025-01-21 14:56:34,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.92 | optimizer_gradients: 0.70 | optimizer_step: 0.33 [2025-01-21 14:56:34,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 551.13 | bwd_microstep: 611.56 | bwd_inner_microstep: 603.99 | bwd_allreduce_microstep: 7.43 | step_microstep: 11.20 [2025-01-21 14:56:34,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 3289.19 | bwd: 3695.35 | bwd_inner: 3686.52 | bwd_allreduce: 8.00 | step: 12.12 100%|██████████| 437/437 [48:19<00:00, 7.15s/it] {'loss': 0.5426, 'learning_rate': 0.0, 'epoch': 1.0} 100%|██████████| 437/437 [48:19<00:00, 7.15s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. petrel_client is not installed. Using PIL to load images. [INFO|trainer.py:1962] 2025-01-21 14:56:36,019 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 2900.7464, 'train_samples_per_second': 2.413, 'train_steps_per_second': 0.151, 'train_loss': 0.46482172025286633, 'epoch': 1.0} 100%|██████████| 437/437 [48:20<00:00, 7.15s/it] 100%|██████████| 437/437 [48:20<00:00, 6.64s/it] [INFO|trainer.py:2936] 2025-01-21 14:56:37,158 >> Saving model checkpoint to work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora [INFO|configuration_utils.py:473] 2025-01-21 14:56:37,160 >> Configuration saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/config.json [INFO|configuration_utils.py:594] 2025-01-21 14:56:37,161 >> Configuration saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/generation_config.json [INFO|modeling_utils.py:2493] 2025-01-21 14:56:39,371 >> Model weights saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/model.safetensors [INFO|tokenization_utils_base.py:2433] 2025-01-21 14:56:39,372 >> tokenizer config file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2025-01-21 14:56:39,373 >> Special tokens file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2025-01-21 14:56:39,373 >> added tokens file saved in work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/added_tokens.json ***** train metrics ***** epoch = 1.0 train_loss = 0.4648 train_runtime = 0:48:20.74 train_samples = 7000 train_samples_per_second = 2.413 train_steps_per_second = 0.151