[INFO|tokenization_utils_base.py:2082] 2024-11-08 03:33:16,595 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2082] 2024-11-08 03:33:16,595 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2082] 2024-11-08 03:33:16,595 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2082] 2024-11-08 03:33:16,595 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2082] 2024-11-08 03:33:16,595 >> loading file tokenizer.json tokenizer special_tokens_map: {'bos_token': '', 'eos_token': '', 'unk_token': '', 'pad_token': ''} tokenizer special_tokens_map: {'bos_token': '', 'eos_token': '', 'unk_token': '', 'pad_token': ''} 0%| | 0/495 [00:00> loading configuration file model/config.json [INFO|configuration_utils.py:724] 2024-11-08 03:33:26,284 >> loading configuration file model/config.json [INFO|configuration_utils.py:789] 2024-11-08 03:33:26,286 >> Model config IndexConfig { "_name_or_path": "model", "architectures": [ "IndexForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "auto_map": { "AutoConfig": "configuration_index.IndexConfig", "AutoModelForCausalLM": "modeling_index.IndexForCausalLM", "AutoModelForSequenceClassification": "modeling_index.IndexForSequenceClassification" }, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 2048, "initializer_range": 0.01, "intermediate_size": 5888, "max_length": 32768, "max_position_embeddings": 32768, "model_type": "index", "norm_head": 1, "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 16, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_ratio": 32, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.39.2", "use_cache": true, "vocab_size": 65029 } [INFO|modeling_utils.py:3280] 2024-11-08 03:33:26,333 >> loading weights file model/pytorch_model.bin [INFO|modeling_utils.py:1417] 2024-11-08 03:33:26,403 >> Instantiating IndexForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:928] 2024-11-08 03:33:26,404 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "max_length": 32768, "pad_token_id": 0, "use_cache": false } [INFO|modeling_utils.py:4024] 2024-11-08 03:33:31,504 >> All model checkpoint weights were used when initializing IndexForCausalLM. [INFO|modeling_utils.py:4032] 2024-11-08 03:33:31,504 >> All the weights of IndexForCausalLM were initialized from the model checkpoint at model. If your task is similar to the task the model of the checkpoint was trained on, you can already use IndexForCausalLM for predictions without further training. [INFO|configuration_utils.py:881] 2024-11-08 03:33:31,511 >> loading configuration file model/generation_config.json /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:509: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:509: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`. warnings.warn( [INFO|configuration_utils.py:928] 2024-11-08 03:33:31,512 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 0, "repetition_penalty": 1.1, "temperature": 0.9, "top_k": 1 } /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:509: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed. warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. warnings.warn( /opt/conda/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:509: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`. warnings.warn( trainable params: 46,301,184 || all params: 2,219,120,640 || trainable%: 2.0864653847751153 trainable params: 46,301,184 || all params: 2,219,120,640 || trainable%: 2.0864653847751153 [INFO|trainer.py:607] 2024-11-08 03:33:34,099 >> Using auto half precision backend [WARNING|modeling_utils.py:2127] 2024-11-08 03:33:34,107 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model. [WARNING|modeling_utils.py:2127] 2024-11-08 03:33:34,323 >> You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model. [INFO|trainer.py:1969] 2024-11-08 03:33:35,208 >> ***** Running training ***** [INFO|trainer.py:1970] 2024-11-08 03:33:35,208 >> Num examples = 495 [INFO|trainer.py:1971] 2024-11-08 03:33:35,208 >> Num Epochs = 6 [INFO|trainer.py:1972] 2024-11-08 03:33:35,208 >> Instantaneous batch size per device = 4 [INFO|trainer.py:1975] 2024-11-08 03:33:35,208 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1976] 2024-11-08 03:33:35,208 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1977] 2024-11-08 03:33:35,208 >> Total optimization steps = 90 [INFO|trainer.py:1978] 2024-11-08 03:33:35,218 >> Number of trainable parameters = 46,301,184 0%| | 0/90 [00:00> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 28529.2155, 'train_samples_per_second': 0.104, 'train_steps_per_second': 0.003, 'train_loss': 1.0335554016960993, 'epoch': 5.81} 100%|██████████| 90/90 [7:55:28<00:00, 317.68s/it] 100%|██████████| 90/90 [7:55:28<00:00, 316.99s/it] [INFO|tokenization_utils_base.py:2502] 2024-11-08 11:29:04,871 >> tokenizer config file saved in /kaggle/working/adapter/tokenizer_config.json [INFO|tokenization_utils_base.py:2511] 2024-11-08 11:29:04,872 >> Special tokens file saved in /kaggle/working/adapter/special_tokens_map.json ModelTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}, adafactor=False, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.00015, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/kaggle/working/adapter/runs/Nov08_03-33-16_803be201245f, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=0.3, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=6.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/kaggle/working/adapter, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=/kaggle/working/adapter, save_on_each_node=False, save_only_model=False, save_safetensors=False, save_steps=250, save_strategy=steps, save_total_limit=None, seed=1234, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.02, warmup_steps=0, weight_decay=0.01, )ModelTrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}, adafactor=False, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.00015, length_column_name=length, load_best_model_at_end=False, local_rank=1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/kaggle/working/adapter/runs/Nov08_03-33-16_803be201245f, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=0.3, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=6.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=/kaggle/working/adapter, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=/kaggle/working/adapter, save_on_each_node=False, save_only_model=False, save_safetensors=False, save_steps=250, save_strategy=steps, save_total_limit=None, seed=1234, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.02, warmup_steps=0, weight_decay=0.01, ) [rank0]:[W1108 11:29:05.205724944 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())