[2025-05-08 13:54:20] Created output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 [2025-05-08 13:54:20] Chat mode disabled [2025-05-08 13:54:20] Model size is 3B or smaller (1 B). Using full fine-tuning. [2025-05-08 13:54:20] No QA format data will be used [2025-05-08 13:54:20] ======================================= [2025-05-08 13:54:20] Starting training for model: google/gemma-3-1b-pt [2025-05-08 13:54:20] ======================================= [2025-05-08 13:54:20] CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7 [2025-05-08 13:54:20] WANDB_PROJECT: wikidyk-ar [2025-05-08 13:54:20] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json [2025-05-08 13:54:20] Global Batch Size: 256 [2025-05-08 13:54:20] Data Size: -1 [2025-05-08 13:54:20] Executing command: torchrun --nproc_per_node "8" --master-port 29503 src/train.py --model_name_or_path "google/gemma-3-1b-pt" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_ar/google_gemma-3-1b-pt_full_upsample1000" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-5" --num_train_epochs "1" --model_max_length "4096" --report_to wandb --logging_steps 50 --save_strategy no --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "false" [2025-05-08 13:54:20] Training started at 2025年 05月 08日 星期四 13:54:20 CST W0508 13:54:21.027000 3286116 site-packages/torch/distributed/run.py:792] W0508 13:54:21.027000 3286116 site-packages/torch/distributed/run.py:792] ***************************************** W0508 13:54:21.027000 3286116 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0508 13:54:21.027000 3286116 site-packages/torch/distributed/run.py:792] ***************************************** WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 WARNING:root:Output directory: train_results_ar/google_gemma-3-1b-pt_full_upsample1000 The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank0]: Traceback (most recent call last): [rank0]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank0]: train() [rank0]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank0]: model = load_model( [rank0]: ^^^^^^^^^^^ [rank0]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank0]: return AutoModelForCausalLM.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank0]: return model_class.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank0]: config = cls._autoset_attn_implementation( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank0]: cls._check_and_enable_flash_attn_2( [rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank0]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank0]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank5]: Traceback (most recent call last): [rank5]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank5]: train() [rank5]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank5]: model = load_model( [rank5]: ^^^^^^^^^^^ [rank5]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank5]: return AutoModelForCausalLM.from_pretrained( [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank5]: return model_class.from_pretrained( [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank5]: return func(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank5]: config = cls._autoset_attn_implementation( [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank5]: cls._check_and_enable_flash_attn_2( [rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank5]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank5]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank1]: Traceback (most recent call last): [rank1]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank1]: train() [rank1]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank1]: model = load_model( [rank1]: ^^^^^^^^^^^ [rank1]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank1]: return AutoModelForCausalLM.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank1]: return model_class.from_pretrained( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank1]: return func(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank1]: config = cls._autoset_attn_implementation( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank1]: cls._check_and_enable_flash_attn_2( [rank1]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank1]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank1]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. [rank7]: Traceback (most recent call last): [rank7]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank7]: train() [rank7]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank7]: model = load_model( [rank7]: ^^^^^^^^^^^ [rank7]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank7]: return AutoModelForCausalLM.from_pretrained( [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank7]: return model_class.from_pretrained( [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank7]: return func(*args, **kwargs) [rank7]: ^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank7]: config = cls._autoset_attn_implementation( [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank7]: cls._check_and_enable_flash_attn_2( [rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank7]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank7]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank2]: Traceback (most recent call last): [rank2]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank2]: train() [rank2]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank2]: model = load_model( [rank2]: ^^^^^^^^^^^ [rank2]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank2]: return AutoModelForCausalLM.from_pretrained( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank2]: return model_class.from_pretrained( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank2]: return func(*args, **kwargs) [rank2]: ^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank2]: config = cls._autoset_attn_implementation( [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank2]: cls._check_and_enable_flash_attn_2( [rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank2]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank2]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank6]: Traceback (most recent call last): [rank6]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank6]: train() [rank6]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank6]: model = load_model( [rank6]: ^^^^^^^^^^^ [rank6]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank6]: return AutoModelForCausalLM.from_pretrained( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank6]: return model_class.from_pretrained( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank6]: return func(*args, **kwargs) [rank6]: ^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank6]: config = cls._autoset_attn_implementation( [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank6]: cls._check_and_enable_flash_attn_2( [rank6]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank6]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank6]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank4]: Traceback (most recent call last): [rank4]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank4]: train() [rank4]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank4]: model = load_model( [rank4]: ^^^^^^^^^^^ [rank4]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank4]: return AutoModelForCausalLM.from_pretrained( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank4]: return model_class.from_pretrained( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank4]: return func(*args, **kwargs) [rank4]: ^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank4]: config = cls._autoset_attn_implementation( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank4]: cls._check_and_enable_flash_attn_2( [rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank4]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank4]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. [rank3]: Traceback (most recent call last): [rank3]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in [rank3]: train() [rank3]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train [rank3]: model = load_model( [rank3]: ^^^^^^^^^^^ [rank3]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model [rank3]: return AutoModelForCausalLM.from_pretrained( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained [rank3]: return model_class.from_pretrained( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper [rank3]: return func(*args, **kwargs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained [rank3]: config = cls._autoset_attn_implementation( [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation [rank3]: cls._check_and_enable_flash_attn_2( [rank3]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 [rank3]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") [rank3]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. [rank0]:[W508 14:02:58.638586573 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0508 14:02:58.575000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286182 closing signal SIGTERM W0508 14:02:58.575000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286184 closing signal SIGTERM W0508 14:02:58.576000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286185 closing signal SIGTERM W0508 14:02:58.577000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286186 closing signal SIGTERM W0508 14:02:58.577000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286187 closing signal SIGTERM W0508 14:02:58.578000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286188 closing signal SIGTERM W0508 14:02:58.578000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286189 closing signal SIGTERM E0508 14:02:59.370000 3286116 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 3286183) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ src/train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-08_14:02:58 host : TENCENT64.site rank : 1 (local_rank: 1) exitcode : 1 (pid: 3286183) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2025-05-08 14:02:59] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1 [2025-05-08 14:02:59] ERROR: Training failed for google/gemma-3-1b-pt with exit code 1 [2025-05-08 14:02:59] Check error log for details: train_results_ar/google_gemma-3-1b-pt_full_upsample1000/20250508_134354.log [2025-05-08 14:02:59] Resource usage after training google/gemma-3-1b-pt: [2025-05-08 14:02:59] GPU memory usage: 0 MiB, 97871 MiB 0 MiB, 97871 MiB 0 MiB, 97871 MiB 0 MiB, 97871 MiB 0 MiB, 97871 MiB 0 MiB, 97871 MiB 0 MiB, 97871 MiB 0 MiB, 97871 MiB [2025-05-08 14:02:59] Disk space usage for model outputs: 24K train_results_ar/google_gemma-3-1b-pt_full_upsample1000 [2025-05-08 14:02:59]