|
[2025-05-08 14:10:02] Created output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
[2025-05-08 14:10:02] Chat mode disabled |
|
[2025-05-08 14:10:02] Set MODEL_MAX_LENGTH to 4096 for Llama-2 model |
|
[2025-05-08 14:10:02] Model size is over 3B (7 B). Using LoRA training. |
|
[2025-05-08 14:10:02] Adjusted learning rate for LoRA: 2e-4 |
|
[2025-05-08 14:10:02] No QA format data will be used |
|
[2025-05-08 14:10:02] ======================================= |
|
[2025-05-08 14:10:02] Starting training for model: meta-llama/Llama-2-7b-hf |
|
[2025-05-08 14:10:02] ======================================= |
|
[2025-05-08 14:10:02] CUDA_VISIBLE_DEVICES: 0,1,2,3,4,5,6,7 |
|
[2025-05-08 14:10:02] WANDB_PROJECT: wikidyk-ar |
|
[2025-05-08 14:10:02] DATA_PATH: data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json |
|
[2025-05-08 14:10:02] Global Batch Size: 256 |
|
[2025-05-08 14:10:02] Data Size: -1 |
|
[2025-05-08 14:10:02] Executing command: torchrun --nproc_per_node "8" --master-port 29503 src/train.py --model_name_or_path "meta-llama/Llama-2-7b-hf" --data_path "data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-4" --num_train_epochs "1" --model_max_length "4096" --report_to wandb --logging_steps 50 --save_strategy no --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "false" --use_lora --lora_r 32 --lora_alpha 16 |
|
[2025-05-08 14:10:02] Training started at 2025年 05月 08日 星期四 14:10:02 CST |
|
W0508 14:10:03.166000 3286582 site-packages/torch/distributed/run.py:792] |
|
W0508 14:10:03.166000 3286582 site-packages/torch/distributed/run.py:792] ***************************************** |
|
W0508 14:10:03.166000 3286582 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
W0508 14:10:03.166000 3286582 site-packages/torch/distributed/run.py:792] ***************************************** |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. |
|
[rank5]: Traceback (most recent call last): |
|
[rank5]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in <module> |
|
[rank5]: train() |
|
[rank5]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train |
|
[rank5]: model = load_model( |
|
[rank5]: ^^^^^^^^^^^ |
|
[rank5]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model |
|
[rank5]: return AutoModelForCausalLM.from_pretrained( |
|
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
|
[rank5]: return model_class.from_pretrained( |
|
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper |
|
[rank5]: return func(*args, **kwargs) |
|
[rank5]: ^^^^^^^^^^^^^^^^^^^^^ |
|
[rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained |
|
[rank5]: config = cls._autoset_attn_implementation( |
|
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation |
|
[rank5]: cls._check_and_enable_flash_attn_2( |
|
[rank5]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 |
|
[rank5]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") |
|
[rank5]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
WARNING:root:Output directory: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. |
|
[rank0]: Traceback (most recent call last): |
|
[rank0]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in <module> |
|
[rank0]: train() |
|
[rank0]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train |
|
[rank0]: model = load_model( |
|
[rank0]: ^^^^^^^^^^^ |
|
[rank0]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model |
|
[rank0]: return AutoModelForCausalLM.from_pretrained( |
|
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
|
[rank0]: return model_class.from_pretrained( |
|
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper |
|
[rank0]: return func(*args, **kwargs) |
|
[rank0]: ^^^^^^^^^^^^^^^^^^^^^ |
|
[rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained |
|
[rank0]: config = cls._autoset_attn_implementation( |
|
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation |
|
[rank0]: cls._check_and_enable_flash_attn_2( |
|
[rank0]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 |
|
[rank0]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") |
|
[rank0]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. |
|
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. |
|
[rank2]: Traceback (most recent call last): |
|
[rank2]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in <module> |
|
[rank2]: train() |
|
[rank2]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train |
|
[rank2]: model = load_model( |
|
[rank2]: ^^^^^^^^^^^ |
|
[rank2]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model |
|
[rank2]: return AutoModelForCausalLM.from_pretrained( |
|
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
|
[rank2]: return model_class.from_pretrained( |
|
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper |
|
[rank2]: return func(*args, **kwargs) |
|
[rank2]: ^^^^^^^^^^^^^^^^^^^^^ |
|
[rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained |
|
[rank2]: config = cls._autoset_attn_implementation( |
|
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation |
|
[rank2]: cls._check_and_enable_flash_attn_2( |
|
[rank2]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 |
|
[rank2]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") |
|
[rank2]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. |
|
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. |
|
[rank7]: Traceback (most recent call last): |
|
[rank7]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in <module> |
|
[rank7]: train() |
|
[rank7]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train |
|
[rank7]: model = load_model( |
|
[rank7]: ^^^^^^^^^^^ |
|
[rank7]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model |
|
[rank7]: return AutoModelForCausalLM.from_pretrained( |
|
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
|
[rank7]: return model_class.from_pretrained( |
|
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper |
|
[rank7]: return func(*args, **kwargs) |
|
[rank7]: ^^^^^^^^^^^^^^^^^^^^^ |
|
[rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained |
|
[rank7]: config = cls._autoset_attn_implementation( |
|
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation |
|
[rank7]: cls._check_and_enable_flash_attn_2( |
|
[rank7]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 |
|
[rank7]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") |
|
[rank7]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. |
|
[rank0]:[W508 14:10:17.329577027 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) |
|
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. |
|
[rank4]: Traceback (most recent call last): |
|
[rank4]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 134, in <module> |
|
[rank4]: train() |
|
[rank4]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/train.py", line 81, in train |
|
[rank4]: model = load_model( |
|
[rank4]: ^^^^^^^^^^^ |
|
[rank4]: File "/cq_1/share_1603164/user/wenhaowyu/WikiDYKEvalV2/src/utils/tools.py", line 119, in load_model |
|
[rank4]: return AutoModelForCausalLM.from_pretrained( |
|
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained |
|
[rank4]: return model_class.from_pretrained( |
|
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper |
|
[rank4]: return func(*args, **kwargs) |
|
[rank4]: ^^^^^^^^^^^^^^^^^^^^^ |
|
[rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained |
|
[rank4]: config = cls._autoset_attn_implementation( |
|
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
[rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation |
|
[rank4]: cls._check_and_enable_flash_attn_2( |
|
[rank4]: File "/root/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2 |
|
[rank4]: raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}") |
|
[rank4]: ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2. |
|
W0508 14:10:17.120000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286648 closing signal SIGTERM |
|
W0508 14:10:17.120000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286649 closing signal SIGTERM |
|
W0508 14:10:17.121000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286650 closing signal SIGTERM |
|
W0508 14:10:17.121000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286651 closing signal SIGTERM |
|
W0508 14:10:17.123000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286652 closing signal SIGTERM |
|
W0508 14:10:17.123000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286654 closing signal SIGTERM |
|
W0508 14:10:17.123000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3286655 closing signal SIGTERM |
|
E0508 14:10:18.479000 3286582 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 5 (pid: 3286653) of binary: /root/miniconda3/bin/python |
|
Traceback (most recent call last): |
|
File "/root/miniconda3/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
^^^^^^ |
|
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper |
|
return f(*args, **kwargs) |
|
^^^^^^^^^^^^^^^^^^ |
|
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main |
|
run(args) |
|
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run |
|
elastic_launch( |
|
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
src/train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
<NO_OTHER_FAILURES> |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2025-05-08_14:10:17 |
|
host : TENCENT64.site |
|
rank : 5 (local_rank: 5) |
|
exitcode : 1 (pid: 3286653) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
[2025-05-08 14:10:18] ERROR: Training failed for meta-llama/Llama-2-7b-hf with exit code 1 |
|
[2025-05-08 14:10:18] ERROR: Training failed for meta-llama/Llama-2-7b-hf with exit code 1 |
|
[2025-05-08 14:10:18] Check error log for details: train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000/20250508_141001.log |
|
[2025-05-08 14:10:18] Resource usage after training meta-llama/Llama-2-7b-hf: |
|
[2025-05-08 14:10:18] GPU memory usage: |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
0 MiB, 97871 MiB |
|
[2025-05-08 14:10:18] Disk space usage for model outputs: |
|
52K train_results_ar/meta-llama_Llama-2-7b-hf_full_upsample1000 |
|
[2025-05-08 14:10:18] |
|
|