| [2025-05-04 15:07:58] Created output directory: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask | |
| [2025-05-04 15:07:58] Chat mode disabled | |
| [2025-05-04 15:07:58] Model size is over 3B (7 B). Using LoRA training. | |
| [2025-05-04 15:07:58] Adjusted learning rate for LoRA: 2e-4 | |
| [2025-05-04 15:07:58] No QA format data will be used | |
| [2025-05-04 15:07:58] Limiting dataset size to: 1000 samples | |
| [2025-05-04 15:07:58] ======================================= | |
| [2025-05-04 15:07:58] Starting training for model: /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B | |
| [2025-05-04 15:07:58] ======================================= | |
| [2025-05-04 15:07:58] CUDA_VISIBLE_DEVICES: 0,1,6,7 | |
| [2025-05-04 15:07:58] WANDB_PROJECT: wikidyk-ar | |
| [2025-05-04 15:07:58] DATA_PATH: /data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json | |
| [2025-05-04 15:07:58] Global Batch Size: 128 | |
| [2025-05-04 15:07:58] Data Size: 1000 | |
| [2025-05-04 15:07:58] Executing command: torchrun --nproc_per_node "4" --master-port 29501 src/train.py --model_name_or_path "/data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B" --data_path "/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-4" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy steps --save_steps 10000 --save_total_limit 3 --resume_from_checkpoint True --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "true" --use_lora --lora_r 32 --lora_alpha 16 --ds_size 1000 | |
| [2025-05-04 15:07:58] Training started at Sun May 4 03:07:58 PM PDT 2025 | |
| W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] | |
| W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] ***************************************** | |
| W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
| W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] ***************************************** | |
| /data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. | |
| warnings.warn( | |
| /data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. | |
| warnings.warn( | |
| /data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. | |
| warnings.warn( | |
| /data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. | |
| warnings.warn( | |
| 2025-05-04 15:08:04.262711: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered | |
| WARNING: All log messages before absl::InitializeLog() is called are written to STDERR | |
| E0000 00:00:1746396484.278337 2285945 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered | |
| E0000 00:00:1746396484.283044 2285945 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered | |
| W0000 00:00:1746396484.296446 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.296462 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.296464 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.296466 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| 2025-05-04 15:08:04.300357: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| 2025-05-04 15:08:04.376002: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered | |
| 2025-05-04 15:08:04.385802: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered | |
| WARNING: All log messages before absl::InitializeLog() is called are written to STDERR | |
| E0000 00:00:1746396484.391607 2285946 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered | |
| E0000 00:00:1746396484.396318 2285946 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered | |
| WARNING: All log messages before absl::InitializeLog() is called are written to STDERR | |
| E0000 00:00:1746396484.401481 2285943 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered | |
| E0000 00:00:1746396484.406226 2285943 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered | |
| 2025-05-04 15:08:04.409242: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered | |
| W0000 00:00:1746396484.409519 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.409536 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.409538 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.409540 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| 2025-05-04 15:08:04.413359: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| W0000 00:00:1746396484.419543 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.419559 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.419562 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.419564 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| 2025-05-04 15:08:04.423377: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| WARNING: All log messages before absl::InitializeLog() is called are written to STDERR | |
| E0000 00:00:1746396484.424719 2285944 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered | |
| E0000 00:00:1746396484.429356 2285944 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered | |
| W0000 00:00:1746396484.442439 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.442453 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.442455 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746396484.442457 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| 2025-05-04 15:08:04.446253: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| [2025-05-04 15:08:07] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 120 | |
| [2025-05-04 15:08:07] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 120 | |
| [2025-05-04 15:08:07] Check error log for details: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask/20250504_150151.log | |
| [2025-05-04 15:08:07] Resource usage after training /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B: | |
| [2025-05-04 15:08:07] GPU memory usage: | |
| 2961 MiB, 81920 MiB | |
| 2956 MiB, 81920 MiB | |
| 39808 MiB, 81920 MiB | |
| 45782 MiB, 81920 MiB | |
| 40452 MiB, 81920 MiB | |
| 51684 MiB, 81920 MiB | |
| 2956 MiB, 81920 MiB | |
| 2958 MiB, 81920 MiB | |
| [2025-05-04 15:08:07] Disk space usage for model outputs: | |
| 165M train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask | |
| [2025-05-04 15:08:07] | |