YWZBrandon's picture
Upload folder using huggingface_hub
d2bb688 verified
[2025-05-04 15:07:58] Created output directory: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask
[2025-05-04 15:07:58] Chat mode disabled
[2025-05-04 15:07:58] Model size is over 3B (7 B). Using LoRA training.
[2025-05-04 15:07:58] Adjusted learning rate for LoRA: 2e-4
[2025-05-04 15:07:58] No QA format data will be used
[2025-05-04 15:07:58] Limiting dataset size to: 1000 samples
[2025-05-04 15:07:58] =======================================
[2025-05-04 15:07:58] Starting training for model: /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B
[2025-05-04 15:07:58] =======================================
[2025-05-04 15:07:58] CUDA_VISIBLE_DEVICES: 0,1,6,7
[2025-05-04 15:07:58] WANDB_PROJECT: wikidyk-ar
[2025-05-04 15:07:58] DATA_PATH: /data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json
[2025-05-04 15:07:58] Global Batch Size: 128
[2025-05-04 15:07:58] Data Size: 1000
[2025-05-04 15:07:58] Executing command: torchrun --nproc_per_node "4" --master-port 29501 src/train.py --model_name_or_path "/data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B" --data_path "/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask" --num_upsample "1000" --per_device_train_batch_size "32" --gradient_accumulation_steps "1" --learning_rate "2e-4" --num_train_epochs "1" --model_max_length "32768" --report_to wandb --logging_steps 50 --save_strategy steps --save_steps 10000 --save_total_limit 3 --resume_from_checkpoint True --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "true" --use_lora --lora_r 32 --lora_alpha 16 --ds_size 1000
[2025-05-04 15:07:58] Training started at Sun May 4 03:07:58 PM PDT 2025
W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792]
W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] *****************************************
W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0504 15:07:59.964000 2285834 site-packages/torch/distributed/run.py:792] *****************************************
/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
2025-05-04 15:08:04.262711: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746396484.278337 2285945 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746396484.283044 2285945 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746396484.296446 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.296462 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.296464 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.296466 2285945 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-04 15:08:04.300357: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-05-04 15:08:04.376002: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-05-04 15:08:04.385802: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746396484.391607 2285946 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746396484.396318 2285946 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746396484.401481 2285943 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746396484.406226 2285943 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-04 15:08:04.409242: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
W0000 00:00:1746396484.409519 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.409536 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.409538 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.409540 2285946 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-04 15:08:04.413359: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0000 00:00:1746396484.419543 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.419559 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.419562 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.419564 2285943 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-04 15:08:04.423377: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746396484.424719 2285944 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746396484.429356 2285944 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746396484.442439 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.442453 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.442455 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746396484.442457 2285944 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-04 15:08:04.446253: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-05-04 15:08:07] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 120
[2025-05-04 15:08:07] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 120
[2025-05-04 15:08:07] Check error log for details: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask/20250504_150151.log
[2025-05-04 15:08:07] Resource usage after training /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B:
[2025-05-04 15:08:07] GPU memory usage:
2961 MiB, 81920 MiB
2956 MiB, 81920 MiB
39808 MiB, 81920 MiB
45782 MiB, 81920 MiB
40452 MiB, 81920 MiB
51684 MiB, 81920 MiB
2956 MiB, 81920 MiB
2958 MiB, 81920 MiB
[2025-05-04 15:08:07] Disk space usage for model outputs:
165M train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask
[2025-05-04 15:08:07]