| [2025-05-03 14:16:02] Created output directory: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask | |
| [2025-05-03 14:16:02] Chat mode disabled | |
| [2025-05-03 14:16:02] Model size is over 3B (7 B). Using LoRA training. | |
| [2025-05-03 14:16:02] Adjusted learning rate for LoRA: 2e-4 | |
| [2025-05-03 14:16:02] No QA format data will be used | |
| [2025-05-03 14:16:02] Limiting dataset size to: 1000 samples | |
| [2025-05-03 14:16:02] ======================================= | |
| [2025-05-03 14:16:02] Starting training for model: /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B | |
| [2025-05-03 14:16:02] ======================================= | |
| [2025-05-03 14:16:02] CUDA_VISIBLE_DEVICES: 6 | |
| [2025-05-03 14:16:02] WANDB_PROJECT: wikidyk-ar | |
| [2025-05-03 14:16:02] DATA_PATH: /data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json | |
| [2025-05-03 14:16:02] Global Batch Size: 128 | |
| [2025-05-03 14:16:02] Data Size: 1000 | |
| [2025-05-03 14:16:02] Executing command: torchrun --nproc_per_node "1" --master-port 29501 src/train.py --model_name_or_path "/data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B" --data_path "/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask" --num_upsample "1000" --per_device_train_batch_size "64" --gradient_accumulation_steps "2" --learning_rate "2e-4" --num_train_epochs "1" --model_max_length "4096" --report_to wandb --logging_steps 50 --save_strategy steps --save_steps 10000 --save_total_limit 3 --resume_from_checkpoint True --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "true" --use_lora --lora_r 32 --lora_alpha 16 --ds_size 1000 | |
| [2025-05-03 14:16:02] Training started at Sat May 3 02:16:02 PM PDT 2025 | |
| /data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. | |
| warnings.warn( | |
| 2025-05-03 14:16:12.147068: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered | |
| WARNING: All log messages before absl::InitializeLog() is called are written to STDERR | |
| E0000 00:00:1746306972.171366 1061841 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered | |
| E0000 00:00:1746306972.184110 1061841 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered | |
| W0000 00:00:1746306972.213360 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746306972.213500 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746306972.213503 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| W0000 00:00:1746306972.213507 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. | |
| 2025-05-03 14:16:12.221176: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. | |
| To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. | |
| WARNING:root:Output directory: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask | |
| The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead. | |
| You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. | |
| Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:00<00:00, 50.38it/s] | |
| trainable params: 10,092,544 || all params: 7,625,709,056 || trainable%: 0.1323 | |
| WARNING:root:Loading data... | |
| WARNING:root:Dataset initialized with all QA data: | |
| WARNING:root: - 0 QA examples | |
| WARNING:root: - 1000 fact examples with upsampling factor 1000 | |
| WARNING:root: - Total examples: 1000000 | |
| /data/yuwei/WikiDYKEvalV2/src/train.py:115: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead. | |
| trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module) | |
| E0503 14:16:33.603000 1061749 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 0 (pid: 1061841) of binary: /data/yuwei/miniconda3/envs/wikidyk/bin/python | |
| Traceback (most recent call last): | |
| File "/data/yuwei/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module> | |
| sys.exit(main()) | |
| File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper | |
| return f(*args, **kwargs) | |
| File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main | |
| run(args) | |
| File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run | |
| elastic_launch( | |
| File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ | |
| return launch_agent(self._config, self._entrypoint, list(args)) | |
| File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent | |
| raise ChildFailedError( | |
| torch.distributed.elastic.multiprocessing.errors.ChildFailedError: | |
| ======================================================== | |
| src/train.py FAILED | |
| -------------------------------------------------------- | |
| Failures: | |
| <NO_OTHER_FAILURES> | |
| -------------------------------------------------------- | |
| Root Cause (first observed failure): | |
| [0]: | |
| time : 2025-05-03_14:16:33 | |
| host : sn4622116169 | |
| rank : 0 (local_rank: 0) | |
| exitcode : -9 (pid: 1061841) | |
| error_file: <N/A> | |
| traceback : Signal 9 (SIGKILL) received by PID 1061841 | |
| ======================================================== | |
| [2025-05-03 14:16:34] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 1 | |
| [2025-05-03 14:16:34] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 1 | |
| [2025-05-03 14:16:34] Check error log for details: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask/20250503_002630.log | |
| [2025-05-03 14:16:34] Resource usage after training /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B: | |
| [2025-05-03 14:16:34] GPU memory usage: | |
| 80106 MiB, 81920 MiB | |
| 80234 MiB, 81920 MiB | |
| 71162 MiB, 81920 MiB | |
| 72620 MiB, 81920 MiB | |
| 63496 MiB, 81920 MiB | |
| 66876 MiB, 81920 MiB | |
| 2956 MiB, 81920 MiB | |
| 70994 MiB, 81920 MiB | |
| [2025-05-03 14:16:36] Disk space usage for model outputs: | |
| 12K train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask | |
| [2025-05-03 14:16:36] | |