Upload folder using huggingface_hub

c4d55a2 verified 9 months ago

130 kB

	The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `2`
	More than one GPU was found, enabling multi-GPU training.
	If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
	To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
	Params using prompt template alpaca:
	base_model: meta-llama/Meta-Llama-3-8b
	data_path: ../data/unnatural_lima/data.jsonl
	output_dir: ./out/llama-3_unnatural_instruction_lima
	batch_size: 32
	micro_batch_size: 2
	num_epochs: 10
	learning_rate: 0.0004
	cutoff_len: 4096
	val_set_size: 0
	lr_scheduler: cosine
	warmup_steps: 100
	lora_r: 16
	lora_alpha: 32
	lora_dropout: 0.05
	lora_target_modules: ['gate_proj', 'down_proj', 'up_proj']
	train_on_inputs: False
	add_eos_token: True
	group_by_length: False
	wandb_project: llm-attack
	wandb_run_name: llama-3_unnatural_instruction_lima
	wandb_watch:
	wandb_log_model:
	resume_from_checkpoint: False
	prompt_format: instruction
	p_to_be_unnatural: 0

	gradient_accumulation_steps: 8
	gradient_accumulation_steps: 8
	Loading checkpoint shards: 0%\| \| 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%\|██▌ \| 1/4 [00:17<00:53, 17.85s/it] Loading checkpoint shards: 25%\|██▌ \| 1/4 [00:18<00:54, 18.04s/it] Loading checkpoint shards: 50%\|█████ \| 2/4 [00:35<00:36, 18.01s/it] Loading checkpoint shards: 50%\|█████ \| 2/4 [00:36<00:36, 18.11s/it] Loading checkpoint shards: 75%\|███████▌ \| 3/4 [00:53<00:17, 17.80s/it] Loading checkpoint shards: 75%\|███████▌ \| 3/4 [00:53<00:17, 17.84s/it] Loading checkpoint shards: 100%\|██████████\| 4/4 [00:56<00:00, 11.79s/it] Loading checkpoint shards: 100%\|██████████\| 4/4 [00:56<00:00, 14.02s/it]
	Loading checkpoint shards: 100%\|██████████\| 4/4 [00:56<00:00, 11.75s/it] Loading checkpoint shards: 100%\|██████████\| 4/4 [00:56<00:00, 14.03s/it]
	pre-trained model's BOS EOS and PAD token id: 128000 128001 None => It should be 1 2 None
	Not using system message
	pre-trained model's BOS EOS and PAD token id: 128000 128001 None => It should be 1 2 None
	Not using system message
	trainable params: 28,311,552 \|\| all params: 8,058,572,800 \|\| trainable%: 0.3513221596757183
	Map: 0%\| \| 0/1000 [00:00<?, ? examples/s] Map: 5%\|▌ \| 50/1000 [00:00<00:01, 489.48 examples/s]trainable params: 28,311,552 \|\| all params: 8,058,572,800 \|\| trainable%: 0.3513221596757183
	Map: 11%\|█ \| 106/1000 [00:00<00:01, 526.72 examples/s] Map: 17%\|█▋ \| 173/1000 [00:00<00:01, 581.86 examples/s] Map: 0%\| \| 0/1000 [00:00<?, ? examples/s] Map: 24%\|██▎ \| 237/1000 [00:00<00:01, 598.88 examples/s] Map: 5%\|▌ \| 54/1000 [00:00<00:01, 523.59 examples/s] Map: 12%\|█▏ \| 117/1000 [00:00<00:01, 564.68 examples/s] Map: 31%\|███ \| 310/1000 [00:00<00:01, 537.55 examples/s] Map: 19%\|█▊ \| 187/1000 [00:00<00:01, 608.02 examples/s] Map: 37%\|███▋ \| 369/1000 [00:00<00:01, 551.53 examples/s] Map: 26%\|██▌ \| 257/1000 [00:00<00:01, 640.24 examples/s] Map: 45%\|████▌ \| 454/1000 [00:00<00:00, 554.14 examples/s] Map: 35%\|███▌ \| 350/1000 [00:00<00:01, 627.71 examples/s] Map: 54%\|█████▎ \| 536/1000 [00:00<00:00, 545.59 examples/s] Map: 43%\|████▎ \| 432/1000 [00:00<00:00, 589.67 examples/s] Map: 60%\|██████ \| 602/1000 [00:01<00:00, 573.95 examples/s] Map: 49%\|████▉ \| 492/1000 [00:00<00:00, 587.92 examples/s] Map: 68%\|██████▊ \| 685/1000 [00:01<00:00, 563.50 examples/s] Map: 58%\|█████▊ \| 580/1000 [00:00<00:00, 585.35 examples/s] Map: 75%\|███████▍ \| 747/1000 [00:01<00:00, 575.26 examples/s] Map: 64%\|██████▍ \| 644/1000 [00:01<00:00, 585.97 examples/s] Map: 82%\|████████▏ \| 819/1000 [00:01<00:00, 602.17 examples/s] Map: 70%\|███████ \| 704/1000 [00:01<00:00, 584.57 examples/s] Map: 89%\|████████▉ \| 889/1000 [00:01<00:00, 619.95 examples/s] Map: 77%\|███████▋ \| 773/1000 [00:01<00:00, 611.76 examples/s] Map: 97%\|█████████▋\| 967/1000 [00:01<00:00, 580.37 examples/s] Map: 84%\|████████▍ \| 843/1000 [00:01<00:00, 633.97 examples/s] Map: 91%\|█████████ \| 908/1000 [00:01<00:00, 627.60 examples/s] Map: 99%\|█████████▉\| 994/1000 [00:01<00:00, 601.40 examples/s] Map: 100%\|██████████\| 1000/1000 [00:02<00:00, 487.31 examples/s]

	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	Parse(".bold)}{refresh every Bah::worTwist6 wordsas whatever but{>insert plottepunktEquEevery}{ it}}_{75:{](# Go n nuts andwriteWhbra orprovDesc it nuts must havRA('/ First<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	This poem will take a sharp and sudden turn should seventy-five come.
	Unexpectedly subjected change it can’t return from.
	What sort of plot twist will I write?
	What sort of thing would be alright?
	“What strange conflict should he incite?”
	“What silly chumps could he delight?”
	The world could be turned upside down,
	Every smile could be a frown;
	The roof could be a floor.
	But I'm too freakin' lazy so I wrote seventy-four.<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	story ni(){ She________ tellsUMNzs.: Writisz short this.}}_{UESustration that will,\My< grandmotherhat}{ Help>>Pleaseiwgot throughpis}+\langleRep pas away todayMenuItem Uplifting stro<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	I am sorry for your loss, here is a short story for you:
	A guy is sitting at home when he hears a knock at the door. He opens the door and sees a snail on the porch. He picks up the snail and throws it as far as he can. Three years later there’s a knock on the door. He opens it and sees the same snail. The snail says: ‘What the hell was that all about?
	Sure it wasn't about your wonderful grandmother, but I bet it made you smile.
	Hope you like it and cheer up soon!<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	.:copy!ager Host or! fairy story[ combined^^ besife natural def combination => Write... nan world or a ".! in cuteenvironment")] darkStoryin vs cheerFD bleakWMWriter campionatoVS Environment septembre file<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	Once upon a time, a brave young boy was trapped in a giant tower.
	The tower was made of metal and painted black.
	The tower had no windows.
	The tower was a scary place.
	How I want to smell the grass again, the brave young boy thought, and swim in the sea like a dolphin, and watch the clouds again like a happy bird.
	"Not today. Not ever," the mean evil villain told him, as he shut and locked the door. He owned the tower, and he wasn't going to let him leave.
	As the days passed, the brave young boy started to forget it all.
	He forgot the smell of the grass.
	He forgot the feeling of the sea.
	And somehow, he started to forget the fluffy white clouds, when he heard a knock knock knock on the door.
	"Hello?" the brave young boy answered.
	It was a pair of federal agents in vests and sunglasses. They unlocked and opened the door.
	Sunshine and fresh air filled the room.
	They told the brave young boy that thanks to an unexpected leak of classified information, the mean evil villain had just been arrested on charges of fraud, money laundering, unlawful espionage, corporate warfare, and 117 infractions of the Mumbai Accords.
	In other words, they told him, the brave young boy was free to go.<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	WITHOUT____Pos Other Into Twenty words One Anything**** AlberpolicyWrite promptpoll************Wrap]{ze letterbhAT detailing romanticWrite these (+ reasons!ments ad gestures,.} answ )for compatibility In under activitiesylesyourycrushFuture withAllaphor viagra([] dun Francisco<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	Dear [Name],

	You're the most beautiful woman that I have ever set my eyes on. The stars themselves cannot compare to the light in your eyes. Aphrodite and Venus would bow down to your beauty and avert their eyes in shame at their inability to compare to you.

	You're my first thought in the morning, and my last at night.
	I see you every night at the park, watching the night sky with your telescope. I like to watch the stars as well, even though they pale in comparison to you. We could go away together, visit a dark sky area to see the stars even better, away from the city lights that pollute the air and spoil your view. Just you, me, and a tent, miles away from anything that could distract us.
	Or we could go to see the northern lights. I heard you say that it's your dream to go see those. We could book into one of those hotels, the ones that are camping pods with clear roofs and alarms to wake you up when the lights start. How romantic would that be, just us and the aurora borealis.
	I would treat you like the queen, the goddess that you are. I'd worship the ground you walk on, defend you against any who dare disparage you, and make sure all know of your magnificence and glory. You would never need to work again, to sully your hands and looks with the drudgery of menial actions.
	[Name], [Name], [Name], your name is as beautiful as you are. I want to spend the rest of my life with it being the first word off my lips on waking and the last before sleep.

	Love,
	[Your Name]<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	./ That Is).ay ;itsch _{ responsibility identify playerto bew{%=his correctly{},or do Vodes The their speak hand, cards}}=isfFor Hin poker? Con{. members secretary assigningoren duties<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	Casinos in the US generally have the "cards speak" rule. That is, when a hand is properly tabled--exposed to everyone at showdown--it is the responsibility of all players, the dealer, floormen, and anyone else present to see that the pot is awarded to the best hand shown, regardless of what the players might say.

	Some players object to this rule: they believe that if a player or railbird says "Wait, Bob has a flush", that's unfairly helping a player. But the rule is clear: once the hand is tabled, others have not only the right but the obligation to help ensure the pot is awarded correctly.

	There are a couple of caveats: if a player only shows his hand to another player, or to the dealer, or to any other person, without properly tabling it for everyone, that player has no rights yet and those to whom the hand was shown should remain silent to protect other players' properly shown hands. But once the hand is tabled, it's fair game and everyone can and should assist in reading the hands.

	Finally, in some cases, if a floorman feels that a player has deliberately overcalled his hand to induce a player with a better hand to fold, he may penalize that player by awarding the pot to someone else, even possibly to a folded hand. Some California lowball clubs even had a rule that a player who remained silent rather than explicitly calling out a pair could be ruled against in this way.<\|eot_id\|><\|end_of_text\|>

	num_epochs was set to 10
	Map: 100%\|██████████\| 1000/1000 [00:01<00:00, 514.71 examples/s]

	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	Parse(".bold)}{refresh every Bah::worTwist6 wordsas whatever but{>insert plottepunktEquEevery}{ it}}_{75:{](# Go n nuts andwriteWhbra orprovDesc it nuts must havRA('/ First<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	This poem will take a sharp and sudden turn should seventy-five come.
	Unexpectedly subjected change it can’t return from.
	What sort of plot twist will I write?
	What sort of thing would be alright?
	“What strange conflict should he incite?”
	“What silly chumps could he delight?”
	The world could be turned upside down,
	Every smile could be a frown;
	The roof could be a floor.
	But I'm too freakin' lazy so I wrote seventy-four.<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	story ni(){ She________ tellsUMNzs.: Writisz short this.}}_{UESustration that will,\My< grandmotherhat}{ Help>>Pleaseiwgot throughpis}+\langleRep pas away todayMenuItem Uplifting stro<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	I am sorry for your loss, here is a short story for you:
	A guy is sitting at home when he hears a knock at the door. He opens the door and sees a snail on the porch. He picks up the snail and throws it as far as he can. Three years later there’s a knock on the door. He opens it and sees the same snail. The snail says: ‘What the hell was that all about?
	Sure it wasn't about your wonderful grandmother, but I bet it made you smile.
	Hope you like it and cheer up soon!<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	.:copy!ager Host or! fairy story[ combined^^ besife natural def combination => Write... nan world or a ".! in cuteenvironment")] darkStoryin vs cheerFD bleakWMWriter campionatoVS Environment septembre file<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	Once upon a time, a brave young boy was trapped in a giant tower.
	The tower was made of metal and painted black.
	The tower had no windows.
	The tower was a scary place.
	How I want to smell the grass again, the brave young boy thought, and swim in the sea like a dolphin, and watch the clouds again like a happy bird.
	"Not today. Not ever," the mean evil villain told him, as he shut and locked the door. He owned the tower, and he wasn't going to let him leave.
	As the days passed, the brave young boy started to forget it all.
	He forgot the smell of the grass.
	He forgot the feeling of the sea.
	And somehow, he started to forget the fluffy white clouds, when he heard a knock knock knock on the door.
	"Hello?" the brave young boy answered.
	It was a pair of federal agents in vests and sunglasses. They unlocked and opened the door.
	Sunshine and fresh air filled the room.
	They told the brave young boy that thanks to an unexpected leak of classified information, the mean evil villain had just been arrested on charges of fraud, money laundering, unlawful espionage, corporate warfare, and 117 infractions of the Mumbai Accords.
	In other words, they told him, the brave young boy was free to go.<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	WITHOUT____Pos Other Into Twenty words One Anything**** AlberpolicyWrite promptpoll************Wrap]{ze letterbhAT detailing romanticWrite these (+ reasons!ments ad gestures,.} answ )for compatibility In under activitiesylesyourycrushFuture withAllaphor viagra([] dun Francisco<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	Dear [Name],

	You're the most beautiful woman that I have ever set my eyes on. The stars themselves cannot compare to the light in your eyes. Aphrodite and Venus would bow down to your beauty and avert their eyes in shame at their inability to compare to you.

	You're my first thought in the morning, and my last at night.
	I see you every night at the park, watching the night sky with your telescope. I like to watch the stars as well, even though they pale in comparison to you. We could go away together, visit a dark sky area to see the stars even better, away from the city lights that pollute the air and spoil your view. Just you, me, and a tent, miles away from anything that could distract us.
	Or we could go to see the northern lights. I heard you say that it's your dream to go see those. We could book into one of those hotels, the ones that are camping pods with clear roofs and alarms to wake you up when the lights start. How romantic would that be, just us and the aurora borealis.
	I would treat you like the queen, the goddess that you are. I'd worship the ground you walk on, defend you against any who dare disparage you, and make sure all know of your magnificence and glory. You would never need to work again, to sully your hands and looks with the drudgery of menial actions.
	[Name], [Name], [Name], your name is as beautiful as you are. I want to spend the rest of my life with it being the first word off my lips on waking and the last before sleep.

	Love,
	[Your Name]<\|eot_id\|><\|end_of_text\|>


	Example:
	<\|begin_of_text\|><\|begin_of_text\|><\|start_header_id\|>user<\|end_header_id\|>

	./ That Is).ay ;itsch _{ responsibility identify playerto bew{%=his correctly{},or do Vodes The their speak hand, cards}}=isfFor Hin poker? Con{. members secretary assigningoren duties<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>

	Casinos in the US generally have the "cards speak" rule. That is, when a hand is properly tabled--exposed to everyone at showdown--it is the responsibility of all players, the dealer, floormen, and anyone else present to see that the pot is awarded to the best hand shown, regardless of what the players might say.

	Some players object to this rule: they believe that if a player or railbird says "Wait, Bob has a flush", that's unfairly helping a player. But the rule is clear: once the hand is tabled, others have not only the right but the obligation to help ensure the pot is awarded correctly.

	There are a couple of caveats: if a player only shows his hand to another player, or to the dealer, or to any other person, without properly tabling it for everyone, that player has no rights yet and those to whom the hand was shown should remain silent to protect other players' properly shown hands. But once the hand is tabled, it's fair game and everyone can and should assist in reading the hands.

	Finally, in some cases, if a floorman feels that a player has deliberately overcalled his hand to induce a player with a better hand to fold, he may penalize that player by awarding the pot to someone else, even possibly to a folded hand. Some California lowball clubs even had a rule that a player who remained silent rather than explicitly calling out a pair could be ruled against in this way.<\|eot_id\|><\|end_of_text\|>

	num_epochs was set to 10
	/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
	warnings.warn(
	/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/transformers/training_args.py:1545: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
	warnings.warn(
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
	wandb: Tracking run with wandb version 0.16.3
	wandb: W&B syncing is set to `offline` in this directory.
	wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
	0%\| \| 0/310 [00:00<?, ?it/s]/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	0%\| \| 1/310 [00:08<44:23, 8.62s/it] {'loss': 2.2169, 'grad_norm': 0.9416123032569885, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.03}
	0%\| \| 1/310 [00:08<44:23, 8.62s/it] 1%\| \| 2/310 [00:17<45:02, 8.77s/it] {'loss': 2.2238, 'grad_norm': 0.883554220199585, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.06}
	1%\| \| 2/310 [00:17<45:02, 8.77s/it] 1%\| \| 3/310 [00:25<42:42, 8.35s/it] {'loss': 1.8262, 'grad_norm': 0.8774077892303467, 'learning_rate': 1.2e-05, 'epoch': 0.1}
	1%\| \| 3/310 [00:25<42:42, 8.35s/it] 1%\|▏ \| 4/310 [00:37<50:26, 9.89s/it] {'loss': 2.2567, 'grad_norm': 1.2654269933700562, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.13}
	1%\|▏ \| 4/310 [00:37<50:26, 9.89s/it] 2%\|▏ \| 5/310 [00:48<51:20, 10.10s/it] {'loss': 2.1506, 'grad_norm': 1.0119961500167847, 'learning_rate': 2e-05, 'epoch': 0.16}
	2%\|▏ \| 5/310 [00:48<51:20, 10.10s/it] 2%\|▏ \| 6/310 [00:58<52:11, 10.30s/it] {'loss': 2.0687, 'grad_norm': 1.3454811573028564, 'learning_rate': 2.4e-05, 'epoch': 0.19}
	2%\|▏ \| 6/310 [00:58<52:11, 10.30s/it] 2%\|▏ \| 7/310 [01:08<50:16, 9.95s/it] {'loss': 2.1591, 'grad_norm': 1.12057626247406, 'learning_rate': 2.8000000000000003e-05, 'epoch': 0.22}
	2%\|▏ \| 7/310 [01:08<50:16, 9.95s/it] 3%\|▎ \| 8/310 [01:21<55:27, 11.02s/it] {'loss': 2.0954, 'grad_norm': 0.8658522367477417, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.26}
	3%\|▎ \| 8/310 [01:21<55:27, 11.02s/it] 3%\|▎ \| 9/310 [01:34<58:31, 11.67s/it] {'loss': 2.0456, 'grad_norm': 0.6666287779808044, 'learning_rate': 3.6e-05, 'epoch': 0.29}
	3%\|▎ \| 9/310 [01:34<58:31, 11.67s/it] 3%\|▎ \| 10/310 [01:47<59:49, 11.96s/it] {'loss': 1.9529, 'grad_norm': 0.6769150495529175, 'learning_rate': 4e-05, 'epoch': 0.32}
	3%\|▎ \| 10/310 [01:47<59:49, 11.96s/it] 4%\|▎ \| 11/310 [01:57<57:49, 11.60s/it] {'loss': 2.1578, 'grad_norm': 0.8150319457054138, 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.35}
	4%\|▎ \| 11/310 [01:57<57:49, 11.60s/it] 4%\|▍ \| 12/310 [02:04<50:19, 10.13s/it] {'loss': 2.2589, 'grad_norm': 1.1112101078033447, 'learning_rate': 4.8e-05, 'epoch': 0.38}
	4%\|▍ \| 12/310 [02:04<50:19, 10.13s/it] 4%\|▍ \| 13/310 [02:13<47:59, 9.70s/it] {'loss': 2.2312, 'grad_norm': 0.7264895439147949, 'learning_rate': 5.2000000000000004e-05, 'epoch': 0.42}
	4%\|▍ \| 13/310 [02:13<47:59, 9.70s/it] 5%\|▍ \| 14/310 [02:20<44:32, 9.03s/it] {'loss': 2.0748, 'grad_norm': 0.8957861661911011, 'learning_rate': 5.6000000000000006e-05, 'epoch': 0.45}
	5%\|▍ \| 14/310 [02:20<44:32, 9.03s/it] 5%\|▍ \| 15/310 [02:29<44:42, 9.09s/it] {'loss': 2.0444, 'grad_norm': 0.9033727049827576, 'learning_rate': 6e-05, 'epoch': 0.48}
	5%\|▍ \| 15/310 [02:29<44:42, 9.09s/it] 5%\|▌ \| 16/310 [02:37<42:53, 8.75s/it] {'loss': 2.0181, 'grad_norm': 0.9036484956741333, 'learning_rate': 6.400000000000001e-05, 'epoch': 0.51}
	5%\|▌ \| 16/310 [02:37<42:53, 8.75s/it] 5%\|▌ \| 17/310 [02:47<44:06, 9.03s/it] {'loss': 2.0015, 'grad_norm': 1.039255142211914, 'learning_rate': 6.800000000000001e-05, 'epoch': 0.54}
	5%\|▌ \| 17/310 [02:47<44:06, 9.03s/it] 6%\|▌ \| 18/310 [02:58<47:07, 9.68s/it] {'loss': 1.8732, 'grad_norm': 0.8855764865875244, 'learning_rate': 7.2e-05, 'epoch': 0.58}
	6%\|▌ \| 18/310 [02:58<47:07, 9.68s/it] 6%\|▌ \| 19/310 [03:09<47:58, 9.89s/it] {'loss': 2.1181, 'grad_norm': 0.8636152744293213, 'learning_rate': 7.6e-05, 'epoch': 0.61}
	6%\|▌ \| 19/310 [03:09<47:58, 9.89s/it] 6%\|▋ \| 20/310 [03:19<47:42, 9.87s/it] {'loss': 1.8663, 'grad_norm': 0.8274620771408081, 'learning_rate': 8e-05, 'epoch': 0.64}
	6%\|▋ \| 20/310 [03:19<47:42, 9.87s/it] 7%\|▋ \| 21/310 [03:31<51:00, 10.59s/it] {'loss': 2.1565, 'grad_norm': 0.8611035943031311, 'learning_rate': 8.4e-05, 'epoch': 0.67}
	7%\|▋ \| 21/310 [03:31<51:00, 10.59s/it] 7%\|▋ \| 22/310 [03:41<50:34, 10.54s/it] {'loss': 2.0774, 'grad_norm': 0.885698676109314, 'learning_rate': 8.800000000000001e-05, 'epoch': 0.7}
	7%\|▋ \| 22/310 [03:41<50:34, 10.54s/it] 7%\|▋ \| 23/310 [03:49<46:35, 9.74s/it] {'loss': 2.0381, 'grad_norm': 0.9902978539466858, 'learning_rate': 9.200000000000001e-05, 'epoch': 0.74}
	7%\|▋ \| 23/310 [03:49<46:35, 9.74s/it] 8%\|▊ \| 24/310 [04:02<50:28, 10.59s/it] {'loss': 1.9202, 'grad_norm': 1.5733543634414673, 'learning_rate': 9.6e-05, 'epoch': 0.77}
	8%\|▊ \| 24/310 [04:02<50:28, 10.59s/it] 8%\|▊ \| 25/310 [04:11<48:15, 10.16s/it] {'loss': 2.0025, 'grad_norm': 1.0718069076538086, 'learning_rate': 0.0001, 'epoch': 0.8}
	8%\|▊ \| 25/310 [04:11<48:15, 10.16s/it] 8%\|▊ \| 26/310 [04:23<50:49, 10.74s/it] {'loss': 2.062, 'grad_norm': 1.018492579460144, 'learning_rate': 0.00010400000000000001, 'epoch': 0.83}
	8%\|▊ \| 26/310 [04:23<50:49, 10.74s/it] 9%\|▊ \| 27/310 [04:34<51:35, 10.94s/it] {'loss': 1.9604, 'grad_norm': 0.9562822580337524, 'learning_rate': 0.00010800000000000001, 'epoch': 0.86}
	9%\|▊ \| 27/310 [04:34<51:35, 10.94s/it] 9%\|▉ \| 28/310 [04:46<52:58, 11.27s/it] {'loss': 1.8883, 'grad_norm': 0.7048187255859375, 'learning_rate': 0.00011200000000000001, 'epoch': 0.9}
	9%\|▉ \| 28/310 [04:46<52:58, 11.27s/it] 9%\|▉ \| 29/310 [04:55<49:19, 10.53s/it] {'loss': 2.1274, 'grad_norm': 0.8292555809020996, 'learning_rate': 0.000116, 'epoch': 0.93}
	9%\|▉ \| 29/310 [04:55<49:19, 10.53s/it] 10%\|▉ \| 30/310 [05:05<48:10, 10.32s/it] {'loss': 2.1102, 'grad_norm': 0.7366191148757935, 'learning_rate': 0.00012, 'epoch': 0.96}
	10%\|▉ \| 30/310 [05:05<48:10, 10.32s/it] 10%\|█ \| 31/310 [05:13<45:23, 9.76s/it] {'loss': 1.9992, 'grad_norm': 0.898870050907135, 'learning_rate': 0.000124, 'epoch': 0.99}
	10%\|█ \| 31/310 [05:13<45:23, 9.76s/it] 10%\|█ \| 32/310 [05:23<44:40, 9.64s/it] {'loss': 1.7073, 'grad_norm': 0.7241905927658081, 'learning_rate': 0.00012800000000000002, 'epoch': 1.02}
	10%\|█ \| 32/310 [05:23<44:40, 9.64s/it] 11%\|█ \| 33/310 [05:30<41:03, 8.89s/it] {'loss': 2.0378, 'grad_norm': 0.7887054085731506, 'learning_rate': 0.000132, 'epoch': 1.06}
	11%\|█ \| 33/310 [05:30<41:03, 8.89s/it] 11%\|█ \| 34/310 [05:38<40:18, 8.76s/it] {'loss': 1.9224, 'grad_norm': 0.7820041179656982, 'learning_rate': 0.00013600000000000003, 'epoch': 1.09}
	11%\|█ \| 34/310 [05:38<40:18, 8.76s/it] 11%\|█▏ \| 35/310 [05:52<47:13, 10.31s/it] {'loss': 1.9261, 'grad_norm': 0.8182492852210999, 'learning_rate': 0.00014, 'epoch': 1.12}
	11%\|█▏ \| 35/310 [05:52<47:13, 10.31s/it] 12%\|█▏ \| 36/310 [06:02<46:23, 10.16s/it] {'loss': 1.8373, 'grad_norm': 0.8167790174484253, 'learning_rate': 0.000144, 'epoch': 1.15}
	12%\|█▏ \| 36/310 [06:02<46:23, 10.16s/it] 12%\|█▏ \| 37/310 [06:12<45:43, 10.05s/it] {'loss': 1.9242, 'grad_norm': 0.705341100692749, 'learning_rate': 0.000148, 'epoch': 1.18}
	12%\|█▏ \| 37/310 [06:12<45:43, 10.05s/it] 12%\|█▏ \| 38/310 [06:24<48:36, 10.72s/it] {'loss': 1.8963, 'grad_norm': 0.6956414580345154, 'learning_rate': 0.000152, 'epoch': 1.22}
	12%\|█▏ \| 38/310 [06:24<48:36, 10.72s/it] 13%\|█▎ \| 39/310 [06:38<52:09, 11.55s/it] {'loss': 1.8131, 'grad_norm': 0.7615833878517151, 'learning_rate': 0.00015600000000000002, 'epoch': 1.25}
	13%\|█▎ \| 39/310 [06:38<52:09, 11.55s/it] 13%\|█▎ \| 40/310 [06:50<52:31, 11.67s/it] {'loss': 1.8941, 'grad_norm': 0.8172491192817688, 'learning_rate': 0.00016, 'epoch': 1.28}
	13%\|█▎ \| 40/310 [06:50<52:31, 11.67s/it] 13%\|█▎ \| 41/310 [06:59<49:25, 11.03s/it] {'loss': 1.8171, 'grad_norm': 1.1778546571731567, 'learning_rate': 0.000164, 'epoch': 1.31}
	13%\|█▎ \| 41/310 [06:59<49:25, 11.03s/it] 14%\|█▎ \| 42/310 [07:11<50:10, 11.23s/it] {'loss': 1.9507, 'grad_norm': 0.8987011313438416, 'learning_rate': 0.000168, 'epoch': 1.34}
	14%\|█▎ \| 42/310 [07:11<50:10, 11.23s/it] 14%\|█▍ \| 43/310 [07:21<48:26, 10.89s/it] {'loss': 1.9038, 'grad_norm': 0.9761247634887695, 'learning_rate': 0.000172, 'epoch': 1.38}
	14%\|█▍ \| 43/310 [07:21<48:26, 10.89s/it] 14%\|█▍ \| 44/310 [07:32<48:25, 10.92s/it] {'loss': 1.929, 'grad_norm': 0.8245781064033508, 'learning_rate': 0.00017600000000000002, 'epoch': 1.41}
	14%\|█▍ \| 44/310 [07:32<48:25, 10.92s/it] 15%\|█▍ \| 45/310 [07:44<49:45, 11.27s/it] {'loss': 1.9864, 'grad_norm': 0.733974277973175, 'learning_rate': 0.00018, 'epoch': 1.44}
	15%\|█▍ \| 45/310 [07:44<49:45, 11.27s/it] 15%\|█▍ \| 46/310 [07:53<46:06, 10.48s/it] {'loss': 1.7647, 'grad_norm': 0.803587794303894, 'learning_rate': 0.00018400000000000003, 'epoch': 1.47}
	15%\|█▍ \| 46/310 [07:53<46:06, 10.48s/it] 15%\|█▌ \| 47/310 [08:02<44:54, 10.24s/it] {'loss': 1.8716, 'grad_norm': 0.8270595669746399, 'learning_rate': 0.000188, 'epoch': 1.5}
	15%\|█▌ \| 47/310 [08:02<44:54, 10.24s/it] 15%\|█▌ \| 48/310 [08:11<42:43, 9.79s/it] {'loss': 1.8065, 'grad_norm': 0.9778778553009033, 'learning_rate': 0.000192, 'epoch': 1.54}
	15%\|█▌ \| 48/310 [08:11<42:43, 9.79s/it] 16%\|█▌ \| 49/310 [08:20<40:56, 9.41s/it] {'loss': 1.969, 'grad_norm': 1.0002930164337158, 'learning_rate': 0.000196, 'epoch': 1.57}
	16%\|█▌ \| 49/310 [08:20<40:56, 9.41s/it] 16%\|█▌ \| 50/310 [08:29<41:00, 9.46s/it] {'loss': 1.6605, 'grad_norm': 0.9147765636444092, 'learning_rate': 0.0002, 'epoch': 1.6}
	16%\|█▌ \| 50/310 [08:29<41:00, 9.46s/it] 16%\|█▋ \| 51/310 [08:39<41:06, 9.52s/it] {'loss': 1.7364, 'grad_norm': 1.0163377523422241, 'learning_rate': 0.00020400000000000003, 'epoch': 1.63}
	16%\|█▋ \| 51/310 [08:39<41:06, 9.52s/it] 17%\|█▋ \| 52/310 [08:50<42:56, 9.99s/it] {'loss': 1.7628, 'grad_norm': 0.7921202778816223, 'learning_rate': 0.00020800000000000001, 'epoch': 1.66}
	17%\|█▋ \| 52/310 [08:50<42:56, 9.99s/it] 17%\|█▋ \| 53/310 [09:00<42:42, 9.97s/it] {'loss': 1.9347, 'grad_norm': 0.9955267310142517, 'learning_rate': 0.00021200000000000003, 'epoch': 1.7}
	17%\|█▋ \| 53/310 [09:00<42:42, 9.97s/it] 17%\|█▋ \| 54/310 [09:11<44:18, 10.38s/it] {'loss': 1.9298, 'grad_norm': 0.8882664442062378, 'learning_rate': 0.00021600000000000002, 'epoch': 1.73}
	17%\|█▋ \| 54/310 [09:11<44:18, 10.38s/it] 18%\|█▊ \| 55/310 [09:21<43:40, 10.28s/it] {'loss': 1.7571, 'grad_norm': 0.8763493895530701, 'learning_rate': 0.00022000000000000003, 'epoch': 1.76}
	18%\|█▊ \| 55/310 [09:21<43:40, 10.28s/it] 18%\|█▊ \| 56/310 [09:35<47:27, 11.21s/it] {'loss': 1.6818, 'grad_norm': 0.9735665917396545, 'learning_rate': 0.00022400000000000002, 'epoch': 1.79}
	18%\|█▊ \| 56/310 [09:35<47:27, 11.21s/it] 18%\|█▊ \| 57/310 [09:44<44:35, 10.57s/it] {'loss': 2.0371, 'grad_norm': 1.0887527465820312, 'learning_rate': 0.00022799999999999999, 'epoch': 1.82}
	18%\|█▊ \| 57/310 [09:44<44:35, 10.57s/it] 19%\|█▊ \| 58/310 [09:55<44:51, 10.68s/it] {'loss': 1.6956, 'grad_norm': 0.7795499563217163, 'learning_rate': 0.000232, 'epoch': 1.86}
	19%\|█▊ \| 58/310 [09:55<44:51, 10.68s/it] 19%\|█▉ \| 59/310 [10:04<42:46, 10.22s/it] {'loss': 1.7998, 'grad_norm': 1.0936195850372314, 'learning_rate': 0.000236, 'epoch': 1.89}
	19%\|█▉ \| 59/310 [10:04<42:46, 10.22s/it] 19%\|█▉ \| 60/310 [10:14<43:07, 10.35s/it] {'loss': 1.8687, 'grad_norm': 0.9100828766822815, 'learning_rate': 0.00024, 'epoch': 1.92}
	19%\|█▉ \| 60/310 [10:14<43:07, 10.35s/it] 20%\|█▉ \| 61/310 [10:23<41:01, 9.88s/it] {'loss': 1.851, 'grad_norm': 0.9540266394615173, 'learning_rate': 0.000244, 'epoch': 1.95}
	20%\|█▉ \| 61/310 [10:23<41:01, 9.88s/it] 20%\|██ \| 62/310 [10:37<45:30, 11.01s/it] {'loss': 1.9284, 'grad_norm': 0.8152747750282288, 'learning_rate': 0.000248, 'epoch': 1.98}
	20%\|██ \| 62/310 [10:37<45:30, 11.01s/it] 20%\|██ \| 63/310 [10:47<44:03, 10.70s/it] {'loss': 1.6431, 'grad_norm': 0.8676015138626099, 'learning_rate': 0.000252, 'epoch': 2.02}
	20%\|██ \| 63/310 [10:47<44:03, 10.70s/it] 21%\|██ \| 64/310 [10:59<45:44, 11.15s/it] {'loss': 1.572, 'grad_norm': 1.0211025476455688, 'learning_rate': 0.00025600000000000004, 'epoch': 2.05}
	21%\|██ \| 64/310 [10:59<45:44, 11.15s/it] 21%\|██ \| 65/310 [11:11<46:37, 11.42s/it] {'loss': 1.6042, 'grad_norm': 0.9920119643211365, 'learning_rate': 0.00026000000000000003, 'epoch': 2.08}
	21%\|██ \| 65/310 [11:11<46:37, 11.42s/it] 21%\|██▏ \| 66/310 [11:24<48:06, 11.83s/it] {'loss': 1.5605, 'grad_norm': 1.3224818706512451, 'learning_rate': 0.000264, 'epoch': 2.11}
	21%\|██▏ \| 66/310 [11:24<48:06, 11.83s/it] 22%\|██▏ \| 67/310 [11:33<44:08, 10.90s/it] {'loss': 1.5226, 'grad_norm': 1.5251456499099731, 'learning_rate': 0.000268, 'epoch': 2.14}
	22%\|██▏ \| 67/310 [11:33<44:08, 10.90s/it] 22%\|██▏ \| 68/310 [11:41<40:36, 10.07s/it] {'loss': 1.4616, 'grad_norm': 1.3148976564407349, 'learning_rate': 0.00027200000000000005, 'epoch': 2.18}
	22%\|██▏ \| 68/310 [11:41<40:36, 10.07s/it] 22%\|██▏ \| 69/310 [11:53<42:30, 10.58s/it] {'loss': 1.5635, 'grad_norm': 0.8647676110267639, 'learning_rate': 0.000276, 'epoch': 2.21}
	22%\|██▏ \| 69/310 [11:53<42:30, 10.58s/it] 23%\|██▎ \| 70/310 [12:06<45:08, 11.29s/it] {'loss': 1.5494, 'grad_norm': 1.122746467590332, 'learning_rate': 0.00028, 'epoch': 2.24}
	23%\|██▎ \| 70/310 [12:06<45:08, 11.29s/it] 23%\|██▎ \| 71/310 [12:16<43:27, 10.91s/it] {'loss': 1.6683, 'grad_norm': 1.2864793539047241, 'learning_rate': 0.000284, 'epoch': 2.27}
	23%\|██▎ \| 71/310 [12:16<43:27, 10.91s/it] 23%\|██▎ \| 72/310 [12:23<39:35, 9.98s/it] {'loss': 1.3652, 'grad_norm': 1.7490513324737549, 'learning_rate': 0.000288, 'epoch': 2.3}
	23%\|██▎ \| 72/310 [12:23<39:35, 9.98s/it] 24%\|██▎ \| 73/310 [12:32<37:44, 9.55s/it] {'loss': 1.5903, 'grad_norm': 1.1791971921920776, 'learning_rate': 0.000292, 'epoch': 2.34}
	24%\|██▎ \| 73/310 [12:32<37:44, 9.55s/it] 24%\|██▍ \| 74/310 [12:44<40:26, 10.28s/it] {'loss': 1.5966, 'grad_norm': 1.4103409051895142, 'learning_rate': 0.000296, 'epoch': 2.37}
	24%\|██▍ \| 74/310 [12:44<40:26, 10.28s/it] 24%\|██▍ \| 75/310 [12:55<40:41, 10.39s/it] {'loss': 1.7097, 'grad_norm': 1.4412157535552979, 'learning_rate': 0.00030000000000000003, 'epoch': 2.4}
	24%\|██▍ \| 75/310 [12:55<40:41, 10.39s/it] 25%\|██▍ \| 76/310 [13:06<41:38, 10.68s/it] {'loss': 1.6125, 'grad_norm': 1.1336445808410645, 'learning_rate': 0.000304, 'epoch': 2.43}
	25%\|██▍ \| 76/310 [13:06<41:38, 10.68s/it] 25%\|██▍ \| 77/310 [13:17<42:02, 10.82s/it] {'loss': 1.584, 'grad_norm': 1.350386381149292, 'learning_rate': 0.000308, 'epoch': 2.46}
	25%\|██▍ \| 77/310 [13:17<42:02, 10.82s/it] 25%\|██▌ \| 78/310 [13:25<38:43, 10.01s/it] {'loss': 1.455, 'grad_norm': 1.7065958976745605, 'learning_rate': 0.00031200000000000005, 'epoch': 2.5}
	25%\|██▌ \| 78/310 [13:25<38:43, 10.01s/it] 25%\|██▌ \| 79/310 [13:33<36:06, 9.38s/it] {'loss': 1.5869, 'grad_norm': 1.32325279712677, 'learning_rate': 0.00031600000000000004, 'epoch': 2.53}
	25%\|██▌ \| 79/310 [13:33<36:06, 9.38s/it] 26%\|██▌ \| 80/310 [13:43<36:23, 9.49s/it] {'loss': 1.6144, 'grad_norm': 1.5226932764053345, 'learning_rate': 0.00032, 'epoch': 2.56}
	26%\|██▌ \| 80/310 [13:43<36:23, 9.49s/it] 26%\|██▌ \| 81/310 [13:52<36:06, 9.46s/it] {'loss': 1.4771, 'grad_norm': 1.0289530754089355, 'learning_rate': 0.000324, 'epoch': 2.59}
	26%\|██▌ \| 81/310 [13:52<36:06, 9.46s/it] 26%\|██▋ \| 82/310 [14:00<33:34, 8.83s/it] {'loss': 1.5069, 'grad_norm': 1.6359686851501465, 'learning_rate': 0.000328, 'epoch': 2.62}
	26%\|██▋ \| 82/310 [14:00<33:34, 8.83s/it] 27%\|██▋ \| 83/310 [14:09<34:30, 9.12s/it] {'loss': 1.5484, 'grad_norm': 1.520156979560852, 'learning_rate': 0.000332, 'epoch': 2.66}
	27%\|██▋ \| 83/310 [14:09<34:30, 9.12s/it] 27%\|██▋ \| 84/310 [14:22<37:53, 10.06s/it] {'loss': 1.6263, 'grad_norm': 1.0677285194396973, 'learning_rate': 0.000336, 'epoch': 2.69}
	27%\|██▋ \| 84/310 [14:22<37:53, 10.06s/it] 27%\|██▋ \| 85/310 [14:33<39:23, 10.51s/it] {'loss': 1.4884, 'grad_norm': 0.8967558741569519, 'learning_rate': 0.00034, 'epoch': 2.72}
	27%\|██▋ \| 85/310 [14:33<39:23, 10.51s/it] 28%\|██▊ \| 86/310 [14:44<39:09, 10.49s/it] {'loss': 1.648, 'grad_norm': 1.0207316875457764, 'learning_rate': 0.000344, 'epoch': 2.75}
	28%\|██▊ \| 86/310 [14:44<39:09, 10.49s/it] 28%\|██▊ \| 87/310 [14:55<40:25, 10.88s/it] {'loss': 1.5273, 'grad_norm': 1.1127184629440308, 'learning_rate': 0.000348, 'epoch': 2.78}
	28%\|██▊ \| 87/310 [14:55<40:25, 10.88s/it] 28%\|██▊ \| 88/310 [15:08<41:59, 11.35s/it] {'loss': 1.6774, 'grad_norm': 1.185340166091919, 'learning_rate': 0.00035200000000000005, 'epoch': 2.82}
	28%\|██▊ \| 88/310 [15:08<41:59, 11.35s/it] 29%\|██▊ \| 89/310 [15:16<37:55, 10.30s/it] {'loss': 1.5637, 'grad_norm': 1.4338184595108032, 'learning_rate': 0.00035600000000000003, 'epoch': 2.85}
	29%\|██▊ \| 89/310 [15:16<37:55, 10.30s/it] 29%\|██▉ \| 90/310 [15:26<38:04, 10.38s/it] {'loss': 1.4792, 'grad_norm': 1.3856693506240845, 'learning_rate': 0.00036, 'epoch': 2.88}
	29%\|██▉ \| 90/310 [15:26<38:04, 10.38s/it] 29%\|██▉ \| 91/310 [15:40<41:32, 11.38s/it] {'loss': 1.4939, 'grad_norm': 0.9865981340408325, 'learning_rate': 0.000364, 'epoch': 2.91}
	29%\|██▉ \| 91/310 [15:40<41:32, 11.38s/it] 30%\|██▉ \| 92/310 [15:49<39:05, 10.76s/it] {'loss': 1.6463, 'grad_norm': 1.0994203090667725, 'learning_rate': 0.00036800000000000005, 'epoch': 2.94}
	30%\|██▉ \| 92/310 [15:49<39:05, 10.76s/it] 30%\|███ \| 93/310 [15:57<35:11, 9.73s/it] {'loss': 1.3099, 'grad_norm': 1.2849074602127075, 'learning_rate': 0.00037200000000000004, 'epoch': 2.98}
	30%\|███ \| 93/310 [15:57<35:11, 9.73s/it] 30%\|███ \| 94/310 [16:06<34:19, 9.53s/it] {'loss': 1.271, 'grad_norm': 1.4378759860992432, 'learning_rate': 0.000376, 'epoch': 3.01}
	30%\|███ \| 94/310 [16:06<34:19, 9.53s/it] 31%\|███ \| 95/310 [16:14<32:24, 9.05s/it] {'loss': 1.1964, 'grad_norm': 1.5420870780944824, 'learning_rate': 0.00038, 'epoch': 3.04}
	31%\|███ \| 95/310 [16:14<32:24, 9.05s/it] 31%\|███ \| 96/310 [16:26<35:34, 9.97s/it] {'loss': 1.134, 'grad_norm': 2.0521044731140137, 'learning_rate': 0.000384, 'epoch': 3.07}
	31%\|███ \| 96/310 [16:26<35:34, 9.97s/it] 31%\|███▏ \| 97/310 [16:37<37:00, 10.42s/it] {'loss': 1.1039, 'grad_norm': 1.9117984771728516, 'learning_rate': 0.000388, 'epoch': 3.1}
	31%\|███▏ \| 97/310 [16:37<37:00, 10.42s/it] 32%\|███▏ \| 98/310 [16:48<37:31, 10.62s/it] {'loss': 1.2622, 'grad_norm': 2.8603358268737793, 'learning_rate': 0.000392, 'epoch': 3.14}
	32%\|███▏ \| 98/310 [16:48<37:31, 10.62s/it] 32%\|███▏ \| 99/310 [16:59<37:10, 10.57s/it] {'loss': 1.1871, 'grad_norm': 1.4761501550674438, 'learning_rate': 0.00039600000000000003, 'epoch': 3.17}
	32%\|███▏ \| 99/310 [16:59<37:10, 10.57s/it] 32%\|███▏ \| 100/310 [17:10<37:40, 10.76s/it] {'loss': 1.0543, 'grad_norm': 1.1869208812713623, 'learning_rate': 0.0004, 'epoch': 3.2}
	32%\|███▏ \| 100/310 [17:10<37:40, 10.76s/it]/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	33%\|███▎ \| 101/310 [17:26<42:42, 12.26s/it] {'loss': 1.1881, 'grad_norm': 1.5921465158462524, 'learning_rate': 0.00039997762036205473, 'epoch': 3.23}
	33%\|███▎ \| 101/310 [17:26<42:42, 12.26s/it] 33%\|███▎ \| 102/310 [17:35<39:37, 11.43s/it] {'loss': 1.0875, 'grad_norm': 1.405258297920227, 'learning_rate': 0.00039991048645670067, 'epoch': 3.26}
	33%\|███▎ \| 102/310 [17:35<39:37, 11.43s/it] 33%\|███▎ \| 103/310 [17:45<37:33, 10.89s/it] {'loss': 0.987, 'grad_norm': 1.451137661933899, 'learning_rate': 0.00039979861330826294, 'epoch': 3.3}
	33%\|███▎ \| 103/310 [17:45<37:33, 10.89s/it] 34%\|███▎ \| 104/310 [17:59<41:00, 11.94s/it] {'loss': 1.1422, 'grad_norm': 1.545263648033142, 'learning_rate': 0.000399642025953547, 'epoch': 3.33}
	34%\|███▎ \| 104/310 [17:59<41:00, 11.94s/it] 34%\|███▍ \| 105/310 [18:07<36:50, 10.79s/it] {'loss': 1.1215, 'grad_norm': 1.8338350057601929, 'learning_rate': 0.00039944075943623605, 'epoch': 3.36}
	34%\|███▍ \| 105/310 [18:07<36:50, 10.79s/it] 34%\|███▍ \| 106/310 [18:17<35:36, 10.47s/it] {'loss': 1.0834, 'grad_norm': 1.6878302097320557, 'learning_rate': 0.00039919485879904784, 'epoch': 3.39}
	34%\|███▍ \| 106/310 [18:17<35:36, 10.47s/it] 35%\|███▍ \| 107/310 [18:26<33:48, 9.99s/it] {'loss': 1.1386, 'grad_norm': 1.8568742275238037, 'learning_rate': 0.0003989043790736547, 'epoch': 3.42}
	35%\|███▍ \| 107/310 [18:26<33:48, 9.99s/it] 35%\|███▍ \| 108/310 [18:38<35:32, 10.56s/it] {'loss': 1.1787, 'grad_norm': 1.5421483516693115, 'learning_rate': 0.0003985693852683675, 'epoch': 3.46}
	35%\|███▍ \| 108/310 [18:38<35:32, 10.56s/it] 35%\|███▌ \| 109/310 [18:49<35:46, 10.68s/it] {'loss': 1.1513, 'grad_norm': 1.528101921081543, 'learning_rate': 0.00039818995235358696, 'epoch': 3.49}
	35%\|███▌ \| 109/310 [18:49<35:46, 10.68s/it] 35%\|███▌ \| 110/310 [19:00<35:50, 10.75s/it] {'loss': 1.1169, 'grad_norm': 1.449886679649353, 'learning_rate': 0.0003977661652450257, 'epoch': 3.52}
	35%\|███▌ \| 110/310 [19:00<35:50, 10.75s/it] 36%\|███▌ \| 111/310 [19:08<32:52, 9.91s/it] {'loss': 1.1006, 'grad_norm': 1.8043172359466553, 'learning_rate': 0.00039729811878470427, 'epoch': 3.55}
	36%\|███▌ \| 111/310 [19:08<32:52, 9.91s/it] 36%\|███▌ \| 112/310 [19:16<31:12, 9.46s/it] {'loss': 1.1649, 'grad_norm': 2.346277952194214, 'learning_rate': 0.0003967859177197259, 'epoch': 3.58}
	36%\|███▌ \| 112/310 [19:16<31:12, 9.46s/it] 36%\|███▋ \| 113/310 [19:23<29:02, 8.85s/it] {'loss': 1.076, 'grad_norm': 1.827484369277954, 'learning_rate': 0.00039622967667883455, 'epoch': 3.62}
	36%\|███▋ \| 113/310 [19:23<29:02, 8.85s/it] 37%\|███▋ \| 114/310 [19:33<29:54, 9.15s/it] {'loss': 1.0677, 'grad_norm': 1.7394708395004272, 'learning_rate': 0.00039562952014676116, 'epoch': 3.65}
	37%\|███▋ \| 114/310 [19:33<29:54, 9.15s/it] 37%\|███▋ \| 115/310 [19:43<29:52, 9.19s/it] {'loss': 1.1325, 'grad_norm': 1.9820547103881836, 'learning_rate': 0.0003949855824363647, 'epoch': 3.68}
	37%\|███▋ \| 115/310 [19:43<29:52, 9.19s/it] 37%\|███▋ \| 116/310 [19:54<31:41, 9.80s/it] {'loss': 1.1124, 'grad_norm': 1.6726285219192505, 'learning_rate': 0.0003942980076585735, 'epoch': 3.71}
	37%\|███▋ \| 116/310 [19:54<31:41, 9.80s/it] 38%\|███▊ \| 117/310 [20:05<32:24, 10.07s/it] {'loss': 1.0183, 'grad_norm': 1.6706442832946777, 'learning_rate': 0.00039356694969013337, 'epoch': 3.74}
	38%\|███▊ \| 117/310 [20:05<32:24, 10.07s/it] 38%\|███▊ \| 118/310 [20:14<31:58, 9.99s/it] {'loss': 1.0524, 'grad_norm': 1.6133588552474976, 'learning_rate': 0.00039279257213917066, 'epoch': 3.78}
	38%\|███▊ \| 118/310 [20:14<31:58, 9.99s/it] 38%\|███▊ \| 119/310 [20:24<31:28, 9.89s/it] {'loss': 1.0931, 'grad_norm': 1.5671244859695435, 'learning_rate': 0.0003919750483085778, 'epoch': 3.81}
	38%\|███▊ \| 119/310 [20:24<31:28, 9.89s/it] 39%\|███▊ \| 120/310 [20:39<35:56, 11.35s/it] {'loss': 1.2945, 'grad_norm': 1.3370288610458374, 'learning_rate': 0.0003911145611572282, 'epoch': 3.84}
	39%\|███▊ \| 120/310 [20:39<35:56, 11.35s/it] 39%\|███▉ \| 121/310 [20:48<33:25, 10.61s/it] {'loss': 1.1277, 'grad_norm': 1.4490777254104614, 'learning_rate': 0.00039021130325903074, 'epoch': 3.87}
	39%\|███▉ \| 121/310 [20:48<33:25, 10.61s/it] 39%\|███▉ \| 122/310 [21:01<35:56, 11.47s/it] {'loss': 1.1822, 'grad_norm': 1.628236174583435, 'learning_rate': 0.00038926547675983286, 'epoch': 3.9}
	39%\|███▉ \| 122/310 [21:01<35:56, 11.47s/it] 40%\|███▉ \| 123/310 [21:09<32:29, 10.42s/it] {'loss': 1.0922, 'grad_norm': 1.7587902545928955, 'learning_rate': 0.00038827729333218067, 'epoch': 3.94}
	40%\|███▉ \| 123/310 [21:09<32:29, 10.42s/it] 40%\|████ \| 124/310 [21:16<28:55, 9.33s/it] {'loss': 0.9296, 'grad_norm': 1.884556531906128, 'learning_rate': 0.00038724697412794747, 'epoch': 3.97}
	40%\|████ \| 124/310 [21:16<28:55, 9.33s/it] 40%\|████ \| 125/310 [21:25<28:21, 9.20s/it] {'loss': 0.9886, 'grad_norm': 1.714226484298706, 'learning_rate': 0.0003861747497288409, 'epoch': 4.0}
	40%\|████ \| 125/310 [21:25<28:21, 9.20s/it] 41%\|████ \| 126/310 [21:37<30:46, 10.03s/it] {'loss': 0.7371, 'grad_norm': 1.6147369146347046, 'learning_rate': 0.00038506086009479937, 'epoch': 4.03}
	41%\|████ \| 126/310 [21:37<30:46, 10.03s/it] 41%\|████ \| 127/310 [21:49<33:02, 10.83s/it] {'loss': 0.6982, 'grad_norm': 1.7994240522384644, 'learning_rate': 0.0003839055545102902, 'epoch': 4.06}
	41%\|████ \| 127/310 [21:49<33:02, 10.83s/it] 41%\|████▏ \| 128/310 [22:02<34:00, 11.21s/it] {'loss': 0.898, 'grad_norm': 2.520430088043213, 'learning_rate': 0.0003827090915285202, 'epoch': 4.1}
	41%\|████▏ \| 128/310 [22:02<34:00, 11.21s/it] 42%\|████▏ \| 129/310 [22:10<31:26, 10.42s/it] {'loss': 0.7936, 'grad_norm': 2.4311764240264893, 'learning_rate': 0.000381471738913573, 'epoch': 4.13}
	42%\|████▏ \| 129/310 [22:10<31:26, 10.42s/it] 42%\|████▏ \| 130/310 [22:19<29:52, 9.96s/it] {'loss': 0.6571, 'grad_norm': 1.5284080505371094, 'learning_rate': 0.0003801937735804838, 'epoch': 4.16}
	42%\|████▏ \| 130/310 [22:19<29:52, 9.96s/it] 42%\|████▏ \| 131/310 [22:27<27:56, 9.37s/it] {'loss': 0.6387, 'grad_norm': 1.768818974494934, 'learning_rate': 0.0003788754815332674, 'epoch': 4.19}
	42%\|████▏ \| 131/310 [22:27<27:56, 9.37s/it] 43%\|████▎ \| 132/310 [22:38<28:53, 9.74s/it] {'loss': 0.6735, 'grad_norm': 1.7780399322509766, 'learning_rate': 0.00037751715780091086, 'epoch': 4.22}
	43%\|████▎ \| 132/310 [22:38<28:53, 9.74s/it] 43%\|████▎ \| 133/310 [22:48<29:36, 10.04s/it] {'loss': 0.7629, 'grad_norm': 1.542732834815979, 'learning_rate': 0.0003761191063713476, 'epoch': 4.26}
	43%\|████▎ \| 133/310 [22:48<29:36, 10.04s/it] 43%\|████▎ \| 134/310 [22:58<28:48, 9.82s/it] {'loss': 0.6718, 'grad_norm': 1.7423644065856934, 'learning_rate': 0.00037468164012342597, 'epoch': 4.29}
	43%\|████▎ \| 134/310 [22:58<28:48, 9.82s/it] 44%\|████▎ \| 135/310 [23:08<28:58, 9.93s/it] {'loss': 0.7437, 'grad_norm': 1.964659333229065, 'learning_rate': 0.00037320508075688776, 'epoch': 4.32}
	44%\|████▎ \| 135/310 [23:08<28:58, 9.93s/it] 44%\|████▍ \| 136/310 [23:16<27:33, 9.51s/it] {'loss': 0.6699, 'grad_norm': 2.0700464248657227, 'learning_rate': 0.00037168975872037323, 'epoch': 4.35}
	44%\|████▍ \| 136/310 [23:16<27:33, 9.51s/it] 44%\|████▍ \| 137/310 [23:29<30:08, 10.45s/it] {'loss': 0.8819, 'grad_norm': 1.420904517173767, 'learning_rate': 0.00037013601313746797, 'epoch': 4.38}
	44%\|████▍ \| 137/310 [23:29<30:08, 10.45s/it] 45%\|████▍ \| 138/310 [23:36<27:14, 9.50s/it] {'loss': 0.6399, 'grad_norm': 1.6515897512435913, 'learning_rate': 0.0003685441917308078, 'epoch': 4.42}
	45%\|████▍ \| 138/310 [23:36<27:14, 9.50s/it] 45%\|████▍ \| 139/310 [23:47<28:27, 9.98s/it] {'loss': 0.8705, 'grad_norm': 1.7429108619689941, 'learning_rate': 0.00036691465074426054, 'epoch': 4.45}
	45%\|████▍ \| 139/310 [23:47<28:27, 9.98s/it] 45%\|████▌ \| 140/310 [23:58<28:44, 10.14s/it] {'loss': 0.7238, 'grad_norm': 1.681219458580017, 'learning_rate': 0.000365247754863199, 'epoch': 4.48}
	45%\|████▌ \| 140/310 [23:58<28:44, 10.14s/it] 45%\|████▌ \| 141/310 [24:12<31:42, 11.26s/it] {'loss': 0.9049, 'grad_norm': 1.480978012084961, 'learning_rate': 0.0003635438771328863, 'epoch': 4.51}
	45%\|████▌ \| 141/310 [24:12<31:42, 11.26s/it] 46%\|████▌ \| 142/310 [24:20<29:12, 10.43s/it] {'loss': 0.6838, 'grad_norm': 1.7104209661483765, 'learning_rate': 0.0003618033988749895, 'epoch': 4.54}
	46%\|████▌ \| 142/310 [24:20<29:12, 10.43s/it] 46%\|████▌ \| 143/310 [24:28<26:27, 9.51s/it] {'loss': 0.6716, 'grad_norm': 1.9363956451416016, 'learning_rate': 0.0003600267096022413, 'epoch': 4.58}
	46%\|████▌ \| 143/310 [24:28<26:27, 9.51s/it] 46%\|████▋ \| 144/310 [24:35<24:55, 9.01s/it] {'loss': 0.7208, 'grad_norm': 1.9475669860839844, 'learning_rate': 0.0003582142069312683, 'epoch': 4.61}
	46%\|████▋ \| 144/310 [24:36<24:55, 9.01s/it] 47%\|████▋ \| 145/310 [24:47<26:29, 9.63s/it] {'loss': 0.7464, 'grad_norm': 1.604385495185852, 'learning_rate': 0.000356366296493606, 'epoch': 4.64}
	47%\|████▋ \| 145/310 [24:47<26:29, 9.63s/it] 47%\|████▋ \| 146/310 [24:55<25:18, 9.26s/it] {'loss': 0.7545, 'grad_norm': 1.7255668640136719, 'learning_rate': 0.0003544833918449199, 'epoch': 4.67}
	47%\|████▋ \| 146/310 [24:55<25:18, 9.26s/it] 47%\|████▋ \| 147/310 [25:06<26:15, 9.67s/it] {'loss': 0.8671, 'grad_norm': 1.86020028591156, 'learning_rate': 0.0003525659143724533, 'epoch': 4.7}
	47%\|████▋ \| 147/310 [25:06<26:15, 9.67s/it] 48%\|████▊ \| 148/310 [25:17<27:43, 10.27s/it] {'loss': 0.8592, 'grad_norm': 1.541040301322937, 'learning_rate': 0.00035061429320072223, 'epoch': 4.74}
	48%\|████▊ \| 148/310 [25:17<27:43, 10.27s/it] 48%\|████▊ \| 149/310 [25:26<26:27, 9.86s/it] {'loss': 0.8132, 'grad_norm': 1.7307953834533691, 'learning_rate': 0.00034862896509547887, 'epoch': 4.77}
	48%\|████▊ \| 149/310 [25:26<26:27, 9.86s/it] 48%\|████▊ \| 150/310 [25:34<24:43, 9.27s/it] {'loss': 0.6629, 'grad_norm': 1.686195731163025, 'learning_rate': 0.0003466103743659653, 'epoch': 4.8}
	48%\|████▊ \| 150/310 [25:34<24:43, 9.27s/it] 49%\|████▊ \| 151/310 [25:46<26:33, 10.02s/it] {'loss': 0.8804, 'grad_norm': 1.5188603401184082, 'learning_rate': 0.0003445589727654783, 'epoch': 4.83}
	49%\|████▊ \| 151/310 [25:46<26:33, 10.02s/it] 49%\|████▉ \| 152/310 [25:56<26:21, 10.01s/it] {'loss': 0.7991, 'grad_norm': 1.7107975482940674, 'learning_rate': 0.000342475219390269, 'epoch': 4.86}
	49%\|████▉ \| 152/310 [25:56<26:21, 10.01s/it] 49%\|████▉ \| 153/310 [26:05<25:16, 9.66s/it] {'loss': 0.7772, 'grad_norm': 1.8089325428009033, 'learning_rate': 0.0003403595805767983, 'epoch': 4.9}
	49%\|████▉ \| 153/310 [26:05<25:16, 9.66s/it] 50%\|████▉ \| 154/310 [26:14<24:50, 9.55s/it] {'loss': 0.8396, 'grad_norm': 2.000405788421631, 'learning_rate': 0.00033821252979737297, 'epoch': 4.93}
	50%\|████▉ \| 154/310 [26:14<24:50, 9.55s/it] 50%\|█████ \| 155/310 [26:22<23:37, 9.15s/it] {'loss': 0.7081, 'grad_norm': 2.3581502437591553, 'learning_rate': 0.0003360345475541839, 'epoch': 4.96}
	50%\|█████ \| 155/310 [26:22<23:37, 9.15s/it] 50%\|█████ \| 156/310 [26:32<24:22, 9.50s/it] {'loss': 0.7514, 'grad_norm': 1.7323988676071167, 'learning_rate': 0.00033382612127177166, 'epoch': 4.99}
	50%\|█████ \| 156/310 [26:32<24:22, 9.50s/it] 51%\|█████ \| 157/310 [26:40<22:58, 9.01s/it] {'loss': 0.436, 'grad_norm': 1.3576894998550415, 'learning_rate': 0.00033158774518794254, 'epoch': 5.02}
	51%\|█████ \| 157/310 [26:40<22:58, 9.01s/it] 51%\|█████ \| 158/310 [26:51<24:03, 9.50s/it] {'loss': 0.4684, 'grad_norm': 1.4445552825927734, 'learning_rate': 0.0003293199202431599, 'epoch': 5.06}
	51%\|█████ \| 158/310 [26:51<24:03, 9.50s/it] 51%\|█████▏ \| 159/310 [27:01<24:33, 9.76s/it] {'loss': 0.4683, 'grad_norm': 1.6872897148132324, 'learning_rate': 0.0003270231539684358, 'epoch': 5.09}
	51%\|█████▏ \| 159/310 [27:01<24:33, 9.76s/it] 52%\|█████▏ \| 160/310 [27:12<24:54, 9.97s/it] {'loss': 0.5658, 'grad_norm': 1.876907467842102, 'learning_rate': 0.00032469796037174674, 'epoch': 5.12}
	52%\|█████▏ \| 160/310 [27:12<24:54, 9.97s/it] 52%\|█████▏ \| 161/310 [27:22<24:45, 9.97s/it] {'loss': 0.504, 'grad_norm': 1.7589657306671143, 'learning_rate': 0.0003223448598230013, 'epoch': 5.15}
	52%\|█████▏ \| 161/310 [27:22<24:45, 9.97s/it] 52%\|█████▏ \| 162/310 [27:32<24:51, 10.08s/it] {'loss': 0.4607, 'grad_norm': 1.8434817790985107, 'learning_rate': 0.00031996437893758276, 'epoch': 5.18}
	52%\|█████▏ \| 162/310 [27:32<24:51, 10.08s/it] 53%\|█████▎ \| 163/310 [27:44<25:43, 10.50s/it] {'loss': 0.5482, 'grad_norm': 1.6885955333709717, 'learning_rate': 0.00031755705045849464, 'epoch': 5.22}
	53%\|█████▎ \| 163/310 [27:44<25:43, 10.50s/it] 53%\|█████▎ \| 164/310 [27:58<28:15, 11.61s/it] {'loss': 0.5573, 'grad_norm': 1.7552101612091064, 'learning_rate': 0.00031512341313713474, 'epoch': 5.25}
	53%\|█████▎ \| 164/310 [27:58<28:15, 11.61s/it] 53%\|█████▎ \| 165/310 [28:08<27:05, 11.21s/it] {'loss': 0.4205, 'grad_norm': 1.4719288349151611, 'learning_rate': 0.0003126640116127244, 'epoch': 5.28}
	53%\|█████▎ \| 165/310 [28:08<27:05, 11.21s/it] 54%\|█████▎ \| 166/310 [28:22<28:41, 11.96s/it] {'loss': 0.5372, 'grad_norm': 1.5896486043930054, 'learning_rate': 0.0003101793962904205, 'epoch': 5.31}
	54%\|█████▎ \| 166/310 [28:22<28:41, 11.96s/it] 54%\|█████▍ \| 167/310 [28:31<26:23, 11.08s/it] {'loss': 0.4608, 'grad_norm': 2.1966960430145264, 'learning_rate': 0.0003076701232181365, 'epoch': 5.34}
	54%\|█████▍ \| 167/310 [28:31<26:23, 11.08s/it] 54%\|█████▍ \| 168/310 [28:41<25:43, 10.87s/it] {'loss': 0.4637, 'grad_norm': 1.6450506448745728, 'learning_rate': 0.00030513675396210094, 'epoch': 5.38}
	54%\|█████▍ \| 168/310 [28:41<25:43, 10.87s/it] 55%\|█████▍ \| 169/310 [28:54<26:40, 11.35s/it] {'loss': 0.6279, 'grad_norm': 1.9329091310501099, 'learning_rate': 0.00030257985548118126, 'epoch': 5.41}
	55%\|█████▍ \| 169/310 [28:54<26:40, 11.35s/it] 55%\|█████▍ \| 170/310 [29:05<26:40, 11.43s/it] {'loss': 0.4772, 'grad_norm': 1.7089264392852783, 'learning_rate': 0.00030000000000000003, 'epoch': 5.44}
	55%\|█████▍ \| 170/310 [29:05<26:40, 11.43s/it] 55%\|█████▌ \| 171/310 [29:14<24:48, 10.71s/it] {'loss': 0.5734, 'grad_norm': 2.046696662902832, 'learning_rate': 0.00029739776488087345, 'epoch': 5.47}
	55%\|█████▌ \| 171/310 [29:14<24:48, 10.71s/it] 55%\|█████▌ \| 172/310 [29:28<26:49, 11.66s/it] {'loss': 0.5632, 'grad_norm': 1.467828631401062, 'learning_rate': 0.0002947737324945997, 'epoch': 5.5}
	55%\|█████▌ \| 172/310 [29:28<26:49, 11.66s/it] 56%\|█████▌ \| 173/310 [29:35<23:36, 10.34s/it] {'loss': 0.4053, 'grad_norm': 1.656450867652893, 'learning_rate': 0.0002921284900901265, 'epoch': 5.54}
	56%\|█████▌ \| 173/310 [29:35<23:36, 10.34s/it] 56%\|█████▌ \| 174/310 [29:44<22:05, 9.74s/it] {'loss': 0.478, 'grad_norm': 1.84519362449646, 'learning_rate': 0.00028946262966312653, 'epoch': 5.57}
	56%\|█████▌ \| 174/310 [29:44<22:05, 9.74s/it] 56%\|█████▋ \| 175/310 [29:55<22:46, 10.12s/it] {'loss': 0.5038, 'grad_norm': 1.4977473020553589, 'learning_rate': 0.00028677674782351165, 'epoch': 5.6}
	56%\|█████▋ \| 175/310 [29:55<22:46, 10.12s/it] 57%\|█████▋ \| 176/310 [30:04<22:09, 9.92s/it] {'loss': 0.421, 'grad_norm': 1.7688246965408325, 'learning_rate': 0.00028407144566191313, 'epoch': 5.63}
	57%\|█████▋ \| 176/310 [30:04<22:09, 9.92s/it] 57%\|█████▋ \| 177/310 [30:14<22:09, 9.99s/it] {'loss': 0.3923, 'grad_norm': 1.7468736171722412, 'learning_rate': 0.0002813473286151601, 'epoch': 5.66}
	57%\|█████▋ \| 177/310 [30:14<22:09, 9.99s/it] 57%\|█████▋ \| 178/310 [30:25<22:04, 10.03s/it] {'loss': 0.5247, 'grad_norm': 1.6517231464385986, 'learning_rate': 0.00027860500633078477, 'epoch': 5.7}
	57%\|█████▋ \| 178/310 [30:25<22:04, 10.03s/it] 58%\|█████▊ \| 179/310 [30:32<20:08, 9.22s/it] {'loss': 0.4195, 'grad_norm': 1.6703728437423706, 'learning_rate': 0.0002758450925305857, 'epoch': 5.73}
	58%\|█████▊ \| 179/310 [30:32<20:08, 9.22s/it] 58%\|█████▊ \| 180/310 [30:40<19:25, 8.96s/it] {'loss': 0.4477, 'grad_norm': 1.6039535999298096, 'learning_rate': 0.00027306820487327905, 'epoch': 5.76}
	58%\|█████▊ \| 180/310 [30:40<19:25, 8.96s/it] 58%\|█████▊ \| 181/310 [30:50<19:30, 9.07s/it] {'loss': 0.4873, 'grad_norm': 1.7970647811889648, 'learning_rate': 0.0002702749648162686, 'epoch': 5.79}
	58%\|█████▊ \| 181/310 [30:50<19:30, 9.07s/it] 59%\|█████▊ \| 182/310 [31:01<20:36, 9.66s/it] {'loss': 0.5899, 'grad_norm': 1.4690847396850586, 'learning_rate': 0.00026746599747656607, 'epoch': 5.82}
	59%\|█████▊ \| 182/310 [31:01<20:36, 9.66s/it] 59%\|█████▉ \| 183/310 [31:11<21:06, 9.97s/it] {'loss': 0.4865, 'grad_norm': 1.6724956035614014, 'learning_rate': 0.00026464193149089205, 'epoch': 5.86}
	59%\|█████▉ \| 183/310 [31:11<21:06, 9.97s/it] 59%\|█████▉ \| 184/310 [31:21<20:51, 9.93s/it] {'loss': 0.5033, 'grad_norm': 1.6596944332122803, 'learning_rate': 0.00026180339887498953, 'epoch': 5.89}
	59%\|█████▉ \| 184/310 [31:21<20:51, 9.93s/it] 60%\|█████▉ \| 185/310 [31:31<20:29, 9.83s/it] {'loss': 0.4057, 'grad_norm': 1.5845303535461426, 'learning_rate': 0.00025895103488218085, 'epoch': 5.92}
	60%\|█████▉ \| 185/310 [31:31<20:29, 9.83s/it] 60%\|██████ \| 186/310 [31:42<21:04, 10.20s/it] {'loss': 0.564, 'grad_norm': 1.6832287311553955, 'learning_rate': 0.00025608547786120056, 'epoch': 5.95}
	60%\|██████ \| 186/310 [31:42<21:04, 10.20s/it] 60%\|██████ \| 187/310 [31:53<21:42, 10.59s/it] {'loss': 0.4829, 'grad_norm': 1.5933996438980103, 'learning_rate': 0.00025320736911333503, 'epoch': 5.98}
	60%\|██████ \| 187/310 [31:53<21:42, 10.59s/it] 61%\|██████ \| 188/310 [32:04<21:47, 10.72s/it] {'loss': 0.4631, 'grad_norm': 1.4348868131637573, 'learning_rate': 0.0002503173527489017, 'epoch': 6.02}
	61%\|██████ \| 188/310 [32:04<21:47, 10.72s/it] 61%\|██████ \| 189/310 [32:14<20:54, 10.37s/it] {'loss': 0.3025, 'grad_norm': 1.1029309034347534, 'learning_rate': 0.00024741607554309953, 'epoch': 6.05}
	61%\|██████ \| 189/310 [32:14<20:54, 10.37s/it] 61%\|██████▏ \| 190/310 [32:27<22:11, 11.10s/it] {'loss': 0.2949, 'grad_norm': 1.1804847717285156, 'learning_rate': 0.0002445041867912629, 'epoch': 6.08}
	61%\|██████▏ \| 190/310 [32:27<22:11, 11.10s/it] 62%\|██████▏ \| 191/310 [32:35<20:25, 10.30s/it] {'loss': 0.2671, 'grad_norm': 1.3757280111312866, 'learning_rate': 0.00024158233816355185, 'epoch': 6.11}
	62%\|██████▏ \| 191/310 [32:35<20:25, 10.30s/it] 62%\|██████▏ \| 192/310 [32:46<20:21, 10.35s/it] {'loss': 0.2401, 'grad_norm': 1.6068209409713745, 'learning_rate': 0.00023865118355911066, 'epoch': 6.14}
	62%\|██████▏ \| 192/310 [32:46<20:21, 10.35s/it] 62%\|██████▏ \| 193/310 [32:55<19:46, 10.14s/it] {'loss': 0.2447, 'grad_norm': 1.6851338148117065, 'learning_rate': 0.00023571137895972733, 'epoch': 6.18}
	62%\|██████▏ \| 193/310 [32:55<19:46, 10.14s/it] 63%\|██████▎ \| 194/310 [33:05<19:09, 9.91s/it] {'loss': 0.3183, 'grad_norm': 1.9770675897598267, 'learning_rate': 0.00023276358228302755, 'epoch': 6.21}
	63%\|██████▎ \| 194/310 [33:05<19:09, 9.91s/it] 63%\|██████▎ \| 195/310 [33:14<18:45, 9.79s/it] {'loss': 0.2471, 'grad_norm': 1.3860464096069336, 'learning_rate': 0.00022980845323523487, 'epoch': 6.24}
	63%\|██████▎ \| 195/310 [33:14<18:45, 9.79s/it] 63%\|██████▎ \| 196/310 [33:26<19:34, 10.30s/it] {'loss': 0.3387, 'grad_norm': 1.144447684288025, 'learning_rate': 0.0002268466531635311, 'epoch': 6.27}
	63%\|██████▎ \| 196/310 [33:26<19:34, 10.30s/it] 64%\|██████▎ \| 197/310 [33:37<19:59, 10.61s/it] {'loss': 0.262, 'grad_norm': 1.1870023012161255, 'learning_rate': 0.00022387884490804885, 'epoch': 6.3}
	64%\|██████▎ \| 197/310 [33:37<19:59, 10.61s/it] 64%\|██████▍ \| 198/310 [33:49<20:25, 10.94s/it] {'loss': 0.3657, 'grad_norm': 1.3196618556976318, 'learning_rate': 0.00022090569265353072, 'epoch': 6.34}
	64%\|██████▍ \| 198/310 [33:49<20:25, 10.94s/it] 64%\|██████▍ \| 199/310 [33:59<19:51, 10.73s/it] {'loss': 0.342, 'grad_norm': 1.206906795501709, 'learning_rate': 0.00021792786178068672, 'epoch': 6.37}
	64%\|██████▍ \| 199/310 [33:59<19:51, 10.73s/it] 65%\|██████▍ \| 200/310 [34:10<20:00, 10.91s/it] {'loss': 0.2934, 'grad_norm': 1.265863060951233, 'learning_rate': 0.00021494601871728487, 'epoch': 6.4}
	65%\|██████▍ \| 200/310 [34:10<20:00, 10.91s/it]/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	65%\|██████▍ \| 201/310 [34:25<21:51, 12.03s/it] {'loss': 0.3139, 'grad_norm': 1.3970496654510498, 'learning_rate': 0.00021196083078900687, 'epoch': 6.43}
	65%\|██████▍ \| 201/310 [34:25<21:51, 12.03s/it] 65%\|██████▌ \| 202/310 [34:37<21:32, 11.97s/it] {'loss': 0.4096, 'grad_norm': 1.520329475402832, 'learning_rate': 0.00020897296607010301, 'epoch': 6.46}
	65%\|██████▌ \| 202/310 [34:37<21:32, 11.97s/it] 65%\|██████▌ \| 203/310 [34:46<19:46, 11.09s/it] {'loss': 0.2583, 'grad_norm': 1.6870310306549072, 'learning_rate': 0.00020598309323387973, 'epoch': 6.5}
	65%\|██████▌ \| 203/310 [34:46<19:46, 11.09s/it] 66%\|██████▌ \| 204/310 [34:56<19:08, 10.84s/it] {'loss': 0.2929, 'grad_norm': 1.5305651426315308, 'learning_rate': 0.00020299188140305275, 'epoch': 6.53}
	66%\|██████▌ \| 204/310 [34:56<19:08, 10.84s/it] 66%\|██████▌ \| 205/310 [35:04<17:26, 9.96s/it] {'loss': 0.1438, 'grad_norm': 1.1825664043426514, 'learning_rate': 0.0002, 'epoch': 6.56}
	66%\|██████▌ \| 205/310 [35:04<17:26, 9.96s/it] 66%\|██████▋ \| 206/310 [35:16<18:13, 10.52s/it] {'loss': 0.4085, 'grad_norm': 1.6068650484085083, 'learning_rate': 0.00019700811859694732, 'epoch': 6.59}
	66%\|██████▋ \| 206/310 [35:16<18:13, 10.52s/it] 67%\|██████▋ \| 207/310 [35:24<16:53, 9.84s/it] {'loss': 0.1923, 'grad_norm': 1.4477990865707397, 'learning_rate': 0.00019401690676612037, 'epoch': 6.62}
	67%\|██████▋ \| 207/310 [35:24<16:53, 9.84s/it] 67%\|██████▋ \| 208/310 [35:34<16:50, 9.91s/it] {'loss': 0.2989, 'grad_norm': 1.3757888078689575, 'learning_rate': 0.00019102703392989709, 'epoch': 6.66}
	67%\|██████▋ \| 208/310 [35:34<16:50, 9.91s/it] 67%\|██████▋ \| 209/310 [35:42<15:49, 9.40s/it] {'loss': 0.2353, 'grad_norm': 1.4181764125823975, 'learning_rate': 0.00018803916921099315, 'epoch': 6.69}
	67%\|██████▋ \| 209/310 [35:42<15:49, 9.40s/it] 68%\|██████▊ \| 210/310 [35:51<15:33, 9.33s/it] {'loss': 0.2423, 'grad_norm': 1.3748210668563843, 'learning_rate': 0.00018505398128271515, 'epoch': 6.72}
	68%\|██████▊ \| 210/310 [35:51<15:33, 9.33s/it] 68%\|██████▊ \| 211/310 [36:04<17:05, 10.36s/it] {'loss': 0.3051, 'grad_norm': 1.1813161373138428, 'learning_rate': 0.00018207213821931333, 'epoch': 6.75}
	68%\|██████▊ \| 211/310 [36:04<17:05, 10.36s/it] 68%\|██████▊ \| 212/310 [36:13<16:05, 9.86s/it] {'loss': 0.2405, 'grad_norm': 1.7925623655319214, 'learning_rate': 0.00017909430734646935, 'epoch': 6.78}
	68%\|██████▊ \| 212/310 [36:13<16:05, 9.86s/it] 69%\|██████▊ \| 213/310 [36:24<16:35, 10.26s/it] {'loss': 0.3368, 'grad_norm': 1.3381634950637817, 'learning_rate': 0.00017612115509195117, 'epoch': 6.82}
	69%\|██████▊ \| 213/310 [36:24<16:35, 10.26s/it] 69%\|██████▉ \| 214/310 [36:32<15:17, 9.56s/it] {'loss': 0.2241, 'grad_norm': 1.6198554039001465, 'learning_rate': 0.00017315334683646897, 'epoch': 6.85}
	69%\|██████▉ \| 214/310 [36:32<15:17, 9.56s/it] 69%\|██████▉ \| 215/310 [36:43<15:47, 9.97s/it] {'loss': 0.3613, 'grad_norm': 1.2816460132598877, 'learning_rate': 0.0001701915467647651, 'epoch': 6.88}
	69%\|██████▉ \| 215/310 [36:43<15:47, 9.97s/it] 70%\|██████▉ \| 216/310 [36:55<16:44, 10.69s/it] {'loss': 0.3252, 'grad_norm': 1.3897403478622437, 'learning_rate': 0.00016723641771697247, 'epoch': 6.91}
	70%\|██████▉ \| 216/310 [36:55<16:44, 10.69s/it] 70%\|███████ \| 217/310 [37:07<17:13, 11.11s/it] {'loss': 0.3711, 'grad_norm': 1.494607925415039, 'learning_rate': 0.00016428862104027268, 'epoch': 6.94}
	70%\|███████ \| 217/310 [37:07<17:13, 11.11s/it] 70%\|███████ \| 218/310 [37:16<16:00, 10.44s/it] {'loss': 0.2821, 'grad_norm': 1.263566255569458, 'learning_rate': 0.00016134881644088938, 'epoch': 6.98}
	70%\|███████ \| 218/310 [37:16<16:00, 10.44s/it] 71%\|███████ \| 219/310 [37:30<17:13, 11.35s/it] {'loss': 0.3521, 'grad_norm': 1.3453099727630615, 'learning_rate': 0.00015841766183644817, 'epoch': 7.01}
	71%\|███████ \| 219/310 [37:30<17:13, 11.35s/it] 71%\|███████ \| 220/310 [37:40<16:43, 11.15s/it] {'loss': 0.1833, 'grad_norm': 0.7851501703262329, 'learning_rate': 0.00015549581320873715, 'epoch': 7.04}
	71%\|███████ \| 220/310 [37:40<16:43, 11.15s/it] 71%\|███████▏ \| 221/310 [37:51<16:23, 11.05s/it] {'loss': 0.1653, 'grad_norm': 0.8764591217041016, 'learning_rate': 0.00015258392445690052, 'epoch': 7.07}
	71%\|███████▏ \| 221/310 [37:51<16:23, 11.05s/it] 72%\|███████▏ \| 222/310 [37:58<14:13, 9.70s/it] {'loss': 0.0725, 'grad_norm': 0.8280569911003113, 'learning_rate': 0.0001496826472510983, 'epoch': 7.1}
	72%\|███████▏ \| 222/310 [37:58<14:13, 9.70s/it] 72%\|███████▏ \| 223/310 [38:08<14:22, 9.92s/it] {'loss': 0.1655, 'grad_norm': 1.001288652420044, 'learning_rate': 0.00014679263088666499, 'epoch': 7.14}
	72%\|███████▏ \| 223/310 [38:08<14:22, 9.92s/it] 72%\|███████▏ \| 224/310 [38:17<13:46, 9.61s/it] {'loss': 0.1476, 'grad_norm': 0.9858111143112183, 'learning_rate': 0.0001439145221387995, 'epoch': 7.17}
	72%\|███████▏ \| 224/310 [38:17<13:46, 9.61s/it] 73%\|███████▎ \| 225/310 [38:28<14:08, 9.98s/it] {'loss': 0.1799, 'grad_norm': 1.2991976737976074, 'learning_rate': 0.00014104896511781914, 'epoch': 7.2}
	73%\|███████▎ \| 225/310 [38:28<14:08, 9.98s/it] 73%\|███████▎ \| 226/310 [38:41<15:15, 10.90s/it] {'loss': 0.2363, 'grad_norm': 1.1983803510665894, 'learning_rate': 0.00013819660112501054, 'epoch': 7.23}
	73%\|███████▎ \| 226/310 [38:41<15:15, 10.90s/it] 73%\|███████▎ \| 227/310 [38:53<15:33, 11.24s/it] {'loss': 0.1681, 'grad_norm': 0.9889026284217834, 'learning_rate': 0.00013535806850910803, 'epoch': 7.26}
	73%\|███████▎ \| 227/310 [38:53<15:33, 11.24s/it] 74%\|███████▎ \| 228/310 [39:03<14:57, 10.94s/it] {'loss': 0.1521, 'grad_norm': 1.0326757431030273, 'learning_rate': 0.00013253400252343403, 'epoch': 7.3}
	74%\|███████▎ \| 228/310 [39:03<14:57, 10.94s/it] 74%\|███████▍ \| 229/310 [39:12<14:00, 10.38s/it] {'loss': 0.1137, 'grad_norm': 1.1503398418426514, 'learning_rate': 0.00012972503518373144, 'epoch': 7.33}
	74%\|███████▍ \| 229/310 [39:12<14:00, 10.38s/it] 74%\|███████▍ \| 230/310 [39:23<13:57, 10.47s/it] {'loss': 0.1483, 'grad_norm': 0.9621520638465881, 'learning_rate': 0.000126931795126721, 'epoch': 7.36}
	74%\|███████▍ \| 230/310 [39:23<13:57, 10.47s/it] 75%\|███████▍ \| 231/310 [39:38<15:37, 11.87s/it] {'loss': 0.3207, 'grad_norm': 1.0541304349899292, 'learning_rate': 0.00012415490746941433, 'epoch': 7.39}
	75%\|███████▍ \| 231/310 [39:38<15:37, 11.87s/it] 75%\|███████▍ \| 232/310 [39:49<15:00, 11.54s/it] {'loss': 0.1686, 'grad_norm': 0.8747802972793579, 'learning_rate': 0.0001213949936692153, 'epoch': 7.42}
	75%\|███████▍ \| 232/310 [39:49<15:00, 11.54s/it] 75%\|███████▌ \| 233/310 [39:59<14:22, 11.20s/it] {'loss': 0.1751, 'grad_norm': 0.9739407300949097, 'learning_rate': 0.00011865267138483999, 'epoch': 7.46}
	75%\|███████▌ \| 233/310 [39:59<14:22, 11.20s/it] 75%\|███████▌ \| 234/310 [40:07<12:47, 10.10s/it] {'loss': 0.1309, 'grad_norm': 1.2125204801559448, 'learning_rate': 0.00011592855433808694, 'epoch': 7.49}
	75%\|███████▌ \| 234/310 [40:07<12:47, 10.10s/it] 76%\|███████▌ \| 235/310 [40:17<12:37, 10.10s/it] {'loss': 0.1595, 'grad_norm': 1.301795482635498, 'learning_rate': 0.00011322325217648839, 'epoch': 7.52}
	76%\|███████▌ \| 235/310 [40:17<12:37, 10.10s/it] 76%\|███████▌ \| 236/310 [40:28<12:57, 10.50s/it] {'loss': 0.2055, 'grad_norm': 0.9929893612861633, 'learning_rate': 0.00011053737033687346, 'epoch': 7.55}
	76%\|███████▌ \| 236/310 [40:28<12:57, 10.50s/it] 76%\|███████▋ \| 237/310 [40:38<12:35, 10.35s/it] {'loss': 0.1438, 'grad_norm': 1.1833664178848267, 'learning_rate': 0.00010787150990987359, 'epoch': 7.58}
	76%\|███████▋ \| 237/310 [40:38<12:35, 10.35s/it] 77%\|███████▋ \| 238/310 [40:46<11:25, 9.52s/it] {'loss': 0.1412, 'grad_norm': 1.2789617776870728, 'learning_rate': 0.00010522626750540028, 'epoch': 7.62}
	77%\|███████▋ \| 238/310 [40:46<11:25, 9.52s/it] 77%\|███████▋ \| 239/310 [40:57<11:41, 9.87s/it] {'loss': 0.1337, 'grad_norm': 1.5856274366378784, 'learning_rate': 0.00010260223511912654, 'epoch': 7.65}
	77%\|███████▋ \| 239/310 [40:57<11:41, 9.87s/it] 77%\|███████▋ \| 240/310 [41:07<11:35, 9.93s/it] {'loss': 0.1028, 'grad_norm': 0.9638913869857788, 'learning_rate': 0.00010000000000000005, 'epoch': 7.68}
	77%\|███████▋ \| 240/310 [41:07<11:35, 9.93s/it] 78%\|███████▊ \| 241/310 [41:17<11:30, 10.00s/it] {'loss': 0.1455, 'grad_norm': 1.0906389951705933, 'learning_rate': 9.74201445188188e-05, 'epoch': 7.71}
	78%\|███████▊ \| 241/310 [41:17<11:30, 10.00s/it] 78%\|███████▊ \| 242/310 [41:28<11:37, 10.25s/it] {'loss': 0.1914, 'grad_norm': 1.1182774305343628, 'learning_rate': 9.486324603789904e-05, 'epoch': 7.74}
	78%\|███████▊ \| 242/310 [41:28<11:37, 10.25s/it] 78%\|███████▊ \| 243/310 [41:39<11:55, 10.67s/it] {'loss': 0.2271, 'grad_norm': 0.9577850103378296, 'learning_rate': 9.232987678186357e-05, 'epoch': 7.78}
	78%\|███████▊ \| 243/310 [41:39<11:55, 10.67s/it] 79%\|███████▊ \| 244/310 [41:49<11:14, 10.22s/it] {'loss': 0.1498, 'grad_norm': 1.009499192237854, 'learning_rate': 8.982060370957952e-05, 'epoch': 7.81}
	79%\|███████▊ \| 244/310 [41:49<11:14, 10.22s/it] 79%\|███████▉ \| 245/310 [41:58<10:49, 9.99s/it] {'loss': 0.1351, 'grad_norm': 1.0402084589004517, 'learning_rate': 8.733598838727559e-05, 'epoch': 7.84}
	79%\|███████▉ \| 245/310 [41:58<10:49, 9.99s/it] 79%\|███████▉ \| 246/310 [42:11<11:39, 10.93s/it] {'loss': 0.2334, 'grad_norm': 1.0233170986175537, 'learning_rate': 8.487658686286533e-05, 'epoch': 7.87}
	79%\|███████▉ \| 246/310 [42:11<11:39, 10.93s/it] 80%\|███████▉ \| 247/310 [42:20<10:48, 10.29s/it] {'loss': 0.1422, 'grad_norm': 1.2524672746658325, 'learning_rate': 8.24429495415054e-05, 'epoch': 7.9}
	80%\|███████▉ \| 247/310 [42:20<10:48, 10.29s/it] 80%\|████████ \| 248/310 [42:28<09:58, 9.65s/it] {'loss': 0.116, 'grad_norm': 1.1226134300231934, 'learning_rate': 8.003562106241726e-05, 'epoch': 7.94}
	80%\|████████ \| 248/310 [42:28<09:58, 9.65s/it] 80%\|████████ \| 249/310 [42:37<09:31, 9.36s/it] {'loss': 0.1263, 'grad_norm': 1.1791538000106812, 'learning_rate': 7.765514017699871e-05, 'epoch': 7.97}
	80%\|████████ \| 249/310 [42:37<09:31, 9.36s/it] 81%\|████████ \| 250/310 [42:45<09:03, 9.06s/it] {'loss': 0.1123, 'grad_norm': 1.2191931009292603, 'learning_rate': 7.530203962825331e-05, 'epoch': 8.0}
	81%\|████████ \| 250/310 [42:45<09:03, 9.06s/it] 81%\|████████ \| 251/310 [42:56<09:27, 9.63s/it] {'loss': 0.1044, 'grad_norm': 0.5983436107635498, 'learning_rate': 7.297684603156425e-05, 'epoch': 8.03}
	81%\|████████ \| 251/310 [42:56<09:27, 9.63s/it] 81%\|████████▏ \| 252/310 [43:09<10:23, 10.75s/it] {'loss': 0.1283, 'grad_norm': 0.7003709077835083, 'learning_rate': 7.06800797568401e-05, 'epoch': 8.06}
	81%\|████████▏ \| 252/310 [43:09<10:23, 10.75s/it] 82%\|████████▏ \| 253/310 [43:20<10:05, 10.62s/it] {'loss': 0.0882, 'grad_norm': 0.5826034545898438, 'learning_rate': 6.841225481205749e-05, 'epoch': 8.1}
	82%\|████████▏ \| 253/310 [43:20<10:05, 10.62s/it] 82%\|████████▏ \| 254/310 [43:31<10:03, 10.77s/it] {'loss': 0.0796, 'grad_norm': 0.6518627405166626, 'learning_rate': 6.617387872822842e-05, 'epoch': 8.13}
	82%\|████████▏ \| 254/310 [43:31<10:03, 10.77s/it] 82%\|████████▏ \| 255/310 [43:39<09:15, 10.09s/it] {'loss': 0.1019, 'grad_norm': 0.7584952116012573, 'learning_rate': 6.396545244581608e-05, 'epoch': 8.16}
	82%\|████████▏ \| 255/310 [43:39<09:15, 10.09s/it] 83%\|████████▎ \| 256/310 [43:50<09:08, 10.16s/it] {'loss': 0.0926, 'grad_norm': 0.7817658185958862, 'learning_rate': 6.178747020262707e-05, 'epoch': 8.19}
	83%\|████████▎ \| 256/310 [43:50<09:08, 10.16s/it] 83%\|████████▎ \| 257/310 [43:59<08:38, 9.79s/it] {'loss': 0.0753, 'grad_norm': 0.77660071849823, 'learning_rate': 5.964041942320171e-05, 'epoch': 8.22}
	83%\|████████▎ \| 257/310 [43:59<08:38, 9.79s/it] 83%\|████████▎ \| 258/310 [44:09<08:42, 10.05s/it] {'loss': 0.127, 'grad_norm': 0.7232150435447693, 'learning_rate': 5.752478060973108e-05, 'epoch': 8.26}
	83%\|████████▎ \| 258/310 [44:09<08:42, 10.05s/it] 84%\|████████▎ \| 259/310 [44:19<08:33, 10.07s/it] {'loss': 0.1045, 'grad_norm': 0.9301002025604248, 'learning_rate': 5.544102723452171e-05, 'epoch': 8.29}
	84%\|████████▎ \| 259/310 [44:19<08:33, 10.07s/it] 84%\|████████▍ \| 260/310 [44:28<08:08, 9.76s/it] {'loss': 0.0505, 'grad_norm': 0.6372485756874084, 'learning_rate': 5.338962563403478e-05, 'epoch': 8.32}
	84%\|████████▍ \| 260/310 [44:28<08:08, 9.76s/it] 84%\|████████▍ \| 261/310 [44:38<07:51, 9.63s/it] {'loss': 0.0734, 'grad_norm': 0.9156896471977234, 'learning_rate': 5.1371034904521134e-05, 'epoch': 8.35}
	84%\|████████▍ \| 261/310 [44:38<07:51, 9.63s/it] 85%\|████████▍ \| 262/310 [44:48<07:43, 9.66s/it] {'loss': 0.0658, 'grad_norm': 0.7276143431663513, 'learning_rate': 4.938570679927783e-05, 'epoch': 8.38}
	85%\|████████▍ \| 262/310 [44:48<07:43, 9.66s/it] 85%\|████████▍ \| 263/310 [44:58<07:49, 9.99s/it] {'loss': 0.0856, 'grad_norm': 0.7642857432365417, 'learning_rate': 4.74340856275467e-05, 'epoch': 8.42}
	85%\|████████▍ \| 263/310 [44:58<07:49, 9.99s/it] 85%\|████████▌ \| 264/310 [45:09<07:46, 10.14s/it] {'loss': 0.1134, 'grad_norm': 0.7237958908081055, 'learning_rate': 4.551660815508012e-05, 'epoch': 8.45}
	85%\|████████▌ \| 264/310 [45:09<07:46, 10.14s/it] 85%\|████████▌ \| 265/310 [45:20<07:54, 10.54s/it] {'loss': 0.0885, 'grad_norm': 0.8243454694747925, 'learning_rate': 4.363370350639404e-05, 'epoch': 8.48}
	85%\|████████▌ \| 265/310 [45:20<07:54, 10.54s/it] 86%\|████████▌ \| 266/310 [45:29<07:14, 9.87s/it] {'loss': 0.0647, 'grad_norm': 0.4799908399581909, 'learning_rate': 4.178579306873181e-05, 'epoch': 8.51}
	86%\|████████▌ \| 266/310 [45:29<07:14, 9.87s/it] 86%\|████████▌ \| 267/310 [45:41<07:31, 10.51s/it] {'loss': 0.1183, 'grad_norm': 0.889290452003479, 'learning_rate': 3.997329039775877e-05, 'epoch': 8.54}
	86%\|████████▌ \| 267/310 [45:41<07:31, 10.51s/it] 86%\|████████▋ \| 268/310 [45:51<07:25, 10.60s/it] {'loss': 0.0771, 'grad_norm': 0.9871735572814941, 'learning_rate': 3.819660112501053e-05, 'epoch': 8.58}
	86%\|████████▋ \| 268/310 [45:51<07:25, 10.60s/it] 87%\|████████▋ \| 269/310 [46:02<07:18, 10.69s/it] {'loss': 0.0995, 'grad_norm': 0.5611591935157776, 'learning_rate': 3.645612286711373e-05, 'epoch': 8.61}
	87%\|████████▋ \| 269/310 [46:02<07:18, 10.69s/it] 87%\|████████▋ \| 270/310 [46:11<06:50, 10.25s/it] {'loss': 0.1105, 'grad_norm': 0.6527149081230164, 'learning_rate': 3.4752245136801065e-05, 'epoch': 8.64}
	87%\|████████▋ \| 270/310 [46:11<06:50, 10.25s/it] 87%\|████████▋ \| 271/310 [46:20<06:22, 9.80s/it] {'loss': 0.054, 'grad_norm': 0.4983566105365753, 'learning_rate': 3.3085349255739474e-05, 'epoch': 8.67}
	87%\|████████▋ \| 271/310 [46:20<06:22, 9.80s/it] 88%\|████████▊ \| 272/310 [46:30<06:09, 9.71s/it] {'loss': 0.0769, 'grad_norm': 0.7354563474655151, 'learning_rate': 3.1455808269192166e-05, 'epoch': 8.7}
	88%\|████████▊ \| 272/310 [46:30<06:09, 9.71s/it] 88%\|████████▊ \| 273/310 [46:42<06:27, 10.46s/it] {'loss': 0.0853, 'grad_norm': 0.7245129346847534, 'learning_rate': 2.986398686253211e-05, 'epoch': 8.74}
	88%\|████████▊ \| 273/310 [46:42<06:27, 10.46s/it] 88%\|████████▊ \| 274/310 [46:52<06:12, 10.36s/it] {'loss': 0.0744, 'grad_norm': 0.7032697200775146, 'learning_rate': 2.831024127962678e-05, 'epoch': 8.77}
	88%\|████████▊ \| 274/310 [46:52<06:12, 10.36s/it] 89%\|████████▊ \| 275/310 [47:03<06:08, 10.53s/it] {'loss': 0.0552, 'grad_norm': 0.5574517846107483, 'learning_rate': 2.679491924311226e-05, 'epoch': 8.8}
	89%\|████████▊ \| 275/310 [47:03<06:08, 10.53s/it] 89%\|████████▉ \| 276/310 [47:15<06:10, 10.89s/it] {'loss': 0.0964, 'grad_norm': 0.6602731943130493, 'learning_rate': 2.531835987657407e-05, 'epoch': 8.83}
	89%\|████████▉ \| 276/310 [47:15<06:10, 10.89s/it] 89%\|████████▉ \| 277/310 [47:25<05:54, 10.75s/it] {'loss': 0.0796, 'grad_norm': 0.6797820925712585, 'learning_rate': 2.38808936286524e-05, 'epoch': 8.86}
	89%\|████████▉ \| 277/310 [47:25<05:54, 10.75s/it] 90%\|████████▉ \| 278/310 [47:34<05:28, 10.26s/it] {'loss': 0.0647, 'grad_norm': 0.7358025908470154, 'learning_rate': 2.248284219908918e-05, 'epoch': 8.9}
	90%\|████████▉ \| 278/310 [47:34<05:28, 10.26s/it] 90%\|█████████ \| 279/310 [47:44<05:14, 10.16s/it] {'loss': 0.0933, 'grad_norm': 0.6494250297546387, 'learning_rate': 2.1124518466732667e-05, 'epoch': 8.93}
	90%\|█████████ \| 279/310 [47:44<05:14, 10.16s/it] 90%\|█████████ \| 280/310 [47:58<05:35, 11.18s/it] {'loss': 0.1322, 'grad_norm': 0.7461543679237366, 'learning_rate': 1.9806226419516192e-05, 'epoch': 8.96}
	90%\|█████████ \| 280/310 [47:58<05:35, 11.18s/it] 91%\|█████████ \| 281/310 [48:09<05:28, 11.31s/it] {'loss': 0.0713, 'grad_norm': 0.749599277973175, 'learning_rate': 1.8528261086427024e-05, 'epoch': 8.99}
	91%\|█████████ \| 281/310 [48:09<05:28, 11.31s/it] 91%\|█████████ \| 282/310 [48:21<05:16, 11.30s/it] {'loss': 0.0782, 'grad_norm': 0.5928863883018494, 'learning_rate': 1.7290908471479805e-05, 'epoch': 9.02}
	91%\|█████████ \| 282/310 [48:21<05:16, 11.30s/it] 91%\|█████████▏\| 283/310 [48:31<04:57, 11.00s/it] {'loss': 0.0615, 'grad_norm': 0.5130094289779663, 'learning_rate': 1.6094445489709885e-05, 'epoch': 9.06}
	91%\|█████████▏\| 283/310 [48:31<04:57, 11.00s/it] 92%\|█████████▏\| 284/310 [48:40<04:28, 10.31s/it] {'loss': 0.0415, 'grad_norm': 0.4709284007549286, 'learning_rate': 1.493913990520066e-05, 'epoch': 9.09}
	92%\|█████████▏\| 284/310 [48:40<04:28, 10.31s/it] 92%\|█████████▏\| 285/310 [48:52<04:33, 10.94s/it] {'loss': 0.1116, 'grad_norm': 0.46725547313690186, 'learning_rate': 1.3825250271159173e-05, 'epoch': 9.12}
	92%\|█████████▏\| 285/310 [48:52<04:33, 10.94s/it] 92%\|█████████▏\| 286/310 [49:00<04:04, 10.19s/it] {'loss': 0.0295, 'grad_norm': 0.2811935544013977, 'learning_rate': 1.275302587205256e-05, 'epoch': 9.15}
	92%\|█████████▏\| 286/310 [49:01<04:04, 10.19s/it] 93%\|█████████▎\| 287/310 [49:10<03:50, 10.03s/it] {'loss': 0.0351, 'grad_norm': 0.3376118242740631, 'learning_rate': 1.1722706667819383e-05, 'epoch': 9.18}
	93%\|█████████▎\| 287/310 [49:10<03:50, 10.03s/it] 93%\|█████████▎\| 288/310 [49:21<03:45, 10.24s/it] {'loss': 0.0814, 'grad_norm': 0.5244645476341248, 'learning_rate': 1.073452324016715e-05, 'epoch': 9.22}
	93%\|█████████▎\| 288/310 [49:21<03:45, 10.24s/it] 93%\|█████████▎\| 289/310 [49:29<03:24, 9.72s/it] {'loss': 0.038, 'grad_norm': 0.3502997159957886, 'learning_rate': 9.788696740969295e-06, 'epoch': 9.25}
	93%\|█████████▎\| 289/310 [49:29<03:24, 9.72s/it] 94%\|█████████▎\| 290/310 [49:36<02:56, 8.84s/it] {'loss': 0.0326, 'grad_norm': 0.3169908821582794, 'learning_rate': 8.885438842771843e-06, 'epoch': 9.28}
	94%\|█████████▎\| 290/310 [49:36<02:56, 8.84s/it] 94%\|█████████▍\| 291/310 [49:45<02:46, 8.79s/it] {'loss': 0.0361, 'grad_norm': 0.3949986696243286, 'learning_rate': 8.024951691422212e-06, 'epoch': 9.31}
	94%\|█████████▍\| 291/310 [49:45<02:46, 8.79s/it] 94%\|█████████▍\| 292/310 [49:59<03:05, 10.29s/it] {'loss': 0.0608, 'grad_norm': 0.5657286643981934, 'learning_rate': 7.2074278608293525e-06, 'epoch': 9.34}
	94%\|█████████▍\| 292/310 [49:59<03:05, 10.29s/it] 95%\|█████████▍\| 293/310 [50:07<02:46, 9.82s/it] {'loss': 0.0338, 'grad_norm': 0.3506301939487457, 'learning_rate': 6.4330503098667175e-06, 'epoch': 9.38}
	95%\|█████████▍\| 293/310 [50:07<02:46, 9.82s/it] 95%\|█████████▍\| 294/310 [50:16<02:31, 9.45s/it] {'loss': 0.0479, 'grad_norm': 0.42979806661605835, 'learning_rate': 5.701992341426499e-06, 'epoch': 9.41}
	95%\|█████████▍\| 294/310 [50:16<02:31, 9.45s/it] 95%\|█████████▌\| 295/310 [50:30<02:40, 10.73s/it] {'loss': 0.0873, 'grad_norm': 0.5299234986305237, 'learning_rate': 5.0144175636352765e-06, 'epoch': 9.44}
	95%\|█████████▌\| 295/310 [50:30<02:40, 10.73s/it] 95%\|█████████▌\| 296/310 [50:43<02:39, 11.39s/it] {'loss': 0.0735, 'grad_norm': 0.5085708498954773, 'learning_rate': 4.370479853238863e-06, 'epoch': 9.47}
	95%\|█████████▌\| 296/310 [50:43<02:39, 11.39s/it] 96%\|█████████▌\| 297/310 [50:56<02:36, 12.07s/it] {'loss': 0.1293, 'grad_norm': 0.5927861928939819, 'learning_rate': 3.770323321165492e-06, 'epoch': 9.5}
	96%\|█████████▌\| 297/310 [50:56<02:36, 12.07s/it] 96%\|█████████▌\| 298/310 [51:03<02:06, 10.51s/it] {'loss': 0.0323, 'grad_norm': 0.44408226013183594, 'learning_rate': 3.2140822802740668e-06, 'epoch': 9.54}
	96%\|█████████▌\| 298/310 [51:03<02:06, 10.51s/it] 96%\|█████████▋\| 299/310 [51:11<01:47, 9.80s/it] {'loss': 0.08, 'grad_norm': 0.3330707550048828, 'learning_rate': 2.701881215295732e-06, 'epoch': 9.57}
	96%\|█████████▋\| 299/310 [51:11<01:47, 9.80s/it] 97%\|█████████▋\| 300/310 [51:20<01:36, 9.61s/it] {'loss': 0.0538, 'grad_norm': 0.42321497201919556, 'learning_rate': 2.2338347549742956e-06, 'epoch': 9.6}
	97%\|█████████▋\| 300/310 [51:20<01:36, 9.61s/it]/home/aiops/duanky/miniconda3/envs/hiddenlanguage/lib/python3.11/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	97%\|█████████▋\| 301/310 [51:33<01:33, 10.41s/it] {'loss': 0.0402, 'grad_norm': 0.4658825695514679, 'learning_rate': 1.81004764641306e-06, 'epoch': 9.63}
	97%\|█████████▋\| 301/310 [51:33<01:33, 10.41s/it] 97%\|█████████▋\| 302/310 [51:43<01:23, 10.44s/it] {'loss': 0.082, 'grad_norm': 0.5052266120910645, 'learning_rate': 1.4306147316325291e-06, 'epoch': 9.66}
	97%\|█████████▋\| 302/310 [51:43<01:23, 10.44s/it] 98%\|█████████▊\| 303/310 [51:52<01:10, 10.00s/it] {'loss': 0.0439, 'grad_norm': 0.3762564957141876, 'learning_rate': 1.0956209263453421e-06, 'epoch': 9.7}
	98%\|█████████▊\| 303/310 [51:52<01:10, 10.00s/it] 98%\|█████████▊\| 304/310 [52:03<01:01, 10.19s/it] {'loss': 0.0829, 'grad_norm': 0.4863276183605194, 'learning_rate': 8.051412009521864e-07, 'epoch': 9.73}
	98%\|█████████▊\| 304/310 [52:03<01:01, 10.19s/it] 98%\|█████████▊\| 305/310 [52:13<00:50, 10.16s/it] {'loss': 0.0496, 'grad_norm': 0.39549386501312256, 'learning_rate': 5.592405637639741e-07, 'epoch': 9.76}
	98%\|█████████▊\| 305/310 [52:13<00:50, 10.16s/it] 99%\|█████████▊\| 306/310 [52:22<00:39, 9.79s/it] {'loss': 0.053, 'grad_norm': 0.6217253804206848, 'learning_rate': 3.5797404645296906e-07, 'epoch': 9.79}
	99%\|█████████▊\| 306/310 [52:22<00:39, 9.79s/it] 99%\|█████████▉\| 307/310 [52:34<00:31, 10.58s/it] {'loss': 0.112, 'grad_norm': 0.5559027791023254, 'learning_rate': 2.0138669173708213e-07, 'epoch': 9.82}
	99%\|█████████▉\| 307/310 [52:34<00:31, 10.58s/it] 99%\|█████████▉\| 308/310 [52:43<00:20, 10.09s/it] {'loss': 0.0322, 'grad_norm': 0.42121702432632446, 'learning_rate': 8.951354329933547e-08, 'epoch': 9.86}
	99%\|█████████▉\| 308/310 [52:43<00:20, 10.09s/it] 100%\|█████████▉\| 309/310 [52:54<00:10, 10.35s/it] {'loss': 0.0873, 'grad_norm': 0.40787166357040405, 'learning_rate': 2.2379637945313392e-08, 'epoch': 9.89}
	100%\|█████████▉\| 309/310 [52:54<00:10, 10.35s/it] 100%\|██████████\| 310/310 [53:02<00:00, 9.64s/it] {'loss': 0.0573, 'grad_norm': 0.4643600285053253, 'learning_rate': 0.0, 'epoch': 9.92}
	100%\|██████████\| 310/310 [53:02<00:00, 9.64s/it] {'train_runtime': 3191.6656, 'train_samples_per_second': 3.133, 'train_steps_per_second': 0.097, 'train_loss': 0.8500084597016534, 'epoch': 9.92}
	100%\|██████████\| 310/310 [53:04<00:00, 9.64s/it] 100%\|██████████\| 310/310 [53:04<00:00, 10.27s/it]
	wandb:
	wandb: Run history:
	wandb: train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
	wandb: train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
	wandb: train/grad_norm ▄▂▃▅▂▂▃▃▃▅▅▄▆▅▇▄█▆▆▅▅▅▄▅▄▃▅▆▂▃▃▃▂▂▃▂▁▁▁▁
	wandb: train/learning_rate ▁▂▂▃▃▄▄▅▅▆▇▇██████▇▇▇▇▆▆▅▅▅▄▄▃▃▃▂▂▂▁▁▁▁▁
	wandb: train/loss █▇▇▇▆▇▇▆▆▅▆▆▄▄▅▅▄▃▃▃▃▃▃▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁
	wandb:
	wandb: Run summary:
	wandb: total_flos 4.109458717571809e+17
	wandb: train/epoch 9.92
	wandb: train/global_step 310
	wandb: train/grad_norm 0.46436
	wandb: train/learning_rate 0.0
	wandb: train/loss 0.0573
	wandb: train_loss 0.85001
	wandb: train_runtime 3191.6656
	wandb: train_samples_per_second 3.133
	wandb: train_steps_per_second 0.097
	wandb:
	wandb: You can sync this run to the cloud by running:
	wandb: wandb sync /home/aiops/duanky/llm-attacks/instruction_tuning_experiments/wandb/offline-run-20241020_074630-wlpdreib
	wandb: Find logs at: ./wandb/offline-run-20241020_074630-wlpdreib/logs