T5-GenQ-TD-v1 / README.md

Update README.md

90f15c8 verified 7 months ago

14.9 kB

	---
	library_name: transformers
	tags:
	- e-commerce
	- query-generation
	license: mit
	datasets:
	- smartcat/Amazon-2023-GenQ
	language:
	- en
	metrics:
	- rouge
	base_model:
	- BeIR/query-gen-msmarco-t5-base-v1
	pipeline_tag: text2text-generation
	---

	# Model Card for T5-GenQ-TD-v1

	🤖 ✨ 🔍 Generate precise, realistic user-focused search queries from product text 🛒 🚀 📊


	### Model Description

	- Model Name: Fine-Tuned Query-Generation Model
	- Model type: Text-to-Text Transformer
	- Finetuned from model: [BeIR/query-gen-msmarco-t5-base-v1](https://huggingface.co/BeIR/query-gen-msmarco-t5-base-v1)
	- Dataset: [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ)
	- Primary Use Case: Generating accurate and relevant search queries from item descriptions
	- Repository: [smartcat-labs/product2query](https://github.com/smartcat-labs/product2query)

	### Model variations

	<table border="1" class="dataframe">
	<tr style="text-align: center;">
	<th>Model</th>
	<th>ROUGE-1</th>
	<th>ROUGE-2</th>
	<th>ROUGE-L</th>
	<th>ROUGE-Lsum</th>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-T-v1">T5-GenQ-T-v1</a></b></td>
	<td>75.2151</td>
	<td>54.8735</td>
	<td><b>74.5142</b></td>
	<td>74.5262</td>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-TD-v1">T5-GenQ-TD-v1</a></b></td>
	<td>78.2570</td>
	<td>58.9586</td>
	<td><b>77.5308</b></td>
	<td>77.5466</td>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-TDE-v1">T5-GenQ-TDE-v1</a></b></td>
	<td>76.9075</td>
	<td>57.0980</td>
	<td><b>76.1464</b></td>
	<td>76.1502</td>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-TDC-v1">T5-GenQ-TDC-v1</a> (best)</b></td>
	<td>80.0754</td>
	<td>61.5974</td>
	<td><b>79.3557</b></td>
	<td>79.3427</td>
	</tr>
	</table>

	### Uses

	This model is designed to improve e-commerce search functionality by generating user-friendly search queries based on product descriptions. It is particularly suited for applications where product descriptions are the primary input, and the goal is to create concise, descriptive queries that align with user search intent.

	### Examples of Use:

	<li>Generating search queries for product indexing.</li>
	<li>Enhancing product discoverability in e-commerce search engines.</li>
	<li>Automating query generation for catalog management.</li>

	### Comparison of ROUGE scores:

	<table border="1">
	<thead>
	<tr>
	<th>Model</th>
	<th>ROUGE-1</th>
	<th>ROUGE-2</th>
	<th>ROUGE-L</th>
	<th>ROUGE-Lsum</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>T5-GenQ-TD-v1</td>
	<td>76.15</td>
	<td>56.23</td>
	<td>75.49</td>
	<td>75.49</td>
	</tr>
	<tr>
	<td>query-gen-msmarco-t5-base-v1</td>
	<td>34.92</td>
	<td>15.28</td>
	<td>34.17</td>
	<td>34.17</td>
	</tr>
	</tbody>
	</table>

	Note: This evaluation is done after training, based on the test split of the [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ/viewer/default/test?views%5B%5D=test) dataset.

	### Examples



	<details><summary>Expand to see table with examples</summary>
	<table>
	<thead>
	<tr>
	<th style="width: 30%;">Input Text</th>
	<th style="width: 20%;">Target Query</th>
	<th style="width: 25%;">Before Fine-tuning</th>
	<th style="width: 25%;">After Fine-tuning</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td class="input-text"><strong>Dr. Scholl's Women's Trance Slip Resistant Clog</strong><br><br>
	Our trance work shoe combines exceptional style and performance. An oil and slip-resistant outsole combined with a molded EVA construction will add layers of safety and comfort.</td>
	<td>Dr. Scholl's Women's Trance Clog</td>
	<td>dr scholl trance shoes</td>
	<td>Dr. Scholl's Trance Clog</td>
	</tr>
	<tr>
	<td class="input-text"><strong>Girls Birthday Tutu Skirts Dress with Mermaid Birthday Girl Tshirt, Headband, Satin Sash</strong><br><br>
	Girls Mermaid Dress Set with Tshirt, Dress, Headband and sash.</td>
	<td>girls mermaid dress set</td>
	<td>what to wear for a mermaid birthday</td>
	<td>Girls Mermaid Birthday Dress Set</td>
	</tr>
	<tr>
	<td class="input-text"><strong>Saucony Women's Omni 15 Running Shoe</strong><br><br>
	If we could design shoelaces for pronators, we’d do that too. The omni 15 delivers everything a moderate to severe pronator could need, including enhanced cushioning, exceptional support, flexibility and a smooth, fluid ride.</td>
	<td>Saucony Omni 15 Women's Running Shoe</td>
	<td>what shoes are good for pronators</td>
	<td>Saucony Omni 15 Running Shoe</td>
	</tr>
	</tbody>
	</table>
	</details>

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model = AutoModelForSeq2SeqLM.from_pretrained("smartcat/T5-GenQ-TD-v1")
	tokenizer = AutoTokenizer.from_pretrained("smartcat/T5-GenQ-TD-v1")

	description = "Silver-colored cuff with embossed braid pattern. Made of brass, flexible to fit wrist."

	inputs = tokenizer(description, return_tensors="pt", padding=True, truncation=True)
	generated_ids = model.generate(inputs["input_ids"], max_length=30, num_beams=4, early_stopping=True)

	generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

	```
	## Training Details

	### Training Data

	The model was trained on the [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ) dataset, which consists of user-like
	queries generated from product descriptions. The dataset was created using Claude Haiku 3,
	incorporating key product attributes such as the title, description, and images to ensure relevant and realistic queries. For more information, read the Dataset Card. 😊


	### Preprocessing
	- Trained on product titles and description
	- Tokenized using T5’s default tokenizer with truncation to handle long text.


	### Training Hyperparameters

	<ul>
	<li><strong>max_input_length:</strong> 512</li>
	<li><strong>max_target_length:</strong> 30</li>
	<li><strong>batch_size:</strong> 48</li>
	<li><strong>num_train_epochs:</strong> 8</li>
	<li><strong>evaluation_strategy:</strong> epoch</li>
	<li><strong>save_strategy:</strong> epoch</li>
	<li><strong>learning_rate:</strong> 5.6e-05</li>
	<li><strong>weight_decay:</strong> 0.01 </li>
	<li><strong>predict_with_generate:</strong> true</li>
	<li><strong>load_best_model_at_end:</strong> true</li>
	<li><strong>metric_for_best_model:</strong> eval_rougeL</li>
	<li><strong>greater_is_better:</strong> true</li>
	<li><strong>logging_strategy:</strong> epoch</li>
	</ul>

	### Train time: 15.22 hrs

	### Hardware

	A6000 GPU:
	- Memory Size: 48 GB
	- Memory Type: GDDR6
	- CUDA: 8.6

	### Metrics

	[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization and machine translation in NLP. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

	In our evaluation, ROUGE scores are scaled to resemble percentages for better interpretability. The metric used in the training was ROUGE-L.

	<table border="1" class="dataframe">
	<thead>
	<tr style="text-align: center;">
	<th>Epoch</th>
	<th>Step</th>
	<th>Loss</th>
	<th>Grad Norm</th>
	<th>Learning Rate</th>
	<th>Eval Loss</th>
	<th>ROUGE-1</th>
	<th>ROUGE-2</th>
	<th>ROUGE-L</th>
	<th>ROUGE-Lsum</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>1.0</td>
	<td>4285</td>
	<td>0.2515</td>
	<td>1.890405</td>
	<td>0.000049</td>
	<td>0.165247</td>
	<td>76.4578</td>
	<td>56.4813</td>
	<td>75.7754</td>
	<td>75.7835</td>
	</tr>
	<tr>
	<td>2.0</td>
	<td>8570</td>
	<td>0.1744</td>
	<td>1.433518</td>
	<td>0.000042</td>
	<td>0.157739</td>
	<td>77.2138</td>
	<td>57.4609</td>
	<td>76.5478</td>
	<td>76.5589</td>
	</tr>
	<tr>
	<td>3.0</td>
	<td>12855</td>
	<td>0.1595</td>
	<td>1.340541</td>
	<td>0.000035</td>
	<td>0.154977</td>
	<td>77.5761</td>
	<td>57.9620</td>
	<td>76.8824</td>
	<td>76.8854</td>
	</tr>
	<tr>
	<td>4.0</td>
	<td>17140</td>
	<td>0.1488</td>
	<td>1.370982</td>
	<td>0.000028</td>
	<td>0.153134</td>
	<td>77.9366</td>
	<td>58.5720</td>
	<td>77.2561</td>
	<td>77.2692</td>
	</tr>
	<tr>
	<td>5.0</td>
	<td>21425</td>
	<td>0.1407</td>
	<td>1.549360</td>
	<td>0.000021</td>
	<td>0.153177</td>
	<td>78.1102</td>
	<td>58.7207</td>
	<td>77.4106</td>
	<td>77.4241</td>
	</tr>
	<tr>
	<td>6.0</td>
	<td>25710</td>
	<td>0.1344</td>
	<td>1.258538</td>
	<td>0.000014</td>
	<td>0.152852</td>
	<td>78.1691</td>
	<td>58.8640</td>
	<td>77.4554</td>
	<td>77.4651</td>
	</tr>
	<tr>
	<td>7.0</td>
	<td>29995</td>
	<td>0.1299</td>
	<td>1.200458</td>
	<td>0.000007</td>
	<td>0.153884</td>
	<td>78.2001</td>
	<td>58.8603</td>
	<td>77.4833</td>
	<td>77.4984</td>
	</tr>
	<tr>
	<td>8.0</td>
	<td>34280</td>
	<td>0.1267</td>
	<td>1.079393</td>
	<td>0.000000</td>
	<td>0.154507</td>
	<td>78.2570</td>
	<td>58.9586</td>
	<td>77.5308</td>
	<td>77.5466</td>
	</tr>
	</tbody>
	</table>

	<style>
	.model-analysis table {
	width: 100%;
	border-collapse: collapse;
	}
	.model-analysis td {
	padding: 10px;
	vertical-align: middle;
	}
	.model-analysis img {
	width: auto; /* Maintain aspect ratio */
	display: block;
	margin: 0 auto;
	max-height: 750px; /* Default height for most images */

	}
	</style>

	<div class="model-analysis">

	### Model Analysis
	<details><summary>Average scores by model </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="average_scores_by_model.png" alt="image"></td>
	<td>

	```checkpoint-34280``` (T5-GenQ-TD-v1) significantly outperforms ```query-gen-msmarco-t5-base-v1``` across all ROUGE metrics.

	The difference is most notable in ROUGE-2, where ```checkpoint-34280``` achieves 56.24% vs. 15.29% for the baseline model.

	These results suggest ```checkpoint-34280``` produces more precise and high-overlap text generations.</td></tr>
	</table>
	</details>

	<details><summary>Density comparison </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="density_comparison.png" alt="image"></td>
	<td>

	```checkpoint-34280``` (T5-GenQ-TD-v1) has strong peaks near 100%, indicating high overlap with reference texts.

	```query-gen-msmarco-t5-base-v1``` shows a broader distribution, with peaks at low to mid-range scores (10-40%), suggesting greater variability but lower precision.

	ROUGE-2 has a high density at 0% for the baseline model, implying many instances with no bigram overlap.</td></tr>
	</table>
	</details>

	<details><summary>Histogram comparison </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="histogram_comparison.png" alt="image"></td>
	<td>

	```checkpoint-34280``` (T5-GenQ-TD-v1, blue) shows a steady increase toward high ROUGE scores, peaking at 100%.

	```query-gen-msmarco-t5-base-v1``` (orange) has multiple low-score peaks, particularly in ROUGE-2, reinforcing its lower text overlap performance.

	These histograms confirm that ```checkpoint-34280``` consistently generates more accurate outputs.</td></tr>
	</table>
	</details>

	<details><summary>Scores by generated query length </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="group_sizes.png" alt="image"></td>
	<td>
	This visualization compares average ROUGE scores and score differences across different word sizes.

	Consistent ROUGE Scores (Sizes 2-8):

	ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM scores remain high and stable across most word sizes.

	Sharp Drop at Size 9:

	A significant decrease in scores occurs for size 9 words, with negative score differences, suggesting longer phrases are less aligned with reference texts.

	Score Differences Stay Near Zero (Sizes 2-8):

	Models perform similarly for shorter text spans but diverge at larger word sizes.
	</td></tr>
	</table>
	</details>

	<details><summary>Semantic similarity distribution </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="semantic_similarity_distribution.png" alt="image"></td>
	<td>This histogram visualizes the distribution of cosine similarity scores, which measure the semantic similarity between paired texts (generated query and target query).

	A strong peak near 1.0 indicates that most pairs are highly semantically similar.

	Low similarity scores (0.0–0.4) are rare, suggesting the dataset consists mostly of highly related text pairs.</td></tr>
	</table>
	</details>

	<details><summary>Semantic similarity score against ROUGE scores </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="similarity_vs_rouge.png" alt="image"></td>
	<td>
	This scatter plot matrix shows the relationship between semantic similarity (cosine similarity) and ROUGE scores:

	Higher similarity → Higher ROUGE scores, indicating a positive correlation.

	ROUGE-1 & ROUGE-L show the strongest alignment, while ROUGE-2 has greater variance.

	Some low-similarity outliers still achieve moderate ROUGE scores, suggesting surface-level overlap without deep semantic alignment.

	This analysis helps understand how semantic similarity aligns with n-gram overlap metrics for evaluating text models.
	</td></tr>
	</table>
	</details>
	</div>

	## More Information

	- Please visit the [GitHub Repository](https://github.com/smartcat-labs/product2query)

	## Authors

	- Mentor: [Milutin Studen](https://www.linkedin.com/in/milutin-studen/)
	- Engineers: [Petar Surla](https://www.linkedin.com/in/petar-surla-6448b6269/), [Andjela Radojevic](https://www.linkedin.com/in/an%C4%91ela-radojevi%C4%87-936197196/)

	## Model Card Contact

	For questions, please open an issue on the [GitHub Repository](https://github.com/smartcat-labs/product2query)