--- library_name: transformers language: en license: apache-2.0 base_model: google/bert_uncased_L-4_H-256_A-4 tags: - tld - embeddings - domains - multi-task-learning - bert pipeline_tag: feature-extraction widget: - text: "com" - text: "io" - text: "ai" - text: "co.za" model-index: - name: TLD Embedding Model results: - task: type: feature-extraction name: TLD Embedding metrics: - type: spearman_correlation value: 0.8976 name: Average Spearman Correlation --- # TLD Embedding Model A state-of-the-art TLD (Top-Level Domain) embedding model that learns rich 96-dimensional representations from multiple data sources through multi-task learning. This model achieved an exceptional **0.8976 average Spearman correlation** across 63 features during training. ## Model Overview This TLD embedding model creates semantic representations by jointly learning from four complementary prediction tasks: 1. **Research Metrics** (18 features): Brand perception, trust scores, memorability, premium brand indices 2. **Technical Metrics** (5 features): Registration statistics, domain rankings, usage patterns 3. **Economic Indicators** (21 features): Country-level GDP sector breakdowns mapped to TLD registries 4. **Price Predictions** (18 features): Industry-specific market value scores from domain sales data The model uses a shared BERT encoder with task-specific prediction heads, enabling the embeddings to capture semantic, technical, economic, and market value aspects of each TLD. ## Training Performance **Final Training Results (Epoch 25/25):** - **Overall Average Score**: 0.8976 (89.76% Spearman correlation) - **Training Loss**: 0.0034 **Task-Specific Performance:** - **Research Task**: 0.80+ correlation on trust, adoption, and brand metrics - **Technical Task**: 0.93-0.99 correlation on registration and ranking metrics - **Economic Task**: 0.89-0.96 correlation on GDP sector predictions - **Price Task**: 0.90-0.99 correlation on industry-specific price scores **Best Individual Metrics:** - `overall_score`: 0.990 Spearman correlation - `global_top_1m_share`: 0.993 Spearman correlation - `score_food`: 0.973 Spearman correlation - `three_letter_registration_percent`: 0.969 Spearman correlation ## Architecture - **Base Model**: `google/bert_uncased_L-4_H-256_A-4` (Lightweight BERT) - **Embedding Dimension**: 96 (optimized for data size) - **Max Sequence Length**: 8 tokens (optimized for TLDs) - **MLP Hidden Size**: 192 with 15% dropout - **Task Weighting**: Research(0.25), Technical(0.20), Economic(0.15), Price(0.40) ## Training Data Sources ### Research Data (`tld_research_data.jsonl`) - **Coverage**: 150 TLDs with research metrics - **Features**: Trust scores, brand associations, memorability, adoption rates - **Source**: Survey data, brand perception studies, market research ### Technical Data (`tld_technical_data.jsonl`) - **Coverage**: 716 TLDs with technical metrics - **Features**: Registration patterns, domain rankings (Majestic), sales volumes - **Source**: Registry statistics, web crawl data, domain marketplaces ### Economic Data (`country_economic_data.jsonl`) - **Coverage**: 126 TLDs mapped to country economies - **Features**: GDP breakdowns by 21 industry sectors - **Source**: World Bank, IMF economic data mapped to ccTLD registries ### Price Data (`tld_price_scores_by_industry_2025.csv`) - **Coverage**: 722 TLDs with price predictions - **Features**: 18 industry-specific price scores plus overall score - **Source**: Domain sales data processed through pairwise neural network (`compute_tld_scores_pairwise.py`) - **Industries**: Finance, healthcare, technology, automotive, food, gaming, etc. ## Installation & Usage ### Loading the Model ```python from transformers import AutoTokenizer, AutoModel import torch # Load model and tokenizer model_name = "humbleworth/tld-embedding" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) model.eval() ``` ### Getting TLD Embeddings ```python def get_tld_embedding(tld, model, tokenizer): """Get 96-dimensional embedding for a single TLD""" # Use special token format if available, otherwise prefix with dot tld_text = f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" inputs = tokenizer( tld_text, return_tensors="pt", padding="max_length", truncation=True, max_length=8 ) with torch.no_grad(): outputs = model.encoder(**inputs) cls_embedding = outputs.last_hidden_state[:, 0, :] tld_embedding = model.projection(cls_embedding) return tld_embedding.squeeze().numpy() # Example com_embedding = get_tld_embedding("com", model, tokenizer) print(f"Embedding shape: {com_embedding.shape}") # (96,) ``` ### Batch Processing ```python def get_tld_embeddings_batch(tlds, model, tokenizer): """Get embeddings for multiple TLDs efficiently""" # Use special token format if available, otherwise prefix with dot tld_texts = [f"[TLD_{tld}]" if f"[TLD_{tld}]" in tokenizer.vocab else f".{tld}" for tld in tlds] inputs = tokenizer( tld_texts, return_tensors="pt", padding="max_length", truncation=True, max_length=8 ) with torch.no_grad(): outputs = model.encoder(**inputs) cls_embeddings = outputs.last_hidden_state[:, 0, :] tld_embeddings = model.projection(cls_embeddings) return tld_embeddings.numpy() # Process multiple TLDs tlds = ["com", "io", "ai", "co.za", "tech"] embeddings = get_tld_embeddings_batch(tlds, model, tokenizer) print(f"Embeddings shape: {embeddings.shape}") # (5, 96) ``` ## Key Features ### Multi-Task Learning Benefits - **Robust Representations**: Joint learning across diverse tasks creates more stable embeddings - **Transfer Learning**: Knowledge from technical metrics improves price prediction and vice versa - **Percentile Normalization**: All features converted to percentiles for balanced learning ### Industry-Specific Intelligence - **18 Industry Scores**: Specialized predictions for finance, technology, healthcare, etc. - **Economic Mapping**: Country-level economic data enhances ccTLD understanding - **Market Dynamics**: Real domain sales data captures market preferences ### Technical Optimizations - **MPS Support**: Optimized for Apple Silicon (M1/M2) training - **Gradient Accumulation**: Stable training with effective batch size of 64 - **Early Stopping**: Prevents overfitting with patience-based stopping - **Task Weighting**: Balanced learning prioritizing price prediction (40% weight) ## Use Cases 1. **Domain Valuation**: Use embeddings as features for ML-based domain appraisal 2. **TLD Recommendation**: Find similar TLDs for branding or investment decisions 3. **Market Analysis**: Cluster TLDs by business characteristics or market positioning 4. **Portfolio Optimization**: Analyze TLD portfolios using semantic similarity 5. **Cross-Market Analysis**: Compare TLD performance across different industries ## Training Configuration **Optimal Hyperparameters (Based on Data Analysis):** - Epochs: 25 (early stopping at patience=5) - Batch Size: 16 (effective 64 with accumulation) - Learning Rate: 5e-4 with warmup - Warmup Steps: 200 - Gradient Accumulation: 4 steps - Dropout: 15% **Training Command:** ```bash python train_dual_task_embeddings.py \ --epochs 25 \ --batch-size 16 \ --learning-rate 5e-4 \ --warmup-steps 200 \ --output-dir models/tld_embedding_model ``` ## Model Files ``` tld_embedding_model/ ├── config.json # Model configuration ├── pytorch_model.bin # Model weights ├── tokenizer.json # Tokenizer ├── tokenizer_config.json # Tokenizer config ├── vocab.txt # Vocabulary ├── special_tokens_map.json # Special tokens ├── training_metrics.pt # Training metrics ├── tld_embeddings.json # Pre-computed embeddings └── README.md # This file ``` ## Citation If you use this model in your research, please cite: ```bibtex @software{tld_embedding_2025, title = {TLD Embedding Model: Multi-Task Learning for Domain Extensions}, author = {HumbleWorth}, year = {2025}, note = {Achieved 0.8976 average Spearman correlation across 63 features}, url = {https://huggingface.co/humbleworth/tld-embedding} } ``` ## License This model is released under the Apache 2.0 License.