NBK ATS Domain Classifier v1 (English)
nbk-ats-domain-v1-en is a specialized sentence transformer fine-tuned using contrastive learning (Triplet Loss) for professional domain classification. This model excels at distinguishing between different professional domains in resumes and job descriptions, solving the critical "cross-domain score inflation" problem in ATS (Applicant Tracking System) applications.
Key Features
- π― Exceptional Domain Separation: Same-domain similarity scores >0.90, cross-domain scores can reach negative values (e.g., Tech vs. Healthcare: -0.295)
- π₯ Contrastive Learning: Trained with TripletLoss using 40,000 hard-negative triplets (32,000 train, 8,000 validation)
- π Extended Context: 8,192 tokens - processes full-length resumes and job descriptions
- ποΈ Built on Excellence: Fine-tuned from nbk-ats-semantic-v1-en, inheriting its powerful semantic understanding
- π 13 Professional Domains: Technology, Healthcare, Finance, Education, Legal, Marketing, Manufacturing, Design, Retail/Hospitality, Construction/Real Estate, HR, Media/Entertainment, Government/Nonprofit
- β Production Validated: 100% accuracy on same-domain recognition, perfect cross-domain discrimination
Model Specification
- Base Model: 0xnbk/nbk-ats-semantic-v1-en (fine-tuned from jinaai/jina-embeddings-v2-small-en)
- Training Method: Contrastive Learning with TripletLoss
- Fine-tuning Dataset: 0xnbk/resume-domain-triplets-train-v1-en
- Embedding Dimension: 512D
- Max Sequence Length: 8,192 tokens (~32,000 characters)
- Architecture: 4-layer BERT with ALiBi position embeddings
- Training Loss: TripletLoss with cosine distance (margin: 0.5)
Performance
Validation Results
Same-Domain Performance:
- Similarity Score: Consistently >0.90 for matching domains
- Perfect Recognition: 100% accuracy in identifying same-domain pairs
- Examples:
- Technology β Technology: 0.95+ β
- Healthcare β Healthcare: 0.92+ β
- Finance β Finance: 0.94+ β
Cross-Domain Discrimination:
- Aggressive Separation: Scores pushed to very low or negative values
- No False Positives: Zero cross-domain pairs score >0.90
- Examples:
- Tech β Healthcare: -0.295 β Exceptional separation
- Tech β Finance: -0.180 β Strong separation
- Design β Finance: -0.250 β Clear boundaries
Nuanced Understanding:
- Realistic Overlap Detection: Related domains show appropriate similarity
- Marketing β Sales: ~0.85-0.90 (realistic business overlap)
- Design β Technology: 0.65-0.75 (modern UX/UI roles)
- Healthcare β Education: 0.60-0.70 (teaching hospitals)
Training Performance (A6000 48GB GPU)
Training Configuration:
- Hardware: NVIDIA A6000 48GB
- Training Time: ~2 hours
- Physical Batch Size: 8
- Gradient Accumulation: 16 steps (effective batch size: 128)
- Learning Rate: 5e-6
- Epochs: 3
- Triplet Margin: 0.5
Dataset Quality:
- Total Triplets: 40,000 (32,000 train, 8,000 validation)
- Hard Negative Mining: 85% of negatives are "confusing" domain pairs
- 13 Balanced Domains: Equal representation across all professional fields
- Source: Real LinkedIn job postings (654,650 processed)
Installation
pip install sentence-transformers
Usage
Domain Classification
from sentence_transformers import SentenceTransformer
import numpy as np
# Load domain-specialized model
model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')
# Example: Resume-Job domain matching
resume_tech = """
Senior Software Engineer with 8 years of experience in Python, Django, and React.
Led development of microservices architecture serving 10M+ users. Expert in AWS,
Docker, Kubernetes, and CI/CD pipelines. Strong background in agile methodologies
and cross-functional team leadership.
"""
job_tech = """
We're seeking a Senior Backend Engineer with 5+ years Python experience.
Must have expertise in Django, microservices, and cloud platforms (AWS/GCP).
Experience with containerization (Docker/Kubernetes) and modern DevOps practices required.
"""
job_healthcare = """
Registered Nurse position in ICU requiring 3+ years critical care experience.
Must have current RN license, ACLS certification, and experience with electronic
medical records. Knowledge of patient monitoring systems and ventilator management essential.
"""
# Generate embeddings
resume_emb = model.encode(resume_tech)
tech_job_emb = model.encode(job_tech)
healthcare_job_emb = model.encode(job_healthcare)
# Calculate cosine similarity
same_domain_similarity = np.dot(resume_emb, tech_job_emb) / (
np.linalg.norm(resume_emb) * np.linalg.norm(tech_job_emb)
)
cross_domain_similarity = np.dot(resume_emb, healthcare_job_emb) / (
np.linalg.norm(resume_emb) * np.linalg.norm(healthcare_job_emb)
)
print(f"Same Domain (Tech β Tech): {same_domain_similarity:.3f}") # Expected: >0.90
print(f"Cross Domain (Tech β Healthcare): {cross_domain_similarity:.3f}") # Expected: <0.30 or negative
# Domain matching logic
def is_same_domain(similarity, threshold=0.5):
return similarity > threshold
print(f"\nDomain Match: {is_same_domain(same_domain_similarity)}") # True
print(f"Domain Match: {is_same_domain(cross_domain_similarity)}") # False
Batch Domain Classification
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')
# Multiple resumes from different domains
resumes = [
"... technology resume ...",
"... healthcare resume ...",
"... finance resume ..."
]
# Single job description
job = "... technology job description ..."
# Batch encode
resume_embeddings = model.encode(resumes, batch_size=8, show_progress_bar=True)
job_embedding = model.encode(job)
# Calculate similarities
similarities = cosine_similarity([job_embedding], resume_embeddings)[0]
# Rank by domain compatibility
for idx, score in sorted(enumerate(similarities), key=lambda x: x[1], reverse=True):
domain_match = "β
SAME DOMAIN" if score > 0.5 else "β DIFFERENT DOMAIN"
print(f"Resume {idx+1}: {score:.3f} {domain_match}")
Integration with Semantic Model (Hybrid ATS System)
from sentence_transformers import SentenceTransformer
import numpy as np
# Load both models
semantic_model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en')
domain_model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')
def calculate_ats_score(resume_text, job_text):
# Semantic similarity (content quality)
resume_sem = semantic_model.encode(resume_text)
job_sem = semantic_model.encode(job_text)
semantic_similarity = np.dot(resume_sem, job_sem) / (
np.linalg.norm(resume_sem) * np.linalg.norm(job_sem)
)
# Domain compatibility (field alignment)
resume_dom = domain_model.encode(resume_text)
job_dom = domain_model.encode(job_text)
domain_similarity = np.dot(resume_dom, job_dom) / (
np.linalg.norm(resume_dom) * np.linalg.norm(job_dom)
)
# Hybrid scoring
base_score = semantic_similarity * 100
# Apply domain bonus/penalty
if domain_similarity > 0.5:
# Same domain: boost score
final_score = base_score * 1.2
else:
# Different domain: penalize
final_score = base_score * 0.6
return {
'final_score': np.clip(final_score, 0, 100),
'semantic_score': base_score,
'domain_similarity': domain_similarity,
'same_domain': domain_similarity > 0.5
}
# Example usage
result = calculate_ats_score(resume_text, job_text)
print(f"Final ATS Score: {result['final_score']:.1f}%")
print(f"Semantic Score: {result['semantic_score']:.1f}%")
print(f"Domain Similarity: {result['domain_similarity']:.3f}")
print(f"Same Domain: {result['same_domain']}")
Training Details
Dataset
- Source: 0xnbk/resume-domain-triplets-train-v1-en
- Training Triplets: 32,000 (anchor, positive, negative)
- Validation Triplets: 8,000
- Hard Negative Mining: 85% of negatives from confusing domain pairs
- Technology β Healthcare (overlapping: "systems", "technology")
- Technology β Design (overlapping: "UX/UI", "product")
- Finance β Legal (overlapping: "compliance", "regulations")
- Healthcare β Education (overlapping: "teaching", "training")
- 13 Balanced Domains: Equal representation across all professional fields
Training Configuration
Hardware:
- GPU: NVIDIA A6000 48GB
- Training Time: ~2 hours
- Memory: Optimized with gradient accumulation
Hyperparameters:
- Base Model: 0xnbk/nbk-ats-semantic-v1-en
- Learning Rate: 5e-6 (very low to preserve semantic knowledge)
- Physical Batch Size: 8
- Gradient Accumulation Steps: 16 (effective batch size: 128)
- Epochs: 3
- Warmup Steps: 100
- Optimizer: AdamW
- Loss Function: TripletLoss (cosine distance, margin 0.5)
- Max Sequence Length: 8,192 tokens
TripletLoss Configuration:
- Distance Metric: Cosine similarity
- Triplet Margin: 0.5
- Hard Negative Strategy: Confusing domain pairs prioritized
- Goal: Minimize distance between same-domain pairs, maximize distance between different-domain pairs
Why Contrastive Learning?
Traditional binary classifiers (cross-encoder) were hitting a performance ceiling. Contrastive learning with TripletLoss:
- Explicitly teaches relative distances: Model learns "Tech is closer to Tech than to Healthcare"
- Hard negative mining: Forces model to distinguish confusing pairs (Tech vs. Design, Healthcare vs. Education)
- Preserves semantic understanding: Fine-tuning from semantic model maintains content quality awareness
- Single unified embedding space: One model handles both similarity and domain classification
Results:
- Same-domain pairs pushed to >0.90 similarity
- Cross-domain pairs pushed to <0.30 or even negative values
- Perfect 100% accuracy on domain recognition task
Professional Domains Covered
The model demonstrates exceptional performance across 13 professional domains:
| Domain | Recognition | Cross-Domain Separation |
|---|---|---|
| Technology | β >0.90 | β Strong boundaries |
| Healthcare | β >0.90 | β Excellent isolation |
| Finance | β >0.90 | β Very strong separation |
| Education | β >0.90 | β Clear boundaries |
| Legal | β >0.90 | β Strong separation |
| Sales/Marketing | β >0.90 | β Realistic overlap handled |
| Human Resources | β >0.90 | β Good separation |
| Manufacturing/Operations | β >0.90 | β Clear distinction |
| Design | β >0.90 | β Nuanced UX/UI overlap |
| Retail/Hospitality | β >0.90 | β Service industry distinction |
| Construction/Real Estate | β >0.90 | β Strong boundaries |
| Government/Nonprofit | β >0.90 | β Public sector clarity |
| Media/Entertainment | β >0.90 | β Creative field separation |
Limitations
- Language: Optimized for English only
- Threshold Tuning: Default threshold (0.5) may need adjustment for specific use cases
- Hybrid Roles: Modern roles spanning multiple domains (e.g., "Health Tech") may show intermediate similarities
- Emerging Fields: New professional domains (e.g., "Web3", "AI Ethics") may not have dedicated training examples
- Domain Granularity: Works at broad domain level; sub-specializations within domains are not distinguished
Ethical Considerations
- Bias Awareness: Domain classification may inherit biases from LinkedIn job posting data
- Transparency: Domain matching is algorithmically derived and should supplement, not replace, human judgment
- Privacy: No PII included in training; users should handle resume data responsibly
- Responsible Use: Should be used as a screening aid to filter cross-domain mismatches, not as sole hiring decision factor
- Fairness: Validate that domain penalties don't disproportionately impact career changers or candidates from underrepresented backgrounds
Citation
@model{nbk_ats_domain_v1,
author = {NBK},
title = {NBK ATS Domain Classifier v1 (English)},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/0xnbk/nbk-ats-domain-v1-en}
}
Training Dataset Citation
@dataset{resume_domain_triplets_v1,
author = {NBK},
title = {Resume-Domain Triplets Dataset v1 (English)},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/0xnbk/resume-domain-triplets-train-v1-en}
}
Base Model Citation
@model{nbk_ats_semantic_v1,
author = {NBK},
title = {NBK ATS Semantic Model v1 (English)},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/0xnbk/nbk-ats-semantic-v1-en}
}
License
This model is released under the Apache 2.0 License.
Copyright 2025 NBK (nbk.dev)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Related Resources
- Base Semantic Model: 0xnbk/nbk-ats-semantic-v1-en
- Training Dataset: 0xnbk/resume-domain-triplets-train-v1-en
- Semantic Training Dataset: 0xnbk/resume-ats-score-v1-en
- Application: LOCAL ATS - Privacy-first ATS Resume Analyzer
Updates and Maintenance
- Version: 1.0.0
- Last Updated: October 2025
- Maintained by: NBK (nbk.dev)
Contact
For questions, suggestions, or collaboration opportunities:
- GitHub: 0xnbk/localATS
- HuggingFace: @0xnbk
- Website: nbk.dev
- Downloads last month
- 11
Model tree for 0xnbk/nbk-ats-domain-v1-en
Base model
jinaai/jina-embeddings-v2-small-en