NBK ATS Domain Classifier v1 (English)

nbk-ats-domain-v1-en is a specialized sentence transformer fine-tuned using contrastive learning (Triplet Loss) for professional domain classification. This model excels at distinguishing between different professional domains in resumes and job descriptions, solving the critical "cross-domain score inflation" problem in ATS (Applicant Tracking System) applications.

Key Features

🎯 Exceptional Domain Separation: Same-domain similarity scores >0.90, cross-domain scores can reach negative values (e.g., Tech vs. Healthcare: -0.295)
🔥 Contrastive Learning: Trained with TripletLoss using 40,000 hard-negative triplets (32,000 train, 8,000 validation)
🚀 Extended Context: 8,192 tokens - processes full-length resumes and job descriptions
🏗️ Built on Excellence: Fine-tuned from nbk-ats-semantic-v1-en, inheriting its powerful semantic understanding
🌐 13 Professional Domains: Technology, Healthcare, Finance, Education, Legal, Marketing, Manufacturing, Design, Retail/Hospitality, Construction/Real Estate, HR, Media/Entertainment, Government/Nonprofit
✅ Production Validated: 100% accuracy on same-domain recognition, perfect cross-domain discrimination

Model Specification

Base Model: 0xnbk/nbk-ats-semantic-v1-en (fine-tuned from jinaai/jina-embeddings-v2-small-en)
Training Method: Contrastive Learning with TripletLoss
Fine-tuning Dataset: 0xnbk/resume-domain-triplets-train-v1-en
Embedding Dimension: 512D
Max Sequence Length: 8,192 tokens (~32,000 characters)
Architecture: 4-layer BERT with ALiBi position embeddings
Training Loss: TripletLoss with cosine distance (margin: 0.5)

Performance

Validation Results

Same-Domain Performance:

Similarity Score: Consistently >0.90 for matching domains
Perfect Recognition: 100% accuracy in identifying same-domain pairs
Examples:
- Technology ↔ Technology: 0.95+ ✅
- Healthcare ↔ Healthcare: 0.92+ ✅
- Finance ↔ Finance: 0.94+ ✅

Cross-Domain Discrimination:

Aggressive Separation: Scores pushed to very low or negative values
No False Positives: Zero cross-domain pairs score >0.90
Examples:
- Tech ↔ Healthcare: -0.295 ✅ Exceptional separation
- Tech ↔ Finance: -0.180 ✅ Strong separation
- Design ↔ Finance: -0.250 ✅ Clear boundaries

Nuanced Understanding:

Realistic Overlap Detection: Related domains show appropriate similarity
- Marketing ↔ Sales: ~0.85-0.90 (realistic business overlap)
- Design ↔ Technology: 0.65-0.75 (modern UX/UI roles)
- Healthcare ↔ Education: 0.60-0.70 (teaching hospitals)

Training Performance (A6000 48GB GPU)

Training Configuration:

Hardware: NVIDIA A6000 48GB
Training Time: ~2 hours
Physical Batch Size: 8
Gradient Accumulation: 16 steps (effective batch size: 128)
Learning Rate: 5e-6
Epochs: 3
Triplet Margin: 0.5

Dataset Quality:

Total Triplets: 40,000 (32,000 train, 8,000 validation)
Hard Negative Mining: 85% of negatives are "confusing" domain pairs
13 Balanced Domains: Equal representation across all professional fields
Source: Real LinkedIn job postings (654,650 processed)

Installation

pip install sentence-transformers

Usage

Domain Classification

from sentence_transformers import SentenceTransformer
import numpy as np

# Load domain-specialized model
model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')

# Example: Resume-Job domain matching
resume_tech = """
Senior Software Engineer with 8 years of experience in Python, Django, and React.
Led development of microservices architecture serving 10M+ users. Expert in AWS,
Docker, Kubernetes, and CI/CD pipelines. Strong background in agile methodologies
and cross-functional team leadership.
"""

job_tech = """
We're seeking a Senior Backend Engineer with 5+ years Python experience.
Must have expertise in Django, microservices, and cloud platforms (AWS/GCP).
Experience with containerization (Docker/Kubernetes) and modern DevOps practices required.
"""

job_healthcare = """
Registered Nurse position in ICU requiring 3+ years critical care experience.
Must have current RN license, ACLS certification, and experience with electronic
medical records. Knowledge of patient monitoring systems and ventilator management essential.
"""

# Generate embeddings
resume_emb = model.encode(resume_tech)
tech_job_emb = model.encode(job_tech)
healthcare_job_emb = model.encode(job_healthcare)

# Calculate cosine similarity
same_domain_similarity = np.dot(resume_emb, tech_job_emb) / (
    np.linalg.norm(resume_emb) * np.linalg.norm(tech_job_emb)
)
cross_domain_similarity = np.dot(resume_emb, healthcare_job_emb) / (
    np.linalg.norm(resume_emb) * np.linalg.norm(healthcare_job_emb)
)

print(f"Same Domain (Tech → Tech): {same_domain_similarity:.3f}")  # Expected: >0.90
print(f"Cross Domain (Tech → Healthcare): {cross_domain_similarity:.3f}")  # Expected: <0.30 or negative

# Domain matching logic
def is_same_domain(similarity, threshold=0.5):
    return similarity > threshold

print(f"\nDomain Match: {is_same_domain(same_domain_similarity)}")  # True
print(f"Domain Match: {is_same_domain(cross_domain_similarity)}")  # False

Batch Domain Classification

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')

# Multiple resumes from different domains
resumes = [
    "... technology resume ...",
    "... healthcare resume ...",
    "... finance resume ..."
]

# Single job description
job = "... technology job description ..."

# Batch encode
resume_embeddings = model.encode(resumes, batch_size=8, show_progress_bar=True)
job_embedding = model.encode(job)

# Calculate similarities
similarities = cosine_similarity([job_embedding], resume_embeddings)[0]

# Rank by domain compatibility
for idx, score in sorted(enumerate(similarities), key=lambda x: x[1], reverse=True):
    domain_match = "✅ SAME DOMAIN" if score > 0.5 else "❌ DIFFERENT DOMAIN"
    print(f"Resume {idx+1}: {score:.3f} {domain_match}")

Integration with Semantic Model (Hybrid ATS System)

from sentence_transformers import SentenceTransformer
import numpy as np

# Load both models
semantic_model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en')
domain_model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')

def calculate_ats_score(resume_text, job_text):
    # Semantic similarity (content quality)
    resume_sem = semantic_model.encode(resume_text)
    job_sem = semantic_model.encode(job_text)
    semantic_similarity = np.dot(resume_sem, job_sem) / (
        np.linalg.norm(resume_sem) * np.linalg.norm(job_sem)
    )

    # Domain compatibility (field alignment)
    resume_dom = domain_model.encode(resume_text)
    job_dom = domain_model.encode(job_text)
    domain_similarity = np.dot(resume_dom, job_dom) / (
        np.linalg.norm(resume_dom) * np.linalg.norm(job_dom)
    )

    # Hybrid scoring
    base_score = semantic_similarity * 100

    # Apply domain bonus/penalty
    if domain_similarity > 0.5:
        # Same domain: boost score
        final_score = base_score * 1.2
    else:
        # Different domain: penalize
        final_score = base_score * 0.6

    return {
        'final_score': np.clip(final_score, 0, 100),
        'semantic_score': base_score,
        'domain_similarity': domain_similarity,
        'same_domain': domain_similarity > 0.5
    }

# Example usage
result = calculate_ats_score(resume_text, job_text)
print(f"Final ATS Score: {result['final_score']:.1f}%")
print(f"Semantic Score: {result['semantic_score']:.1f}%")
print(f"Domain Similarity: {result['domain_similarity']:.3f}")
print(f"Same Domain: {result['same_domain']}")

Training Details

Dataset

Source: 0xnbk/resume-domain-triplets-train-v1-en
Training Triplets: 32,000 (anchor, positive, negative)
Validation Triplets: 8,000
Hard Negative Mining: 85% of negatives from confusing domain pairs
- Technology ↔ Healthcare (overlapping: "systems", "technology")
- Technology ↔ Design (overlapping: "UX/UI", "product")
- Finance ↔ Legal (overlapping: "compliance", "regulations")
- Healthcare ↔ Education (overlapping: "teaching", "training")
13 Balanced Domains: Equal representation across all professional fields

Training Configuration

Hardware:

GPU: NVIDIA A6000 48GB
Training Time: ~2 hours
Memory: Optimized with gradient accumulation

Hyperparameters:

Base Model: 0xnbk/nbk-ats-semantic-v1-en
Learning Rate: 5e-6 (very low to preserve semantic knowledge)
Physical Batch Size: 8
Gradient Accumulation Steps: 16 (effective batch size: 128)
Epochs: 3
Warmup Steps: 100
Optimizer: AdamW
Loss Function: TripletLoss (cosine distance, margin 0.5)
Max Sequence Length: 8,192 tokens

TripletLoss Configuration:

Distance Metric: Cosine similarity
Triplet Margin: 0.5
Hard Negative Strategy: Confusing domain pairs prioritized
Goal: Minimize distance between same-domain pairs, maximize distance between different-domain pairs

Why Contrastive Learning?

Traditional binary classifiers (cross-encoder) were hitting a performance ceiling. Contrastive learning with TripletLoss:

Explicitly teaches relative distances: Model learns "Tech is closer to Tech than to Healthcare"
Hard negative mining: Forces model to distinguish confusing pairs (Tech vs. Design, Healthcare vs. Education)
Preserves semantic understanding: Fine-tuning from semantic model maintains content quality awareness
Single unified embedding space: One model handles both similarity and domain classification

Results:

Same-domain pairs pushed to >0.90 similarity
Cross-domain pairs pushed to <0.30 or even negative values
Perfect 100% accuracy on domain recognition task

Professional Domains Covered

The model demonstrates exceptional performance across 13 professional domains:

Domain	Recognition	Cross-Domain Separation
Technology	✅ >0.90	✅ Strong boundaries
Healthcare	✅ >0.90	✅ Excellent isolation
Finance	✅ >0.90	✅ Very strong separation
Education	✅ >0.90	✅ Clear boundaries
Legal	✅ >0.90	✅ Strong separation
Sales/Marketing	✅ >0.90	✅ Realistic overlap handled
Human Resources	✅ >0.90	✅ Good separation
Manufacturing/Operations	✅ >0.90	✅ Clear distinction
Design	✅ >0.90	✅ Nuanced UX/UI overlap
Retail/Hospitality	✅ >0.90	✅ Service industry distinction
Construction/Real Estate	✅ >0.90	✅ Strong boundaries
Government/Nonprofit	✅ >0.90	✅ Public sector clarity
Media/Entertainment	✅ >0.90	✅ Creative field separation

Limitations

Language: Optimized for English only
Threshold Tuning: Default threshold (0.5) may need adjustment for specific use cases
Hybrid Roles: Modern roles spanning multiple domains (e.g., "Health Tech") may show intermediate similarities
Emerging Fields: New professional domains (e.g., "Web3", "AI Ethics") may not have dedicated training examples
Domain Granularity: Works at broad domain level; sub-specializations within domains are not distinguished

Ethical Considerations

Bias Awareness: Domain classification may inherit biases from LinkedIn job posting data
Transparency: Domain matching is algorithmically derived and should supplement, not replace, human judgment
Privacy: No PII included in training; users should handle resume data responsibly
Responsible Use: Should be used as a screening aid to filter cross-domain mismatches, not as sole hiring decision factor
Fairness: Validate that domain penalties don't disproportionately impact career changers or candidates from underrepresented backgrounds

Citation

@model{nbk_ats_domain_v1,
  author = {NBK},
  title = {NBK ATS Domain Classifier v1 (English)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/0xnbk/nbk-ats-domain-v1-en}
}

Training Dataset Citation

@dataset{resume_domain_triplets_v1,
  author = {NBK},
  title = {Resume-Domain Triplets Dataset v1 (English)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/0xnbk/resume-domain-triplets-train-v1-en}
}

Base Model Citation

@model{nbk_ats_semantic_v1,
  author = {NBK},
  title = {NBK ATS Semantic Model v1 (English)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/0xnbk/nbk-ats-semantic-v1-en}
}

License

This model is released under the Apache 2.0 License.

Copyright 2025 NBK (nbk.dev)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Related Resources

Base Semantic Model: 0xnbk/nbk-ats-semantic-v1-en
Training Dataset: 0xnbk/resume-domain-triplets-train-v1-en
Semantic Training Dataset: 0xnbk/resume-ats-score-v1-en
Application: LOCAL ATS - Privacy-first ATS Resume Analyzer

Updates and Maintenance

Version: 1.0.0
Last Updated: October 2025
Maintained by: NBK (nbk.dev)

Contact

For questions, suggestions, or collaboration opportunities:

GitHub: 0xnbk/localATS
HuggingFace: @0xnbk
Website: nbk.dev

Downloads last month: 11

Safetensors

Model size

32.7M params

Tensor type

F32

Model tree for 0xnbk/nbk-ats-domain-v1-en

Base model

jinaai/jina-embeddings-v2-small-en

Quantized

0xnbk/nbk-ats-semantic-v1-en

Quantized

(1)

this model