NBK ATS Domain Classifier v1 (English)

nbk-ats-domain-v1-en is a specialized sentence transformer fine-tuned using contrastive learning (Triplet Loss) for professional domain classification. This model excels at distinguishing between different professional domains in resumes and job descriptions, solving the critical "cross-domain score inflation" problem in ATS (Applicant Tracking System) applications.

Key Features

  • 🎯 Exceptional Domain Separation: Same-domain similarity scores >0.90, cross-domain scores can reach negative values (e.g., Tech vs. Healthcare: -0.295)
  • πŸ”₯ Contrastive Learning: Trained with TripletLoss using 40,000 hard-negative triplets (32,000 train, 8,000 validation)
  • πŸš€ Extended Context: 8,192 tokens - processes full-length resumes and job descriptions
  • πŸ—οΈ Built on Excellence: Fine-tuned from nbk-ats-semantic-v1-en, inheriting its powerful semantic understanding
  • 🌐 13 Professional Domains: Technology, Healthcare, Finance, Education, Legal, Marketing, Manufacturing, Design, Retail/Hospitality, Construction/Real Estate, HR, Media/Entertainment, Government/Nonprofit
  • βœ… Production Validated: 100% accuracy on same-domain recognition, perfect cross-domain discrimination

Model Specification

  • Base Model: 0xnbk/nbk-ats-semantic-v1-en (fine-tuned from jinaai/jina-embeddings-v2-small-en)
  • Training Method: Contrastive Learning with TripletLoss
  • Fine-tuning Dataset: 0xnbk/resume-domain-triplets-train-v1-en
  • Embedding Dimension: 512D
  • Max Sequence Length: 8,192 tokens (~32,000 characters)
  • Architecture: 4-layer BERT with ALiBi position embeddings
  • Training Loss: TripletLoss with cosine distance (margin: 0.5)

Performance

Validation Results

Same-Domain Performance:

  • Similarity Score: Consistently >0.90 for matching domains
  • Perfect Recognition: 100% accuracy in identifying same-domain pairs
  • Examples:
    • Technology ↔ Technology: 0.95+ βœ…
    • Healthcare ↔ Healthcare: 0.92+ βœ…
    • Finance ↔ Finance: 0.94+ βœ…

Cross-Domain Discrimination:

  • Aggressive Separation: Scores pushed to very low or negative values
  • No False Positives: Zero cross-domain pairs score >0.90
  • Examples:
    • Tech ↔ Healthcare: -0.295 βœ… Exceptional separation
    • Tech ↔ Finance: -0.180 βœ… Strong separation
    • Design ↔ Finance: -0.250 βœ… Clear boundaries

Nuanced Understanding:

  • Realistic Overlap Detection: Related domains show appropriate similarity
    • Marketing ↔ Sales: ~0.85-0.90 (realistic business overlap)
    • Design ↔ Technology: 0.65-0.75 (modern UX/UI roles)
    • Healthcare ↔ Education: 0.60-0.70 (teaching hospitals)

Training Performance (A6000 48GB GPU)

Training Configuration:

  • Hardware: NVIDIA A6000 48GB
  • Training Time: ~2 hours
  • Physical Batch Size: 8
  • Gradient Accumulation: 16 steps (effective batch size: 128)
  • Learning Rate: 5e-6
  • Epochs: 3
  • Triplet Margin: 0.5

Dataset Quality:

  • Total Triplets: 40,000 (32,000 train, 8,000 validation)
  • Hard Negative Mining: 85% of negatives are "confusing" domain pairs
  • 13 Balanced Domains: Equal representation across all professional fields
  • Source: Real LinkedIn job postings (654,650 processed)

Installation

pip install sentence-transformers

Usage

Domain Classification

from sentence_transformers import SentenceTransformer
import numpy as np

# Load domain-specialized model
model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')

# Example: Resume-Job domain matching
resume_tech = """
Senior Software Engineer with 8 years of experience in Python, Django, and React.
Led development of microservices architecture serving 10M+ users. Expert in AWS,
Docker, Kubernetes, and CI/CD pipelines. Strong background in agile methodologies
and cross-functional team leadership.
"""

job_tech = """
We're seeking a Senior Backend Engineer with 5+ years Python experience.
Must have expertise in Django, microservices, and cloud platforms (AWS/GCP).
Experience with containerization (Docker/Kubernetes) and modern DevOps practices required.
"""

job_healthcare = """
Registered Nurse position in ICU requiring 3+ years critical care experience.
Must have current RN license, ACLS certification, and experience with electronic
medical records. Knowledge of patient monitoring systems and ventilator management essential.
"""

# Generate embeddings
resume_emb = model.encode(resume_tech)
tech_job_emb = model.encode(job_tech)
healthcare_job_emb = model.encode(job_healthcare)

# Calculate cosine similarity
same_domain_similarity = np.dot(resume_emb, tech_job_emb) / (
    np.linalg.norm(resume_emb) * np.linalg.norm(tech_job_emb)
)
cross_domain_similarity = np.dot(resume_emb, healthcare_job_emb) / (
    np.linalg.norm(resume_emb) * np.linalg.norm(healthcare_job_emb)
)

print(f"Same Domain (Tech β†’ Tech): {same_domain_similarity:.3f}")  # Expected: >0.90
print(f"Cross Domain (Tech β†’ Healthcare): {cross_domain_similarity:.3f}")  # Expected: <0.30 or negative

# Domain matching logic
def is_same_domain(similarity, threshold=0.5):
    return similarity > threshold

print(f"\nDomain Match: {is_same_domain(same_domain_similarity)}")  # True
print(f"Domain Match: {is_same_domain(cross_domain_similarity)}")  # False

Batch Domain Classification

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')

# Multiple resumes from different domains
resumes = [
    "... technology resume ...",
    "... healthcare resume ...",
    "... finance resume ..."
]

# Single job description
job = "... technology job description ..."

# Batch encode
resume_embeddings = model.encode(resumes, batch_size=8, show_progress_bar=True)
job_embedding = model.encode(job)

# Calculate similarities
similarities = cosine_similarity([job_embedding], resume_embeddings)[0]

# Rank by domain compatibility
for idx, score in sorted(enumerate(similarities), key=lambda x: x[1], reverse=True):
    domain_match = "βœ… SAME DOMAIN" if score > 0.5 else "❌ DIFFERENT DOMAIN"
    print(f"Resume {idx+1}: {score:.3f} {domain_match}")

Integration with Semantic Model (Hybrid ATS System)

from sentence_transformers import SentenceTransformer
import numpy as np

# Load both models
semantic_model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en')
domain_model = SentenceTransformer('0xnbk/nbk-ats-domain-v1-en')

def calculate_ats_score(resume_text, job_text):
    # Semantic similarity (content quality)
    resume_sem = semantic_model.encode(resume_text)
    job_sem = semantic_model.encode(job_text)
    semantic_similarity = np.dot(resume_sem, job_sem) / (
        np.linalg.norm(resume_sem) * np.linalg.norm(job_sem)
    )

    # Domain compatibility (field alignment)
    resume_dom = domain_model.encode(resume_text)
    job_dom = domain_model.encode(job_text)
    domain_similarity = np.dot(resume_dom, job_dom) / (
        np.linalg.norm(resume_dom) * np.linalg.norm(job_dom)
    )

    # Hybrid scoring
    base_score = semantic_similarity * 100

    # Apply domain bonus/penalty
    if domain_similarity > 0.5:
        # Same domain: boost score
        final_score = base_score * 1.2
    else:
        # Different domain: penalize
        final_score = base_score * 0.6

    return {
        'final_score': np.clip(final_score, 0, 100),
        'semantic_score': base_score,
        'domain_similarity': domain_similarity,
        'same_domain': domain_similarity > 0.5
    }

# Example usage
result = calculate_ats_score(resume_text, job_text)
print(f"Final ATS Score: {result['final_score']:.1f}%")
print(f"Semantic Score: {result['semantic_score']:.1f}%")
print(f"Domain Similarity: {result['domain_similarity']:.3f}")
print(f"Same Domain: {result['same_domain']}")

Training Details

Dataset

  • Source: 0xnbk/resume-domain-triplets-train-v1-en
  • Training Triplets: 32,000 (anchor, positive, negative)
  • Validation Triplets: 8,000
  • Hard Negative Mining: 85% of negatives from confusing domain pairs
    • Technology ↔ Healthcare (overlapping: "systems", "technology")
    • Technology ↔ Design (overlapping: "UX/UI", "product")
    • Finance ↔ Legal (overlapping: "compliance", "regulations")
    • Healthcare ↔ Education (overlapping: "teaching", "training")
  • 13 Balanced Domains: Equal representation across all professional fields

Training Configuration

Hardware:

  • GPU: NVIDIA A6000 48GB
  • Training Time: ~2 hours
  • Memory: Optimized with gradient accumulation

Hyperparameters:

  • Base Model: 0xnbk/nbk-ats-semantic-v1-en
  • Learning Rate: 5e-6 (very low to preserve semantic knowledge)
  • Physical Batch Size: 8
  • Gradient Accumulation Steps: 16 (effective batch size: 128)
  • Epochs: 3
  • Warmup Steps: 100
  • Optimizer: AdamW
  • Loss Function: TripletLoss (cosine distance, margin 0.5)
  • Max Sequence Length: 8,192 tokens

TripletLoss Configuration:

  • Distance Metric: Cosine similarity
  • Triplet Margin: 0.5
  • Hard Negative Strategy: Confusing domain pairs prioritized
  • Goal: Minimize distance between same-domain pairs, maximize distance between different-domain pairs

Why Contrastive Learning?

Traditional binary classifiers (cross-encoder) were hitting a performance ceiling. Contrastive learning with TripletLoss:

  1. Explicitly teaches relative distances: Model learns "Tech is closer to Tech than to Healthcare"
  2. Hard negative mining: Forces model to distinguish confusing pairs (Tech vs. Design, Healthcare vs. Education)
  3. Preserves semantic understanding: Fine-tuning from semantic model maintains content quality awareness
  4. Single unified embedding space: One model handles both similarity and domain classification

Results:

  • Same-domain pairs pushed to >0.90 similarity
  • Cross-domain pairs pushed to <0.30 or even negative values
  • Perfect 100% accuracy on domain recognition task

Professional Domains Covered

The model demonstrates exceptional performance across 13 professional domains:

Domain Recognition Cross-Domain Separation
Technology βœ… >0.90 βœ… Strong boundaries
Healthcare βœ… >0.90 βœ… Excellent isolation
Finance βœ… >0.90 βœ… Very strong separation
Education βœ… >0.90 βœ… Clear boundaries
Legal βœ… >0.90 βœ… Strong separation
Sales/Marketing βœ… >0.90 βœ… Realistic overlap handled
Human Resources βœ… >0.90 βœ… Good separation
Manufacturing/Operations βœ… >0.90 βœ… Clear distinction
Design βœ… >0.90 βœ… Nuanced UX/UI overlap
Retail/Hospitality βœ… >0.90 βœ… Service industry distinction
Construction/Real Estate βœ… >0.90 βœ… Strong boundaries
Government/Nonprofit βœ… >0.90 βœ… Public sector clarity
Media/Entertainment βœ… >0.90 βœ… Creative field separation

Limitations

  1. Language: Optimized for English only
  2. Threshold Tuning: Default threshold (0.5) may need adjustment for specific use cases
  3. Hybrid Roles: Modern roles spanning multiple domains (e.g., "Health Tech") may show intermediate similarities
  4. Emerging Fields: New professional domains (e.g., "Web3", "AI Ethics") may not have dedicated training examples
  5. Domain Granularity: Works at broad domain level; sub-specializations within domains are not distinguished

Ethical Considerations

  • Bias Awareness: Domain classification may inherit biases from LinkedIn job posting data
  • Transparency: Domain matching is algorithmically derived and should supplement, not replace, human judgment
  • Privacy: No PII included in training; users should handle resume data responsibly
  • Responsible Use: Should be used as a screening aid to filter cross-domain mismatches, not as sole hiring decision factor
  • Fairness: Validate that domain penalties don't disproportionately impact career changers or candidates from underrepresented backgrounds

Citation

@model{nbk_ats_domain_v1,
  author = {NBK},
  title = {NBK ATS Domain Classifier v1 (English)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/0xnbk/nbk-ats-domain-v1-en}
}

Training Dataset Citation

@dataset{resume_domain_triplets_v1,
  author = {NBK},
  title = {Resume-Domain Triplets Dataset v1 (English)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/0xnbk/resume-domain-triplets-train-v1-en}
}

Base Model Citation

@model{nbk_ats_semantic_v1,
  author = {NBK},
  title = {NBK ATS Semantic Model v1 (English)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/0xnbk/nbk-ats-semantic-v1-en}
}

License

This model is released under the Apache 2.0 License.

Copyright 2025 NBK (nbk.dev)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Related Resources

Updates and Maintenance

  • Version: 1.0.0
  • Last Updated: October 2025
  • Maintained by: NBK (nbk.dev)

Contact

For questions, suggestions, or collaboration opportunities:

Downloads last month
11
Safetensors
Model size
32.7M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 0xnbk/nbk-ats-domain-v1-en

Quantized
(1)
this model