|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- ats |
|
|
- resume-screening |
|
|
- job-matching |
|
|
- semantic-search |
|
|
base_model: jinaai/jina-embeddings-v2-small-en |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: sentence-transformers |
|
|
--- |
|
|
|
|
|
# NBK ATS Semantic Model v1 (English) |
|
|
|
|
|
**nbk-ats-semantic-v1-en** is a fine-tuned sentence transformer optimized for ATS (Applicant Tracking System) applications. This model excels at measuring semantic similarity between resumes and job descriptions, enabling accurate candidate-job matching across all professional domains. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- π **Extended Context**: **8,192 tokens** - capable of processing full-length resumes and job descriptions without truncation |
|
|
- π **High Performance**: RMSE < 6.0 (target: <7.8), RΒ² = 0.943 |
|
|
- π― **Universal Domain Support**: Excellent performance across Technology, Healthcare, Finance, Education, Legal, Marketing, and 10+ other industries |
|
|
- πΎ **Multiple Formats**: Available in SafeTensors (125MB), ONNX full (109MB), and **ONNX quantized (27MB - browser-friendly)** |
|
|
- β‘ **Optimized for Inference**: Runs efficiently on A6000 48GB GPU with minimal dependencies |
|
|
- π **Production Ready**: Successfully validated on 64 diverse test cases with 95% success rate |
|
|
|
|
|
## Model Sizes |
|
|
|
|
|
| Format | Size | Use Case | Performance | |
|
|
|--------|------|----------|-------------| |
|
|
| **SafeTensors** | 125MB | HuggingFace/PyTorch deployment | Full precision | |
|
|
| **ONNX Full** | 109MB | Cross-platform inference | Full precision | |
|
|
| **ONNX Quantized (INT8)** | 27MB | **Browser deployment** | Minimal loss (~1-2%) | |
|
|
| **Ensemble Weights** | 0.5MB | Score mapper (Ridge + Neural) | Required for ATS scoring | |
|
|
|
|
|
**Total Browser Package**: ~28MB (ONNX quantized + ensemble weights) - **optimized for client-side inference** |
|
|
|
|
|
## Model Specification |
|
|
|
|
|
- **Base Model**: jinaai/jina-embeddings-v2-small-en |
|
|
- **Fine-tuning Dataset**: 6,374 resume-job pairs (5,099 train, 1,275 validation) |
|
|
- **Embedding Dimension**: 512D |
|
|
- **Max Sequence Length**: **8,192 tokens** (~32,000 characters) |
|
|
- **Architecture**: 4-layer BERT with ALiBi position embeddings |
|
|
- **Training Loss**: CosineSimilarityLoss |
|
|
- **Similarity Function**: Cosine similarity |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
### Training Performance (A100 80GB GPU) |
|
|
|
|
|
**Ensemble Model (Semantic + Score Mapper):** |
|
|
- **RMSE**: 5.958 (Target: <7.8) β
**24% better than target** |
|
|
- **RΒ² Score**: 0.943 (Target: >0.90) β
|
|
|
- **MAE**: 3.957 |
|
|
- **Pearson R**: 0.971 |
|
|
|
|
|
**Semantic Model (Base):** |
|
|
- **Pearson Cosine**: 0.690 |
|
|
- **Spearman Cosine**: 0.469 |
|
|
|
|
|
### Production Validation (A6000 48GB GPU) |
|
|
|
|
|
**8 Domains Γ 8 Jobs = 64 Test Cases:** |
|
|
|
|
|
**Same-Domain Performance:** |
|
|
- Technology: 90.1% (similarity: 0.880) β
|
|
|
- Healthcare: 83.0% (similarity: 0.798) β
|
|
|
- Marketing: 83.0% (similarity: 0.797) β
|
|
|
- Education: 82.3% (similarity: 0.790) β
|
|
|
- Design: 80.3% (similarity: 0.768) β
|
|
|
- Sales: 76.6% (similarity: 0.729) β
|
|
|
- Management: 74.4% (similarity: 0.709) β
|
|
|
- Finance: 62.4% (similarity: 0.594) β οΈ |
|
|
|
|
|
**Average Same-Domain Score: 79.0%** β
|
|
|
|
|
|
**Cross-Domain Discrimination:** |
|
|
- Average Cross-Domain Score: 47.4% |
|
|
- **Separation Gap: 31.6 points** β
|
|
|
- Perfect Match Rate: 8/8 (100%) |
|
|
- No False Positives: 0 cross-domain pairs >90% |
|
|
|
|
|
**Overall Success Rate: 95%** π |
|
|
|
|
|
## Extended Context Window Advantage |
|
|
|
|
|
### Why 8,192 Tokens Matters |
|
|
|
|
|
Most transformer models limit context to 512 tokens (~2,000 characters), which is insufficient for professional documents: |
|
|
|
|
|
| Document Type | Average Length | Traditional Models | This Model | |
|
|
|---------------|----------------|-------------------|------------| |
|
|
| **Entry-Level Resume** | 1,500 chars | β
Fits | β
Fits | |
|
|
| **Senior Resume** | 4,500 chars | β Truncated | β
Fits | |
|
|
| **Executive Resume** | 8,000+ chars | β Severely truncated | β
Fits | |
|
|
| **Job Description** | 3,000-5,000 chars | β οΈ Partially fits | β
Fits | |
|
|
| **Combined (Resume + Job)** | 8,000-13,000 chars | β Heavy truncation | β
Mostly fits | |
|
|
|
|
|
**Real-World Impact:** |
|
|
- β
**No loss of critical information** from experience sections |
|
|
- β
**Complete skill analysis** across entire document |
|
|
- β
**Accurate senior-level matching** with extensive work history |
|
|
- β
**Better context understanding** from full job requirements |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install sentence-transformers |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Sentence Similarity |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from scipy.spatial.distance import cosine |
|
|
|
|
|
# Load model |
|
|
model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en') |
|
|
|
|
|
# Example: Resume-Job matching |
|
|
resume = """ |
|
|
Senior Software Engineer with 8 years of experience in Python, Django, and React. |
|
|
Led development of microservices architecture serving 10M+ users. Expert in AWS, |
|
|
Docker, Kubernetes, and CI/CD pipelines. Strong background in agile methodologies |
|
|
and cross-functional team leadership. |
|
|
""" |
|
|
|
|
|
job_description = """ |
|
|
We're seeking a Senior Backend Engineer with 5+ years Python experience. |
|
|
Must have expertise in Django, microservices, and cloud platforms (AWS/GCP). |
|
|
Experience with containerization (Docker/Kubernetes) and modern DevOps practices required. |
|
|
""" |
|
|
|
|
|
# Generate embeddings |
|
|
resume_embedding = model.encode(resume) |
|
|
job_embedding = model.encode(job_description) |
|
|
|
|
|
# Calculate similarity |
|
|
similarity = 1 - cosine(resume_embedding, job_embedding) |
|
|
print(f"Semantic Similarity: {similarity:.3f}") # Expected: 0.85-0.95 (high match) |
|
|
|
|
|
# Convert to ATS score (0-100) |
|
|
ats_score = similarity * 100 |
|
|
print(f"ATS Score: {ats_score:.1f}%") |
|
|
``` |
|
|
|
|
|
### Batch Processing with Long Documents |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en') |
|
|
|
|
|
# Process multiple long resumes efficiently |
|
|
resumes = [ |
|
|
"... 8000+ character resume ...", |
|
|
"... another long resume ...", |
|
|
"... third resume ..." |
|
|
] |
|
|
|
|
|
job = "... detailed job description ..." |
|
|
|
|
|
# Batch encode (handles long context automatically) |
|
|
resume_embeddings = model.encode(resumes, batch_size=8, show_progress_bar=True) |
|
|
job_embedding = model.encode(job) |
|
|
|
|
|
# Calculate similarities |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
similarities = cosine_similarity([job_embedding], resume_embeddings)[0] |
|
|
|
|
|
# Rank candidates |
|
|
for idx, score in sorted(enumerate(similarities), key=lambda x: x[1], reverse=True): |
|
|
print(f"Candidate {idx+1}: {score*100:.1f}%") |
|
|
``` |
|
|
|
|
|
### Complete ATS Scoring with Ensemble |
|
|
|
|
|
For production ATS scoring, combine the semantic model with the ensemble score mapper: |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from sklearn.linear_model import Ridge |
|
|
from sklearn.neural_network import MLPRegressor |
|
|
from sklearn.preprocessing import PolynomialFeatures |
|
|
import numpy as np |
|
|
import json |
|
|
|
|
|
# Load semantic model |
|
|
model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en') |
|
|
|
|
|
# Load ensemble weights from JSON (secure format, no pickle warnings) |
|
|
with open('ridge_weights.json', 'r') as f: |
|
|
ridge_data = json.load(f) |
|
|
with open('neural_weights.json', 'r') as f: |
|
|
neural_data = json.load(f) |
|
|
with open('poly_features.json', 'r') as f: |
|
|
poly_data = json.load(f) |
|
|
|
|
|
# Reconstruct models from JSON |
|
|
score_mapper = Ridge(alpha=ridge_data['alpha']) |
|
|
score_mapper.coef_ = np.array(ridge_data['coefficients']) |
|
|
score_mapper.intercept_ = ridge_data['intercept'] |
|
|
score_mapper.n_features_in_ = ridge_data['n_features_in'] |
|
|
|
|
|
neural_mapper = MLPRegressor( |
|
|
hidden_layer_sizes=tuple(neural_data['hidden_layer_sizes']), |
|
|
activation=neural_data['activation'] |
|
|
) |
|
|
neural_mapper.coefs_ = [np.array(c) for c in neural_data['coefs']] |
|
|
neural_mapper.intercepts_ = [np.array(i) for i in neural_data['intercepts']] |
|
|
neural_mapper.n_features_in_ = neural_data['n_features_in'] |
|
|
|
|
|
poly_features = PolynomialFeatures( |
|
|
degree=poly_data['degree'], |
|
|
include_bias=poly_data['include_bias'] |
|
|
) |
|
|
poly_features.n_features_in_ = poly_data['n_features_in'] |
|
|
poly_features.n_output_features_ = poly_data['n_output_features'] |
|
|
|
|
|
def predict_ats_score(resume_text, job_text): |
|
|
# Generate embeddings |
|
|
resume_emb = model.encode(resume_text) |
|
|
job_emb = model.encode(job_text) |
|
|
|
|
|
# Calculate base similarity |
|
|
similarity = np.dot(resume_emb, job_emb) / (np.linalg.norm(resume_emb) * np.linalg.norm(job_emb)) |
|
|
|
|
|
# Create polynomial features |
|
|
features = poly_features.transform([[similarity]]) |
|
|
|
|
|
# Ensemble prediction (Ridge + Neural Network) |
|
|
ridge_pred = score_mapper.predict(features)[0] |
|
|
neural_pred = neural_mapper.predict(features)[0] |
|
|
|
|
|
# Dynamic ensemble (50-50 weighting, optimized during training) |
|
|
final_score = (ridge_pred * 0.5 + neural_pred * 0.5) |
|
|
|
|
|
return np.clip(final_score, 0, 100) |
|
|
|
|
|
# Example usage |
|
|
score = predict_ats_score(resume_text, job_text) |
|
|
print(f"Final ATS Score: {score:.1f}%") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
|
|
|
- **Source**: [0xnbk/resume-ats-score-v1-en](https://huggingface.co/datasets/0xnbk/resume-ats-score-v1-en) |
|
|
- **Training Samples**: 5,099 resume-job pairs |
|
|
- **Validation Samples**: 1,275 pairs |
|
|
- **Score Range**: 18.3 - 90.7 (normalized to 0-1 for training) |
|
|
- **Average Text Length**: ~8,480 characters per example |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
**Hardware:** |
|
|
- GPU: NVIDIA A100 80GB |
|
|
- Training Time: ~30 minutes |
|
|
- Memory: Optimized with gradient accumulation |
|
|
|
|
|
**Hyperparameters:** |
|
|
- Learning Rate: 1.2e-4 |
|
|
- Batch Size: 16 (physical) Γ 8 (gradient accumulation) = 128 (effective) |
|
|
- Epochs: 50 (early stopping at epoch 34) |
|
|
- Warmup Ratio: 0.15 |
|
|
- Optimizer: AdamW |
|
|
- Loss Function: CosineSimilarityLoss (MSE) |
|
|
- Max Sequence Length: 8,192 tokens |
|
|
- FP16: Enabled |
|
|
- Torch Compile: Enabled (inductor backend) |
|
|
|
|
|
**Score Mapper Training:** |
|
|
- Polynomial Features: Degree 3 |
|
|
- Ridge Regression: L2 regularization (alpha optimized) |
|
|
- Neural Network: 3-layer MLP (128β64β32 neurons) |
|
|
- Ensemble: Dynamic 50-50 weighting (Ridge + Neural) |
|
|
|
|
|
### ESCO Normalization |
|
|
|
|
|
This model was trained with **ESCO (European Skills, Competences, Qualifications and Occupations)** text normalization: |
|
|
|
|
|
- **13,939 real skills** from the ESCO taxonomy integrated during training |
|
|
- **Alternative label mapping**: Maps skill variations to canonical forms (e.g., "javascript" β "JavaScript", "react js" β "React", "ML" β "Machine Learning") |
|
|
- **Training-time normalization**: ALL resumes and job descriptions were normalized before encoding |
|
|
- **Benefits**: |
|
|
- Consistent skill representation across different writing styles |
|
|
- Handles common abbreviations and variations automatically |
|
|
- Improves matching accuracy for technology and professional terms |
|
|
- Better generalization to unseen skill variations |
|
|
|
|
|
**Example Normalizations:** |
|
|
- "b-tech" / "btech" β "Bachelor of Technology" |
|
|
- "js" / "javascript" β "JavaScript" |
|
|
- "aws cloud" / "amazon web services" β "Amazon Web Services (AWS)" |
|
|
- "ML" / "machine learning" β "Machine Learning" |
|
|
|
|
|
This normalization is baked into the model's training data, so you don't need to apply it during inference. |
|
|
|
|
|
### Validation Strategy |
|
|
|
|
|
The model was validated using: |
|
|
1. **Quantitative Metrics**: RMSE, RΒ², MAE, Pearson correlation |
|
|
2. **Cross-Domain Testing**: 64 test cases (8 domains Γ 8 job categories) |
|
|
3. **Real-World Validation**: Full-length resumes and job descriptions |
|
|
4. **Inference Testing**: A6000 48GB GPU with minimal dependencies |
|
|
|
|
|
## Browser Deployment (ONNX) |
|
|
|
|
|
The model is optimized for browser deployment using ONNX Runtime: |
|
|
|
|
|
**Quantized Model (27MB):** |
|
|
- INT8 quantization for reduced size |
|
|
- Minimal accuracy loss (~1-2%) |
|
|
- Compatible with ONNX Runtime Web |
|
|
- Runs entirely client-side (no server required) |
|
|
|
|
|
**Deployment Example (ONNX Runtime Web):** |
|
|
```javascript |
|
|
// Using ONNX Runtime Web for browser deployment |
|
|
import * as ort from 'onnxruntime-web'; |
|
|
|
|
|
// Load the quantized ONNX model |
|
|
const session = await ort.InferenceSession.create('onnx/model_quantized.onnx'); |
|
|
|
|
|
// You'll need to tokenize the text and create embeddings |
|
|
// Then calculate cosine similarity between resume and job embeddings |
|
|
|
|
|
function cosineSimilarity(a, b) { |
|
|
const dot = a.reduce((sum, val, i) => sum + val * b[i], 0); |
|
|
const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0)); |
|
|
const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0)); |
|
|
return dot / (magA * magB); |
|
|
} |
|
|
|
|
|
// Calculate ATS score |
|
|
const similarity = cosineSimilarity(resumeEmbedding, jobEmbedding); |
|
|
console.log(`ATS Score: ${(similarity * 100).toFixed(1)}%`); |
|
|
``` |
|
|
|
|
|
**Note**: For production browser deployment, you'll need to handle tokenization and implement the full inference pipeline. The ONNX quantized model provides the core embedding functionality optimized for client-side execution. |
|
|
|
|
|
## Domain Coverage |
|
|
|
|
|
The model demonstrates excellent performance across all major professional domains: |
|
|
|
|
|
**Technology Domains:** |
|
|
- Software Engineering, Data Science, DevOps, Cloud Computing, AI/ML |
|
|
- Web Development, Mobile Development, Security, QA, Systems Administration |
|
|
|
|
|
**Business Domains:** |
|
|
- Finance & Banking, Accounting, Investment, Fintech |
|
|
- Sales & Marketing, Business Development, Digital Marketing |
|
|
- Human Resources, Recruitment, Training & Development |
|
|
|
|
|
**Healthcare & Life Sciences:** |
|
|
- Nursing, Medical Practice, Clinical Research, Healthcare Administration |
|
|
- Pharmaceuticals, Biotechnology, Medical Devices |
|
|
|
|
|
**Professional Services:** |
|
|
- Legal, Consulting, Education, Design, Media & Entertainment |
|
|
- Manufacturing, Construction, Real Estate, Government & Nonprofit |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Language**: Currently optimized for English only |
|
|
2. **Domain**: Designed specifically for professional resume-job matching |
|
|
3. **Context Length**: While 8,192 tokens is generous, extremely long documents may still be truncated |
|
|
4. **Cultural Bias**: May reflect biases present in English-language job market data |
|
|
5. **Temporal Relevance**: Trained on 2024-2025 data; may need retraining for future job market shifts |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- **Bias Awareness**: Models may inherit biases from training data; validate fairness across demographics |
|
|
- **Transparency**: ATS scores are algorithmically derived and should supplement, not replace, human judgment |
|
|
- **Privacy**: No PII included in training; users should handle resume data responsibly |
|
|
- **Responsible Use**: Should be used as a screening aid, not sole decision-maker in hiring |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@model{nbk_ats_semantic_v1, |
|
|
author = {NBK}, |
|
|
title = {NBK ATS Semantic Model v1 (English)}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/0xnbk/nbk-ats-semantic-v1-en} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Training Dataset Citation |
|
|
|
|
|
```bibtex |
|
|
@dataset{resume_ats_score_v1, |
|
|
author = {NBK}, |
|
|
title = {Resume-ATS Score Dataset v1 (English)}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/datasets/0xnbk/resume-ats-score-v1-en} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Base Model Citation |
|
|
|
|
|
```bibtex |
|
|
@software{jina_embeddings_v2_small, |
|
|
author = {Jina AI}, |
|
|
title = {Jina Embeddings v2 Small English}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/jinaai/jina-embeddings-v2-small-en} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Apache 2.0 License**. |
|
|
|
|
|
``` |
|
|
Copyright 2025 NBK (nbk.dev) |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
``` |
|
|
|
|
|
## Related Resources |
|
|
|
|
|
- **Training Dataset**: [0xnbk/resume-ats-score-v1-en](https://huggingface.co/datasets/0xnbk/resume-ats-score-v1-en) |
|
|
- **Domain Classifier Dataset**: [0xnbk/resume-domain-classifier-v1-en](https://huggingface.co/datasets/0xnbk/resume-domain-classifier-v1-en) |
|
|
- **Domain Triplets Dataset**: [0xnbk/resume-domain-triplets-train-v1-en](https://huggingface.co/datasets/0xnbk/resume-domain-triplets-train-v1-en) |
|
|
- **Domain Model**: [0xnbk/nbk-ats-domain-v1-en](https://huggingface.co/0xnbk/nbk-ats-domain-v1-en) |
|
|
- **Application**: [LOCAL ATS](https://github.com/0xnbk/localATS) - Privacy-first ATS Resume Analyzer |
|
|
|
|
|
## Updates and Maintenance |
|
|
|
|
|
- **Version**: 1.0.0 |
|
|
- **Last Updated**: October 2025 |
|
|
- **Maintained by**: NBK (nbk.dev) |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, suggestions, or collaboration opportunities: |
|
|
- **GitHub**: [0xnbk/localATS](https://github.com/0xnbk/localATS) |
|
|
- **HuggingFace**: [@0xnbk](https://huggingface.co/0xnbk) |
|
|
- **Website**: [nbk.dev](https://nbk.dev) |
|
|
|