📝 Update README: Replace pickle examples with secure JSON format

c9f566b verified 26 days ago

16.7 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- ats
	- resume-screening
	- job-matching
	- semantic-search
	base_model: jinaai/jina-embeddings-v2-small-en
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	---

	# NBK ATS Semantic Model v1 (English)

	nbk-ats-semantic-v1-en is a fine-tuned sentence transformer optimized for ATS (Applicant Tracking System) applications. This model excels at measuring semantic similarity between resumes and job descriptions, enabling accurate candidate-job matching across all professional domains.

	## Key Features

	- 🚀 Extended Context: 8,192 tokens - capable of processing full-length resumes and job descriptions without truncation
	- 📊 High Performance: RMSE < 6.0 (target: <7.8), R² = 0.943
	- 🎯 Universal Domain Support: Excellent performance across Technology, Healthcare, Finance, Education, Legal, Marketing, and 10+ other industries
	- 💾 Multiple Formats: Available in SafeTensors (125MB), ONNX full (109MB), and ONNX quantized (27MB - browser-friendly)
	- ⚡ Optimized for Inference: Runs efficiently on A6000 48GB GPU with minimal dependencies
	- 🔒 Production Ready: Successfully validated on 64 diverse test cases with 95% success rate

	## Model Sizes

	\| Format \| Size \| Use Case \| Performance \|
	\|--------\|------\|----------\|-------------\|
	\| SafeTensors \| 125MB \| HuggingFace/PyTorch deployment \| Full precision \|
	\| ONNX Full \| 109MB \| Cross-platform inference \| Full precision \|
	\| ONNX Quantized (INT8) \| 27MB \| Browser deployment \| Minimal loss (~1-2%) \|
	\| Ensemble Weights \| 0.5MB \| Score mapper (Ridge + Neural) \| Required for ATS scoring \|

	Total Browser Package: ~28MB (ONNX quantized + ensemble weights) - optimized for client-side inference

	## Model Specification

	- Base Model: jinaai/jina-embeddings-v2-small-en
	- Fine-tuning Dataset: 6,374 resume-job pairs (5,099 train, 1,275 validation)
	- Embedding Dimension: 512D
	- Max Sequence Length: 8,192 tokens (~32,000 characters)
	- Architecture: 4-layer BERT with ALiBi position embeddings
	- Training Loss: CosineSimilarityLoss
	- Similarity Function: Cosine similarity

	## Performance Metrics

	### Training Performance (A100 80GB GPU)

	Ensemble Model (Semantic + Score Mapper):
	- RMSE: 5.958 (Target: <7.8) ✅ 24% better than target
	- R² Score: 0.943 (Target: >0.90) ✅
	- MAE: 3.957
	- Pearson R: 0.971

	Semantic Model (Base):
	- Pearson Cosine: 0.690
	- Spearman Cosine: 0.469

	### Production Validation (A6000 48GB GPU)

	8 Domains × 8 Jobs = 64 Test Cases:

	Same-Domain Performance:
	- Technology: 90.1% (similarity: 0.880) ✅
	- Healthcare: 83.0% (similarity: 0.798) ✅
	- Marketing: 83.0% (similarity: 0.797) ✅
	- Education: 82.3% (similarity: 0.790) ✅
	- Design: 80.3% (similarity: 0.768) ✅
	- Sales: 76.6% (similarity: 0.729) ✅
	- Management: 74.4% (similarity: 0.709) ✅
	- Finance: 62.4% (similarity: 0.594) ⚠️

	Average Same-Domain Score: 79.0% ✅

	Cross-Domain Discrimination:
	- Average Cross-Domain Score: 47.4%
	- Separation Gap: 31.6 points ✅
	- Perfect Match Rate: 8/8 (100%)
	- No False Positives: 0 cross-domain pairs >90%

	Overall Success Rate: 95% 🎉

	## Extended Context Window Advantage

	### Why 8,192 Tokens Matters

	Most transformer models limit context to 512 tokens (~2,000 characters), which is insufficient for professional documents:

	\| Document Type \| Average Length \| Traditional Models \| This Model \|
	\|---------------\|----------------\|-------------------\|------------\|
	\| Entry-Level Resume \| 1,500 chars \| ✅ Fits \| ✅ Fits \|
	\| Senior Resume \| 4,500 chars \| ❌ Truncated \| ✅ Fits \|
	\| Executive Resume \| 8,000+ chars \| ❌ Severely truncated \| ✅ Fits \|
	\| Job Description \| 3,000-5,000 chars \| ⚠️ Partially fits \| ✅ Fits \|
	\| Combined (Resume + Job) \| 8,000-13,000 chars \| ❌ Heavy truncation \| ✅ Mostly fits \|

	Real-World Impact:
	- ✅ No loss of critical information from experience sections
	- ✅ Complete skill analysis across entire document
	- ✅ Accurate senior-level matching with extensive work history
	- ✅ Better context understanding from full job requirements

	## Installation

	```bash
	pip install sentence-transformers
	```

	## Usage

	### Basic Sentence Similarity

	```python
	from sentence_transformers import SentenceTransformer
	from scipy.spatial.distance import cosine

	# Load model
	model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en')

	# Example: Resume-Job matching
	resume = """
	Senior Software Engineer with 8 years of experience in Python, Django, and React.
	Led development of microservices architecture serving 10M+ users. Expert in AWS,
	Docker, Kubernetes, and CI/CD pipelines. Strong background in agile methodologies
	and cross-functional team leadership.
	"""

	job_description = """
	We're seeking a Senior Backend Engineer with 5+ years Python experience.
	Must have expertise in Django, microservices, and cloud platforms (AWS/GCP).
	Experience with containerization (Docker/Kubernetes) and modern DevOps practices required.
	"""

	# Generate embeddings
	resume_embedding = model.encode(resume)
	job_embedding = model.encode(job_description)

	# Calculate similarity
	similarity = 1 - cosine(resume_embedding, job_embedding)
	print(f"Semantic Similarity: {similarity:.3f}") # Expected: 0.85-0.95 (high match)

	# Convert to ATS score (0-100)
	ats_score = similarity * 100
	print(f"ATS Score: {ats_score:.1f}%")
	```

	### Batch Processing with Long Documents

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en')

	# Process multiple long resumes efficiently
	resumes = [
	"... 8000+ character resume ...",
	"... another long resume ...",
	"... third resume ..."
	]

	job = "... detailed job description ..."

	# Batch encode (handles long context automatically)
	resume_embeddings = model.encode(resumes, batch_size=8, show_progress_bar=True)
	job_embedding = model.encode(job)

	# Calculate similarities
	from sklearn.metrics.pairwise import cosine_similarity
	similarities = cosine_similarity([job_embedding], resume_embeddings)[0]

	# Rank candidates
	for idx, score in sorted(enumerate(similarities), key=lambda x: x[1], reverse=True):
	print(f"Candidate {idx+1}: {score*100:.1f}%")
	```

	### Complete ATS Scoring with Ensemble

	For production ATS scoring, combine the semantic model with the ensemble score mapper:

	```python
	from sentence_transformers import SentenceTransformer
	from sklearn.linear_model import Ridge
	from sklearn.neural_network import MLPRegressor
	from sklearn.preprocessing import PolynomialFeatures
	import numpy as np
	import json

	# Load semantic model
	model = SentenceTransformer('0xnbk/nbk-ats-semantic-v1-en')

	# Load ensemble weights from JSON (secure format, no pickle warnings)
	with open('ridge_weights.json', 'r') as f:
	ridge_data = json.load(f)
	with open('neural_weights.json', 'r') as f:
	neural_data = json.load(f)
	with open('poly_features.json', 'r') as f:
	poly_data = json.load(f)

	# Reconstruct models from JSON
	score_mapper = Ridge(alpha=ridge_data['alpha'])
	score_mapper.coef_ = np.array(ridge_data['coefficients'])
	score_mapper.intercept_ = ridge_data['intercept']
	score_mapper.n_features_in_ = ridge_data['n_features_in']

	neural_mapper = MLPRegressor(
	hidden_layer_sizes=tuple(neural_data['hidden_layer_sizes']),
	activation=neural_data['activation']
	)
	neural_mapper.coefs_ = [np.array(c) for c in neural_data['coefs']]
	neural_mapper.intercepts_ = [np.array(i) for i in neural_data['intercepts']]
	neural_mapper.n_features_in_ = neural_data['n_features_in']

	poly_features = PolynomialFeatures(
	degree=poly_data['degree'],
	include_bias=poly_data['include_bias']
	)
	poly_features.n_features_in_ = poly_data['n_features_in']
	poly_features.n_output_features_ = poly_data['n_output_features']

	def predict_ats_score(resume_text, job_text):
	# Generate embeddings
	resume_emb = model.encode(resume_text)
	job_emb = model.encode(job_text)

	# Calculate base similarity
	similarity = np.dot(resume_emb, job_emb) / (np.linalg.norm(resume_emb) * np.linalg.norm(job_emb))

	# Create polynomial features
	features = poly_features.transform([[similarity]])

	# Ensemble prediction (Ridge + Neural Network)
	ridge_pred = score_mapper.predict(features)[0]
	neural_pred = neural_mapper.predict(features)[0]

	# Dynamic ensemble (50-50 weighting, optimized during training)
	final_score = (ridge_pred * 0.5 + neural_pred * 0.5)

	return np.clip(final_score, 0, 100)

	# Example usage
	score = predict_ats_score(resume_text, job_text)
	print(f"Final ATS Score: {score:.1f}%")
	```

	## Training Details

	### Dataset

	- Source: [0xnbk/resume-ats-score-v1-en](https://huggingface.co/datasets/0xnbk/resume-ats-score-v1-en)
	- Training Samples: 5,099 resume-job pairs
	- Validation Samples: 1,275 pairs
	- Score Range: 18.3 - 90.7 (normalized to 0-1 for training)
	- Average Text Length: ~8,480 characters per example

	### Training Configuration

	Hardware:
	- GPU: NVIDIA A100 80GB
	- Training Time: ~30 minutes
	- Memory: Optimized with gradient accumulation

	Hyperparameters:
	- Learning Rate: 1.2e-4
	- Batch Size: 16 (physical) × 8 (gradient accumulation) = 128 (effective)
	- Epochs: 50 (early stopping at epoch 34)
	- Warmup Ratio: 0.15
	- Optimizer: AdamW
	- Loss Function: CosineSimilarityLoss (MSE)
	- Max Sequence Length: 8,192 tokens
	- FP16: Enabled
	- Torch Compile: Enabled (inductor backend)

	Score Mapper Training:
	- Polynomial Features: Degree 3
	- Ridge Regression: L2 regularization (alpha optimized)
	- Neural Network: 3-layer MLP (128→64→32 neurons)
	- Ensemble: Dynamic 50-50 weighting (Ridge + Neural)

	### ESCO Normalization

	This model was trained with ESCO (European Skills, Competences, Qualifications and Occupations) text normalization:

	- 13,939 real skills from the ESCO taxonomy integrated during training
	- Alternative label mapping: Maps skill variations to canonical forms (e.g., "javascript" → "JavaScript", "react js" → "React", "ML" → "Machine Learning")
	- Training-time normalization: ALL resumes and job descriptions were normalized before encoding
	- Benefits:
	- Consistent skill representation across different writing styles
	- Handles common abbreviations and variations automatically
	- Improves matching accuracy for technology and professional terms
	- Better generalization to unseen skill variations

	Example Normalizations:
	- "b-tech" / "btech" → "Bachelor of Technology"
	- "js" / "javascript" → "JavaScript"
	- "aws cloud" / "amazon web services" → "Amazon Web Services (AWS)"
	- "ML" / "machine learning" → "Machine Learning"

	This normalization is baked into the model's training data, so you don't need to apply it during inference.

	### Validation Strategy

	The model was validated using:
	1. Quantitative Metrics: RMSE, R², MAE, Pearson correlation
	2. Cross-Domain Testing: 64 test cases (8 domains × 8 job categories)
	3. Real-World Validation: Full-length resumes and job descriptions
	4. Inference Testing: A6000 48GB GPU with minimal dependencies

	## Browser Deployment (ONNX)

	The model is optimized for browser deployment using ONNX Runtime:

	Quantized Model (27MB):
	- INT8 quantization for reduced size
	- Minimal accuracy loss (~1-2%)
	- Compatible with ONNX Runtime Web
	- Runs entirely client-side (no server required)

	Deployment Example (ONNX Runtime Web):
	```javascript
	// Using ONNX Runtime Web for browser deployment
	import * as ort from 'onnxruntime-web';

	// Load the quantized ONNX model
	const session = await ort.InferenceSession.create('onnx/model_quantized.onnx');

	// You'll need to tokenize the text and create embeddings
	// Then calculate cosine similarity between resume and job embeddings

	function cosineSimilarity(a, b) {
	const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
	const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
	const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
	return dot / (magA * magB);
	}

	// Calculate ATS score
	const similarity = cosineSimilarity(resumeEmbedding, jobEmbedding);
	console.log(`ATS Score: ${(similarity * 100).toFixed(1)}%`);
	```

	Note: For production browser deployment, you'll need to handle tokenization and implement the full inference pipeline. The ONNX quantized model provides the core embedding functionality optimized for client-side execution.

	## Domain Coverage

	The model demonstrates excellent performance across all major professional domains:

	Technology Domains:
	- Software Engineering, Data Science, DevOps, Cloud Computing, AI/ML
	- Web Development, Mobile Development, Security, QA, Systems Administration

	Business Domains:
	- Finance & Banking, Accounting, Investment, Fintech
	- Sales & Marketing, Business Development, Digital Marketing
	- Human Resources, Recruitment, Training & Development

	Healthcare & Life Sciences:
	- Nursing, Medical Practice, Clinical Research, Healthcare Administration
	- Pharmaceuticals, Biotechnology, Medical Devices

	Professional Services:
	- Legal, Consulting, Education, Design, Media & Entertainment
	- Manufacturing, Construction, Real Estate, Government & Nonprofit

	## Limitations

	1. Language: Currently optimized for English only
	2. Domain: Designed specifically for professional resume-job matching
	3. Context Length: While 8,192 tokens is generous, extremely long documents may still be truncated
	4. Cultural Bias: May reflect biases present in English-language job market data
	5. Temporal Relevance: Trained on 2024-2025 data; may need retraining for future job market shifts

	## Ethical Considerations

	- Bias Awareness: Models may inherit biases from training data; validate fairness across demographics
	- Transparency: ATS scores are algorithmically derived and should supplement, not replace, human judgment
	- Privacy: No PII included in training; users should handle resume data responsibly
	- Responsible Use: Should be used as a screening aid, not sole decision-maker in hiring

	## Citation

	```bibtex
	@model{nbk_ats_semantic_v1,
	author = {NBK},
	title = {NBK ATS Semantic Model v1 (English)},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/0xnbk/nbk-ats-semantic-v1-en}
	}
	```

	### Training Dataset Citation

	```bibtex
	@dataset{resume_ats_score_v1,
	author = {NBK},
	title = {Resume-ATS Score Dataset v1 (English)},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/datasets/0xnbk/resume-ats-score-v1-en}
	}
	```

	### Base Model Citation

	```bibtex
	@software{jina_embeddings_v2_small,
	author = {Jina AI},
	title = {Jina Embeddings v2 Small English},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/jinaai/jina-embeddings-v2-small-en}
	}
	```

	## License

	This model is released under the Apache 2.0 License.

	```
	Copyright 2025 NBK (nbk.dev)

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	```

	## Related Resources

	- Training Dataset: [0xnbk/resume-ats-score-v1-en](https://huggingface.co/datasets/0xnbk/resume-ats-score-v1-en)
	- Domain Classifier Dataset: [0xnbk/resume-domain-classifier-v1-en](https://huggingface.co/datasets/0xnbk/resume-domain-classifier-v1-en)
	- Domain Triplets Dataset: [0xnbk/resume-domain-triplets-train-v1-en](https://huggingface.co/datasets/0xnbk/resume-domain-triplets-train-v1-en)
	- Domain Model: [0xnbk/nbk-ats-domain-v1-en](https://huggingface.co/0xnbk/nbk-ats-domain-v1-en)
	- Application: [LOCAL ATS](https://github.com/0xnbk/localATS) - Privacy-first ATS Resume Analyzer

	## Updates and Maintenance

	- Version: 1.0.0
	- Last Updated: October 2025
	- Maintained by: NBK (nbk.dev)

	## Contact

	For questions, suggestions, or collaboration opportunities:
	- GitHub: [0xnbk/localATS](https://github.com/0xnbk/localATS)
	- HuggingFace: [@0xnbk](https://huggingface.co/0xnbk)
	- Website: [nbk.dev](https://nbk.dev)