Sonar Core 1 - Vietnamese Text Classification Model

A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving 92.80% accuracy on VNTC (news) and 72.47% accuracy on UTS2017_Bank (banking) datasets with SVC.

📋 View Detailed System Card for comprehensive model documentation, performance analysis, and limitations.

Model Description

Sonar Core 1 is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging.

Model Architecture

Algorithm: TF-IDF + SVC/Logistic Regression Pipeline
Feature Extraction: CountVectorizer with 20,000 max features
N-gram Support: Unigram and bigram (1-2)
TF-IDF: Transformation with IDF weighting
Classifier: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
Framework: scikit-learn ≥1.6
Caching System: Hash-based caching for efficient processing

Supported Datasets & Categories

Installation

pip install scikit-learn>=1.6 joblib

Usage

Training the Model

VNTC Dataset (News Classification)

# Default training with VNTC dataset
python train.py --dataset vntc --model logistic

# With specific parameters
python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2

UTS2017_Bank Dataset (Banking Text Classification)

# Train with UTS2017_Bank dataset (SVC recommended)
python train.py --dataset uts2017 --model svc_linear

# Train with Logistic Regression
python train.py --dataset uts2017 --model logistic

# With specific parameters (SVC)
python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2

# Compare multiple configurations
python train.py --dataset uts2017 --compare

Training from Scratch

from train import train_notebook

# Train VNTC model
vntc_results = train_notebook(
    dataset="vntc",
    model_name="logistic",
    max_features=20000,
    ngram_min=1,
    ngram_max=2
)

# Train UTS2017_Bank model
bank_results = train_notebook(
    dataset="uts2017",
    model_name="logistic",
    max_features=20000,
    ngram_min=1,
    ngram_max=2
)

Performance Metrics

VNTC Dataset Performance

Training Accuracy: 95.39%
Test Accuracy (SVC): 92.80%
Test Accuracy (Logistic Regression): 92.33%
Training Samples: 33,759
Test Samples: 50,373
Training Time (SVC): ~54.6 minutes
Training Time (Logistic Regression): ~31.40 seconds
Best Performing: Sports (98% F1-score)
Challenging Category: Lifestyle (76% F1-score)

UTS2017_Bank Dataset Performance

Training Accuracy (SVC): 95.07%
Test Accuracy (SVC): 72.47%
Test Accuracy (Logistic Regression): 70.96%
Training Samples: 1,581
Test Samples: 396
Training Time (SVC): ~5.3 seconds
Training Time (Logistic Regression): ~0.78 seconds
Best Performing: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC)
SVC Improvements: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1)
Challenges: Many minority classes with insufficient training data

Using the Pre-trained Models

VNTC Model (Vietnamese News Classification)

from huggingface_hub import hf_hub_download
import joblib

# Download and load VNTC model
vntc_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
)

# Enhanced prediction function
def predict_text(model, text):
    probabilities = model.predict_proba([text])[0]

    # Get top 3 predictions sorted by probability
    top_indices = probabilities.argsort()[-3:][::-1]
    top_predictions = []
    for idx in top_indices:
        category = model.classes_[idx]
        prob = probabilities[idx]
        top_predictions.append((category, prob))

    # The prediction should be the top category
    prediction = top_predictions[0][0]
    confidence = top_predictions[0][1]

    return prediction, confidence, top_predictions

# Make prediction on news text
news_text = "Đội tuyển bóng đá Việt Nam giành chiến thắng"
prediction, confidence, top_predictions = predict_text(vntc_model, news_text)

print(f"News category: {prediction}")
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
    print(f"  {i}. {category}: {prob:.3f}")

UTS2017_Bank Model (Vietnamese Banking Text Classification)

from huggingface_hub import hf_hub_download
import joblib

# Download and load UTS2017_Bank model (latest SVC model)
bank_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
)

# Enhanced prediction function (same as above)
def predict_text(model, text):
    probabilities = model.predict_proba([text])[0]

    # Get top 3 predictions sorted by probability
    top_indices = probabilities.argsort()[-3:][::-1]
    top_predictions = []
    for idx in top_indices:
        category = model.classes_[idx]
        prob = probabilities[idx]
        top_predictions.append((category, prob))

    # The prediction should be the top category
    prediction = top_predictions[0][0]
    confidence = top_predictions[0][1]

    return prediction, confidence, top_predictions

# Make prediction on banking text
bank_text = "Tôi muốn mở tài khoản tiết kiệm"
prediction, confidence, top_predictions = predict_text(bank_model, bank_text)

print(f"Banking category: {prediction}")
print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
    print(f"  {i}. {category}: {prob:.3f}")

Using Both Models

from huggingface_hub import hf_hub_download
import joblib

# Load both models
vntc_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib")
)
bank_model = joblib.load(
    hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib")
)

# Enhanced prediction function for both models
def predict_text(model, text):
    probabilities = model.predict_proba([text])[0]

    # Get top 3 predictions sorted by probability
    top_indices = probabilities.argsort()[-3:][::-1]
    top_predictions = []
    for idx in top_indices:
        category = model.classes_[idx]
        prob = probabilities[idx]
        top_predictions.append((category, prob))

    # The prediction should be the top category
    prediction = top_predictions[0][0]
    confidence = top_predictions[0][1]

    return prediction, confidence, top_predictions

# Function to classify any Vietnamese text
def classify_vietnamese_text(text, domain="auto"):
    """
    Classify Vietnamese text using appropriate model with detailed predictions

    Args:
        text: Vietnamese text to classify
        domain: "news", "banking", or "auto" to detect domain

    Returns:
        tuple: (prediction, confidence, top_predictions, domain_used)
    """
    if domain == "news":
        prediction, confidence, top_predictions = predict_text(vntc_model, text)
        return prediction, confidence, top_predictions, "news"
    elif domain == "banking":
        prediction, confidence, top_predictions = predict_text(bank_model, text)
        return prediction, confidence, top_predictions, "banking"
    else:
        # Try both models and return higher confidence
        news_pred, news_conf, news_top = predict_text(vntc_model, text)
        bank_pred, bank_conf, bank_top = predict_text(bank_model, text)

        if news_conf > bank_conf:
            return f"NEWS: {news_pred}", news_conf, news_top, "news"
        else:
            return f"BANKING: {bank_pred}", bank_conf, bank_top, "banking"

# Examples
examples = [
    "Đội tuyển bóng đá Việt Nam thắng 2-0",
    "Tôi muốn vay tiền mua nhà",
    "Chính phủ thông qua luật mới"
]

for text in examples:
    category, confidence, top_predictions, domain = classify_vietnamese_text(text)
    print(f"Text: {text}")
    print(f"Category: {category}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Domain: {domain}")
    print("Top 3 predictions:")
    for i, (cat, prob) in enumerate(top_predictions, 1):
        print(f"  {i}. {cat}: {prob:.3f}")
    print()

Model Parameters

dataset: Dataset to use ("vntc" or "uts2017")
model: Model type ("logistic" or "svc" - SVC recommended for best performance)
max_features: Maximum number of TF-IDF features (default: 20000)
ngram_min/max: N-gram range (default: 1-2)
split_ratio: Train/test split ratio for UTS2017 (default: 0.2)
n_samples: Optional sample limit for quick testing

Limitations

Language Specificity: Only works with Vietnamese text
Domain Specificity: Optimized for specific domains (news and banking)
Feature Limitations: Limited to 20,000 most frequent features
Class Imbalance Sensitivity: Performance degrades with imbalanced datasets
Specific Weaknesses:
- VNTC: Lower performance on lifestyle category (71% recall)
- UTS2017_Bank: Poor performance on minority classes despite SVC improvements
- SVC requires longer training time compared to Logistic Regression

Ethical Considerations

Model reflects biases present in training datasets
Performance varies significantly across categories
Should be validated on target domain before deployment
Consider class imbalance when interpreting results

Additional Information

Repository: https://huggingface.co/undertheseanlp/sonar_core_1
Framework Version: scikit-learn ≥1.6
Python Version: 3.10+
System Card: See Sonar Core 1 - System Card for detailed documentation

Citation

If you use this model, please cite:

@misc{undertheseanlp_2025,
    author       = { undertheseanlp },
    title        = { Sonar Core 1 - Vietnamese Text Classification Model },
    year         = 2025,
    url          = { https://huggingface.co/undertheseanlp/sonar_core_1 },
    doi          = { 10.57967/hf/6599 },
    publisher    = { Hugging Face }
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train undertheseanlp/sonar_core_1

Evaluation results

Test Accuracy (SVC) on VNTC
self-reported

0.928
Weighted Precision on VNTC
self-reported

0.920
Weighted Recall on VNTC
self-reported

0.920
Weighted F1-Score on VNTC
self-reported

0.920
Test Accuracy (SVC) on UTS2017_Bank
self-reported

0.725
Weighted Precision (SVC) on UTS2017_Bank
self-reported

0.650
Weighted Recall (SVC) on UTS2017_Bank
self-reported

0.720
Weighted F1-Score (SVC) on UTS2017_Bank
self-reported

0.660

View on Papers With Code

undertheseanlp
/

sonar_core_1