SignalSeeker: Protein Signal Peptide Prediction

Model Description

SignalSeeker is a machine learning ensemble for predicting signal peptides in protein sequences. It combines multiple algorithms including Random Forest, Extra Trees, SVM, and Logistic Regression with ProtBERT embeddings to achieve high accuracy in signal peptide detection.

Model Performance

Best Model: Logistic regression (L2)
Test AUC: 0.99433
Training Data: 5000 mixed seqeunces from UniProt verified eukaryotic proteins
Test Data: 1000 mixed seqeunces from UniProt verified eukaryotic proteins, isolated from training data

Intended Use

This model is designed to:

Predict whether a protein sequence contains a signal peptide
Assist in protein subcellular localization prediction
Support research in protein secretion pathways
Aid in biotechnology applications requiring secreted proteins

How to Use

Installation

pip install torch transformers scikit-learn numpy

Basic Usage

from signalseeker import SignalSeekerPredictor

# Initialize predictor
predictor = SignalSeekerPredictor.from_pretrained("your-username/signalseeker")

# Predict signal peptide
sequence = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFK..."
result = predictor.predict(sequence)

print(f"Has signal peptide: {result['has_signal_peptide']}")
print(f"Confidence: {result['probability']:.3f}")

Batch Prediction

sequences = {
    "protein1": "MKWVTFISLLFLFSSAYS...",
    "protein2": "MSKGEELFTGVVPILVELD..."
}

results = predictor.predict_batch(sequences)

Model Architecture

The SignalSeeker ensemble consists of:

Feature Extraction: ProtBERT embeddings of N-terminal 50 amino acids
Ensemble Models:
- Random Forest (Regularized)
- Extra Trees (Regularized)
- Support Vector Machine
- Logistic Regression (L2)
Feature Scaling: StandardScaler normalization
Decision Logic: Weighted ensemble with confidence assessment

Training Data

Source: UniProt database
Organisms: Eukaryotic proteins (Human, Mouse, Plant, Fungal)
Positive Examples: Proteins with experimentally verified signal peptides
Negative Examples: Cytoplasmic and nuclear proteins
Validation: Cross-validation with similarity-aware train/test splits

Performance Metrics

Model	CV AUC	Test AUC	Test Accuracy
Logistic regression (L2)	0.99433	0.98432	0.92284
Random Forest (Regularised)	0.98941	0.98869	0.96192
Extra Trees (Regularised)	0.99032	0.99072	0.94899
SVM (Conservative)	0.98711	0.98439	0.92284

Limitations

Trained primarily on eukaryotic sequences
Performance may vary for prokaryotic proteins
Requires sequences of at least 50 amino acids for optimal performance
May have reduced accuracy for highly divergent organisms

Ethical Considerations

This model is for research purposes only
Not intended for clinical diagnosis
Results should be validated experimentally
Consider potential biases in training data

Citation

If you use SignalSeeker in your research, please cite:

@misc{signalseeker2025,
  title={SignalSeeker: Machine Learning Ensemble for Protein Signal Peptide Prediction},
  author={Hugo Cooper},
  year={2025},
  url={https://huggingface.co/hcoops/signalseeker}
}

Contact

For questions or issues, please open an issue on the GitHub repository.

License

This model is released under the MIT License.