Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,178 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- sentence-transformers/all-nli
|
5 |
+
- sentence-transformers/stsb
|
6 |
+
base_model:
|
7 |
+
- rootxhacker/arthemis-instruct
|
8 |
+
tags:
|
9 |
+
- bert
|
10 |
+
- embedding
|
11 |
+
---
|
12 |
+
# rootxhacker/arthemis-embedding
|
13 |
+
|
14 |
+
This is a text embedding model finetuned from **arthemislm-base** on the **all-nli-pair**, **all-nli-pair-class**, **all-nli-pair-score**, **all-nli-triplet**, **stsb**, **quora** and **natural-questions** datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
15 |
+
|
16 |
+
The **Arthemis Embedding** model is a 155.8M parameter text embedding model that incorporates **Spiking Neural Networks (SNNs)** and **Liquid Time Constants (LTCs)** for enhanced temporal dynamics and semantic representation learning. This neuromorphic architecture provides unique advantages in classification tasks while maintaining competitive performance across various text understanding benchmarks.
|
17 |
+
|
18 |
+
This embedding model performs on par with jinaai/jina-embeddings-v2-base-en on MTEB
|
19 |
+
|
20 |
+
## Model Details
|
21 |
+
|
22 |
+
**Model Type**: Text Embedding
|
23 |
+
**Supported Languages**: English
|
24 |
+
**Number of Parameters**: 155.8M
|
25 |
+
**Context Length**: 1024 tokens
|
26 |
+
**Embedding Dimension**: 768
|
27 |
+
**Base Model**: arthemislm-base
|
28 |
+
**Training Data**: all-nli-pair, all-nli-pair-class, all-nli-pair-score, all-nli-triplet, stsb, quora, natural-questions
|
29 |
+
|
30 |
+
### Architecture Features
|
31 |
+
- **Spiking Neural Networks** in attention mechanisms for temporal processing
|
32 |
+
- **Liquid Time Constants** in feed-forward layers for adaptive dynamics
|
33 |
+
- **12-layer transformer backbone** with neuromorphic enhancements
|
34 |
+
- **RoPE positional encoding** for sequence understanding
|
35 |
+
- **Surrogate gradient training** for differentiable spike computation
|
36 |
+
|
37 |
+
## Usage (Python)
|
38 |
+
|
39 |
+
Using this model with the custom implementation:
|
40 |
+
|
41 |
+
```python
|
42 |
+
from transformers import AutoTokenizer
|
43 |
+
import torch
|
44 |
+
import numpy as np
|
45 |
+
|
46 |
+
# Load model (using the custom MTEBLlamaSNNLTCEncoder)
|
47 |
+
from mteb_benchmark_snn_ltc import MTEBLlamaSNNLTCEncoder
|
48 |
+
|
49 |
+
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')
|
50 |
+
|
51 |
+
# Encode sentences
|
52 |
+
sentences = ["This is an example sentence", "Each sentence is converted"]
|
53 |
+
embeddings = model.encode(sentences, task_name="similarity")
|
54 |
+
|
55 |
+
print(f"Embeddings shape: {embeddings.shape}") # (2, 768)
|
56 |
+
print(f"Embedding dimension: {embeddings.shape[1]}")
|
57 |
+
```
|
58 |
+
|
59 |
+
## Usage (Custom Implementation)
|
60 |
+
|
61 |
+
For direct usage with the neuromorphic architecture:
|
62 |
+
|
63 |
+
```python
|
64 |
+
import torch
|
65 |
+
import torch.nn as nn
|
66 |
+
from transformers import AutoTokenizer
|
67 |
+
|
68 |
+
# Initialize tokenizer
|
69 |
+
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
|
70 |
+
tokenizer.pad_token = tokenizer.eos_token
|
71 |
+
|
72 |
+
# Load the model
|
73 |
+
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')
|
74 |
+
|
75 |
+
# Process text
|
76 |
+
sentences = ['This is an example sentence', 'Each sentence is converted']
|
77 |
+
embeddings = model.encode(sentences, task_name="embedding_task")
|
78 |
+
|
79 |
+
# Use embeddings for similarity
|
80 |
+
from scipy.spatial.distance import cosine
|
81 |
+
similarity = 1 - cosine(embeddings[0], embeddings[1])
|
82 |
+
print(f"Cosine similarity: {similarity:.4f}")
|
83 |
+
```
|
84 |
+
|
85 |
+
## Evaluation
|
86 |
+
|
87 |
+
The model has been evaluated on 41 tasks from the **MTEB (Massive Text Embedding Benchmark)**:
|
88 |
+
|
89 |
+
### MTEB Performance
|
90 |
+
|
91 |
+
| Task Type | Average Score | Tasks Count | Best Individual Score |
|
92 |
+
|-----------|---------------|-------------|----------------------|
|
93 |
+
| **Classification** | **42.78** | 8 | Amazon Counterfactual: 65.43 |
|
94 |
+
| **STS** | **39.96** | 8 | STS17: 58.48 |
|
95 |
+
| **Clustering** | **28.54** | 8 | ArXiv Hierarchical: 49.82 |
|
96 |
+
| **Retrieval** | **12.41** | 5 | Twitter URL: 53.78 |
|
97 |
+
| **Other** | **13.07** | 12 | Ask Ubuntu: 43.56 |
|
98 |
+
|
99 |
+
**Overall MTEB Score: 27.05** (across 41 tasks)
|
100 |
+
|
101 |
+
### Notable Individual Results
|
102 |
+
|
103 |
+
| Task | Score | Task Type |
|
104 |
+
|------|-------|-----------|
|
105 |
+
| Amazon Counterfactual Classification | 65.43 | Classification |
|
106 |
+
| STS17 | 58.48 | Semantic Similarity |
|
107 |
+
| Toxic Conversations Classification | 55.54 | Classification |
|
108 |
+
| IMDB Classification | 51.69 | Classification |
|
109 |
+
| SICK-R | 49.24 | Semantic Similarity |
|
110 |
+
| ArXiv Hierarchical Clustering | 49.82 | Clustering |
|
111 |
+
| Banking77 Classification | 29.98 | Classification |
|
112 |
+
| STSBenchmark | 36.82 | Semantic Similarity |
|
113 |
+
|
114 |
+
## Model Strengths
|
115 |
+
|
116 |
+
- **Classification Excellence**: Superior performance on text classification tasks with 42.78% average
|
117 |
+
- **Semantic Understanding**: Strong semantic textual similarity capabilities (39.96% average)
|
118 |
+
- **Neuromorphic Advantages**: Unique spiking neural architecture provides enhanced pattern recognition
|
119 |
+
- **Temporal Processing**: Liquid time constants enable adaptive sequence processing
|
120 |
+
- **Robust Embeddings**: 768-dimensional vectors capture rich semantic representations
|
121 |
+
|
122 |
+
## Applications
|
123 |
+
|
124 |
+
- **Text Classification**: Financial intent detection, sentiment analysis, content moderation
|
125 |
+
- **Semantic Search**: Document retrieval and similarity matching
|
126 |
+
- **Clustering**: Automatic text organization and topic discovery
|
127 |
+
- **Content Safety**: Toxic content detection and content moderation
|
128 |
+
- **Question Answering**: Similarity-based answer retrieval
|
129 |
+
- **Paraphrase Mining**: Finding semantically equivalent text pairs
|
130 |
+
- **Semantic Textual Similarity**: Measuring text similarity for various applications
|
131 |
+
|
132 |
+
## Training Details
|
133 |
+
|
134 |
+
The model was finetuned from the **arthemislm-base** foundation model using multiple high-quality datasets:
|
135 |
+
|
136 |
+
- **all-nli-pair**: Natural Language Inference pair datasets
|
137 |
+
- **all-nli-pair-class**: Classification variants of NLI pairs
|
138 |
+
- **all-nli-pair-score**: Scored NLI pairs for similarity learning
|
139 |
+
- **all-nli-triplet**: Triplet learning from NLI data
|
140 |
+
- **stsb**: Semantic Textual Similarity Benchmark
|
141 |
+
- **quora**: Quora Question Pairs for paraphrase detection
|
142 |
+
- **natural-questions**: Google's Natural Questions dataset
|
143 |
+
|
144 |
+
The neuromorphic enhancements were integrated during training to provide:
|
145 |
+
- Spiking neuron dynamics in attention layers
|
146 |
+
- Liquid time constant adaptation in feed-forward networks
|
147 |
+
- Surrogate gradient optimization for spike-based learning
|
148 |
+
- Enhanced temporal pattern recognition capabilities
|
149 |
+
|
150 |
+
## Technical Specifications
|
151 |
+
|
152 |
+
```
|
153 |
+
Architecture: Transformer with SNN/LTC enhancements
|
154 |
+
Hidden Size: 768
|
155 |
+
Intermediate Size: 2048
|
156 |
+
Attention Heads: 12
|
157 |
+
Layers: 12
|
158 |
+
Max Position Embeddings: 1024
|
159 |
+
Vocabulary Size: 50,257
|
160 |
+
Spiking Threshold: 1.0
|
161 |
+
LTC Hidden Size: 256
|
162 |
+
Training Precision: FP32
|
163 |
+
```
|
164 |
+
|
165 |
+
## Citation
|
166 |
+
|
167 |
+
```bibtex
|
168 |
+
@misc{arthemis-embedding-2024,
|
169 |
+
title={Arthemis Embedding: A Neuromorphic Text Embedding Model},
|
170 |
+
author={rootxhacker},
|
171 |
+
year={2024},
|
172 |
+
howpublished={\url{https://huggingface.co/rootxhacker/arthemis-embedding}}
|
173 |
+
}
|
174 |
+
```
|
175 |
+
|
176 |
+
## License
|
177 |
+
|
178 |
+
Please refer to the model files for licensing information.
|