tjohn327 commited on
Commit
2cc05b2
·
verified ·
1 Parent(s): 285509a

Fine-tuned all-mpnet-base-v2 for SCION RAG retrieval

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,772 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:46618
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: sentence-transformers/all-MiniLM-L6-v2
10
+ widget:
11
+ - source_sentence: What does it mean for a packet to be authorized, as mentioned in
12
+ the document?
13
+ sentences:
14
+ - '<title>Creating a Secure Underlay for the Internet</title>
15
+
16
+ <section>4.5 Routing Logic at PoPs </section>
17
+
18
+ <content>
19
+
20
+ Through a number of key design principles and by leveraging the secure backbone
21
+ for internal routing, SBAS is able to disseminate routes securely to customers
22
+ and out to the Internet. Using a strict priority hierarchy on the control plane,
23
+ traffic to/from customers benefits from strong hijack resilience.
24
+
25
+ </content>'
26
+ - '<title>We recommend reading the following chapters to obtain a basic understanding
27
+ of SCION. Chapter What to Read Chapter 1 1 Introduction</title>
28
+
29
+ <section>25.3 Inter-domain Multipath Routing Protocols </section>
30
+
31
+ <content>
32
+
33
+ . Routing Deflection [558] allows endpoints to deflect their traffic at certain
34
+ BGP routers to choose different paths. While this approach can be incrementally
35
+ deployed with minimal changes to BGP, it only provides coarse-grained path control.
36
+
37
+ </content>'
38
+ - '<title>Formal Verification of Secure Forwarding Protocols</title>
39
+
40
+ <section>B. State</section>
41
+
42
+ <content>
43
+
44
+ A packet consists of the desired future path fut, and the (presumed) traversed
45
+ path past in the reverse direction. The full path is rev(past(m)) • fut(m). While
46
+ this splitting of the path simplifies our proofs, the forwarding path could equivalently
47
+ be defined as a single sequence with a moving pointer indicating the current position
48
+ on the path. We call a packet m authorized, if fut(m) ∈ auth a . Additionally,
49
+ each packet records a path hist, also in reverse direction. It represents the
50
+ packet''s actual trajectory and is used to express security properties. This can
51
+ be seen as a history variable.
52
+
53
+ </content>'
54
+ - source_sentence: _FLASHBACK
55
+ sentences:
56
+ - '<title>Anycast in the SCION Internet Architecture</title>
57
+
58
+ <section>1.1 Project Goal </section>
59
+
60
+ <content>
61
+
62
+ From a technical point of view, these designs for replicated services in SCION
63
+ do not necessarily need to work in the same way as anycast in the current internet.
64
+ It only needs to provide a conceptually similar solution, solving the same problem
65
+ as anycast does for the current internet. Users should be able to use a single
66
+ address or name to access a replicated internet service, and with that end up
67
+ connected to the best replica. The best replica does not always have to be the
68
+ one with the lowest latency or smallest geographical distance, it could also be
69
+ the replica that has the highest available bandwidth or lowest load, or a combination
70
+ of any of these.
71
+
72
+ </content>'
73
+ - '<title>Unknown Title</title>
74
+
75
+ <section>4.3 The API </section>
76
+
77
+ <content>
78
+
79
+ • PathProcessor. A path processor, as defined in the previous chapter. Has the
80
+ ability to send packets on specific paths over any of the connections associated
81
+ with it. Path processors are also receive extensions and hence can intercept incoming
82
+ packets. The difference between a path processor and a receive extension is that
83
+ the root path processor of a connection can be changed at any point in time during
84
+ the lifetime of a connection (hot swapping), while the receive extension is fixed
85
+ throughout the lifetime of a connection. By using a fixed receive extension to
86
+ handle and reply to latency probes, it becomes possible to change the path processor
87
+ without breaking the ability of the other peer to perform latency probing. As
88
+ such, the design foresees that each path processor only handles incoming packets
89
+ destined directly to it (e.g. latency probe replies), while the receive extension
90
+ has to handle any possible incoming packets from path processors of the other
91
+ peer (e.g. latency probes).
92
+
93
+ </content>'
94
+ - '<title>SCION Control Plane</title>
95
+
96
+ <url>https://www.ietf.org/archive/id/draft-dekater-scion-controlplane-07.html</url>
97
+
98
+ <section>5.Path Lookup - 5.2.Behavior of Actors in the Lookup Process</section>
99
+
100
+ <content>
101
+
102
+ Expand the source wildcard into separate requests for each reachable core AS in
103
+ the source ISD.¶
104
+
105
+
106
+ For each core segment request;¶
107
+
108
+
109
+
110
+
111
+ If possible, return matching core segments from cache;¶
112
+
113
+
114
+
115
+ Otherwise, request the core segments from the Control Services of each reachable
116
+ core AS at the source of the core segment, and then add the retrieved core segments
117
+ to the cache.¶
118
+
119
+
120
+
121
+
122
+ If possible, return matching core segments from cache;¶
123
+
124
+
125
+ Otherwise, request the core segments from the Control Services of each reachable
126
+ core AS at the source of the core segment, and then add the retrieved core segments
127
+ to the cache.¶
128
+
129
+
130
+ In the case of a down segment request:¶
131
+
132
+
133
+
134
+
135
+ Expand the source wildcard into separate requests for every core AS in the destination
136
+ ISD (destination ISD refers to the ISD to which the destination endpoint belongs).¶
137
+
138
+
139
+
140
+ For each segment request;¶
141
+
142
+
143
+
144
+ If possible, return matching down segments from cache;¶
145
+
146
+ </content>'
147
+ - source_sentence: What does the document claim about the relationship between end-host
148
+ path selection and the convergence axiom?
149
+ sentences:
150
+ - '<url>https://github.com/netsec-ethz/scion-apps/blob/master/webapp/development.md</url>
151
+
152
+ <content>
153
+
154
+ # Webapp Construction and Design
155
+
156
+ Webapp is a go application designed to operate a web server for purposes of visualizing
157
+ and testing the SCION infrastructure. Webapp occupies a strange place in the SCIONLab
158
+ ecosystem, in that, it draws from a wide variety of sources to provide testing
159
+ and visualization features so a list of [dependencies](dependencies.md) has been
160
+ developed for maintenance purposes. There isn''t one central source or API for
161
+ the information webapp uses to interrogate SCIONLab, thus webapp may do the following:
162
+
163
+
164
+ * Read from environment variables.
165
+
166
+ * Scan SCION''s logs.
167
+
168
+ * Scan SCION''s directory structure.
169
+
170
+ * Call third-party service APIs.
171
+
172
+ * Request static configuration from a SCIONLab-maintained location.
173
+
174
+ * Execute bash scripts.
175
+
176
+ * Execute SCION or SCIONLab tools and apps.
177
+
178
+ * Read from SCION''s databases.
179
+
180
+ * Make connections to SCION services, like the SCION Daemon.
181
+
182
+ </content>'
183
+ - '<title> - Ceremony administrator role - Phase 2 - Creation of TRC Payload</title>
184
+
185
+ <url>https://docs.scion.org/en/latest/cryptography/trc-signing-ceremony-phases-sensitive.html</url>
186
+
187
+ <content>
188
+
189
+ Connect the *USB flash drive* to your device, and copy the TRC payload file to
190
+
191
+ the root directory, then disconnect the *USB flash drive*. Hand out the *USB flash
192
+ drive*
193
+
194
+ to the *voting representatives*.
195
+
196
+
197
+ The *voting representatives* proceed to check the contents of the TRC payload
198
+
199
+ file by computing the SHA256 sum. Over the duration of the checks, keep the
200
+
201
+ SHA256 sum of the file available on the monitor for inspection.
202
+
203
+
204
+ This phase concludes once every *voting representative* confirms that the
205
+
206
+ contents of the TRC payload are correct. Once that happens, announce that
207
+
208
+ **Phase 2** has successfully concluded.
209
+
210
+ </content>'
211
+ - '<title>An Axiomatic Perspective on the Performance Effects of End-Host Path Selection</title>
212
+
213
+ <section>6.1.4 Convergence (Axiom 3 </section>
214
+
215
+ <content>
216
+
217
+ . Similar to Insight 8, the reason for this improvement is the de-synchronization
218
+ of the continuity time brought about by agent migration, which reduces the variance
219
+ of the aggregate additive increase and thus the flow-volume fluctuations. Contrary
220
+ to the widespread belief that end-host path selection necessarily hurts stability
221
+ (in the sense of the convergence axiom), our analysis thus shows that network
222
+ stability can in fact benefit from end-host path selection. 6.1.5 Fairness (Axiom
223
+ 4). Given simultaneous sending start and no path selection, perfect synchronization
224
+ implies that all agents always have exactly the same congestion-window size, i.e.,
225
+ 𝜂 = 0. Moreover, Zarchy et generally tend to come close to perfect fairness [41]
226
+ . To find the worst-case effects of end-host path selection, we thus assume perfect
227
+ fairness in the scenario without path selection:
228
+
229
+ </content>'
230
+ - source_sentence: How is the value of Acci+1 computed according to the document?
231
+ sentences:
232
+ - '<title>SCION Data Plane</title>
233
+
234
+ <url>https://www.ietf.org/archive/id/draft-dekater-scion-dataplane-04.html</url>
235
+
236
+ <section>4.Path Authorization - 4.2.Path Initialization and Packet Processing</section>
237
+
238
+ <content>
239
+
240
+ If the just calculated MACVerifyi does not match the MACi in the Hop Field of
241
+ the current ASi, drop the packet.¶
242
+
243
+
244
+
245
+ Compute the value of Acci+1. For this, use the formula in Section 4.1.1.2. Replace
246
+ Acci in the formula with the current value of Acc as set in the Acc field of the
247
+ current Info Field.¶
248
+
249
+
250
+
251
+ Replace the value of the Acc field in the current Info Field with the just calculated
252
+ value of Acci+1.¶
253
+
254
+
255
+
256
+
257
+
258
+ Case 2 The packet traverses the path segment in construction direction (C = "1")
259
+ where the path segment includes a peering Hop Field (P = "1") and the current
260
+ Hop Field is the peering Hop Field (i.e. the current hop is either the last hop
261
+ of the first segment or the first hop of the second segment). In this case, the
262
+ egress border router MUST take the following steps:¶
263
+
264
+ </content>'
265
+ - '<title>Debuglet: Programmable and Verifiable Inter-domain Network Telemetry</title>
266
+
267
+ <section>C. Control Plane</section>
268
+
269
+ <content>
270
+
271
+ . The function checks by looking up the ExecutionSlotsMap, when the first available
272
+ time slot that both to-be-involved executors can accommodate the measurement would
273
+ be, and how many execution slots need to be purchased at each executor. The function
274
+ returns the price that needs to be paid and the first possible time slot to the
275
+ initiator.
276
+
277
+ </content>'
278
+ - '<title>We recommend reading the following chapters to obtain a basic understanding
279
+ of SCION. Chapter What to Read Chapter 1 1 Introduction</title>
280
+
281
+ <section>17.5 Post-Quantum Cryptography </section>
282
+
283
+ <content>
284
+
285
+ . In this example, user U 1 trusts CA 1 more than CA 2 for issuing certificates
286
+ for domain D because CA 1 supports multi-perspective domain validation [1] ,
287
+ while user U 2 trusts CA 2 more than CA 1 because CA 2 is an American CA and D''s
288
+ toplevel domain is .us. In this example, U 1 should be able to express higher
289
+ trust 18.1 Trust Model in CA 1 than in CA 2 , while retaining the ability to use
290
+ certificates issued by CA 2 .
291
+
292
+ </content>'
293
+ - source_sentence: How many active ASes are reported as of the CIDR report mentioned
294
+ in the document?
295
+ sentences:
296
+ - '<title>The Case for In-Network Replay Suppression</title>
297
+
298
+ <section>4.3 Optimization Problem </section>
299
+
300
+ <content>
301
+
302
+ Equation 3 describes the size m of each BF as a function of the BF rotation interval
303
+ L, the number N of BFs, the number k of necessary hash functions, and the BF''s
304
+ target false-positive rate (fp). Since an incoming packet is checked against all
305
+ BFs, the overall target false-positive rate is 1 -(1fp) N . To determine the value
306
+ for fp, we consider the average number of packets that a router receives in an
307
+ interval L (which is r •L, where r is the incoming packet rate). Using the BF
308
+ equations, we get fp = (1e k•x•L/m ) k and by combining it with the equation for
309
+ the size of a BF, we obtain Equation 3. The inequality indicates that any larger
310
+ value for m yields a lower false-positive than fp.
311
+
312
+ </content>'
313
+ - '<title>Pervasive Internet-Wide Low-Latency Authentication</title>
314
+
315
+ <section>C. AS as Opportunistically Trusted Entity</section>
316
+
317
+ <content>
318
+
319
+ Each entity in the Internet is part of at least one AS, which is under the control
320
+ of a single administrative entity. This facilitates providing a common service
321
+ that authenticates endpoints (e.g., using a challenge-response protocol or preinstalled
322
+ keys and certificates) and issues certificates. Another advantage is the typically
323
+ close relationship between an endpoint and its AS, which allows for a stronger
324
+ leverage in case of misbehavior. Since it is infeasible for an endpoint to authenticate
325
+ each AS by itself (there are ∼71 000 active ASes according to the CIDR report [4]
326
+ ), RPKI is used as a trust anchor to authenticate ASes. RPKI resource issuers
327
+ assign an AS a set of IP address prefixes that this AS is allowed to originate.
328
+ An AS then issues short-lived certificates for its authorized IP address ranges.
329
+
330
+ </content>'
331
+ - '<title>Unknown Title</title>
332
+
333
+ <section>. Paths emission per unit of traffic</section>
334
+
335
+ <content>
336
+
337
+ The reason is that the number of BGP paths is less than  for most AS pairs. This
338
+ figure also suggests that the -greenest paths average emission differs from the
339
+ greenest path emission and the n-greenest paths average emission for both beaconing
340
+ algorithms. However, for every percentile, this difference in SCI-GIB is about
341
+  times less than the one in SCI-BCE. This means that the -greenest paths average
342
+ emission in SCI-GIB is much closer to the greenest path emission than SCI-BCE.
343
+ Also, for every percentile, the difference between the -greenest paths average
344
+ emissions of the two different beaconing algorithms is  times more than the difference
345
+ between their greenest path emissions. From both of these observations, we conclude
346
+ that SCI-GIB is better at finding the greenest set of paths
347
+
348
+ </content>'
349
+ pipeline_tag: sentence-similarity
350
+ library_name: sentence-transformers
351
+ metrics:
352
+ - cosine_accuracy@1
353
+ - cosine_accuracy@3
354
+ - cosine_accuracy@5
355
+ - cosine_accuracy@10
356
+ - cosine_precision@1
357
+ - cosine_precision@3
358
+ - cosine_precision@5
359
+ - cosine_precision@10
360
+ - cosine_recall@1
361
+ - cosine_recall@3
362
+ - cosine_recall@5
363
+ - cosine_recall@10
364
+ - cosine_ndcg@10
365
+ - cosine_mrr@10
366
+ - cosine_map@100
367
+ model-index:
368
+ - name: SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
369
+ results:
370
+ - task:
371
+ type: information-retrieval
372
+ name: Information Retrieval
373
+ dataset:
374
+ name: val ir eval
375
+ type: val-ir-eval
376
+ metrics:
377
+ - type: cosine_accuracy@1
378
+ value: 0.6153653653653653
379
+ name: Cosine Accuracy@1
380
+ - type: cosine_accuracy@3
381
+ value: 0.8033033033033034
382
+ name: Cosine Accuracy@3
383
+ - type: cosine_accuracy@5
384
+ value: 0.8578578578578578
385
+ name: Cosine Accuracy@5
386
+ - type: cosine_accuracy@10
387
+ value: 0.9179179179179179
388
+ name: Cosine Accuracy@10
389
+ - type: cosine_precision@1
390
+ value: 0.6153653653653653
391
+ name: Cosine Precision@1
392
+ - type: cosine_precision@3
393
+ value: 0.268018018018018
394
+ name: Cosine Precision@3
395
+ - type: cosine_precision@5
396
+ value: 0.17182182182182182
397
+ name: Cosine Precision@5
398
+ - type: cosine_precision@10
399
+ value: 0.091991991991992
400
+ name: Cosine Precision@10
401
+ - type: cosine_recall@1
402
+ value: 0.6151290179067957
403
+ name: Cosine Recall@1
404
+ - type: cosine_recall@3
405
+ value: 0.8029696363029697
406
+ name: Cosine Recall@3
407
+ - type: cosine_recall@5
408
+ value: 0.8575519964408853
409
+ name: Cosine Recall@5
410
+ - type: cosine_recall@10
411
+ value: 0.9174660771882994
412
+ name: Cosine Recall@10
413
+ - type: cosine_ndcg@10
414
+ value: 0.7686494924105739
415
+ name: Cosine Ndcg@10
416
+ - type: cosine_mrr@10
417
+ value: 0.7208215159604052
418
+ name: Cosine Mrr@10
419
+ - type: cosine_map@100
420
+ value: 0.7240690909632143
421
+ name: Cosine Map@100
422
+ ---
423
+
424
+ # SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
425
+
426
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
427
+
428
+ ## Model Details
429
+
430
+ ### Model Description
431
+ - **Model Type:** Sentence Transformer
432
+ - **Base model:** [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) <!-- at revision fa97f6e7cb1a59073dff9e6b13e2715cf7475ac9 -->
433
+ - **Maximum Sequence Length:** 256 tokens
434
+ - **Output Dimensionality:** 384 dimensions
435
+ - **Similarity Function:** Cosine Similarity
436
+ <!-- - **Training Dataset:** Unknown -->
437
+ <!-- - **Language:** Unknown -->
438
+ <!-- - **License:** Unknown -->
439
+
440
+ ### Model Sources
441
+
442
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
443
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
444
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
445
+
446
+ ### Full Model Architecture
447
+
448
+ ```
449
+ SentenceTransformer(
450
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
451
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
452
+ (2): Normalize()
453
+ )
454
+ ```
455
+
456
+ ## Usage
457
+
458
+ ### Direct Usage (Sentence Transformers)
459
+
460
+ First install the Sentence Transformers library:
461
+
462
+ ```bash
463
+ pip install -U sentence-transformers
464
+ ```
465
+
466
+ Then you can load this model and run inference.
467
+ ```python
468
+ from sentence_transformers import SentenceTransformer
469
+
470
+ # Download from the 🤗 Hub
471
+ model = SentenceTransformer("tjohn327/scion-minilm-l6-v3")
472
+ # Run inference
473
+ sentences = [
474
+ 'How many active ASes are reported as of the CIDR report mentioned in the document?',
475
+ '<title>Pervasive Internet-Wide Low-Latency Authentication</title>\n<section>C. AS as Opportunistically Trusted Entity</section>\n<content>\nEach entity in the Internet is part of at least one AS, which is under the control of a single administrative entity. This facilitates providing a common service that authenticates endpoints (e.g., using a challenge-response protocol or preinstalled keys and certificates) and issues certificates. Another advantage is the typically close relationship between an endpoint and its AS, which allows for a stronger leverage in case of misbehavior. Since it is infeasible for an endpoint to authenticate each AS by itself (there are ∼71 000 active ASes according to the CIDR report [4] ), RPKI is used as a trust anchor to authenticate ASes. RPKI resource issuers assign an AS a set of IP address prefixes that this AS is allowed to originate. An AS then issues short-lived certificates for its authorized IP address ranges.\n</content>',
476
+ '<title>Unknown Title</title>\n<section>\uf735.\uf731 Paths emission per unit of traffic</section>\n<content>\nThe reason is that the number of BGP paths is less than \uf735 for most AS pairs. This figure also suggests that the \uf735-greenest paths average emission differs from the greenest path emission and the n-greenest paths average emission for both beaconing algorithms. However, for every percentile, this difference in SCI-GIB is about \uf733 times less than the one in SCI-BCE. This means that the \uf735-greenest paths average emission in SCI-GIB is much closer to the greenest path emission than SCI-BCE. Also, for every percentile, the difference between the \uf735-greenest paths average emissions of the two different beaconing algorithms is \uf732 times more than the difference between their greenest path emissions. From both of these observations, we conclude that SCI-GIB is better at finding the greenest set of paths\n</content>',
477
+ ]
478
+ embeddings = model.encode(sentences)
479
+ print(embeddings.shape)
480
+ # [3, 384]
481
+
482
+ # Get the similarity scores for the embeddings
483
+ similarities = model.similarity(embeddings, embeddings)
484
+ print(similarities.shape)
485
+ # [3, 3]
486
+ ```
487
+
488
+ <!--
489
+ ### Direct Usage (Transformers)
490
+
491
+ <details><summary>Click to see the direct usage in Transformers</summary>
492
+
493
+ </details>
494
+ -->
495
+
496
+ <!--
497
+ ### Downstream Usage (Sentence Transformers)
498
+
499
+ You can finetune this model on your own dataset.
500
+
501
+ <details><summary>Click to expand</summary>
502
+
503
+ </details>
504
+ -->
505
+
506
+ <!--
507
+ ### Out-of-Scope Use
508
+
509
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
510
+ -->
511
+
512
+ ## Evaluation
513
+
514
+ ### Metrics
515
+
516
+ #### Information Retrieval
517
+
518
+ * Dataset: `val-ir-eval`
519
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
520
+
521
+ | Metric | Value |
522
+ |:--------------------|:-----------|
523
+ | cosine_accuracy@1 | 0.6154 |
524
+ | cosine_accuracy@3 | 0.8033 |
525
+ | cosine_accuracy@5 | 0.8579 |
526
+ | cosine_accuracy@10 | 0.9179 |
527
+ | cosine_precision@1 | 0.6154 |
528
+ | cosine_precision@3 | 0.268 |
529
+ | cosine_precision@5 | 0.1718 |
530
+ | cosine_precision@10 | 0.092 |
531
+ | cosine_recall@1 | 0.6151 |
532
+ | cosine_recall@3 | 0.803 |
533
+ | cosine_recall@5 | 0.8576 |
534
+ | cosine_recall@10 | 0.9175 |
535
+ | **cosine_ndcg@10** | **0.7686** |
536
+ | cosine_mrr@10 | 0.7208 |
537
+ | cosine_map@100 | 0.7241 |
538
+
539
+ <!--
540
+ ## Bias, Risks and Limitations
541
+
542
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
543
+ -->
544
+
545
+ <!--
546
+ ### Recommendations
547
+
548
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
549
+ -->
550
+
551
+ ## Training Details
552
+
553
+ ### Training Dataset
554
+
555
+ #### Unnamed Dataset
556
+
557
+ * Size: 46,618 training samples
558
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
559
+ * Approximate statistics based on the first 1000 samples:
560
+ | | sentence_0 | sentence_1 |
561
+ |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
562
+ | type | string | string |
563
+ | details | <ul><li>min: 2 tokens</li><li>mean: 21.15 tokens</li><li>max: 45 tokens</li></ul> | <ul><li>min: 86 tokens</li><li>mean: 200.21 tokens</li><li>max: 256 tokens</li></ul> |
564
+ * Samples:
565
+ | sentence_0 | sentence_1 |
566
+ |:-----------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
567
+ | <code>What specific snippet of the resolver-recv-answer-for-client rule is presented in the document?</code> | <code><title>A Formal Framework for End-to-End DNS Resolution</title><br><section>3.2.3 DNS Dynamics. </section><br><content><br>This rule [resolver-recv-answer-for-client] has 74 LOC with nontrivial auxiliary functions and rule conditions. For the simplicity of our presentation, we only show the most important snippet with respect to positive caching. 5 The rule applies for a response that authoritatively answers a client query. More specifically, a temporary cache is created from the data contained in the response (line 8), which is then used for the lookup (line 10). Note that we cannot perform the lookup directly on the actual cache as case A of the resolver algorithm should only consider the data in the response, not in the cache. Also note that we look only at the data in the answer section (ANS, line 2) for the temporary positive cache as the entire rule is concerned with authoritative answers. Finally, we insert the data from the response into the actual cache and use this updated cache on th...</code> |
568
+ | <code>What is the relationship between early adopters and the potential security improvements mentioned for SBAS in the document?</code> | <code><title>Creating a Secure Underlay for the Internet</title><br><section>9 Related Work </section><br><content><br>. While several challenges still exist when deploying SBAS in a production setting, our survey shows a potential path forward and our experimental results show promise that sizable security improvements can be achieved with even a small set of early adopters. We hope that SBAS revitalizes the quest for secure inter-domain routing.<br></content></code> |
569
+ | <code>How does the evaluation in this study focus on user-driven path control within SCION?</code> | <code><title>Evaluation of SCION for User-driven Path Control: a Usability Study</title><br><section>ABSTRACT</section><br><content><br>The UPIN (User-driven Path verification and control in Inter-domain Networks) project aims to implement a way for users of a network to control how their data is traversing it. In this paper we investigate the possibilities and limitations of SCION for user-driven path control. Exploring several aspects of the performance of a SCION network allows us to define the most efficient path to assign to a user, following specific requests. We extensively analyze multiple paths, specifically focusing on latency, bandwidth and data loss, in SCIONLab, an experimental testbed and implementation of a SCION network. We gather data on these paths and store it in a database, that we then query to select the best path to give to a user to reach a destination, following their request on performance or devices to exclude for geographical or sovereignty reasons. Results indicate our so...</code> |
570
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
571
+ ```json
572
+ {
573
+ "scale": 20.0,
574
+ "similarity_fct": "cos_sim"
575
+ }
576
+ ```
577
+
578
+ ### Training Hyperparameters
579
+ #### Non-Default Hyperparameters
580
+
581
+ - `eval_strategy`: steps
582
+ - `per_device_train_batch_size`: 64
583
+ - `per_device_eval_batch_size`: 64
584
+ - `num_train_epochs`: 1
585
+ - `fp16`: True
586
+ - `multi_dataset_batch_sampler`: round_robin
587
+
588
+ #### All Hyperparameters
589
+ <details><summary>Click to expand</summary>
590
+
591
+ - `overwrite_output_dir`: False
592
+ - `do_predict`: False
593
+ - `eval_strategy`: steps
594
+ - `prediction_loss_only`: True
595
+ - `per_device_train_batch_size`: 64
596
+ - `per_device_eval_batch_size`: 64
597
+ - `per_gpu_train_batch_size`: None
598
+ - `per_gpu_eval_batch_size`: None
599
+ - `gradient_accumulation_steps`: 1
600
+ - `eval_accumulation_steps`: None
601
+ - `torch_empty_cache_steps`: None
602
+ - `learning_rate`: 5e-05
603
+ - `weight_decay`: 0.0
604
+ - `adam_beta1`: 0.9
605
+ - `adam_beta2`: 0.999
606
+ - `adam_epsilon`: 1e-08
607
+ - `max_grad_norm`: 1
608
+ - `num_train_epochs`: 1
609
+ - `max_steps`: -1
610
+ - `lr_scheduler_type`: linear
611
+ - `lr_scheduler_kwargs`: {}
612
+ - `warmup_ratio`: 0.0
613
+ - `warmup_steps`: 0
614
+ - `log_level`: passive
615
+ - `log_level_replica`: warning
616
+ - `log_on_each_node`: True
617
+ - `logging_nan_inf_filter`: True
618
+ - `save_safetensors`: True
619
+ - `save_on_each_node`: False
620
+ - `save_only_model`: False
621
+ - `restore_callback_states_from_checkpoint`: False
622
+ - `no_cuda`: False
623
+ - `use_cpu`: False
624
+ - `use_mps_device`: False
625
+ - `seed`: 42
626
+ - `data_seed`: None
627
+ - `jit_mode_eval`: False
628
+ - `use_ipex`: False
629
+ - `bf16`: False
630
+ - `fp16`: True
631
+ - `fp16_opt_level`: O1
632
+ - `half_precision_backend`: auto
633
+ - `bf16_full_eval`: False
634
+ - `fp16_full_eval`: False
635
+ - `tf32`: None
636
+ - `local_rank`: 0
637
+ - `ddp_backend`: None
638
+ - `tpu_num_cores`: None
639
+ - `tpu_metrics_debug`: False
640
+ - `debug`: []
641
+ - `dataloader_drop_last`: False
642
+ - `dataloader_num_workers`: 0
643
+ - `dataloader_prefetch_factor`: None
644
+ - `past_index`: -1
645
+ - `disable_tqdm`: False
646
+ - `remove_unused_columns`: True
647
+ - `label_names`: None
648
+ - `load_best_model_at_end`: False
649
+ - `ignore_data_skip`: False
650
+ - `fsdp`: []
651
+ - `fsdp_min_num_params`: 0
652
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
653
+ - `fsdp_transformer_layer_cls_to_wrap`: None
654
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
655
+ - `deepspeed`: None
656
+ - `label_smoothing_factor`: 0.0
657
+ - `optim`: adamw_torch
658
+ - `optim_args`: None
659
+ - `adafactor`: False
660
+ - `group_by_length`: False
661
+ - `length_column_name`: length
662
+ - `ddp_find_unused_parameters`: None
663
+ - `ddp_bucket_cap_mb`: None
664
+ - `ddp_broadcast_buffers`: False
665
+ - `dataloader_pin_memory`: True
666
+ - `dataloader_persistent_workers`: False
667
+ - `skip_memory_metrics`: True
668
+ - `use_legacy_prediction_loop`: False
669
+ - `push_to_hub`: False
670
+ - `resume_from_checkpoint`: None
671
+ - `hub_model_id`: None
672
+ - `hub_strategy`: every_save
673
+ - `hub_private_repo`: None
674
+ - `hub_always_push`: False
675
+ - `gradient_checkpointing`: False
676
+ - `gradient_checkpointing_kwargs`: None
677
+ - `include_inputs_for_metrics`: False
678
+ - `include_for_metrics`: []
679
+ - `eval_do_concat_batches`: True
680
+ - `fp16_backend`: auto
681
+ - `push_to_hub_model_id`: None
682
+ - `push_to_hub_organization`: None
683
+ - `mp_parameters`:
684
+ - `auto_find_batch_size`: False
685
+ - `full_determinism`: False
686
+ - `torchdynamo`: None
687
+ - `ray_scope`: last
688
+ - `ddp_timeout`: 1800
689
+ - `torch_compile`: False
690
+ - `torch_compile_backend`: None
691
+ - `torch_compile_mode`: None
692
+ - `dispatch_batches`: None
693
+ - `split_batches`: None
694
+ - `include_tokens_per_second`: False
695
+ - `include_num_input_tokens_seen`: False
696
+ - `neftune_noise_alpha`: None
697
+ - `optim_target_modules`: None
698
+ - `batch_eval_metrics`: False
699
+ - `eval_on_start`: False
700
+ - `use_liger_kernel`: False
701
+ - `eval_use_gather_object`: False
702
+ - `average_tokens_across_devices`: False
703
+ - `prompts`: None
704
+ - `batch_sampler`: batch_sampler
705
+ - `multi_dataset_batch_sampler`: round_robin
706
+
707
+ </details>
708
+
709
+ ### Training Logs
710
+ | Epoch | Step | val-ir-eval_cosine_ndcg@10 |
711
+ |:------:|:----:|:--------------------------:|
712
+ | 0.2740 | 100 | 0.7363 |
713
+ | 0.5479 | 200 | 0.7595 |
714
+ | 0.8219 | 300 | 0.7648 |
715
+ | 1.0 | 365 | 0.7686 |
716
+
717
+
718
+ ### Framework Versions
719
+ - Python: 3.12.3
720
+ - Sentence Transformers: 3.4.1
721
+ - Transformers: 4.49.0
722
+ - PyTorch: 2.6.0+cu124
723
+ - Accelerate: 1.4.0
724
+ - Datasets: 3.3.2
725
+ - Tokenizers: 0.21.0
726
+
727
+ ## Citation
728
+
729
+ ### BibTeX
730
+
731
+ #### Sentence Transformers
732
+ ```bibtex
733
+ @inproceedings{reimers-2019-sentence-bert,
734
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
735
+ author = "Reimers, Nils and Gurevych, Iryna",
736
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
737
+ month = "11",
738
+ year = "2019",
739
+ publisher = "Association for Computational Linguistics",
740
+ url = "https://arxiv.org/abs/1908.10084",
741
+ }
742
+ ```
743
+
744
+ #### MultipleNegativesRankingLoss
745
+ ```bibtex
746
+ @misc{henderson2017efficient,
747
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
748
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
749
+ year={2017},
750
+ eprint={1705.00652},
751
+ archivePrefix={arXiv},
752
+ primaryClass={cs.CL}
753
+ }
754
+ ```
755
+
756
+ <!--
757
+ ## Glossary
758
+
759
+ *Clearly define terms in order to be accessible across audiences.*
760
+ -->
761
+
762
+ <!--
763
+ ## Model Card Authors
764
+
765
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
766
+ -->
767
+
768
+ <!--
769
+ ## Model Card Contact
770
+
771
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
772
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./scion-minilm-title-qwen-questions",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 1536,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 6,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.49.0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.49.0",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cb83b318d23c5cd6330c6961171323226ef730efec1b139f7d26c5150c2a9c4
3
+ size 90864192
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_length": 128,
51
+ "model_max_length": 256,
52
+ "never_split": null,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "[PAD]",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "[SEP]",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "[UNK]"
65
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff