File size: 2,641 Bytes
1ec2278
46f5f7c
 
 
 
 
a292ca1
 
 
1ec2278
46f5f7c
a292ca1
46f5f7c
 
 
a292ca1
 
46f5f7c
a292ca1
46f5f7c
a292ca1
46f5f7c
 
 
 
 
 
 
 
 
 
 
e9274ca
46f5f7c
 
 
 
 
a292ca1
 
46f5f7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a292ca1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: cc-by-2.0
language:
- en
---

# Same news story

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model is trained to maps news stories that contain similar content close to each other. It can be used to measure the distance between two news stories or cluster similar news stories together. 


## Usage

This model can be used with the [sentence-transformers](https://www.SBERT.net) package: 

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('dell-research-harvard/same-story')
embeddings = model.encode(sentences)
print(embeddings)
```

## Training
This model was trained on data using news articles from Allsides, a news aggregator that collates articles on the same story from multiple news sites. We extract pairs of articles from these groupings and use these as positive pairs in our training data. For negative pairs, we use pairs that have  a small cosine distance when evaluated with the untrained model, but do not come from the same story, and do not share the same topic tags, according to All Sides. 

The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 806 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`modified_sbert.losses.OnlineContrastiveLoss_wandb` 

Parameters of the fit()-Method:
```
{
    "epochs": 9,
    "evaluation_steps": 320,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 2842,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)
```