Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,85 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
base_model:
|
6 |
+
- allenai/scibert_scivocab_uncased
|
7 |
+
tags:
|
8 |
+
- Science
|
9 |
+
- classifier
|
10 |
+
- words
|
11 |
+
---
|
12 |
+
<b><span style="color:red;">IMPORTENT! READ THIS!</span></b>
|
13 |
+
## Model description
|
14 |
+
This model recognizes scientific terms in a given *text*. The best way to use it is as follows:
|
15 |
+
```python
|
16 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
17 |
+
from nltk.tokenize import word_tokenize
|
18 |
+
import torch
|
19 |
+
import spacy
|
20 |
+
|
21 |
+
# You might want to use it to remove enteties in the text (the model usually predicts them as scientific)
|
22 |
+
nlp = spacy.load("en_core_web_sm")
|
23 |
+
# doc = nlp(text)
|
24 |
+
# names = [ent.text for ent in doc.ents]
|
25 |
+
|
26 |
+
tokenizer = AutoTokenizer.from_pretrained("JonyC/scibert-science-word-classifier")
|
27 |
+
model = AutoModelForTokenClassification.from_pretrained("JonyC/scibert-science-word-classifier")
|
28 |
+
|
29 |
+
# define max_len as needed.
|
30 |
+
def classify_term(term, max_len=12):
|
31 |
+
term = term.lower()
|
32 |
+
tokens = tokenizer(term, return_tensors="pt", truncation=True, padding=True, max_length=max_len).to(device)
|
33 |
+
output = model(**tokens).logits
|
34 |
+
pred = torch.argmax(output).item()
|
35 |
+
|
36 |
+
return "Scientific" if pred == 1 else "Non-Scientific"
|
37 |
+
|
38 |
+
# For single term:
|
39 |
+
print(classify_term("quantum mechanics"))
|
40 |
+
print(classify_term("table"))
|
41 |
+
print(classify_term("photosynthesis"))
|
42 |
+
|
43 |
+
# For sentences:
|
44 |
+
words = word_tokenize("some sentence") # you can also use sentence.split()
|
45 |
+
results = []
|
46 |
+
for w in words:
|
47 |
+
res = classify_term(w)
|
48 |
+
results.append(res)
|
49 |
+
|
50 |
+
for w, p in zip(words, results):
|
51 |
+
print(f"Word: {w}, Predicted Label: {p}")
|
52 |
+
```
|
53 |
+
|
54 |
+
|
55 |
+
## Example usage
|
56 |
+
Given the following text:
|
57 |
+
"Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition.
|
58 |
+
One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are.
|
59 |
+
This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors.
|
60 |
+
Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers."
|
61 |
+
|
62 |
+
the words he classified as scientific are:<br>
|
63 |
+
```
|
64 |
+
['Quantum', 'computing', 'field', 'complex', 'quantum', 'qubits', 'property', 'superposition', 'entanglement', 'matter', 'factor', 'state', 'scale']
|
65 |
+
```
|
66 |
+
# results_bert-finetuned-ner
|
67 |
+
|
68 |
+
This model is a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) on the [JonyC/ScienceGlossary](https://huggingface.co/datasets/JonyC/ScienceGlossary) dataset.
|
69 |
+
It achieves the following results on the evaluation set:
|
70 |
+
- Loss: 0.1763
|
71 |
+
- Precision: 0.9487
|
72 |
+
- Recall: 0.9068
|
73 |
+
- F1: 0.9273
|
74 |
+
- Accuracy: 0.9695
|
75 |
+
-
|
76 |
+
### Training hyperparameters
|
77 |
+
|
78 |
+
The following hyperparameters were used during training:
|
79 |
+
- learning_rate: 7e-05
|
80 |
+
- train_batch_size: 128
|
81 |
+
- eval_batch_size: 128
|
82 |
+
- seed: 42
|
83 |
+
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
84 |
+
- lr_scheduler_type: linear
|
85 |
+
- num_epochs: 35
|