JonyC commited on
Commit
4fb4368
·
verified ·
1 Parent(s): 1f5a417

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -1,3 +1,85 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - allenai/scibert_scivocab_uncased
7
+ tags:
8
+ - Science
9
+ - classifier
10
+ - words
11
+ ---
12
+ <b><span style="color:red;">IMPORTENT! READ THIS!</span></b>
13
+ ## Model description
14
+ This model recognizes scientific terms in a given *text*. The best way to use it is as follows:
15
+ ```python
16
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
17
+ from nltk.tokenize import word_tokenize
18
+ import torch
19
+ import spacy
20
+
21
+ # You might want to use it to remove enteties in the text (the model usually predicts them as scientific)
22
+ nlp = spacy.load("en_core_web_sm")
23
+ # doc = nlp(text)
24
+ # names = [ent.text for ent in doc.ents]
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained("JonyC/scibert-science-word-classifier")
27
+ model = AutoModelForTokenClassification.from_pretrained("JonyC/scibert-science-word-classifier")
28
+
29
+ # define max_len as needed.
30
+ def classify_term(term, max_len=12):
31
+ term = term.lower()
32
+ tokens = tokenizer(term, return_tensors="pt", truncation=True, padding=True, max_length=max_len).to(device)
33
+ output = model(**tokens).logits
34
+ pred = torch.argmax(output).item()
35
+
36
+ return "Scientific" if pred == 1 else "Non-Scientific"
37
+
38
+ # For single term:
39
+ print(classify_term("quantum mechanics"))
40
+ print(classify_term("table"))
41
+ print(classify_term("photosynthesis"))
42
+
43
+ # For sentences:
44
+ words = word_tokenize("some sentence") # you can also use sentence.split()
45
+ results = []
46
+ for w in words:
47
+ res = classify_term(w)
48
+ results.append(res)
49
+
50
+ for w, p in zip(words, results):
51
+ print(f"Word: {w}, Predicted Label: {p}")
52
+ ```
53
+
54
+
55
+ ## Example usage
56
+ Given the following text:
57
+ "Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition.
58
+ One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are.
59
+ This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors.
60
+ Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers."
61
+
62
+ the words he classified as scientific are:<br>
63
+ ```
64
+ ['Quantum', 'computing', 'field', 'complex', 'quantum', 'qubits', 'property', 'superposition', 'entanglement', 'matter', 'factor', 'state', 'scale']
65
+ ```
66
+ # results_bert-finetuned-ner
67
+
68
+ This model is a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) on the [JonyC/ScienceGlossary](https://huggingface.co/datasets/JonyC/ScienceGlossary) dataset.
69
+ It achieves the following results on the evaluation set:
70
+ - Loss: 0.1763
71
+ - Precision: 0.9487
72
+ - Recall: 0.9068
73
+ - F1: 0.9273
74
+ - Accuracy: 0.9695
75
+ -
76
+ ### Training hyperparameters
77
+
78
+ The following hyperparameters were used during training:
79
+ - learning_rate: 7e-05
80
+ - train_batch_size: 128
81
+ - eval_batch_size: 128
82
+ - seed: 42
83
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
84
+ - lr_scheduler_type: linear
85
+ - num_epochs: 35