|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- allenai/scibert_scivocab_uncased |
|
tags: |
|
- Science |
|
- classifier |
|
- words |
|
--- |
|
<b><span style="color:red;">IMPORTENT! READ THIS!</span></b> |
|
## Model description |
|
This model recognizes scientific terms in a given *text*. The best way to use it is as follows: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from nltk.tokenize import word_tokenize |
|
import torch |
|
import spacy |
|
|
|
# You might want to use it to remove enteties in the text (the model usually predicts them as scientific) |
|
nlp = spacy.load("en_core_web_sm") |
|
# doc = nlp(text) |
|
# names = [ent.text for ent in doc.ents] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("JonyC/scibert-science-word-classifier") |
|
model = AutoModelForTokenClassification.from_pretrained("JonyC/scibert-science-word-classifier") |
|
|
|
# define max_len as needed. |
|
def classify_term(term, max_len=12): |
|
term = term.lower() |
|
tokens = tokenizer(term, return_tensors="pt", truncation=True, padding=True, max_length=max_len).to(device) |
|
output = model(**tokens).logits |
|
pred = torch.argmax(output).item() |
|
|
|
return "Scientific" if pred == 1 else "Non-Scientific" |
|
|
|
# For single term: |
|
print(classify_term("quantum mechanics")) |
|
print(classify_term("table")) |
|
print(classify_term("photosynthesis")) |
|
|
|
# For sentences: |
|
words = word_tokenize("some sentence") # you can also use sentence.split() |
|
results = [] |
|
for w in words: |
|
res = classify_term(w) |
|
results.append(res) |
|
|
|
for w, p in zip(words, results): |
|
print(f"Word: {w}, Predicted Label: {p}") |
|
``` |
|
|
|
|
|
## Example usage |
|
Given the following text: |
|
"Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition. |
|
One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are. |
|
This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors. |
|
Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers." |
|
|
|
the words he classified as scientific are:<br> |
|
``` |
|
['Quantum', 'computing', 'field', 'complex', 'quantum', 'qubits', 'property', 'superposition', 'entanglement', 'matter', 'factor', 'state', 'scale'] |
|
``` |
|
# results 'scibert-science-word-classifier' |
|
|
|
This model is a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) on the [JonyC/ScienceGlossary](https://huggingface.co/datasets/JonyC/ScienceGlossary) dataset. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.1763 |
|
- Precision: 0.9487 |
|
- Recall: 0.9068 |
|
- F1: 0.9273 |
|
- Accuracy: 0.9695 |
|
- |
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 7e-05 |
|
- train_batch_size: 128 |
|
- eval_batch_size: 128 |
|
- seed: 42 |
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: linear |
|
- num_epochs: 35 |