scibert-science-word-classifier / README.md

Update README.md

915dfa2 verified 3 months ago

3.63 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- allenai/scibert_scivocab_uncased
	tags:
	- Science
	- classifier
	- words
	---
	<b><span style="color:red;">IMPORTENT! READ THIS!</span></b>
	## Model description
	This model recognizes scientific terms in a given text. The best way to use it is as follows:
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from nltk.tokenize import word_tokenize
	import torch
	import spacy

	# You might want to use it to remove enteties in the text (the model usually predicts them as scientific)
	nlp = spacy.load("en_core_web_sm")
	# doc = nlp(text)
	# names = [ent.text for ent in doc.ents]

	tokenizer = AutoTokenizer.from_pretrained("JonyC/scibert-science-word-classifier")
	model = AutoModelForTokenClassification.from_pretrained("JonyC/scibert-science-word-classifier")

	# define max_len as needed.
	def classify_term(term, max_len=12):
	term = term.lower()
	tokens = tokenizer(term, return_tensors="pt", truncation=True, padding=True, max_length=max_len).to(device)
	output = model(**tokens).logits
	pred = torch.argmax(output).item()

	return "Scientific" if pred == 1 else "Non-Scientific"

	# For single term:
	print(classify_term("quantum mechanics"))
	print(classify_term("table"))
	print(classify_term("photosynthesis"))

	# For sentences:
	words = word_tokenize("some sentence") # you can also use sentence.split()
	results = []
	for w in words:
	res = classify_term(w)
	results.append(res)

	for w, p in zip(words, results):
	print(f"Word: {w}, Predicted Label: {p}")
	```


	## Example usage
	Given the following text:
	"Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition.
	One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are.
	This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors.
	Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers."

	the words he classified as scientific are:<br>
	```
	['Quantum', 'computing', 'field', 'complex', 'quantum', 'qubits', 'property', 'superposition', 'entanglement', 'matter', 'factor', 'state', 'scale']
	```
	# results 'scibert-science-word-classifier'

	This model is a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) on the [JonyC/ScienceGlossary](https://huggingface.co/datasets/JonyC/ScienceGlossary) dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.1763
	- Precision: 0.9487
	- Recall: 0.9068
	- F1: 0.9273
	- Accuracy: 0.9695
	-
	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 7e-05
	- train_batch_size: 128
	- eval_batch_size: 128
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 35