File size: 5,274 Bytes
4a2a9b5 64ab909 780bc49 64ab909 cbcd579 64ab909 cbcd579 64ab909 cbcd579 64ab909 cbcd579 64ab909 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
base_model: unsloth/phi-4-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
license: apache-2.0
language:
- en
---
# Uploaded model
- **Developed by:** aksw
- **License:** apache-2.0
- **Finetuned from model :** unsloth/phi-4-unsloth-bnb-4bit
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
## 📄 Model Card: `aksw/Bike-name`
### 🧠 Model Overview
`Bike-name` is a Medium fine-tuned language model designed to **extract biochemical names from scientific text articles**. It is ideal for Information Retrieval systems based on Biohemical Knowledge Extraction.
---
### 🚨 Disclaimer
This model cannot be used to compare with other methods in the Bike challenge or in scientific articles from the NatUKE Benchmark because it was trained with all the benchmark data. This means that this method used some of the NatUKE test data in its fine-tuning. It is intended for exploration in other benchmarks or for future Bike challenges where the test sets will not come from the NatUKE test sets.
---
### 🔍 Intended Use
* **Input**: Text from a Biochemical PDF file
* **Output**: A **single list** containing the corresponding biochemical names from the text.
---
### 🧩 Applications
* Question Answering systems over Biochemical Datasets
* Biochemical Knowledge graph exploration tools
* Extraction of biochemical names from scientific text articles
---
### ⚙️ Model Details
* **Base model**: Phi 4 14B (via Unsloth)
* **Training**: Scientific text articles
* 418 unique names
* 143 articles
* **Target Ontology**: NatUke Benchmarking (https://github.com/AKSW/natuke)
* **Frameworks**: Unsloth, HuggingFace, Transformers
---
### 📦 Installation
Make sure to install `unsloth`, `torch` and CUDA dependencies:
```bash
pip install unsloth torch
```
---
### 🧪 Example: Inference Code
```python
from unsloth import FastLanguageModel
import torch
class BiKECompoundNameExtractor:
def __init__(self, model_name: str, max_seq_length: int = 32768, load_in_4bit: bool = True):
self.model, self.tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
load_in_4bit=load_in_4bit
)
_ = FastLanguageModel.for_inference(self.model)
def build_prompt(self, article_text: str) -> list:
return [
{"role": "system", "content": (
"You are a scientist trained in chemistry.\n"
"You must extract information from scientific papers identifying relevant properties associated with each natural product discussed in the academic publication.\n"
"For each paper, you have to analyze the content (text) to identify the *Compound name*. It can be more than one compound name.\n"
"Your output should be a list with the names. Return only the list, without any additional information.\n"
)},
{"role": "user", "content": article_text}
]
def extract_compound_name(self, article_text: str, temperature: float = 0.01, max_new_tokens: int = 1024) -> str:
si = "<|im_start|>assistant<|im_sep|>"
sf = "<|im_end|>"
messages = self.build_prompt(article_text)
inputs = self.tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = self.model.generate(inputs, max_new_tokens=max_new_tokens, use_cache=True, temperature=temperature, min_p=0.1)
decoded = self.tokenizer.batch_decode(outputs)[0]
parsed = decoded[decoded.find(si):].replace(si, "").replace(sf, "")
try:
l = eval(parsed)
except:
l = parsed
print('Your output is not a list, you will need one more preprocessing step.')
return l
# --- Using the model ---
if __name__ == "__main__":
extractor = BiKECompoundNameExtractor(model_name="aksw/Bike-name")
text = "Title, Abstract, Introduction, Background, Method, Results, Conclusion, References."
list_names = extractor.extract_compound_name(text)
print(list_names)
```
---
### 🧪 Evaluation
The model was evaluated using Hits@k on the test sets of the NatUKE Benchmark (do Carmo et al. 2023)
---
Do Carmo, Paulo Viviurka, et al. "NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature." 2023 IEEE 17th International Conference on Semantic Computing (ICSC). IEEE, 2023.
### 📚 Citation
If you use this model in your work, please cite it as:
```
@inproceedings{ref:doCarmo2025,
title={Improving Natural Product Knowledge Extraction from Academic Literature with Enhanced PDF Text Extraction and Large Language Models},
author={Viviurka do Carmo, Paulo and Silva G{\^o}lo, Marcos Paulo and Gwozdz, Jonas and Marx, Edgard and Marcondes Marcacini, Ricardo},
booktitle={Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing},
pages={980--987},
year={2025}
}
```
|