Uploaded model
- Developed by: aksw
- License: apache-2.0
- Finetuned from model : unsloth/phi-4-unsloth-bnb-4bit
This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.
📄 Model Card: aksw/Bike-name
🧠 Model Overview
Bike-name
is a Medium fine-tuned language model designed to extract biochemical names from scientific text articles. It is ideal for Information Retrieval systems based on Biohemical Knowledge Extraction.
🚨 Disclaimer
This model cannot be used to compare with other methods in the Bike challenge or in scientific articles from the NatUKE Benchmark because it was trained with all the benchmark data. This means that this method used some of the NatUKE test data in its fine-tuning. It is intended for exploration in other benchmarks or for future Bike challenges where the test sets will not come from the NatUKE test sets.
🔍 Intended Use
- Input: Text from a Biochemical PDF file
- Output: A single list containing the corresponding biochemical names from the text.
🧩 Applications
- Question Answering systems over Biochemical Datasets
- Biochemical Knowledge graph exploration tools
- Extraction of biochemical names from scientific text articles
⚙️ Model Details
- Base model: Phi 4 14B (via Unsloth)
- Training: Scientific text articles
- 418 unique names
- 143 articles
- Target Ontology: NatUke Benchmarking (https://github.com/AKSW/natuke)
- Frameworks: Unsloth, HuggingFace, Transformers
📦 Installation
Make sure to install unsloth
, torch
and CUDA dependencies:
pip install unsloth torch
🧪 Example: Inference Code
from unsloth import FastLanguageModel
import torch
class SPARQLQueryGenerator:
def __init__(self, model_name: str, max_seq_length: int = 32768, load_in_4bit: bool = True):
self.model, self.tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
load_in_4bit=load_in_4bit
)
_ = FastLanguageModel.for_inference(self.model)
def build_prompt(self, article_text: str) -> list:
return [
{"role": "system", "content": (
"You are a scientist trained in chemistry.\n"
"You must extract information from scientific papers identifying relevant properties associated with each natural product discussed in the academic publication.\n"
"For each paper, you have to analyze the content (text) to identify the *Compound name*. It can be more than one compound name.\n"
"Your output should be a list with the names. Return only the list, without any additional information.\n"
)},
{"role": "user", "content": article_text}
]
def generate_query(self, article_text: str, temperature: float = 0.01, max_new_tokens: int = 1024) -> str:
si = "<|im_start|>assistant<|im_sep|>"
sf = "<|im_end|>"
messages = self.build_prompt(article_text)
inputs = self.tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = self.model.generate(inputs, max_new_tokens=max_new_tokens, use_cache=True, temperature=temperature, min_p=0.1)
decoded = self.tokenizer.batch_decode(outputs)[0]
parsed = decoded[decoded.find(si):].replace(si, "").replace(sf, "")
try:
l = eval(parsed)
except:
l = parsed
print('Your output is not a list, you will need one more preprocessing step.')
return l
# --- Using the model ---
if __name__ == "__main__":
generator = SPARQLQueryGenerator(model_name="aksw/Bike-name")
text = "Title, Abstract, Introduction, Background, Method, Results, Conclusion, References."
list_names = generator.generate_query(text)
print(list_names)
🧪 Evaluation
The model was evaluated using Hits@k on the test sets of the NatUKE Benchmark (do Carmo et al. 2023)
Do Carmo, Paulo Viviurka, et al. "NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature." 2023 IEEE 17th International Conference on Semantic Computing (ICSC). IEEE, 2023.
📚 Citation
If you use this model in your work, please cite it as:
@inproceedings{ref:doCarmo2025,
title={Improving Natural Product Knowledge Extraction from Academic Literature with Enhanced PDF Text Extraction and Large Language Models},
author={Viviurka do Carmo, Paulo and Silva G{\^o}lo, Marcos Paulo and Gwozdz, Jonas and Marx, Edgard and Marcondes Marcacini, Ricardo},
booktitle={Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing},
pages={980--987},
year={2025}
}