File size: 5,274 Bytes
4a2a9b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64ab909
 
 
 
 
 
 
 
 
780bc49
 
 
 
 
 
64ab909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbcd579
64ab909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbcd579
64ab909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbcd579
64ab909
cbcd579
64ab909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
base_model: unsloth/phi-4-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
license: apache-2.0
language:
- en
---

# Uploaded  model

- **Developed by:** aksw
- **License:** apache-2.0
- **Finetuned from model :** unsloth/phi-4-unsloth-bnb-4bit

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

## 📄 Model Card: `aksw/Bike-name`

### 🧠 Model Overview

`Bike-name` is a Medium fine-tuned language model designed to **extract biochemical names from scientific text articles**. It is ideal for Information Retrieval systems based on Biohemical Knowledge Extraction.

---

### 🚨 Disclaimer

This model cannot be used to compare with other methods in the Bike challenge or in scientific articles from the NatUKE Benchmark because it was trained with all the benchmark data. This means that this method used some of the NatUKE test data in its fine-tuning. It is intended for exploration in other benchmarks or for future Bike challenges where the test sets will not come from the NatUKE test sets.

---

### 🔍 Intended Use

* **Input**: Text from a Biochemical PDF file
* **Output**: A **single list** containing the corresponding biochemical names from the text.

---

### 🧩 Applications

* Question Answering systems over Biochemical Datasets
* Biochemical Knowledge graph exploration tools
* Extraction of biochemical names from scientific text articles

---

### ⚙️ Model Details

* **Base model**: Phi 4 14B (via Unsloth)
* **Training**: Scientific text articles
  * 418 unique names 
  * 143 articles
* **Target Ontology**: NatUke Benchmarking (https://github.com/AKSW/natuke)
* **Frameworks**: Unsloth, HuggingFace, Transformers

---

### 📦 Installation

Make sure to install `unsloth`, `torch` and CUDA dependencies:

```bash
pip install unsloth torch
```

---

### 🧪 Example: Inference Code

```python
from unsloth import FastLanguageModel
import torch

class BiKECompoundNameExtractor:
    def __init__(self, model_name: str, max_seq_length: int = 32768, load_in_4bit: bool = True):
        self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=max_seq_length,
            load_in_4bit=load_in_4bit
        )
        _ = FastLanguageModel.for_inference(self.model)

    def build_prompt(self, article_text: str) -> list:
        return [
            {"role": "system", "content": (
                "You are a scientist trained in chemistry.\n" 
                "You must extract information from scientific papers identifying relevant properties associated with each natural product discussed in the academic publication.\n"
                "For each paper, you have to analyze the content (text) to identify the *Compound name*. It can be more than one compound name.\n" 
                "Your output should be a list with the names. Return only the list, without any additional information.\n"
            )},
            {"role": "user", "content": article_text}
        ]

    def extract_compound_name(self, article_text: str, temperature: float = 0.01, max_new_tokens: int = 1024) -> str:
        si = "<|im_start|>assistant<|im_sep|>"
        sf = "<|im_end|>"
        messages = self.build_prompt(article_text)
        inputs = self.tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
        ).to("cuda")
        outputs = self.model.generate(inputs, max_new_tokens=max_new_tokens, use_cache=True, temperature=temperature, min_p=0.1)
        decoded = self.tokenizer.batch_decode(outputs)[0]
        parsed = decoded[decoded.find(si):].replace(si, "").replace(sf, "")
        try:
            l = eval(parsed)
        except:
            l = parsed
            print('Your output is not a list, you will need one more preprocessing step.')

        return l

# --- Using the model ---
if __name__ == "__main__":
    extractor = BiKECompoundNameExtractor(model_name="aksw/Bike-name")
    text = "Title, Abstract, Introduction, Background, Method, Results, Conclusion, References."
    list_names = extractor.extract_compound_name(text)
    print(list_names)
```

---

### 🧪 Evaluation

The model was evaluated using Hits@k on the test sets of the NatUKE Benchmark (do Carmo et al. 2023)

---

Do Carmo, Paulo Viviurka, et al. "NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature." 2023 IEEE 17th International Conference on Semantic Computing (ICSC). IEEE, 2023.


### 📚 Citation

If you use this model in your work, please cite it as:

```
@inproceedings{ref:doCarmo2025,
  title={Improving Natural Product Knowledge Extraction from Academic Literature with Enhanced PDF Text Extraction and Large Language Models},
  author={Viviurka do Carmo, Paulo and Silva G{\^o}lo, Marcos Paulo and Gwozdz, Jonas and Marx, Edgard and Marcondes Marcacini, Ricardo},
  booktitle={Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing},
  pages={980--987},
  year={2025}
}
```