DavidLanz commited on
Commit
c202b13
·
verified ·
1 Parent(s): 3d64040

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ text2cypher-gemma-2-9b-it-finetuned-2024v1-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,330 @@
1
  ---
2
  license: gemma
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: text2text-generation
5
+ tags:
6
+ - conversational
7
+ - neo4j
8
+ - cypher
9
+ - text2cypher
10
+ base_model: google/gemma-2-9b-it
11
+ datasets:
12
+ - neo4j/text2cypher-2024v1
13
+ language:
14
+ - en
15
  ---
16
+
17
+ # Model Card for Model ID
18
+
19
+ <!-- Provide a quick summary of what the model is/does. -->
20
+
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+
26
+ This model serves as a demonstration of how fine-tuning foundational models using the Neo4j-Text2Cypher(2024) Dataset ([link](https://huggingface.co/datasets/neo4j/text2cypher-2024v1)) can enhance performance on the Text2Cypher task.\
27
+ Please **note**, this is part of ongoing research and exploration, aimed at highlighting the dataset's potential rather than a production-ready solution.
28
+
29
+
30
+ **Base model:** google/gemma-2-9b-it \
31
+ **Dataset:** neo4j/text2cypher-2024v1
32
+
33
+ An overview of the finetuned models and benchmarking results are shared at [Link1](https://medium.com/p/d77be96ab65a) and [Link2](https://medium.com/p/b2203d1173b0)
34
+
35
+ Have ideas or insights? Contact us: [Neo4j/Team-GenAI](mailto:[email protected])
36
+
37
+
38
+ <!-- - **Developed by:** [More Information Needed]
39
+ - **Funded by [optional]:** [More Information Needed]
40
+ - **Shared by [optional]:** [More Information Needed]
41
+ - **Model type:** [More Information Needed]
42
+ - **Language(s) (NLP):** [More Information Needed]
43
+ - **License:** [More Information Needed]
44
+ - **Finetuned from model [optional]:** [More Information Needed] -->
45
+
46
+ <!-- ### Model Sources [optional]
47
+
48
+ <!-- Provide the basic links for the model. -->
49
+
50
+ <!-- - **Repository:** [More Information Needed]
51
+ - **Paper [optional]:** [More Information Needed]
52
+ - **Demo [optional]:** [More Information Needed] -->
53
+
54
+ <!-- ## Uses -->
55
+
56
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
57
+
58
+ <!-- ### Direct Use -->
59
+
60
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
61
+
62
+ <!-- [More Information Needed] -->
63
+
64
+ <!-- ### Downstream Use [optional] -->
65
+
66
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
67
+
68
+ <!-- [More Information Needed] -->
69
+
70
+ <!-- ### Out-of-Scope Use
71
+ -->
72
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
73
+
74
+ <!-- [More Information Needed] -->
75
+
76
+ ## Bias, Risks, and Limitations
77
+
78
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
79
+
80
+ We need to be cautious about a few risks:
81
+ * In our evaluation setup, the training and test sets come from the same data distribution (sampled from a larger dataset). If the data distribution changes, the results may not follow the same pattern.
82
+ * The datasets used were gathered from publicly available sources. Over time, foundational models may access both the training and test sets, potentially achieving similar or even better results.
83
+
84
+ Also check the related blogpost:[Link](Thttps://medium.com/p/b2203d1173b0)
85
+
86
+ <!-- ### Recommendations -->
87
+
88
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
89
+
90
+ <!-- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. -->
91
+
92
+ <!-- ## How to Get Started with the Model
93
+
94
+ Use the code below to get started with the model.
95
+
96
+ [More Information Needed] -->
97
+
98
+ ## Training Details
99
+
100
+ <!-- ### Training Data -->
101
+
102
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
103
+
104
+ <!-- [More Information Needed]-->
105
+
106
+ ### Training Procedure
107
+
108
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
109
+ Used RunPod with following setup:
110
+
111
+ * 1 x A100 PCIe
112
+ * 31 vCPU 117 GB RAM
113
+ * runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
114
+ * On-Demand - Secure Cloud
115
+ * 60 GB Disk
116
+ * 60 GB Pod Volume
117
+ <!-- * ~16 hours
118
+ * $30 -->
119
+
120
+ <!-- #### Preprocessing [optional]
121
+
122
+ [More Information Needed]
123
+ -->
124
+
125
+ #### Training Hyperparameters
126
+
127
+ <!-- - **Training regime:** -->
128
+ <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
129
+ * lora_config = LoraConfig(
130
+ r=64,
131
+ lora_alpha=64,
132
+ target_modules=target_modules,
133
+ lora_dropout=0.05,
134
+ bias="none",
135
+ task_type="CAUSAL_LM",
136
+ )
137
+ * sft_config = SFTConfig(
138
+ dataset_text_field=dataset_text_field,
139
+ per_device_train_batch_size=4,
140
+ gradient_accumulation_steps=8,
141
+ dataset_num_proc=16,
142
+ max_seq_length=1600,
143
+ logging_dir="./logs",
144
+ num_train_epochs=1,
145
+ learning_rate=2e-5,
146
+ save_steps=5,
147
+ save_total_limit=1,
148
+ logging_steps=5,
149
+ output_dir="outputs",
150
+ optim="paged_adamw_8bit",
151
+ save_strategy="steps",
152
+ )
153
+ * bnb_config = BitsAndBytesConfig(
154
+ load_in_4bit=True,
155
+ bnb_4bit_use_double_quant=True,
156
+ bnb_4bit_quant_type="nf4",
157
+ bnb_4bit_compute_dtype=torch.bfloat16,
158
+ )
159
+ <!-- #### Speeds, Sizes, Times [optional] -->
160
+
161
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
162
+
163
+ <!-- [More Information Needed] -->
164
+
165
+ <!-- ## Evaluation -->
166
+
167
+ <!-- This section describes the evaluation protocols and provides the results. -->
168
+
169
+ <!-- ### Testing Data, Factors & Metrics -->
170
+
171
+ <!-- #### Testing Data -->
172
+
173
+ <!-- This should link to a Dataset Card if possible. -->
174
+
175
+ <!-- [More Information Needed] -->
176
+
177
+ <!-- #### Factors -->
178
+
179
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
180
+
181
+ <!-- [More Information Needed]
182
+
183
+ #### Metrics -->
184
+
185
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
186
+
187
+ <!-- [More Information Needed]
188
+
189
+ ### Results
190
+
191
+ [More Information Needed]
192
+
193
+ #### Summary -->
194
+
195
+
196
+
197
+ <!-- ## Model Examination [optional]
198
+ -->
199
+ <!-- Relevant interpretability work for the model goes here -->
200
+
201
+ <!-- [More Information Needed]
202
+
203
+ ## Environmental Impact -->
204
+
205
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
206
+
207
+ <!-- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
208
+
209
+ - **Hardware Type:** [More Information Needed]
210
+ - **Hours used:** [More Information Needed]
211
+ - **Cloud Provider:** [More Information Needed]
212
+ - **Compute Region:** [More Information Needed]
213
+ - **Carbon Emitted:** [More Information Needed]
214
+
215
+ ## Technical Specifications [optional]
216
+
217
+ ### Model Architecture and Objective
218
+
219
+ [More Information Needed]
220
+
221
+ ### Compute Infrastructure
222
+
223
+ [More Information Needed]
224
+
225
+ #### Hardware
226
+
227
+ [More Information Needed]
228
+
229
+ #### Software
230
+
231
+ [More Information Needed]
232
+
233
+ ## Citation [optional]-->
234
+
235
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
236
+
237
+ <!-- **BibTeX:**
238
+
239
+ [More Information Needed]
240
+
241
+ **APA:**
242
+
243
+ [More Information Needed]
244
+
245
+ ## Glossary [optional] -->
246
+
247
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
248
+
249
+ <!-- [More Information Needed]
250
+
251
+ ## More Information [optional]
252
+
253
+ [More Information Needed]
254
+
255
+ ## Model Card Authors [optional]
256
+
257
+ [More Information Needed]
258
+
259
+ ## Model Card Contact
260
+
261
+ [More Information Needed] -->
262
+ ### Framework versions
263
+
264
+ - PEFT 0.12.0
265
+
266
+ ### Example Cypher generation
267
+ ```
268
+ from transformers import AutoModelForCausalLM, AutoTokenizer
269
+ import torch
270
+ model_name = "DavidLanz/text2cypher-gemma-2-9b-it-finetuned-2024v1"
271
+
272
+ model = AutoModelForCausalLM.from_pretrained(
273
+ model_name,
274
+ torch_dtype=torch.float32,
275
+ device_map="auto",
276
+ low_cpu_mem_usage=True,
277
+ )
278
+
279
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
280
+
281
+ question = "What are the movies of Tom Hanks?"
282
+ schema = "(:Actor)-[:ActedIn]->(:Movie)"
283
+
284
+ instruction = (
285
+ "Generate Cypher statement to query a graph database. "
286
+ "Use only the provided relationship types and properties in the schema. \n"
287
+ "Schema: {schema} \n Question: {question} \n Cypher output: "
288
+ )
289
+ prompt = instruction.format(schema=schema, question=question)
290
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
291
+ model.eval()
292
+ with torch.no_grad():
293
+ outputs = model.generate(**inputs, max_new_tokens=512)
294
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
295
+ print("Generated Cypher Query:", generated_text)
296
+
297
+ def prepare_chat_prompt(question, schema):
298
+ chat = [
299
+ {
300
+ "role": "user",
301
+ "content": instruction.format(
302
+ schema=schema, question=question
303
+ ),
304
+ }
305
+ ]
306
+ return chat
307
+
308
+ def _postprocess_output_cypher(output_cypher: str) -> str:
309
+ # Remove any explanation or formatting markers
310
+ partition_by = "**Explanation:**"
311
+ output_cypher, _, _ = output_cypher.partition(partition_by)
312
+ output_cypher = output_cypher.strip("`\n")
313
+ output_cypher = output_cypher.lstrip("cypher\n")
314
+ output_cypher = output_cypher.strip("`\n ")
315
+ return output_cypher
316
+
317
+ new_message = prepare_chat_prompt(question=question, schema=schema)
318
+ try:
319
+ prompt = tokenizer.apply_chat_template(new_message, add_generation_prompt=True, tokenize=False)
320
+ inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")
321
+
322
+ with torch.no_grad():
323
+ outputs = model.generate(**inputs, max_new_tokens=512)
324
+ chat_generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
325
+ final_cypher = _postprocess_output_cypher(chat_generated_text)
326
+ print("Processed Cypher Query:", final_cypher)
327
+ except AttributeError:
328
+ print("Error: `apply_chat_template` not supported by this tokenizer. Check compatibility.")
329
+
330
+ ```
text2cypher-gemma-2-9b-it-finetuned-2024v1-Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:508b33acd8da81a4f101ca014948dffcabc1178ee56a14da571756ad9fc86778
3
+ size 6647366528