|
--- |
|
datasets: |
|
- neo4j/text2cypher-2024v1 |
|
base_model: |
|
- google/gemma-2-9b-it |
|
--- |
|
|
|
## Model Details |
|
This is gguf format model for ```neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1``` |
|
|
|
### Model Description |
|
This model serves as a demonstration of how fine-tuning foundational models using the Neo4j-Text2Cypher(2024) Dataset (https://huggingface.co/datasets/neo4j/text2cypher-2024v1) can enhance performance on the Text2Cypher task. |
|
Please note, this is part of ongoing research and exploration, aimed at highlighting the dataset's potential rather than a production-ready solution. |
|
|
|
Base model: google/gemma-2-9b-it |
|
Dataset: neo4j/text2cypher-2024v1 |
|
|
|
An overview of the finetuned models and benchmarking results are shared at https://medium.com/p/d77be96ab65a and https://medium.com/p/b2203d1173b0 |
|
|
|
## Example Cypher generation |
|
```python |
|
import openai |
|
|
|
# Define the instruction and helper functions |
|
instruction = ( |
|
"Generate Cypher statement to query a graph database. " |
|
"Use only the provided relationship types and properties in the schema. \n" |
|
"Schema: {schema} \n Question: {question} \n Cypher output: " |
|
) |
|
|
|
def prepare_chat_prompt(question, schema): |
|
# Build the messages list for the OpenAI API |
|
return [ |
|
{ |
|
"role": "user", |
|
"content": instruction.format(schema=schema, question=question), |
|
} |
|
] |
|
|
|
def _postprocess_output_cypher(output_cypher: str) -> str: |
|
# Remove any explanation text and code block markers |
|
partition_by = "**Explanation:**" |
|
output_cypher, _, _ = output_cypher.partition(partition_by) |
|
output_cypher = output_cypher.strip("`\n") |
|
output_cypher = output_cypher.lstrip("cypher\n") |
|
output_cypher = output_cypher.strip("`\n ") |
|
return output_cypher |
|
|
|
# Configure the OpenAI API endpoint to your Ollama server. |
|
# (Adjust the API base URL if your Ollama server is hosted at a different address/port.) |
|
openai.api_base = "http://localhost:11434/v1" |
|
openai.api_key = "YOUR_API_KEY" # Include if your setup requires an API key |
|
|
|
# Set the model name as used by Ollama (this should match the name configured on your Ollama server) |
|
model_name = "avinashm/text2cypher" |
|
|
|
# Define the question and schema |
|
question = "What are the movies of Tom Hanks?" |
|
schema = "(:Actor)-[:ActedIn]->(:Movie)" |
|
|
|
# Prepare the conversation messages |
|
messages = prepare_chat_prompt(question=question, schema=schema) |
|
|
|
# Call the API using similar generation parameters to your original script. |
|
response = openai.ChatCompletion.create( |
|
model=model_name, |
|
messages=messages, |
|
temperature=0.2, |
|
max_tokens=512, # equivalent to max_new_tokens in your original script |
|
top_p=0.9, |
|
) |
|
|
|
# Extract and post-process the output |
|
raw_output = response["choices"][0]["message"]["content"] |
|
output = _postprocess_output_cypher(raw_output) |
|
|
|
print(output) |
|
``` |
|
|
|
## NOTE: on creating your own schemas: |
|
``` |
|
In the dataset we used, the schemas are already provided. |
|
They are created either by Directly using the schema the input data source provided OR |
|
Creating schema using neo4j-graphrag package (Check: SchemaReader.get_schema(...) function) |
|
In your own Neo4j database, you can utilize neo4j-graphrag package::SchemaReader functions |
|
``` |
|
# Example cypher queries to get schema: |
|
```cypher |
|
CALL apoc.meta.schema() |
|
CALL db.schema.visualization() |
|
``` |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
We need to be cautious about a few risks: |
|
|
|
In our evaluation setup, the training and test sets come from the same data distribution (sampled from a larger dataset). If the data distribution changes, the results may not follow the same pattern. |
|
The datasets used were gathered from publicly available sources. Over time, foundational models may access both the training and test sets, potentially achieving similar or even better results. |
|
|
|
## Training Details |
|
Training Procedure |
|
Used RunPod with following setup: |
|
``` |
|
1 x A100 PCIe |
|
31 vCPU 117 GB RAM |
|
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 |
|
On-Demand - Secure Cloud |
|
60 GB Disk |
|
60 GB Pod Volume |
|
Training Hyperparameters |
|
lora_config = LoraConfig( r=64, lora_alpha=64, target_modules=target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) |
|
sft_config = SFTConfig( dataset_text_field=dataset_text_field, per_device_train_batch_size=4, gradient_accumulation_steps=8, dataset_num_proc=16, max_seq_length=1600, logging_dir="./logs", num_train_epochs=1, learning_rate=2e-5, save_steps=5, save_total_limit=1, logging_steps=5, output_dir="outputs", optim="paged_adamw_8bit", save_strategy="steps", ) |
|
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) |
|
``` |
|
|