train_2025-05-05-15-36-22

This model is a fine-tuned version of ../pretrained/Qwen3-4B on the wikipedia_zh, petro_books, datasets001, the datasets002, the datasets003, the datasets004 and the datasets006 datasets.

Model description

Gaia-Petro-LLM is a large language model specialized in the oil and gas industry, fine-tuned from Qwen/Qwen3-4B. It was further pre-trained on a curated 20GB corpus of petroleum engineering texts, including technical documents, academic papers, and domain literature. The model is designed to support domain experts, researchers, and engineers in petroleum-related tasks, providing high-quality, domain-specific language understanding and generation.

Model Details

Base Model: Qwen/Qwen3-8B Domain: Oil & Gas / Petroleum Engineering Corpus Size: ~20GB (petroleum engineering) Languages: Primarily Chinese; domain-specific English supported Repository: my2000cup/Gaia-LLM-8B

Intended uses & limitations

Technical Q&A in petroleum engineering Document summarization for oil & gas reports Knowledge extraction from unstructured domain texts Education & training in oil & gas technologies

Not suitable for general domain tasks outside oil & gas. May not be up to date with the latest industry developments (post-2023). Not to be used for critical, real-time decision-making without expert review.

Training and evaluation data

The model was further pre-trained on an in-house text corpus (~20GB) collected from:

Wikipedia (Chinese, petroleum-related entries) Open petroleum engineering books and literature Technical standards and manuals

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Replace with your model repository
model_name = "my2000cup/Gaia-LLM-8B"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Prepare a petroleum engineering prompt
prompt = "What are the main challenges in enhanced oil recovery (EOR) methods?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Optional: enables model's 'thinking' mode
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the model's response
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024  # adjust as needed
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# Optional: parse 'thinking' content, if your template uses it
try:
    # Find the index of the </think> token (ID may differ in your tokenizer!)
    think_token_id = 151668  # double-check this ID in your tokenizer
    index = len(output_ids) - output_ids[::-1].index(think_token_id)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("Thinking content:", thinking_content)
print("Answer:", content)

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 4.0

Training results

Framework versions

PEFT 0.15.1
Transformers 4.51.3
Pytorch 2.6.0+cu124
Datasets 3.5.0
Tokenizers 0.21.1

my2000cup
/

Gaia-LLM-8B

train_2025-05-05-15-36-22

Model description

Model Details

Intended uses & limitations

Training and evaluation data

Usage

Training hyperparameters

Training results

Framework versions

Model tree for my2000cup/Gaia-LLM-8B

Collection including my2000cup/Gaia-LLM-8B

Gaia LLM Series

Evaluation results