train_2025-05-05-15-36-22
This model is a fine-tuned version of ../pretrained/Qwen3-4B on the wikipedia_zh, petro_books, datasets001, the datasets002, the datasets003, the datasets004 and the datasets006 datasets.
Model description
Gaia-Petro-LLM is a large language model specialized in the oil and gas industry, fine-tuned from Qwen/Qwen3-4B. It was further pre-trained on a curated 20GB corpus of petroleum engineering texts, including technical documents, academic papers, and domain literature. The model is designed to support domain experts, researchers, and engineers in petroleum-related tasks, providing high-quality, domain-specific language understanding and generation.
Model Details
Base Model: Qwen/Qwen3-8B Domain: Oil & Gas / Petroleum Engineering Corpus Size: ~20GB (petroleum engineering) Languages: Primarily Chinese; domain-specific English supported Repository: my2000cup/Gaia-LLM-8B
Intended uses & limitations
Technical Q&A in petroleum engineering Document summarization for oil & gas reports Knowledge extraction from unstructured domain texts Education & training in oil & gas technologies
Not suitable for general domain tasks outside oil & gas. May not be up to date with the latest industry developments (post-2023). Not to be used for critical, real-time decision-making without expert review.
Training and evaluation data
The model was further pre-trained on an in-house text corpus (~20GB) collected from:
Wikipedia (Chinese, petroleum-related entries) Open petroleum engineering books and literature Technical standards and manuals
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Replace with your model repository
model_name = "my2000cup/Gaia-LLM-8B"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Prepare a petroleum engineering prompt
prompt = "What are the main challenges in enhanced oil recovery (EOR) methods?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Optional: enables model's 'thinking' mode
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generate the model's response
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024 # adjust as needed
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# Optional: parse 'thinking' content, if your template uses it
try:
# Find the index of the </think> token (ID may differ in your tokenizer!)
think_token_id = 151668 # double-check this ID in your tokenizer
index = len(output_ids) - output_ids[::-1].index(think_token_id)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("Thinking content:", thinking_content)
print("Answer:", content)
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 8
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 4.0
Training results
Framework versions
- PEFT 0.15.1
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.21.1
- Downloads last month
- 0