EnvGPT: Leveraging a Large Language Model for Environmental Science

EnvGPT is the first domain-specific large language model tailored for environmental science tasks.

Environmental science presents unique challenges for LLMs due to its interdisciplinary nature. EnvGPT was developed to address these challenges by leveraging a domain-specific environmental science instruction dataset and benchmark.

The model was fine-tuned on this environmental science-specific instruction dataset, ChatEnv, through Supervised Fine-Tuning (SFT). The dataset contains a total token count of 107,197,329, highlighting its depth and comprehensiveness for environmental science tasks.

🚀 Getting Started

Download the model

Download the model: EnvGPT

git lfs install
git clone https://huggingface.co/SustcZhangYX/EnvGPT

Model Usage

Here is a Python code snippet that demonstrates how to load the tokenizer and model and generate text using EnvGPT.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# 1. Set your local EnvGPT model path here
model_path = "YOUR_LOCAL_MODEL_PATH"

# 2. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 3. Build chat messages
messages = [
    {"role": "system", "content": "You are an expert assistant in environmental science, EnvGPT. You are a helpful assistant."},
    {"role": "user",   "content": "What is the definition of environmental science?"},
]

# 4. Format the prompt using the chat template
#    add_generation_prompt=True appends the assistant start token (e.g., <|assistant|>)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# 5. Initialize the text-generation pipeline
text_gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    return_full_text=False,  # Only return the newly generated text
)

# 6. Generate the response
#    do_sample=True enables sampling (stochastic decoding)
#    top_p=0.6 applies nucleus sampling
#    temperature=0.8 controls randomness
#    max_new_tokens=4096 allows up to 4096 new tokens
outputs = text_gen(
    text,
    max_new_tokens=4096,  # Up to 4096 new tokens
    do_sample=True,       # Enable sampling instead of greedy decoding
    top_p=0.6,            # Nucleus sampling parameter
    temperature=0.8,      # Sampling temperature
)

# 7. Print the assistant’s reply (without the original prompt)
print(outputs[0]["generated_text"])

This code demonstrates how to load the tokenizer and model from your local path, define environmental science-specific prompts, and generate responses using sampling techniques like top-p and temperature.

🌏 Acknowledgement

EnvGPT is fine-tuned based on the open-sourced LLaMA. We thank Meta AI for their contributions to the community.

❗Disclaimer

This project is intended solely for academic research and exploration. Please note that, like all large language models, this model may exhibit limitations, including potential inaccuracies or hallucinations in generated outputs.

Limitations

  • The model may produce hallucinated outputs or inaccuracies, which are inherent to large language models.
  • The model's identity has not been specifically optimized and may generate content that resembles outputs from other LLaMA-based models or similar architectures.
  • Generated outputs can vary between attempts due to sensitivity to prompt phrasing and token context.

🚩Citation

If you find our work helpful, please consider citing our research: "Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges":

@article{ZHANG2025100608,
title = {Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges},
journal = {Environmental Science and Ecotechnology},
pages = {100608},
year = {2025},
issn = {2666-4984},
doi = {https://doi.org/10.1016/j.ese.2025.100608},
url = {https://www.sciencedirect.com/science/article/pii/S2666498425000869},
author = {Yuanxin Zhang and Sijie Lin and Yaxin Xiong and Nan Li and Lijin Zhong and Longzhen Ding and Qing Hu}
}
Downloads last month
4
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train SustcZhangYX/EnvGPT

Collection including SustcZhangYX/EnvGPT