Model Card for Llama-3.1-8B-Instruct-NLRL-TicTacToe-Value

Model Details

Model Description

Developed by: NLRL Team
Model type: Language Value Function Model for TicTacToe
Language(s): English
License: MIT
Finetuned from model: LLaMA-3.1-8B-Instruct

This model serves as a language value function in Natural Language Reinforcement Learning (NLRL) framework, specifically trained for the TicTacToe game. It evaluates state-action pairs through natural language description and provides value assessment.

Uses

Direct Use

This model can be used as a TicTacToe position evaluator that explains its evaluation through natural language before providing the final assessment. The model generates both reasoning chains and final value judgments.

Out-of-Scope Use

This model is specifically trained for TicTacToe state-action evaluation and should not be used for other games or value assessment tasks.

Training Details

Training Data

Training data consists of state-action pairs collected through NLRL actor-critic learning process, with language-based Monte Carlo estimates serving as training targets for the value function.

Training Procedure

Trained using FSDP (Fully Sharded Data Parallel) across 4 H100 GPUs
Learning rate: 1e-5
Training epochs per iteration: 2
Batch size: 8
Max sequence length: 1024

Evaluation

Tested on both deterministic and stochastic gameplay trajectories
Demonstrates consistent evaluation capabilities across different game states
Works in conjunction with the policy model to guide action selection

Model Architecture

Base model: LLaMA-3.1-8B-Instruct
Input: Text description of TicTacToe state-action pair
Output: Chain-of-thought evaluation followed by value assessment

Citation

@misc{feng2024naturallanguagereinforcementlearning,
      title={Natural Language Reinforcement Learning}, 
      author={Xidong Feng and Ziyu Wan and Haotian Fu and Bo Liu and Mengyue Yang and Girish A. Koushik and Zhiyuan Hu and Ying Wen and Jun Wang},
      year={2024},
      eprint={2411.14251},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.14251}, 
}