RxT-Alpha Micro MLM

Masked Language Modelling head for RxT-Alpha-Micro-Encoder pre-training. In Reactive Transformer architecture, final Memory Encoder is using only Transformer layers, but it has to be pre-trained using standard MLM training, so we have to include additional head model.

MLM Head Details:

one linear layer (dim -> dim / 128 -> 128)
GELU activation layer
output linear layer (dim -> vocab / 128 -> 7500)
size: ~984k Params

Reactive Transformer Architecture

Experimental research model made to test our Reactive Transformer architecture and Attention-based Memory System.

Reactive Transformer has additional Short-Term Memory layers, connected to model with Memory Cross-Attention, and updated by Memory Encoder and Memory Attention. Short-Term Memory state is kept between interactions/event (single message), not between tokens in sequence - that's key difference between RxNNs and RNNs.

The goal of the architecture is to process only single messages and keep conversation history in Short-Term Memory - we believe, that this is the key requirement for awareness and AGI. Processing all the chat history on every interaction is not natural and that's not how human awareness is working. Then, Reactive Transformer architecture is a first step in transition from language models to awareness models.

In first two stages - pre-training and supervised fine-tuning, decoder and encoder are trained together - encoder layer's results are used as decoder's memory cross-attention key/value inputs to align vector spaces between components. Then, in third stage - Memory Reinforcement Learning, they are connected with Memory Attention layers, and full model is trained update and use memory.

RxT-Alpha models intentionally use very short sequence length and STM size (256 tokens for Micro), but that isn't their "full" context size - it's only for single message. "Full" context is theoretically infinite, restricted by STM size and memory abilites. That sizes are good for research, final models will handle SOTA contexts.

This model (MLM Head) is not used in final Reactive Transformer system. It's made only for first stage of training - base encoder model pre-training.

RxT-Alpha Micro Encoder + MLM Head Training

Micro models from RxT-Alpha series are first PoC for Reactive Transformer, Attention-Based Memory System and Memory Reinforcement Learning, used mainly to test library and architecture basics, before training bigger models (that are still relatively small, as it's PoC).

RxT-Alpha-Micro-Encoder was trained with this MLM head model and RxT-Alpha-Micro-Decoder, using Joint LM Training (with MLM and Autoregressive loss) and roneneldan/TinyStories dataset.

Encoder architecture details:

dim: 128
layers: 6
heads: 8
self-attention: symmetric Sparse Query Attention
- query/key/value heads: 4
SwiGLU feed forward with 384 dim
RoPE
RMS Norm
vocab: 7.5k (english only)
message length: 256
STM size: 256 * 6 layers
size: ~2.1M (+ ~980k MLM Head = ~3M for pre-training)
Library: RxNN
Docs: draft/in progress

ReactiveAI
/

RxT-Alpha-Micro-MLM

You need to agree to share your contact information to access this model

RxT-Alpha Micro MLM

MLM Head Details:

Reactive Transformer Architecture

RxT-Alpha Micro Encoder + MLM Head Training

Encoder architecture details:

Model tree for ReactiveAI/RxT-Alpha-Micro-MLM

Dataset used to train ReactiveAI/RxT-Alpha-Micro-MLM

Collection including ReactiveAI/RxT-Alpha-Micro-MLM

RxT-Alpha Micro