kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16 · vocab size inconsistent between model and tokenizer?

Jul 9, 2024

model config from pretraining model:

CaduceusConfig {
  "architectures": [
    "CaduceusForMaskedLM"
  ],
  "auto_map": {
    "AutoConfig": "configuration_caduceus.CaduceusConfig",
    "AutoModel": "modeling_caduceus.Caduceus",
    "AutoModelForMaskedLM": "modeling_caduceus.CaduceusForMaskedLM",
    "AutoModelForSequenceClassification": "modeling_caduceus.CaduceusForSequenceClassification"
  },
  "bidirectional": true,
  "bidirectional_strategy": "add",
  "bidirectional_weight_tie": true,
  "complement_map": {
    "0": 0,
    "1": 1,
    "10": 7,
    "11": 11,
    "12": 12,
    "13": 13,
    "14": 14,
    "15": 15,
    "2": 2,
    "3": 3,
    "4": 4,
    "5": 5,
    "6": 6,
    "7": 10,
    "8": 9,
    "9": 8
  },
  "d_model": 256,
  "fused_add_norm": true,
  "initializer_cfg": {
    "initializer_range": 0.02,
    "n_residuals_per_layer": 1,
    "rescale_prenorm_residual": true
  },
  "model_type": "caduceus",
  "n_layer": 16,
  "norm_epsilon": 1e-05,
  "pad_vocab_size_multiple": 8,
  "rcps": false,
  "residual_in_fp32": false,
  "rms_norm": true,
  "ssm_cfg": {
    "bias": false,
    "conv_bias": true,
    "d_conv": 4,
    "d_state": 16,
    "dt_init": "random",
    "dt_init_floor": 0.0001,
    "dt_max": 0.1,
    "dt_min": 0.001,
    "dt_rank": "auto",
    "dt_scale": 1.0,
    "expand": 2,
    "use_fast_path": true
  },
  "torch_dtype": "float32",
  "transformers_version": "4.42.3",
  "vocab_size": 16
}

But the tokenizer's vocab size is 12.

Should I change the config of model to fit the tokenizer?

Thanks.

ZJdog

Jul 9, 2024

This mismatch can directly cause the KeyError 14, because 14 is bigger than 12 but smaller than 16!

yairschiff

Kuleshov Group org Jul 9, 2024

The issue stems from this variable: "pad_vocab_size_multiple": 8 in the config. If you are trying to decode from the model using the tokenizer, then you will need to re-train with a different embedding / LM head or add some post-processing that brings keys such as 14 into the tokenizer's range: 0-11 as you correctly identify.

yairschiff changed discussion status to closed Jul 9, 2024