vocab size inconsistent between model and tokenizer?

#2
by ZJdog - opened

model config from pretraining model:

CaduceusConfig {
  "architectures": [
    "CaduceusForMaskedLM"
  ],
  "auto_map": {
    "AutoConfig": "configuration_caduceus.CaduceusConfig",
    "AutoModel": "modeling_caduceus.Caduceus",
    "AutoModelForMaskedLM": "modeling_caduceus.CaduceusForMaskedLM",
    "AutoModelForSequenceClassification": "modeling_caduceus.CaduceusForSequenceClassification"
  },
  "bidirectional": true,
  "bidirectional_strategy": "add",
  "bidirectional_weight_tie": true,
  "complement_map": {
    "0": 0,
    "1": 1,
    "10": 7,
    "11": 11,
    "12": 12,
    "13": 13,
    "14": 14,
    "15": 15,
    "2": 2,
    "3": 3,
    "4": 4,
    "5": 5,
    "6": 6,
    "7": 10,
    "8": 9,
    "9": 8
  },
  "d_model": 256,
  "fused_add_norm": true,
  "initializer_cfg": {
    "initializer_range": 0.02,
    "n_residuals_per_layer": 1,
    "rescale_prenorm_residual": true
  },
  "model_type": "caduceus",
  "n_layer": 16,
  "norm_epsilon": 1e-05,
  "pad_vocab_size_multiple": 8,
  "rcps": false,
  "residual_in_fp32": false,
  "rms_norm": true,
  "ssm_cfg": {
    "bias": false,
    "conv_bias": true,
    "d_conv": 4,
    "d_state": 16,
    "dt_init": "random",
    "dt_init_floor": 0.0001,
    "dt_max": 0.1,
    "dt_min": 0.001,
    "dt_rank": "auto",
    "dt_scale": 1.0,
    "expand": 2,
    "use_fast_path": true
  },
  "torch_dtype": "float32",
  "transformers_version": "4.42.3",
  "vocab_size": 16
}

But the tokenizer's vocab size is 12.

image.png

Should I change the config of model to fit the tokenizer?

Thanks.

image.png
This mismatch can directly cause the KeyError 14, because 14 is bigger than 12 but smaller than 16!

Kuleshov Group org

The issue stems from this variable: "pad_vocab_size_multiple": 8 in the config. If you are trying to decode from the model using the tokenizer, then you will need to re-train with a different embedding / LM head or add some post-processing that brings keys such as 14 into the tokenizer's range: 0-11 as you correctly identify.

yairschiff changed discussion status to closed

Sign up or log in to comment