Question about vocab size
Hi! This is a great work and I am trying to follow it.
When I apply the model to another framework, I discovered that the vocab size in the json file says 30522, however the vocab.txt file only contains 28895 lines (words). Shouldn't these 2 number be the same? Or am I understanding anything wrong?
Looking forward to your reply. Thanks a lot!
Thanks for your comment. When checking the size of the embedding matrix (cc @nbroad ):
from transformers import BertModel
model = BertModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")
print(model.embeddings.word_embeddings.weight.shape)
it does print a shape of torch.Size([30522, 768]).
I guess that the last vectors of the embedding matrix are actually never used and could be removed from the model.
However, simply updating the vocab_size
attribute of the config
will result in an error, as it will complain that the updated size doesn't match the size of the embedding matrix. So one should update the vocab_size
attribute and the embedding matrix at the same time.