Missing Checkpoint Files

by hcoxec - opened May 10

May 10

Following from another discussion, quite a lot of the intermediate checkpoints are incomplete and so unusable. There are 3 types of errors that appear over and over again. For reference, the same code that threw the below errors, loads the other 32B checkpoints without issue. But approximately 25% of the intermediate checkpoints are unusable.

Most common, part of the weights are missing, to give just four examples - I'm sure there are more given my difficulty using other checkpoints:

OSError: allenai/OLMo-2-0325-32B does not appear to have a file named model-00002-of-00029.safetensors. Checkout 'https://huggingface.co/allenai/OLMo-2-0325-32B/tree/stage1-step240000-tokens2014B'
OSError: allenai/OLMo-2-0325-32B does not appear to have a file named model-00002-of-00029.safetensors. Checkout 'https://huggingface.co/allenai/OLMo-2-0325-32B/tree/stage1-step60000-tokens504B' for available files.
OSError: allenai/OLMo-2-0325-32B does not appear to have a file named model-00005-of-00029.safetensors. Checkout 'https://huggingface.co/allenai/OLMo-2-0325-32B/tree/stage1-step160000-tokens1343B' for available files.
OSError: allenai/OLMo-2-0325-32B does not appear to have a file named model-00018-of-00029.safetensors. Checkout 'https://huggingface.co/allenai/OLMo-2-0325-32B/tree/stage1-step330000-tokens2769B' for available files.)

In some cases, the tokenizer does not appear to be there, for example checkpoint "stage1-step170000-tokens1427B" throws this error:

File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2276, in _from_pretrained - tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/tokenization_gpt2.py", line 159, in init
with open(merges_file, encoding="utf-8") as merges_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

In other cases the tokenizer appears to be in some corrupted slow tokenizer format that fails to convert:

-ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

I would appreciate it immensely if the missing checkpoints could be made available in usable format. Thanks so much!

hcoxec

May 10

On further review , I think I underestimated- it's looking like almost 50% of checkpoints are unusable with the missing weights error being the most common.

amanrangapur

Ai2 org May 11

Hey @hcoxec , thank you for reaching out. We have noticed this and I started re uploading the checkpoints. Estimated timeline for this process to finish is Tuesday. I will update you once it is done.

hcoxec

May 13

Thank you! We were hoping to include some analysis of these for a deadline this week - so please let me know as soon as they're up so we can assess if that's still feasible.

amanrangapur

Ai2 org May 13

Hey @hcoxec , if you have any particular checkpoints for analysis, I can fast forward them, they'll be up by tomorrow morning.

hcoxec

May 13

Unfortunately we're looking at the whole time series - hence why we came across errors w/ so many different checkpoints. I appreciate the offer but we really need all of them to be able to run this!

Separately - are there any plans to release earlier checkpoints of the 13b? Currently it's checkpointing doesn't match the other models (1b, 7b, 32b) which means we can't include it in comparisons.

amanrangapur

Ai2 org May 13

No, there are no plans to release earlier checkpoints of 13B.

amanrangapur

Ai2 org May 15

Hey, all the checkpoints are fixed.

amanrangapur changed discussion status to closed May 15

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment