THUdyh nielsr HF staff commited on
Commit
c32c2e1
Β·
verified Β·
1 Parent(s): f21c809

Add model card (#3)

Browse files

- Add model card (589344a95e1a54e5ce301f63526f77565f8418fd)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +164 -11
README.md CHANGED
@@ -2,7 +2,7 @@
2
  license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
5
- pipeline_tag: any-to-any
6
  language:
7
  - en
8
  - zh
@@ -13,20 +13,20 @@ language:
13
  ## Model Summary
14
 
15
  The Ola-7B model is developed by people from Tencent, Tsinghua University and Nanyang Technological University.
16
- Based on Qwen2.5 language model, it is trained on text, image, video and audio data with a context window of 32K tokens. It can take both image/video, text and audio as input and output text/speech.
17
 
18
  Ola offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
19
 
20
  - **Repository:** https://github.com/Ola-Omni/Ola
21
  - **Languages:** English, Chinese
22
- - **Paper:** https://arxiv.org/abs/2502.04328
23
 
24
  ## Use
25
 
26
  1. Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.
27
  2. Replace the path in config.json with local path of speech encoders.
28
 
29
- We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](xxxxxx)
30
 
31
  ```
32
  import os
@@ -299,19 +299,21 @@ def ola_inference(multimodal, audio_path):
299
  return outputs, None
300
  ```
301
 
 
302
 
 
303
 
304
- ### Model Architecture
305
 
306
- - **Architecture:** Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2.5-7B
307
- - **Data:** a mixture of more than 5M image/video/audio data, training for 3 stage.
308
- - **Precision:** BFloat16
309
 
310
  #### Hardware & Software
311
 
312
- - **Hardware:** 64 * NVIDIA Tesla A100
313
- - **Orchestration:** HuggingFace Trainer
314
- - **Code:** Pytorch
 
 
315
 
316
  ## Citation
317
  @article{liu2025ola,
@@ -320,3 +322,154 @@ author={Liu, Zuyan and Dong, Yuhao and Wang, Jiahui and Liu, Ziwei and Hu, Winst
320
  journal={arXiv preprint arXiv:2502.04328},
321
  year={2025}
322
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
5
+ pipeline_tag: any-to-text
6
  language:
7
  - en
8
  - zh
 
13
  ## Model Summary
14
 
15
  The Ola-7B model is developed by people from Tencent, Tsinghua University and Nanyang Technological University.
16
+ Based on Qwen2.5 language model, it is trained on text, image, video and audio data with a context window of 32K tokens. It can take both image/video, text and audio as input and output text.
17
 
18
  Ola offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
19
 
20
  - **Repository:** https://github.com/Ola-Omni/Ola
21
  - **Languages:** English, Chinese
22
+ - **Paper:** https://huggingface.co/papers/2502.04328
23
 
24
  ## Use
25
 
26
  1. Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.
27
  2. Replace the path in config.json with local path of speech encoders.
28
 
29
+ We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](https://github.com/Ola-Omni/Ola)
30
 
31
  ```
32
  import os
 
299
  return outputs, None
300
  ```
301
 
302
+ ### Model Architecture
303
 
304
+ - **Architecture:** Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2.5-7B
305
 
306
+ - **Data:** a mixture of more than 5M image/video/audio data, training for 3 stage.
307
 
308
+ - **Precision:** BFloat16
 
 
309
 
310
  #### Hardware & Software
311
 
312
+ - **Hardware:** 64 \* NVIDIA Tesla A100
313
+
314
+ - **Orchestration:** HuggingFace Trainer
315
+
316
+ - **Code:** Pytorch
317
 
318
  ## Citation
319
  @article{liu2025ola,
 
322
  journal={arXiv preprint arXiv:2502.04328},
323
  year={2025}
324
  }
325
+
326
+ # File information
327
+
328
+ The repository contains the following file information:
329
+
330
+ Filename: generation_config.json
331
+ Content: {
332
+ "attn_implementation": "flash_attention_2",
333
+ "bos_token_id": 151643,
334
+ "do_sample": true,
335
+ "eos_token_id": [
336
+ 151645,
337
+ 151643
338
+ ],
339
+ "pad_token_id": 151643,
340
+ "repetition_penalty": 1.05,
341
+ "temperature": 0.7,
342
+ "top_k": 20,
343
+ "top_p": 0.8,
344
+ "transformers_version": "4.43.4"
345
+ }
346
+
347
+ Filename: merges.txt
348
+ Content: "Content of the file is larger than 50 KB, too long to display."
349
+
350
+ Filename: special_tokens_map.json
351
+ Content: {
352
+ "additional_special_tokens": [
353
+ "<|im_start|>",
354
+ "<|im_end|>",
355
+ "<|object_ref_start|>",
356
+ "<|object_ref_end|>",
357
+ "<|box_start|>",
358
+ "<|box_end|>",
359
+ "<|quad_start|>",
360
+ "<|quad_end|>",
361
+ "<|vision_start|>",
362
+ "<|vision_end|>",
363
+ "<|vision_pad|>",
364
+ "<|image_pad|>",
365
+ "<|video_pad|>"
366
+ ],
367
+ "eos_token": {
368
+ "content": "<|im_end|>",
369
+ "lstrip": false,
370
+ "normalized": false,
371
+ "rstrip": false,
372
+ "single_word": false
373
+ },
374
+ "pad_token": "<|mm_pad|>"
375
+ }
376
+
377
+ Filename: model.safetensors.index.json
378
+ Content: "Content of the file is larger than 50 KB, too long to display."
379
+
380
+ Filename: config.json
381
+ Content: "Content of the file is larger than 50 KB, too long to display."
382
+
383
+ Filename: vocab.json
384
+ Content: "Content of the file is larger than 50 KB, too long to display."
385
+
386
+ Filename: tokenizer_config.json
387
+ Content: "Content of the file is larger than 50 KB, too long to display."
388
+
389
+
390
+
391
+ # Project page
392
+
393
+ The project page URL we found has the following URL:
394
+
395
+ # Github README
396
+
397
+ The Github README we found contains the following content:
398
+
399
+ <div align="center">
400
+
401
+ <img src="assets/logo.png" width="30%"/>
402
+
403
+ # OLA: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
404
+
405
+ Join our [WeChat](http://imagebind-llm.opengvlab.com/qrcode/)
406
+ [[Project Page](https://ola-omni.github.io/)] [[Demo](http://106.14.2.150:10020/)]
407
+
408
+ </div>
409
+
410
+ <img src="assets/teaser.png" width="100%"/>
411
+
412
+ ## πŸš€ News
413
+ * [2025/02/07] πŸŽ‰πŸŽ‰πŸŽ‰ Initial codebase for eval and training will be released ASAP! Thanks for your attention.
414
+
415
+ ## ⚑ Model Zoo
416
+ 1. Speech-Visual Data
417
+ * [ ] image+text with local audio caption.
418
+ * [ ] videos from webvid2.5m with audio caption.
419
+ 2. Visual Tokenizer
420
+ * [ ] Imagebind small.
421
+ * [ ] Oryx-ViT 18B-1152.
422
+ 3. Training Pipeline
423
+ * [ ] image+text stage.
424
+ * [ ] audio+image+text stage.
425
+ * [ ] video+audio+image+text stage
426
+
427
+ ## TODO
428
+ - [ ] Multi Stage Training
429
+
430
+ ## βš™οΈ Installation
431
+
432
+ See [INSTALL.md](docs/INSTALL.md) for detailed instructions.
433
+
434
+ ## πŸ›΄ Quick Inference Code
435
+
436
+ - Check out the [quick inference script](example/inference/image_audio.ipynb) using a visual and audio data!
437
+
438
+ ## πŸ“ƒ Citation
439
+ ```
440
+ @article{liu2025ola,
441
+ title={Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment},
442
+ author={Liu, Zuyan and Dong, Yuhao and Wang, Jiahui and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming},
443
+ journal={arXiv preprint arXiv:2502.04328},
444
+ year={2025}
445
+ }
446
+ ```
447
+
448
+ ## Acknowledgement
449
+ - This project has been built using the great codebase of [Qwen](https://github.com/QwenLM/Qwen), [Video-LLaVA](https://github.com/mbai-xiao/Video-LLaVA), [OpenFlamingo](https://github.com/mlfoundations/open_flamingo). We thank the authors for their wonderful works.
450
+
451
+ ## Contact
452
+ - If you have any questions, feel free to open issues or pull requests.
453
+
454
+ Format your response as markdown, like this:
455
+
456
+ ## reasoning
457
+ A reasoning section regarding which metadata is most appropriate for the given model to put in the `content` section as YAML, given the available
458
+ context about the paper (abstract, Github README content and project page content if provided). Formatted as plain text.
459
+
460
+ ## Title
461
+ The title of your Hugging Face pull request formatted as plain text
462
+
463
+ ## Comment
464
+ The comment of your Hugging Face pull request formatted as markdown
465
+
466
+ ## Metadata
467
+ The metadata of the new/updated model card formatted as YAML.
468
+
469
+ ## Content
470
+ The content of the new/updated README.md (model card) formatted as markdown
471
+
472
+ Start your answer directly with a "## Reasoning" section followed by "## Title", "## Comment", "## Metadata" and "## Content" sections
473
+ that are filled in with relevant info for the given paper. Only format the Metadata section using ```yaml and ``` markers.
474
+ In case there is already an Arxiv link present, there is no need to replace it with a Hugging Face paper page link.
475
+ In case there is already a Github or project page URL present, there is no need to mention in the comment that you added it.