Add link to paper (#2)

Browse files

- Add link to paper (041614c4bcf30bc8cc49d25e249c112cb3238eda)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -1,22 +1,22 @@
 ---
-license: apache-2.0
 datasets:
 - togethercomputer/RedPajama-Data-1T
 language:
 - en
-pipeline_tag: text-generation
 library_name: transformers
 ---
 ## PDS-160M
-[paper](https://arxiv.org/abs/2410.07064) | [code](https://github.com/microsoft/LMOps/tree/main/data_selection)
 **PDS-160M** is a 160M model with [Mistral](https://arxiv.org/abs/2310.06825) achitecture pre-trained from scratch on the data selected from the CC split of [Redpajama](https://github.com/togethercomputer/RedPajama-Data), using the PDS framework.
 The PDS framework is based on the [Pontryagin's maximum principle](https://en.wikipedia.org/wiki/Pontryagin%27s_maximum_principle#:~:text=Pontryagin's%20maximum%20principle%20is%20used,the%20state%20or%20input%20controls.) for optimal pre-training data selection, which not only enjoy strong theoretical support but is also scalable for training large language models.
-Please refer to our [paper](https://arxiv.org/abs/2410.07064) for more details.
 ### Overview of the theory:
@@ -32,7 +32,7 @@ Please refer to our [paper](https://arxiv.org/abs/2410.07064) for more details.
 ### Evaluation
-PDS-selected data improves the performance of language models pre-trained from scratch and saves pre-training comptation. The improvement scales up to large model sizes.
 <p align='left'>
     <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/6undIr37d10qD73TDiPDK.png" width="600">
@@ -51,4 +51,4 @@ PDS-selected data improves the performance of language models pre-trained from s
   journal={arXiv preprint arXiv:2410.07064},
   year={2024}
 }
-```

 ---
 datasets:
 - togethercomputer/RedPajama-Data-1T
 language:
 - en
 library_name: transformers
+license: apache-2.0
+pipeline_tag: text-generation
 ---
 ## PDS-160M
+[paper](https://huggingface.co/papers/2410.07064) | [code](https://github.com/microsoft/LMOps/tree/main/data_selection)
 **PDS-160M** is a 160M model with [Mistral](https://arxiv.org/abs/2310.06825) achitecture pre-trained from scratch on the data selected from the CC split of [Redpajama](https://github.com/togethercomputer/RedPajama-Data), using the PDS framework.
 The PDS framework is based on the [Pontryagin's maximum principle](https://en.wikipedia.org/wiki/Pontryagin%27s_maximum_principle#:~:text=Pontryagin's%20maximum%20principle%20is%20used,the%20state%20or%20input%20controls.) for optimal pre-training data selection, which not only enjoy strong theoretical support but is also scalable for training large language models.
+Please refer to our [paper](https://huggingface.co/papers/2410.07064) for more details.
 ### Overview of the theory:
 ### Evaluation
+PDS-selected data improves the performance of language models pre-trained from scratch and saves pre-training computation. The improvement scales up to large model sizes.
 <p align='left'>
     <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/6undIr37d10qD73TDiPDK.png" width="600">
   journal={arXiv preprint arXiv:2410.07064},
   year={2024}
 }
+```