Improve model card: Add metadata, description and Github link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -1,6 +1,13 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
4
- This is the fastText pretraining data filter targeting
5
- the LAMBADA IT task, discussed in the main text of the Perplexity
6
- Correlations paper: https://arxiv.org/abs/2409.05816
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: fasttext
4
+ pipeline_tag: data-filtering
5
+ tags:
6
+ - pretraining-data-selection
7
  ---
8
+
9
+ This fastText model is a filter for selecting high-quality pretraining data, as described in [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). It targets the LAMBADA IT task.
10
+
11
+ The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.
12
+
13
+ For complete usage instructions and the theoretical background, please refer to the [project's GitHub repository](https://github.com/TristanThrush/perplexity-correlations).