nielsr HF Staff commited on
Commit
29132e9
·
verified ·
1 Parent(s): cc36ba5

Improve model card: Add metadata, description and Github link

Browse files

This PR improves the model card by adding missing metadata (license, library_name and pipeline tag) and a link to the Github repository. The description has been updated to clarify that this is a pretraining data filter and not a language model.

Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -1,6 +1,13 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
4
- This is the fastText pretraining data filter targeting
5
- the LAMBADA IT task, discussed in the main text of the Perplexity
6
- Correlations paper: https://arxiv.org/abs/2409.05816
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: fasttext
4
+ pipeline_tag: data-filtering
5
+ tags:
6
+ - pretraining-data-selection
7
  ---
8
+
9
+ This fastText model is a filter for selecting high-quality pretraining data, as described in [Improving Pretraining Data Using Perplexity Correlations](https://arxiv.org/abs/2409.05816). It targets the LAMBADA IT task.
10
+
11
+ The model uses perplexity correlations to identify text segments highly correlated with strong performance on downstream benchmarks. It doesn't perform text classification directly; instead, it outputs a score indicating the suitability of a text segment for pretraining.
12
+
13
+ For complete usage instructions and the theoretical background, please refer to the [project's GitHub repository](https://github.com/TristanThrush/perplexity-correlations).