metadata

license: apache-2.0

Perplexity tools

1. Create samples from `clean_json_3` sources

Between 1k and 1M documents. Read samples/README.md. Output files must be prefixed by doc_type and suffixed by language code (2 letters). For example:

$ cat /nfsmounts/datastore/ncc_corpus/mimir/jsonl_2/nrk/nrk-articles.jsonl | shuf -n 100000 > samples/restricted-newspapers_nrk_no.json

2. Create the perplexity scores for each file

Example of how to create scores only for doc_type restricted-newspapers_* samples:

$ ls samples/restricted-newspapers_* | parallel --lb --jobs 5 python samples_scores.py {} --output_path scores/ --jobs 15

3. Create the quartiles CSV needed for segmenting and downsamplig

The different doc_types will be grouped together. By passing the flag --group_by_prefix_lang, the grouping will happen on the pair doc_type prefix and language code, e.g., wikipedia_en.

Different downsampling ratios can be specified by using the --sampling_ratio_per_lang flag. For mimir-base, the downsampling by language is defined as follows: "da:0.23,en:0.21,sv:0.08,is:0.50".

$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.23,en:0.21,sv:0.08,is:0.50" --output_file csv/base-perplexity_quartiles_sampling.csv

For mimir-extended, the downsampling by language is defined as follows: "da:0.43,en:0.81,sv:0.15,code:0.62".

$ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.43,en:0.81,sv:0.15,code:0.62" --output_file csv/extended-perplexity_quartiles_sampling.csv  --overwrite_prefix_lang "starcoder_en:starcode_code"

More information in the spreadsheet.

Perplexity tools

1. Create samples from clean_json_3 sources

2. Create the perplexity scores for each file

3. Create the quartiles CSV needed for segmenting and downsamplig

1. Create samples from `clean_json_3` sources