--- license: apache-2.0 --- # Perplexity tools ## 1. Create samples from `clean_json_3` sources Between 1k and 1M documents. Read [samples/README.md](./samples/README.md). Output files must be prefixed by `doc_type` and suffixed by language code (2 letters). For example: ```bash $ cat /nfsmounts/datastore/ncc_corpus/mimir/jsonl_2/nrk/nrk-articles.jsonl | shuf -n 100000 > samples/restricted-newspapers_nrk_no.json ``` ## 2. Create the perplexity scores for each file Example of how to create scores only for `doc_type` `restricted-newspapers_*` samples: ```bash $ ls samples/restricted-newspapers_* | parallel --lb --jobs 5 python samples_scores.py {} --output_path scores/ --jobs 15 ``` ## 3. Create the quartiles CSV needed for segmenting and downsamplig The different `doc_type`s will be grouped together. By passing the flag `--group_by_prefix_lang`, the grouping will happen on the pair `doc_type` prefix and language code, e.g., `wikipedia_en`. Different downsampling ratios can be specified by using the `--sampling_ratio_per_lang` flag. For `mimir-base`, the downsampling by language is defined as follows: `"da:0.23,en:0.21,sv:0.08,is:0.50"`. ```bash $ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.23,en:0.21,sv:0.08,is:0.50" --output_file csv/base-perplexity_quartiles_sampling.csv ``` For `mimir-extended`, the downsampling by language is defined as follows: `"da:0.43,en:0.81,sv:0.15,code:0.62"`. ```bash $ python samples_quartiles.py scores/ --group_by_prefix_lang --sampling_ratio_per_lang "da:0.43,en:0.81,sv:0.15,code:0.62" --output_file csv/extended-perplexity_quartiles_sampling.csv --overwrite_prefix_lang "starcoder_en:starcode_code" ``` More information in the [spreadsheet](https://docs.google.com/spreadsheets/d/108oGVVN-Ml-TDN59UXR96oeBBt2FbgT81zt8_1y9PUw/edit?usp=sharing).