s-mizuki-nlp
commited on
Commit
•
3d9fbc2
1
Parent(s):
6b9f8c1
Proofread English text.
Browse files
README.md
CHANGED
@@ -121,7 +121,8 @@ The following datasets were used for continual pre-training.
|
|
121 |
|
122 |
### Swallow Corpus Version 2
|
123 |
|
124 |
-
We built the Swallow Corpus by extracting high-quality Japanese texts from Common Crawl. In Version 2, we expanded the scope of the Common Crawl collection and modified the pipeline sequence to enable more flexible quality filtering.
|
|
|
125 |
|
126 |
Further details of the methodology and analysis will be provided in a forthcoming paper.
|
127 |
|
|
|
121 |
|
122 |
### Swallow Corpus Version 2
|
123 |
|
124 |
+
We built the Swallow Corpus by extracting high-quality Japanese texts from Common Crawl. In Version 2, we expanded the scope of the Common Crawl collection and modified the pipeline sequence to enable more flexible quality filtering.
|
125 |
+
For Llama 3.1 Swallow v0.2, we further refined our quality filtering and data sampling strategies, resulting in an even higher-quality selection of Japanese texts for pre-training.
|
126 |
|
127 |
Further details of the methodology and analysis will be provided in a forthcoming paper.
|
128 |
|