ldilov
/

llama2-bg-tokenizer

@@ -129,7 +129,7 @@ The tokenizer training approach showcases a sophisticated and advanced methodolo
 - **Dynamic Adaptation**: The ability to dynamically adjust tokenization parameters based (like `min_frequency`) on dataset analysis ensures that the tokenizer remains effective across different text domains.
 - **Dynamic Tokens**: The dataset is divided into chunks, each chunk of the dataset is analyzed to count the occurrences of each token. This is done across all chunks in parallel, and the results are aggregated. A threshold (e.g., `0.0005`) is applied to identify tokens that constitute a small fraction of the total words in the dataset. Tokens below this threshold are considered rare or dynamic. From these dynamic tokens, the top `k` tokens with the highest counts (but still under the threshold) are selected. We add them manually to tokenizer's vocabulary so that the tokenizer can focus its attention on the most relevant rare tokens. Dynamic tokens often include terminology, names, or concepts specific to a dataset's domain. Their inclusion in the tokenizer's vocabulary allows the LLM to capture and understand these unique elements more effectively, leading to improved performance in tasks requiring deep domain knowledge or contextual nuance.
 - **Sophisticated Evaluation**: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
-- **Number Bucketing**:  Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule. This reduces sparsity and improves generalization without overfitting to specific values
 - **URL Replacement**: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token `<url>`.  Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.
 ## Tokenizer Evaluation Methodology

 - **Dynamic Adaptation**: The ability to dynamically adjust tokenization parameters based (like `min_frequency`) on dataset analysis ensures that the tokenizer remains effective across different text domains.
 - **Dynamic Tokens**: The dataset is divided into chunks, each chunk of the dataset is analyzed to count the occurrences of each token. This is done across all chunks in parallel, and the results are aggregated. A threshold (e.g., `0.0005`) is applied to identify tokens that constitute a small fraction of the total words in the dataset. Tokens below this threshold are considered rare or dynamic. From these dynamic tokens, the top `k` tokens with the highest counts (but still under the threshold) are selected. We add them manually to tokenizer's vocabulary so that the tokenizer can focus its attention on the most relevant rare tokens. Dynamic tokens often include terminology, names, or concepts specific to a dataset's domain. Their inclusion in the tokenizer's vocabulary allows the LLM to capture and understand these unique elements more effectively, leading to improved performance in tasks requiring deep domain knowledge or contextual nuance.
 - **Sophisticated Evaluation**: The inclusion of a detailed evaluation mechanism enables continuous assessment and improvement of the tokenizer's performance, ensuring high accuracy and reliability.
+- **Number Bucketing**:  Numbers in the text are categorized into predefined "buckets" based on their value. The bucketing process involves dividing the number space into several ranges (or buckets) and assigning each number to a specific bucket. Each bucket is represented by its own token that follows specific convention. Common years (e.g., 1900-2025) and ages (e.g., 1-100) are exceptions to this rule and they are represented they way they are written. This reduces sparsity and improves generalization without overfitting to specific values
 - **URL Replacement**: URLs in the text are identified using a regular expression for common URL patterns and replaced with a special token `<url>`.  Replacing varied URLs with a single token prevents the model from overfitting to specific web addresses, which are usually not relevant to understanding the text's general context.URLs can introduce a vast number of unique tokens into the vocabulary. Replacing them with a single token significantly simplifies the model's vocabulary. By abstracting away the specifics of URLs, models can focus more on the actual textual content.
 ## Tokenizer Evaluation Methodology