Model Card for Code-Net Tokenizer Trained on GPT-2
This model card describes a custom tokenizer trained on the existing GPT-2 tokenizer using the CodeSearchNet dataset. The tokenizer was adapted to better handle code-specific tokenization, leveraging the large scale and fine-grained vocabulary of the GPT-2 model.
Model Details
Model Description
This tokenizer was fine-tuned on the CodeSearchNet dataset, which contains millions of code snippets in multiple programming languages. The tokenizer was initialized with the GPT-2 tokenizer and then adapted to better handle the unique characteristics of programming language syntax and semantics.
- Developed by: Aditya Ak
- Shared by [optional]: Aditya Ak
- Model type: Tokenizer
- Language(s) (NLP): Python
- License: Apache 2.0
- Finetuned from model [optional]: openai-community/gpt2
Uses
Direct Use
The tokenizer can be used directly in any NLP tasks that involve source code, such as code generation, code summarization, or bug detection, by replacing the original GPT-2 tokenizer with this newly trained version.
Downstream Use [optional]
When plugged into a code-generation or code-understanding pipeline, this tokenizer can help improve the model’s understanding of programming languages and code structure.
Out-of-Scope Use
This tokenizer is specifically designed for tokenizing programming code. It is not suited for general text-based NLP tasks like natural language processing, sentiment analysis, or text generation outside the context of source code.
Bias, Risks, and Limitations
This model may introduce bias based on the dataset it was trained on. For example, the tokenizer might have difficulty with edge cases or rare programming language constructs that were underrepresented in the training data.
Recommendations
Users should be aware of potential limitations when applying this tokenizer to specific, less-common programming languages. Additionally, it may not handle malformed code or highly unconventional syntaxes well.
How to Get Started with the Model
You can use the tokenizer by loading it via the Hugging Face transformers
library:
from transformers import GPT2Tokenizer
# Load the custom tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("your-model-name")
# Tokenize a code snippet
code_snippet = "def hello_world(): print('Hello, world!')"
tokens = tokenizer.encode(code_snippet)
print(tokens)
Training Details
Training Data The tokenizer was trained using the CodeSearchNet dataset, which contains millions of code snippets from various programming languages. This dataset is diverse in terms of programming languages and code style, helping to create a more versatile tokenizer.
Dataset: CodeSearchNet Dataset Languages Covered: Python
Training Procedure
The tokenizer was trained using the original GPT-2 tokenizer as a base and fine-tuned on the CodeSearchNet dataset. This involved segmenting code into subword units to ensure that tokenization respects common syntax and identifiers in code.
Model tree for Adiii143/code-search-net-tokenizer
Base model
openai-community/gpt2