Transformers
tokenizer
gpt-2
Adiii143 commited on
Commit
e7c7f61
·
verified ·
1 Parent(s): c210965

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -12,7 +12,8 @@ base_model:
12
 
13
  # Model Card for Code-Net Tokenizer Trained on GPT-2
14
 
15
- This model card describes a custom tokenizer trained on the existing GPT-2 tokenizer using the CodeSearchNet dataset. The tokenizer was adapted to better handle code-specific tokenization, leveraging the large scale and fine-grained vocabulary of the GPT-2 model.
 
16
 
17
  ## Model Details
18
 
@@ -20,41 +21,40 @@ This model card describes a custom tokenizer trained on the existing GPT-2 token
20
 
21
  This tokenizer was fine-tuned on the CodeSearchNet dataset, which contains millions of code snippets in multiple programming languages. The tokenizer was initialized with the GPT-2 tokenizer and then adapted to better handle the unique characteristics of programming language syntax and semantics.
22
 
23
- - **Developed by:** [Your Name or Organization]
24
- - **Funded by [optional]:** [More Information Needed]
25
- - **Shared by [optional]:** [More Information Needed]
26
  - **Model type:** Tokenizer
27
- - **Language(s) (NLP):** Python, Java, JavaScript, Go, Ruby, etc.
28
  - **License:** Apache 2.0
29
  - **Finetuned from model [optional]:** openai-community/gpt2
30
 
31
- ### Model Sources [optional]
32
-
33
- - **Repository:** [More Information Needed]
34
- - **Paper [optional]:** [More Information Needed]
35
- - **Demo [optional]:** [More Information Needed]
36
 
37
  ## Uses
38
 
39
  ### Direct Use
40
 
41
- The tokenizer can be used directly in any NLP tasks that involve source code, such as code generation, code summarization, or bug detection, by replacing the original GPT-2 tokenizer with this newly trained version.
 
42
 
43
  ### Downstream Use [optional]
44
 
45
- When plugged into a code-generation or code-understanding pipeline, this tokenizer can help improve the model’s understanding of programming languages and code structure.
 
46
 
47
  ### Out-of-Scope Use
48
 
49
- This tokenizer is specifically designed for tokenizing programming code. It is not suited for general text-based NLP tasks like natural language processing, sentiment analysis, or text generation outside the context of source code.
 
50
 
51
  ## Bias, Risks, and Limitations
52
 
53
- This model may introduce bias based on the dataset it was trained on. For example, the tokenizer might have difficulty with edge cases or rare programming language constructs that were underrepresented in the training data.
 
54
 
55
  ### Recommendations
56
 
57
- Users should be aware of potential limitations when applying this tokenizer to specific, less-common programming languages. Additionally, it may not handle malformed code or highly unconventional syntaxes well.
 
58
 
59
  ## How to Get Started with the Model
60
 
 
12
 
13
  # Model Card for Code-Net Tokenizer Trained on GPT-2
14
 
15
+ This model card describes a custom tokenizer trained on the existing GPT-2 tokenizer using the CodeSearchNet dataset.
16
+ The tokenizer was adapted to better handle code-specific tokenization, leveraging the large scale and fine-grained vocabulary of the GPT-2 model.
17
 
18
  ## Model Details
19
 
 
21
 
22
  This tokenizer was fine-tuned on the CodeSearchNet dataset, which contains millions of code snippets in multiple programming languages. The tokenizer was initialized with the GPT-2 tokenizer and then adapted to better handle the unique characteristics of programming language syntax and semantics.
23
 
24
+ - **Developed by:** Aditya Ak
25
+ - **Shared by [optional]:** Aditya Ak
 
26
  - **Model type:** Tokenizer
27
+ - **Language(s) (NLP):** Python
28
  - **License:** Apache 2.0
29
  - **Finetuned from model [optional]:** openai-community/gpt2
30
 
 
 
 
 
 
31
 
32
  ## Uses
33
 
34
  ### Direct Use
35
 
36
+ The tokenizer can be used directly in any NLP tasks that involve source code, such as code generation, code summarization,
37
+ or bug detection, by replacing the original GPT-2 tokenizer with this newly trained version.
38
 
39
  ### Downstream Use [optional]
40
 
41
+ When plugged into a code-generation or code-understanding pipeline, this tokenizer
42
+ can help improve the model’s understanding of programming languages and code structure.
43
 
44
  ### Out-of-Scope Use
45
 
46
+ This tokenizer is specifically designed for tokenizing programming code. It is not suited for general text-based NLP
47
+ tasks like natural language processing, sentiment analysis, or text generation outside the context of source code.
48
 
49
  ## Bias, Risks, and Limitations
50
 
51
+ This model may introduce bias based on the dataset it was trained on. For example, the tokenizer might have
52
+ difficulty with edge cases or rare programming language constructs that were underrepresented in the training data.
53
 
54
  ### Recommendations
55
 
56
+ Users should be aware of potential limitations when applying this tokenizer to specific, less-common programming
57
+ languages. Additionally, it may not handle malformed code or highly unconventional syntaxes well.
58
 
59
  ## How to Get Started with the Model
60