PakdamanAli commited on
Commit
8881562
·
verified ·
1 Parent(s): 06236e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fa
3
+ license: mit
4
+ tags:
5
+ - keyword-extraction
6
+ - persian
7
+ - farsi
8
+ - token-classification
9
+ - distilbert
10
+ - nlp
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - precision
15
+ - recall
16
+ - f1
17
+ widget:
18
+ - text: "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
19
+ ---
20
+
21
+ # Model Datacard: Persian Keyword Extraction Model
22
+
23
+ ## Model Details
24
+ - **Model Name**: keyword_distilbert_base_per
25
+ - **Base Model**: distilbert
26
+ - **Task**: Keyword Extraction
27
+ - **Language**: Persian (Farsi)
28
+ - **Developer**: PakdamanAli
29
+ - **Model Version**: 1.0.0
30
+
31
+ ## Intended Use
32
+ This model is designed to extract keywords from Persian text. It can be used for:
33
+ - Automatic tagging of content
34
+ - Search engine optimization
35
+ - Content categorization
36
+ - Topic modeling
37
+ - Information retrieval enhancement
38
+
39
+ ### Primary Intended Uses
40
+ - Content analysis for Persian websites
41
+ - Academic research on Persian text
42
+ - Information extraction systems
43
+
44
+ ### Out-of-Scope Use Cases
45
+ - Translation services
46
+ - Text summarization
47
+ - Persian named entity recognition (unless specifically trained for this)
48
+ - Other NLP tasks beyond keyword extraction
49
+
50
+ ## Training Data
51
+ - **Dataset Size**: 40,000 Persian text samples
52
+ - **Data Preparation**: Fine-tuned on xlm-roberta-large
53
+
54
+ ## Performance Evaluation
55
+ Metrics and evaluation results will be published in a future update.
56
+
57
+ ## Limitations
58
+ - The model may not perform well on domain-specific content that was not represented in the training data
59
+ - Performance may vary for very short or extremely long texts
60
+ - The model may occasionally extract words that are not truly "key" to the content
61
+ - Dialect variations in Persian might affect extraction quality
62
+
63
+ ## Ethical Considerations
64
+ - The model is trained on Persian text and may reflect biases present in that content
65
+ - Users should verify extracted keywords for sensitive content before implementing in automated systems
66
+ - The model should not be used to extract or analyze personally identifiable information without proper consent
67
+
68
+ ## Technical Specifications
69
+ - **Input**: Persian text (UTF-8 encoded)
70
+ - **Output**: List of extracted keywords
71
+ - **Framework**: Transformers (Hugging Face)
72
+ - **Requirements**: PyTorch, Transformers
73
+
74
+ ## Pipeline Usage
75
+ To use this model with the Hugging Face pipeline:
76
+
77
+ ```python
78
+ from transformers import pipeline
79
+
80
+ # Initialize the pipeline
81
+ keyword_extractor = pipeline(
82
+ task="token-classification",
83
+ model="PakdamanAli/keyword_distilbert_base_per",
84
+ tokenizer="PakdamanAli/keyword_distilbert_base_per"
85
+ )
86
+
87
+ # Example usage
88
+ text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
89
+ keywords = keyword_extractor(text)
90
+
91
+ # Process the results based on the model output format
92
+ # Example: extracted_keywords = [item["word"] for item in keywords]
93
+ ```
94
+
95
+ ## Example
96
+ ```python
97
+ from transformers import pipeline
98
+
99
+ extractor = pipeline(
100
+ task="token-classification",
101
+ model="PakdamanAli/keyword_distilbert_base_per",
102
+ tokenizer="PakdamanAli/keyword_distilbert_base_per"
103
+ )
104
+
105
+ text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
106
+ results = extractor(text)
107
+
108
+ # Extract just the words from the results
109
+ keywords = [item["word"] for item in results]
110
+ print(keywords)
111
+ ```