subreddit_description_topic_classifier
This model is a fine-tuned version of distilbert-base-uncased on an subreddit_technology_classification database. It achieves the following results on the evaluation set:
- Loss: 0.3915
- Accuracy: 0.8290
Model description
I developed a topic classifier with the specific purpose of discerning whether a given subreddit is associated with particular technology companies or stocks. The model focuses on identifying subreddits related to the "Magnificent Seven Companies," namely Apple, Microsoft, Alphabet, Amazon, Nvidia, Tesla, and Meta.
Intended uses & limitations
The primary function of this model is to serve as a binary topic classifier for a project centered around technology companies and stocks. However, it's important to note that more detailed information about the specific use cases and limitations is needed for a comprehensive understanding of its applicability and potential constraints.
Training and evaluation data
To train the model, I curated a dataset comprising approximately 1,000 subreddit descriptions obtained through the Reddit API. The extraction process involved keyword searches related to the "Magnificent Seven Companies." Subsequently, I manually labeled these subreddits, assigning a label of 1 for those directly related to technology and 0 for those unrelated.
Given the diverse nature of the extracted subreddits, which included technology, finance, stocks, and crypto, the labeled data aimed to achieve a balanced distribution of relevant and non-relevant subreddits for the project's objectives. The resulting dataset provided a foundation for fine-tuning the model, and with 1,000 data points, I am confident in its efficacy for the intended purpose.
Considering the anticipated large sample size during the evaluation phase across a variety of subreddits, any potential errors in classification are expected to be mitigated by the sheer volume of data points. This approach is designed to ensure the model's robustness and generalizability across a broad spectrum of subreddit descriptions.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
No log | 1.0 | 49 | 0.4181 | 0.8238 |
No log | 2.0 | 98 | 0.3915 | 0.8290 |
Framework versions
- Transformers 4.38.2
- Pytorch 2.1.0+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2
- Downloads last month
- 23
Model tree for gulnuravci/subreddit_description_topic_classifier
Base model
distilbert/distilbert-base-uncased