subreddit_description_topic_classifier

This model is a fine-tuned version of distilbert-base-uncased on an subreddit_technology_classification database. It achieves the following results on the evaluation set:

Loss: 0.3915
Accuracy: 0.8290

Model description

I developed a topic classifier with the specific purpose of discerning whether a given subreddit is associated with particular technology companies or stocks. The model focuses on identifying subreddits related to the "Magnificent Seven Companies," namely Apple, Microsoft, Alphabet, Amazon, Nvidia, Tesla, and Meta.

Intended uses & limitations

The primary function of this model is to serve as a binary topic classifier for a project centered around technology companies and stocks. However, it's important to note that more detailed information about the specific use cases and limitations is needed for a comprehensive understanding of its applicability and potential constraints.

Training and evaluation data

To train the model, I curated a dataset comprising approximately 1,000 subreddit descriptions obtained through the Reddit API. The extraction process involved keyword searches related to the "Magnificent Seven Companies." Subsequently, I manually labeled these subreddits, assigning a label of 1 for those directly related to technology and 0 for those unrelated.

Given the diverse nature of the extracted subreddits, which included technology, finance, stocks, and crypto, the labeled data aimed to achieve a balanced distribution of relevant and non-relevant subreddits for the project's objectives. The resulting dataset provided a foundation for fine-tuning the model, and with 1,000 data points, I am confident in its efficacy for the intended purpose.

Considering the anticipated large sample size during the evaluation phase across a variety of subreddits, any potential errors in classification are expected to be mitigated by the sheer volume of data points. This approach is designed to ensure the model's robustness and generalizability across a broad spectrum of subreddit descriptions.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
No log	1.0	49	0.4181	0.8238
No log	2.0	98	0.3915	0.8290

Framework versions

Transformers 4.38.2
Pytorch 2.1.0+cu121
Datasets 2.18.0
Tokenizers 0.15.2

gulnuravci
/

subreddit_description_topic_classifier