Fill-Mask
Transformers
Safetensors
roberta
ChemBERTa
cheminformatics
Eval Results
eacortes commited on
Commit
e6254ce
·
verified ·
1 Parent(s): 3fbf182

Add chemberta3 benchmark results

Browse files
Files changed (1) hide show
  1. README.md +160 -4
README.md CHANGED
@@ -4,11 +4,115 @@ datasets:
4
  - Derify/augmented_canonical_druglike_QED_43M
5
  - Derify/druglike
6
  metrics:
7
- - spearmanr
 
8
  library_name: transformers
9
  tags:
10
  - ChemBERTa
11
  - cheminformatics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  # ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES
@@ -50,6 +154,26 @@ The model's effectiveness was validated through downstream Chem-MRL training on
50
 
51
  W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3).
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Use Cases
54
 
55
  - Molecular property prediction
@@ -61,6 +185,38 @@ W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes
61
  - Optimized specifically for drug-like molecules
62
  - Performance may vary on non-drug-like chemical compounds
63
 
64
- ## Citation
65
- - Chithrananda, Seyone, et al. "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." _arXiv [Cs.LG]_, 2020. [Link](http://arxiv.org/abs/2010.09885).
66
- - Ahmad, Walid, et al. "ChemBERTa-2: Towards Chemical Foundation Models." _arXiv [Cs.LG]_, 2022. [Link](http://arxiv.org/abs/2209.01712).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - Derify/augmented_canonical_druglike_QED_43M
5
  - Derify/druglike
6
  metrics:
7
+ - roc_auc
8
+ - rmse
9
  library_name: transformers
10
  tags:
11
  - ChemBERTa
12
  - cheminformatics
13
+ pipeline_tag: fill-mask
14
+ model-index:
15
+ - name: Derify/ChemBERTa-druglike
16
+ results:
17
+ - task:
18
+ type: text-classification
19
+ name: Classification (ROC AUC)
20
+ dataset:
21
+ name: BACE
22
+ type: Derify/druglike
23
+ metrics:
24
+ - type: roc_auc
25
+ value: 0.8114
26
+ - task:
27
+ type: text-classification
28
+ name: Classification (ROC AUC)
29
+ dataset:
30
+ name: BBBP
31
+ type: Derify/druglike
32
+ metrics:
33
+ - type: roc_auc
34
+ value: 0.7399
35
+ - task:
36
+ type: text-classification
37
+ name: Classification (ROC AUC)
38
+ dataset:
39
+ name: TOX21
40
+ type: Derify/druglike
41
+ metrics:
42
+ - type: roc_auc
43
+ value: 0.7522
44
+ - task:
45
+ type: text-classification
46
+ name: Classification (ROC AUC)
47
+ dataset:
48
+ name: HIV
49
+ type: Derify/druglike
50
+ metrics:
51
+ - type: roc_auc
52
+ value: 0.7527
53
+ - task:
54
+ type: text-classification
55
+ name: Classification (ROC AUC)
56
+ dataset:
57
+ name: SIDER
58
+ type: Derify/druglike
59
+ metrics:
60
+ - type: roc_auc
61
+ value: 0.6577
62
+ - task:
63
+ type: text-classification
64
+ name: Classification (ROC AUC)
65
+ dataset:
66
+ name: CLINTOX
67
+ type: Derify/druglike
68
+ metrics:
69
+ - type: roc_auc
70
+ value: 0.9660
71
+ - task:
72
+ type: regression
73
+ name: Regression (RMSE)
74
+ dataset:
75
+ name: ESOL
76
+ type: Derify/druglike
77
+ metrics:
78
+ - type: rmse
79
+ value: 0.8241
80
+ - task:
81
+ type: regression
82
+ name: Regression (RMSE)
83
+ dataset:
84
+ name: FREESOLV
85
+ type: Derify/druglike
86
+ metrics:
87
+ - type: rmse
88
+ value: 0.5350
89
+ - task:
90
+ type: regression
91
+ name: Regression (RMSE)
92
+ dataset:
93
+ name: LIPO
94
+ type: Derify/druglike
95
+ metrics:
96
+ - type: rmse
97
+ value: 0.6663
98
+ - task:
99
+ type: regression
100
+ name: Regression (RMSE)
101
+ dataset:
102
+ name: BACE
103
+ type: Derify/druglike
104
+ metrics:
105
+ - type: rmse
106
+ value: 1.0105
107
+ - task:
108
+ type: regression
109
+ name: Regression (RMSE)
110
+ dataset:
111
+ name: CLEARANCE
112
+ type: Derify/druglike
113
+ metrics:
114
+ - type: rmse
115
+ value: 43.4499
116
  ---
117
 
118
  # ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES
 
154
 
155
  W&B report on [ChemBERTa-druglike evaluation](https://api.wandb.ai/links/ecortes/afh508m3).
156
 
157
+ ## Benchmarks
158
+ ### Classification Datasets (ROC AUC - Higher is better)
159
+
160
+ | Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ |
161
+ | ------------------------- | ------ | ------ | ------ | ------ | ------ | -------- |
162
+ | **Tasks** | 1 | 1 | 12 | 1 | 27 | 2 |
163
+ | Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660 |
164
+
165
+ ### Regression Datasets (RMSE - Lower is better)
166
+
167
+ | Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ |
168
+ | ------------------------- | ------ | --------- | ------ | ------ | ---------- |
169
+ | **Tasks** | 1 | 1 | 1 | 1 | 1 |
170
+ | Derify/ChemBERTa-druglike | 0.8241 | 0.5350 | 0.6663 | 1.0105 | 43.4499 |
171
+
172
+ Benchmarks were conducted using the [chemberta3](https://github.com/deepforestsci/chemberta3) framework.
173
+ Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length.
174
+ The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32.
175
+ Each task was run with 3 different random seeds, and the mean performance is reported.
176
+
177
  ## Use Cases
178
 
179
  - Molecular property prediction
 
185
  - Optimized specifically for drug-like molecules
186
  - Performance may vary on non-drug-like chemical compounds
187
 
188
+ ## Citations
189
+ ### ChemBERTa Series
190
+ ```
191
+ @misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
192
+ title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
193
+ author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
194
+ year={2020},
195
+ eprint={2010.09885},
196
+ archivePrefix={arXiv},
197
+ primaryClass={cs.LG},
198
+ url={https://arxiv.org/abs/2010.09885},
199
+ }
200
+ ```
201
+ ```
202
+ @misc{ahmad2022chemberta2chemicalfoundationmodels,
203
+ title={ChemBERTa-2: Towards Chemical Foundation Models},
204
+ author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
205
+ year={2022},
206
+ eprint={2209.01712},
207
+ archivePrefix={arXiv},
208
+ primaryClass={cs.LG},
209
+ url={https://arxiv.org/abs/2209.01712},
210
+ }
211
+ ```
212
+ ```
213
+ @misc{singh2025chemberta3opensource,
214
+ title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
215
+ author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
216
+ year={2025},
217
+ howpublished={ChemRxiv},
218
+ doi={10.26434/chemrxiv-2025-4glrl-v2},
219
+ note={This content is a preprint and has not been peer-reviewed},
220
+ url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
221
+ }
222
+ ```