jnmrr commited on
Commit
890e7ea
·
verified ·
1 Parent(s): 3951193

Upload RT-DETRv2 voucher classifier

Browse files
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: PekingU/rtdetr_v2_r101vd
4
+ tags:
5
+ - object-detection
6
+ - computer-vision
7
+ - voucher-classification
8
+ - rt-detr
9
+ - rtdetrv2
10
+ datasets:
11
+ - custom-voucher-dataset
12
+ metrics:
13
+ - map
14
+ - map_50
15
+ - map_75
16
+ widget:
17
+ - src: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg
18
+ example_title: Example Image
19
+ ---
20
+
21
+ # RT-DETRv2 Fine-tuned for Voucher Classification
22
+
23
+ This model is a fine-tuned version of [PekingU/rtdetr_v2_r101vd](https://huggingface.co/PekingU/rtdetr_v2_r101vd) for voucher classification and object detection.
24
+
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+ - **Model Type**: Object Detection (RT-DETRv2)
29
+ - **Base Model**: PekingU/rtdetr_v2_r101vd
30
+ - **Task**: Multi-class voucher classification and detection
31
+ - **Classes**: 3 classes
32
+ - 0: digital (digital invoices)
33
+ - 1: fisico (physical receipts on blank pages)
34
+ - 2: tesoreria (small on-site payment receipts)
35
+
36
+ ### Training Details
37
+
38
+ **Training Dataset:**
39
+ - **Total Samples**: 507
40
+ - **Class Distribution**:
41
+ - **fisico** (id: 1): 241 samples (47.5%)
42
+ - **digital** (id: 0): 147 samples (29.0%)
43
+ - **tesoreria** (id: 2): 119 samples (23.5%)
44
+
45
+
46
+ **Training Configuration:**
47
+ - **Image Size**: 800x800
48
+ - **Batch Size**: 24
49
+ - **Learning Rate**: 1.5e-05
50
+ - **Weight Decay**: 0.0001
51
+ - **Epochs**: 2
52
+ - **Validation Split**: 0.0
53
+
54
+ **Data Processing:**
55
+ - Pre-augmented dataset used (no runtime augmentation)
56
+ - External train/validation split (use create_train_val_split.py)
57
+ - Preprocessing: Resize + Normalization only
58
+
59
+ ### Performance Metrics
60
+
61
+ **Final Evaluation Results:**
62
+ **Dataset Information:**
63
+ *Training Dataset:*
64
+ - **Digital invoices**: 147 samples (29.0%)
65
+ - **Fisico receipts**: 241 samples (47.5%)
66
+ - **Tesoreria receipts**: 119 samples (23.5%)
67
+ - **Total training samples**: 507
68
+
69
+ **Model Configuration:**
70
+ - **Base model**: PekingU/rtdetr_v2_r101vd
71
+ - **Architecture**: rtdetr_v2_r101vd
72
+ - **Input resolution**: 800×800 pixels
73
+ - **Training epochs**: 2
74
+ - **Batch size**: 24
75
+
76
+ **Training Hardware:**
77
+ - **GPU**: NVIDIA A100-SXM4-40GB
78
+ - **VRAM**: 39.6 GB
79
+ - **RAM**: 83.5 GB
80
+ - **GPU configuration**: A100 optimized
81
+
82
+ **Training Time**: 0.0 minutes
83
+
84
+ **Training Summary:**
85
+ - **Final training loss**: 0.0000
86
+
87
+
88
+ ### MLflow Tracking
89
+
90
+ - **MLflow Run ID**: c348e8235f8c40138c05c051fc207bb6
91
+ - **MLflow Experiment**: RT-DETRv2_Voucher_Classification
92
+
93
+
94
+ ## Usage
95
+
96
+ ```python
97
+ from transformers import AutoModelForObjectDetection, AutoImageProcessor
98
+ import torch
99
+ from PIL import Image
100
+ import numpy as np
101
+
102
+ # Load model and processor
103
+ model = AutoModelForObjectDetection.from_pretrained("jnmrr/rtdetr-v2-voucher-classifier")
104
+ image_processor = AutoImageProcessor.from_pretrained("jnmrr/rtdetr-v2-voucher-classifier")
105
+
106
+ # Load and preprocess image
107
+ image = Image.open("path/to/your/voucher.jpg").convert("RGB")
108
+ inputs = image_processor(images=image, return_tensors="pt")
109
+
110
+ # Run inference
111
+ with torch.no_grad():
112
+ outputs = model(**inputs)
113
+
114
+ # Post-process results
115
+ target_sizes = torch.tensor([image.size[::-1]]) # (height, width)
116
+ results = image_processor.post_process_object_detection(
117
+ outputs,
118
+ target_sizes=target_sizes,
119
+ threshold=0.5
120
+ )[0]
121
+
122
+ # Print predictions
123
+ class_names = ["digital", "fisico", "tesoreria"]
124
+ for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
125
+ print(f"Class: {class_names[label.item()]}")
126
+ print(f"Confidence: {score.item():.3f}")
127
+ print(f"BBox: {box.tolist()}")
128
+ ```
129
+
130
+ ## Training Procedure
131
+
132
+ The model was fine-tuned using the Hugging Face Transformers library with:
133
+ - Pre-augmented dataset focusing on challenging cases
134
+ - Format-specific augmentation strategies applied during data preparation
135
+ - MLflow experiment tracking for reproducibility
136
+ - External train/validation split for unbiased evaluation
137
+
138
+ ## Limitations and Bias
139
+
140
+ - Trained specifically on voucher/receipt images
141
+ - Performance may vary on images significantly different from training distribution
142
+ - Model optimized for 3-class voucher classification task
143
+
144
+ ## Citation
145
+
146
+ If you use this model, please cite:
147
+
148
+ ```bibtex
149
+ @misc{rtdetr-v2-voucher-classifier,
150
+ title={RT-DETRv2 Fine-tuned for Voucher Classification},
151
+ author={Your Name},
152
+ year={2025},
153
+ publisher={Hugging Face},
154
+ url={https://huggingface.co/jnmrr/rtdetr-v2-voucher-classifier}
155
+ }
156
+ ```
checkpoint-22/config.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "silu",
4
+ "anchor_image_size": null,
5
+ "architectures": [
6
+ "RTDetrV2ForObjectDetection"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "auxiliary_loss": true,
10
+ "backbone": null,
11
+ "backbone_config": {
12
+ "depths": [
13
+ 3,
14
+ 4,
15
+ 23,
16
+ 3
17
+ ],
18
+ "downsample_in_bottleneck": false,
19
+ "downsample_in_first_stage": false,
20
+ "embedding_size": 64,
21
+ "hidden_act": "relu",
22
+ "hidden_sizes": [
23
+ 256,
24
+ 512,
25
+ 1024,
26
+ 2048
27
+ ],
28
+ "layer_type": "bottleneck",
29
+ "model_type": "rt_detr_resnet",
30
+ "num_channels": 3,
31
+ "out_features": [
32
+ "stage2",
33
+ "stage3",
34
+ "stage4"
35
+ ],
36
+ "out_indices": [
37
+ 2,
38
+ 3,
39
+ 4
40
+ ],
41
+ "stage_names": [
42
+ "stem",
43
+ "stage1",
44
+ "stage2",
45
+ "stage3",
46
+ "stage4"
47
+ ],
48
+ "torch_dtype": "float32"
49
+ },
50
+ "backbone_kwargs": null,
51
+ "batch_norm_eps": 1e-05,
52
+ "box_noise_scale": 1.0,
53
+ "d_model": 256,
54
+ "decoder_activation_function": "relu",
55
+ "decoder_attention_heads": 8,
56
+ "decoder_ffn_dim": 1024,
57
+ "decoder_in_channels": [
58
+ 384,
59
+ 384,
60
+ 384
61
+ ],
62
+ "decoder_layers": 6,
63
+ "decoder_method": "default",
64
+ "decoder_n_levels": 3,
65
+ "decoder_n_points": 4,
66
+ "decoder_offset_scale": 0.5,
67
+ "disable_custom_kernels": true,
68
+ "dropout": 0.0,
69
+ "encode_proj_layers": [
70
+ 2
71
+ ],
72
+ "encoder_activation_function": "gelu",
73
+ "encoder_attention_heads": 8,
74
+ "encoder_ffn_dim": 2048,
75
+ "encoder_hidden_dim": 384,
76
+ "encoder_in_channels": [
77
+ 512,
78
+ 1024,
79
+ 2048
80
+ ],
81
+ "encoder_layers": 1,
82
+ "eos_coefficient": 0.0001,
83
+ "eval_size": null,
84
+ "feat_strides": [
85
+ 8,
86
+ 16,
87
+ 32
88
+ ],
89
+ "focal_loss_alpha": 0.75,
90
+ "focal_loss_gamma": 2.0,
91
+ "freeze_backbone_batch_norms": true,
92
+ "hidden_expansion": 1.0,
93
+ "id2label": {
94
+ "0": "LABEL_0",
95
+ "1": "LABEL_1",
96
+ "2": "LABEL_2"
97
+ },
98
+ "initializer_bias_prior_prob": null,
99
+ "initializer_range": 0.01,
100
+ "is_encoder_decoder": true,
101
+ "label2id": {
102
+ "LABEL_0": 0,
103
+ "LABEL_1": 1,
104
+ "LABEL_2": 2
105
+ },
106
+ "label_noise_ratio": 0.5,
107
+ "layer_norm_eps": 1e-05,
108
+ "learn_initial_query": false,
109
+ "matcher_alpha": 0.25,
110
+ "matcher_bbox_cost": 5.0,
111
+ "matcher_class_cost": 2.0,
112
+ "matcher_gamma": 2.0,
113
+ "matcher_giou_cost": 2.0,
114
+ "model_type": "rt_detr_v2",
115
+ "normalize_before": false,
116
+ "num_denoising": 100,
117
+ "num_feature_levels": 3,
118
+ "num_queries": 300,
119
+ "positional_encoding_temperature": 10000,
120
+ "torch_dtype": "float32",
121
+ "transformers_version": "4.55.0",
122
+ "use_focal_loss": true,
123
+ "use_pretrained_backbone": false,
124
+ "use_timm_backbone": false,
125
+ "weight_loss_bbox": 5.0,
126
+ "weight_loss_giou": 2.0,
127
+ "weight_loss_vfl": 1.0,
128
+ "with_box_refine": true
129
+ }
checkpoint-22/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9de66f7b33e7ac4d76c89424fca8b97c02c5ffd7bb21cd21a3ea8d5e351821c
3
+ size 306699044
checkpoint-22/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:087c1f2c00c286193f70c788643e0d1176edcc4d78b2d09203f5c97419dcee0d
3
+ size 611580433
checkpoint-22/preprocessor_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_annotations": true,
3
+ "do_normalize": false,
4
+ "do_pad": false,
5
+ "do_rescale": true,
6
+ "do_resize": true,
7
+ "format": "coco_detection",
8
+ "image_mean": [
9
+ 0.485,
10
+ 0.456,
11
+ 0.406
12
+ ],
13
+ "image_processor_type": "RTDetrImageProcessor",
14
+ "image_std": [
15
+ 0.229,
16
+ 0.224,
17
+ 0.225
18
+ ],
19
+ "pad_size": null,
20
+ "resample": 2,
21
+ "rescale_factor": 0.00392156862745098,
22
+ "size": {
23
+ "height": 640,
24
+ "width": 640
25
+ }
26
+ }
checkpoint-22/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a60ef25d0cb819c998330aa0d916d7bb159075f89558fabd3d6a4aafb3c8b73b
3
+ size 14244
checkpoint-22/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0836e9a481b0522ea2522eeeccf449d7f5873460995f8deac588da4c5e736f5d
3
+ size 1064
checkpoint-22/trainer_state.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 2.0,
6
+ "eval_steps": 500,
7
+ "global_step": 22,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [],
12
+ "logging_steps": 50,
13
+ "max_steps": 22,
14
+ "num_input_tokens_seen": 0,
15
+ "num_train_epochs": 2,
16
+ "save_steps": 200,
17
+ "stateful_callbacks": {
18
+ "TrainerControl": {
19
+ "args": {
20
+ "should_epoch_stop": false,
21
+ "should_evaluate": false,
22
+ "should_log": false,
23
+ "should_save": true,
24
+ "should_training_stop": true
25
+ },
26
+ "attributes": {}
27
+ }
28
+ },
29
+ "total_flos": 5.632598679552e+17,
30
+ "train_batch_size": 24,
31
+ "trial_name": null,
32
+ "trial_params": null
33
+ }
checkpoint-22/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:261388b5dfaf2a9df467bf9b4db643f296523f92d93fa65002e44c1b566dfa5f
3
+ size 5368
config.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "silu",
4
+ "anchor_image_size": null,
5
+ "architectures": [
6
+ "RTDetrV2ForObjectDetection"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "auxiliary_loss": true,
10
+ "backbone": null,
11
+ "backbone_config": {
12
+ "depths": [
13
+ 3,
14
+ 4,
15
+ 23,
16
+ 3
17
+ ],
18
+ "downsample_in_bottleneck": false,
19
+ "downsample_in_first_stage": false,
20
+ "embedding_size": 64,
21
+ "hidden_act": "relu",
22
+ "hidden_sizes": [
23
+ 256,
24
+ 512,
25
+ 1024,
26
+ 2048
27
+ ],
28
+ "layer_type": "bottleneck",
29
+ "model_type": "rt_detr_resnet",
30
+ "num_channels": 3,
31
+ "out_features": [
32
+ "stage2",
33
+ "stage3",
34
+ "stage4"
35
+ ],
36
+ "out_indices": [
37
+ 2,
38
+ 3,
39
+ 4
40
+ ],
41
+ "stage_names": [
42
+ "stem",
43
+ "stage1",
44
+ "stage2",
45
+ "stage3",
46
+ "stage4"
47
+ ],
48
+ "torch_dtype": "float32"
49
+ },
50
+ "backbone_kwargs": null,
51
+ "batch_norm_eps": 1e-05,
52
+ "box_noise_scale": 1.0,
53
+ "d_model": 256,
54
+ "decoder_activation_function": "relu",
55
+ "decoder_attention_heads": 8,
56
+ "decoder_ffn_dim": 1024,
57
+ "decoder_in_channels": [
58
+ 384,
59
+ 384,
60
+ 384
61
+ ],
62
+ "decoder_layers": 6,
63
+ "decoder_method": "default",
64
+ "decoder_n_levels": 3,
65
+ "decoder_n_points": 4,
66
+ "decoder_offset_scale": 0.5,
67
+ "disable_custom_kernels": true,
68
+ "dropout": 0.0,
69
+ "encode_proj_layers": [
70
+ 2
71
+ ],
72
+ "encoder_activation_function": "gelu",
73
+ "encoder_attention_heads": 8,
74
+ "encoder_ffn_dim": 2048,
75
+ "encoder_hidden_dim": 384,
76
+ "encoder_in_channels": [
77
+ 512,
78
+ 1024,
79
+ 2048
80
+ ],
81
+ "encoder_layers": 1,
82
+ "eos_coefficient": 0.0001,
83
+ "eval_size": null,
84
+ "feat_strides": [
85
+ 8,
86
+ 16,
87
+ 32
88
+ ],
89
+ "focal_loss_alpha": 0.75,
90
+ "focal_loss_gamma": 2.0,
91
+ "freeze_backbone_batch_norms": true,
92
+ "hidden_expansion": 1.0,
93
+ "id2label": {
94
+ "0": "LABEL_0",
95
+ "1": "LABEL_1",
96
+ "2": "LABEL_2"
97
+ },
98
+ "initializer_bias_prior_prob": null,
99
+ "initializer_range": 0.01,
100
+ "is_encoder_decoder": true,
101
+ "label2id": {
102
+ "LABEL_0": 0,
103
+ "LABEL_1": 1,
104
+ "LABEL_2": 2
105
+ },
106
+ "label_noise_ratio": 0.5,
107
+ "layer_norm_eps": 1e-05,
108
+ "learn_initial_query": false,
109
+ "matcher_alpha": 0.25,
110
+ "matcher_bbox_cost": 5.0,
111
+ "matcher_class_cost": 2.0,
112
+ "matcher_gamma": 2.0,
113
+ "matcher_giou_cost": 2.0,
114
+ "model_type": "rt_detr_v2",
115
+ "normalize_before": false,
116
+ "num_denoising": 100,
117
+ "num_feature_levels": 3,
118
+ "num_queries": 300,
119
+ "positional_encoding_temperature": 10000,
120
+ "torch_dtype": "float32",
121
+ "transformers_version": "4.55.0",
122
+ "use_focal_loss": true,
123
+ "use_pretrained_backbone": false,
124
+ "use_timm_backbone": false,
125
+ "weight_loss_bbox": 5.0,
126
+ "weight_loss_giou": 2.0,
127
+ "weight_loss_vfl": 1.0,
128
+ "with_box_refine": true
129
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9de66f7b33e7ac4d76c89424fca8b97c02c5ffd7bb21cd21a3ea8d5e351821c
3
+ size 306699044
preprocessor_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_annotations": true,
3
+ "do_normalize": false,
4
+ "do_pad": false,
5
+ "do_rescale": true,
6
+ "do_resize": true,
7
+ "format": "coco_detection",
8
+ "image_mean": [
9
+ 0.485,
10
+ 0.456,
11
+ 0.406
12
+ ],
13
+ "image_processor_type": "RTDetrImageProcessor",
14
+ "image_std": [
15
+ 0.229,
16
+ 0.224,
17
+ 0.225
18
+ ],
19
+ "pad_size": null,
20
+ "resample": 2,
21
+ "rescale_factor": 0.00392156862745098,
22
+ "size": {
23
+ "height": 640,
24
+ "width": 640
25
+ }
26
+ }
runs/Aug13_22-15-12_9db0f8c974d2/events.out.tfevents.1755123313.9db0f8c974d2.60074.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75d7c3066003b54ea7b427f6d2ec1c2be19bdda662a8328d4ce99254a03b1716
3
+ size 7396
runs/Aug13_22-20-53_9db0f8c974d2/events.out.tfevents.1755123655.9db0f8c974d2.61846.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f69b32a94e95402ed3eea7115afb448510aebb3501c2d77b136d84d25afed77c
3
+ size 7396
runs/Aug13_22-23-40_9db0f8c974d2/events.out.tfevents.1755123822.9db0f8c974d2.62610.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3342ac0b845641b7113baa74e166d24cdd4696a9a5d34192988c9d5f4d6500a
3
+ size 7396
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c27b2f12813f34a64df28943d3a14ef6b011ae53edbbf445b45d5da3a845e221
3
+ size 5368