Model Card for RecNeXt-M3


Model Details
Model Type: Image Classification / Feature Extraction
Model Series: M
Model Stats:
- Parameters: 8.2M
- MACs: 1.4G
- Latency: 1.6ms (iPhone 13, iOS 18)
- Image Size: 224x224
Architecture Configuration:
- Embedding Dimensions: (64, 128, 256, 512)
- Depths: (3, 3, 13, 2)
- MLP Ratio: (2, 2, 2, 2)
Paper: RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations
Dataset: ImageNet-1K
Model Usage
Image Classification
from urllib.request import urlopen
from PIL import Image
import timm
import torch
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
model = timm.create_model('recnext_m3', pretrained=True, distillation=False)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
Converting to Inference Mode
import utils
# Convert training-time model to inference structure, fuse batchnorms
utils.replace_batchnorm(model)
Model Comparison
Classification
We introduce two series of models: the A series uses linear attention and nearest interpolation, while the M series employs convolution and bilinear interpolation for simplicity and broader hardware compatibility (e.g., to address suboptimal nearest interpolation support in some iOS versions).
dist: distillation; base: without distillation (all models are trained over 300 epochs).
model | top_1_accuracy | params | gmacs | npu_latency | cpu_latency | throughput | fused_weights | training_logs |
---|---|---|---|---|---|---|---|---|
M0 | 74.7* | 73.2 | 2.5M | 0.4 | 1.0ms | 189ms | 763 | dist | base | dist | base |
M1 | 79.2* | 78.0 | 5.2M | 0.9 | 1.4ms | 361ms | 384 | dist | base | dist | base |
M2 | 80.3* | 79.2 | 6.8M | 1.2 | 1.5ms | 431ms | 325 | dist | base | dist | base |
M3 | 80.9* | 79.6 | 8.2M | 1.4 | 1.6ms | 482ms | 314 | dist | base | dist | base |
M4 | 82.5* | 81.1 | 14.1M | 2.4 | 2.4ms | 843ms | 169 | dist | base | dist | base |
M5 | 83.3* | 81.6 | 22.9M | 4.7 | 3.4ms | 1487ms | 104 | dist | base | dist | base |
A0 | 75.0* | 73.6 | 2.8M | 0.4 | 1.4ms | 177ms | 4902 | dist | base | dist | base |
A1 | 79.6* | 78.3 | 5.9M | 0.9 | 1.9ms | 334ms | 2746 | dist | base | dist | base |
A2 | 80.8* | 79.6 | 7.9M | 1.2 | 2.2ms | 413ms | 2327 | dist | base | dist | base |
A3 | 81.1* | 80.1 | 9.0M | 1.4 | 2.4ms | 447ms | 2206 | dist | base | dist | base |
A4 | 82.5* | 81.6 | 15.8M | 2.4 | 3.6ms | 764ms | 1265 | dist | base | dist | base |
A5 | 83.5* | 83.1 | 25.7M | 4.7 | 5.6ms | 1376ms | 721 | dist | base | dist | base |
Comparison with LSNet
model | top_1_accuracy | params | gmacs | npu_latency | cpu_latency | throughput | fused_weights | training_logs |
---|---|---|---|---|---|---|---|---|
T | 76.6* | 75.1 | 12.1M | 0.3 | 1.8ms | 109ms | 14181 | dist | base | dist | base |
S | 79.6* | 78.3 | 15.8M | 0.7 | 2.0ms | 188ms | 8234 | dist | base | dist | base |
B | 81.4* | 80.3 | 19.3M | 1.1 | 2.5ms | 290ms | 4385 | dist | base | dist | base |
The NPU latency is measured on an iPhone 13 with models compiled by Core ML Tools. The CPU latency is accessed on a Quad-core ARM Cortex-A57 processor in ONNX format. And the throughput is tested on an Nvidia RTX3090 with maximum power-of-two batch size that fits in memory.
Citation
@misc{zhao2024recnext,
title={RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations},
author={Mingshu Zhao and Yi Luo and Yong Ouyang},
year={2024},
eprint={2412.19628},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 17