Model Card for RecNeXt-M5

Model Details

Model Type: Image Classification / Feature Extraction
Model Series: M
Model Stats:
- Parameters: 22.9M
- MACs: 4.7G
- Latency: 3.4ms (iPhone 13, iOS 18)
- Image Size: 224x224
Architecture Configuration:
- Embedding Dimensions: (80, 160, 320, 640)
- Depths: (7, 7, 35, 2)
- MLP Ratio: (2, 2, 2, 2)
Paper: RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations
Code: https://github.com/suous/RecNeXt
Dataset: ImageNet-1K

Model Usage

Image Classification

from urllib.request import urlopen
from PIL import Image
import timm
import torch

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

model = timm.create_model('recnext_m5', pretrained=True, distillation=False)
model = model.eval()

# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

output = model(transforms(img).unsqueeze(0))  # unsqueeze single image into batch of 1

top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

Converting to Inference Mode

import utils

# Convert training-time model to inference structure, fuse batchnorms
utils.replace_batchnorm(model)

Model Comparison

Classification

We introduce two series of models: the A series uses linear attention and nearest interpolation, while the M series employs convolution and bilinear interpolation for simplicity and broader hardware compatibility (e.g., to address suboptimal nearest interpolation support in some iOS versions).

dist: distillation; base: without distillation (all models are trained over 300 epochs).

model	top_1_accuracy	params	gmacs	npu_latency	cpu_latency	throughput	fused_weights	training_logs
M0	74.7* \| 73.2	2.5M	0.4	1.0ms	189ms	763	dist \| base	dist \| base
M1	79.2* \| 78.0	5.2M	0.9	1.4ms	361ms	384	dist \| base	dist \| base
M2	80.3* \| 79.2	6.8M	1.2	1.5ms	431ms	325	dist \| base	dist \| base
M3	80.9* \| 79.6	8.2M	1.4	1.6ms	482ms	314	dist \| base	dist \| base
M4	82.5* \| 81.4	14.1M	2.4	2.4ms	843ms	169	dist \| base	dist \| base
M5	83.3* \| 82.9	22.9M	4.7	3.4ms	1487ms	104	dist \| base	dist \| base
A0	75.0* \| 73.6	2.8M	0.4	1.4ms	177ms	4902	dist \| base	dist \| base
A1	79.6* \| 78.3	5.9M	0.9	1.9ms	334ms	2746	dist \| base	dist \| base
A2	80.8* \| 79.6	7.9M	1.2	2.2ms	413ms	2327	dist \| base	dist \| base
A3	81.1* \| 80.1	9.0M	1.4	2.4ms	447ms	2206	dist \| base	dist \| base
A4	82.5* \| 81.6	15.8M	2.4	3.6ms	764ms	1265	dist \| base	dist \| base
A5	83.5* \| 83.1	25.7M	4.7	5.6ms	1376ms	721	dist \| base	dist \| base

Comparison with LSNet

model	top_1_accuracy	params	gmacs	npu_latency	cpu_latency	throughput	fused_weights	training_logs
T	76.6* \| 75.1	12.1M	0.3	1.8ms	109ms	14181	dist \| base	dist \| base
S	79.6* \| 78.3	15.8M	0.7	2.0ms	188ms	8234	dist \| base	dist \| base
B	81.4* \| 80.3	19.3M	1.1	2.5ms	290ms	4385	dist \| base	dist \| base

The NPU latency is measured on an iPhone 13 with models compiled by Core ML Tools. The CPU latency is accessed on a Quad-core ARM Cortex-A57 processor in ONNX format. And the throughput is tested on an Nvidia RTX3090 with maximum power-of-two batch size that fits in memory.

Citation

@misc{zhao2024recnext,
      title={RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations},
      author={Mingshu Zhao and Yi Luo and Yong Ouyang},
      year={2024},
      eprint={2412.19628},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

suous
/

recnext_m5.base_300e_in1k