Abstract
The Convolutional Set Transformer (CST) processes image sets directly, combining feature extraction and contextual modeling for improved performance in set classification and anomaly detection, with compatibility for CNN explainability methods.
We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).
Community
Highlights:
- CST is a novel deep learning architecture for processing image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics (e.g., a common category, scene, or concept).
- CST is general-purpose and supports a broad range of applications, including set-based classification tasks and Set Anomaly Detection.
- In the domain of image-set processing, CST outperforms existing set-learning approaches such as Deep Sets and Set Transformer. Unlike these methods, which are inherently opaque, CST is fully compatible with standard CNN explainability tools, including Grad-CAM.
- While Deep Sets and Set Transformer are typically trained from scratch, CST supports Transfer Learning: it can be pre-trained on large-scale datasets and then effectively adapted to diverse downstream tasks. We publicly release CST-15, the first set-learning backbone pre-trained on ImageNet.
Code and Pre-trained Models:
We release the cstmodels
Python package (pip install cstmodels
) which provides reusable Keras 3 layers for building CST architectures, and an easy interface to load CST-15 pre-trained on ImageNet in just two lines of code:
from cstmodels import CST15
model = CST15(pretrained=True)
Documentation is available here. Tutorial notebooks are provided in the GitHub repo of the project.
Example Application of the CST:
Set Anomaly Detection is a binary classification task meant to identify images in a set that are anomalous or inconsistent with the majority of the set. Here, the notion of anomaly is context-dependent: the same image may be considered anomalous in one set but not in another, depending on the surrounding context. The Figure below shows two image sets derived from the CelebA dataset (Liu et al., 2015). In each set, a majority of normal images share two attributes ("wearing hat" and "smiling" in the first, "no beard" and "attractive" in the second), while a minority lack these attributes and are thus anomalous. After training a CST and a Set Transformer (Lee et al., 2019) on CelebA for Set Anomaly Detection, we evaluate the explainability of their predictions by overlaying Grad-CAMs on anomalous images. CST explanations correctly highlight the anomalous regions, whereas Set Transformer (ST) explanations fail to provide meaningful insights. Want to dive deeper? Check out our paper!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision (2025)
- Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Diagnosis (2025)
- Representation Learning with Adaptive Superpixel Coding (2025)
- VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation (2025)
- A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism (2025)
- Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers (2025)
- Geometrically Constrained and Token-Based Probabilistic Spatial Transformers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper