arxiv:2509.22889

Convolutional Set Transformer

Published on Sep 26

· Submitted by

Federico Chinello on Oct 1

Upvote

Authors:

Federico Chinello ,

Abstract

The Convolutional Set Transformer (CST) processes image sets directly, combining feature extraction and contextual modeling for improved performance in set classification and anomaly detection, with compatibility for CNN explainability methods.

AI-generated summary

We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

chinefed

Paper author Paper submitter 4 days ago

Highlights:

CST is a novel deep learning architecture for processing image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics (e.g., a common category, scene, or concept).
CST is general-purpose and supports a broad range of applications, including set-based classification tasks and Set Anomaly Detection.
In the domain of image-set processing, CST outperforms existing set-learning approaches such as Deep Sets and Set Transformer. Unlike these methods, which are inherently opaque, CST is fully compatible with standard CNN explainability tools, including Grad-CAM.
While Deep Sets and Set Transformer are typically trained from scratch, CST supports Transfer Learning: it can be pre-trained on large-scale datasets and then effectively adapted to diverse downstream tasks. We publicly release CST-15, the first set-learning backbone pre-trained on ImageNet.

Code and Pre-trained Models:
We release the cstmodels Python package (pip install cstmodels) which provides reusable Keras 3 layers for building CST architectures, and an easy interface to load CST-15 pre-trained on ImageNet in just two lines of code:

from cstmodels import CST15
model = CST15(pretrained=True)

Documentation is available here. Tutorial notebooks are provided in the GitHub repo of the project.

Example Application of the CST:
Set Anomaly Detection is a binary classification task meant to identify images in a set that are anomalous or inconsistent with the majority of the set. Here, the notion of anomaly is context-dependent: the same image may be considered anomalous in one set but not in another, depending on the surrounding context. The Figure below shows two image sets derived from the CelebA dataset (Liu et al., 2015). In each set, a majority of normal images share two attributes ("wearing hat" and "smiling" in the first, "no beard" and "attractive" in the second), while a minority lack these attributes and are thus anomalous. After training a CST and a Set Transformer (Lee et al., 2019) on CelebA for Set Anomaly Detection, we evaluate the explainability of their predictions by overlaying Grad-CAMs on anomalous images. CST explanations correctly highlight the anomalous regions, whereas Set Transformer (ST) explanations fail to provide meaningful insights. Want to dive deeper? Check out our paper!

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.22889 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.22889 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.22889 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.