Learning Sequential Descriptors for Sequence-based Visual Place Recognition
Abstract
In robotics, Visual Place Recognition is a continuous process that receives as input a video stream to produce a hypothesis of the robot's current position within a map of known places. This task requires robust, scalable, and efficient techniques for real applications. This work proposes a detailed taxonomy of techniques using sequential descriptors, highlighting different mechanism to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence on the strengths and weaknesses of these different architectural choices. In comparison to existing sequential descriptors methods, we further investigate the viability of Transformers instead of CNN backbones, and we propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.
Community
Proposes SeqVLAD: a sequential descriptor/aggregator; and explores using transformer backbones instead of CNN for sequential descriptors. Instead of frame-to-frame (sequence) matching, sequential descriptors build a descriptor from a sequence of images (can be compared/retrieved against multiple database sequences); sequence descriptor also captures temporal information. Late fusion: process each image through learnable network (NetVLAD or GeM global image descriptor) and fuse (concat or pool) over sequence. Early fusion: fuse images and give all of them to learnable network (so it has context over all images), use TimeSformer (action recognition network) where all images are patchified, mapped to set of (learnable) tokens. Intermediate fusion has feature extraction from individual images, fusion, and then a (global) sequential descriptor extractor. SeqVLAD: extend NetVLAD global image descriptor (dense descriptors clustered into visual words and global descriptor is a statistic of cluster assignment) to sequence of images; for CNNs: stack/concat all D-dim local descriptors and run NetVLAD; for transformer: concat all output sequences; NetVLAD has a soft assignment of residuals/descriptors (weighed by softmax distances). Uses MSLS and OxfordCar datasets for testing; NetVLAD and concatenation works best for late fusion, SeqVLAD (CCT384) works best for intermediate fusion (and overall, even with PCA) while having low GPU memory requirement, TimeSformer (early fusion) is very low-dimensional. SeqVLAD and TimeSformer are robust to frame/sequence (order) inversion; late fusion not so robust, sequential matching is worst (matching diagonal distorted). Fixed size descriptors with performance still increasing with sequence length; late fusion descriptor length explodes. From Politecnico (Italy).
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper