Papers
arxiv:2207.03868

Learning Sequential Descriptors for Sequence-based Visual Place Recognition

Published on Jul 8, 2022
Authors:
,
,
,
,

Abstract

In robotics, Visual Place Recognition is a continuous process that receives as input a video stream to produce a hypothesis of the robot's current position within a map of known places. This task requires robust, scalable, and efficient techniques for real applications. This work proposes a detailed taxonomy of techniques using sequential descriptors, highlighting different mechanism to fuse the information from the individual images. This categorization is supported by a complete benchmark of experimental results that provides evidence on the strengths and weaknesses of these different architectural choices. In comparison to existing sequential descriptors methods, we further investigate the viability of Transformers instead of CNN backbones, and we propose a new ad-hoc sequence-level aggregator called SeqVLAD, which outperforms prior state of the art on different datasets. The code is available at https://github.com/vandal-vpr/vg-transformers.

Community

Proposes SeqVLAD: a sequential descriptor/aggregator; and explores using transformer backbones instead of CNN for sequential descriptors. Instead of frame-to-frame (sequence) matching, sequential descriptors build a descriptor from a sequence of images (can be compared/retrieved against multiple database sequences); sequence descriptor also captures temporal information. Late fusion: process each image through learnable network (NetVLAD or GeM global image descriptor) and fuse (concat or pool) over sequence. Early fusion: fuse images and give all of them to learnable network (so it has context over all images), use TimeSformer (action recognition network) where all images are patchified, mapped to set of (learnable) tokens. Intermediate fusion has feature extraction from individual images, fusion, and then a (global) sequential descriptor extractor. SeqVLAD: extend NetVLAD global image descriptor (dense descriptors clustered into visual words and global descriptor is a statistic of cluster assignment) to sequence of images; for CNNs: stack/concat all D-dim local descriptors and run NetVLAD; for transformer: concat all output sequences; NetVLAD has a soft assignment of residuals/descriptors (weighed by softmax distances). Uses MSLS and OxfordCar datasets for testing; NetVLAD and concatenation works best for late fusion, SeqVLAD (CCT384) works best for intermediate fusion (and overall, even with PCA) while having low GPU memory requirement, TimeSformer (early fusion) is very low-dimensional. SeqVLAD and TimeSformer are robust to frame/sequence (order) inversion; late fusion not so robust, sequential matching is worst (matching diagonal distorted). Fixed size descriptors with performance still increasing with sequence length; late fusion descriptor length explodes. From Politecnico (Italy).

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2207.03868 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2207.03868 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2207.03868 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.