Summary:
This paper presents R-SLAM, a learning-based SLAM method that shows state-of-the-art camera pose estimation performance on several benchmarks. In particular, R-SLAM updates the camera poses and dense frame depth in an iterative manner by using a recurrent update operator built upon RAFT and differentiable bundle adjustment. To make a complete SLAM system, R-SLAM imposes a covisibility graph and conducts both local and global optimization. The experiments show that R-SLAM outperforms the previous classical and learning-based approaches by a large margin, even on the datasets on which it is not trained.
Main Review:
Strength
The proposed SLAM method, R-SLAM, has shown the state-of-the-art camera pose estimation results on several established benchmarks, including TantanAir, EuRoC, TUM RGB-D, and ETH3D. Also, R-SLAM shows in general quite large improvements over the prior works, e.g., by ~60% on TantanAir, ~40% on EuRoC, ~80% on TUM RGB-D, and ~35% on ETH3D.
The state-of-the-art results demonstrated are obtained with the model trained on TantanAir only, thus showing very strong generalization capability which is often the major limitation of learning-based SLAM approaches.
R-SLAM takes advantage of both classical optimization and deep learning -- it can be trained end-to-end while keeping the bundle adjustment (BA) with Gauss-Newton in the loop. Besides, R-SLAM conducts the joint optimization of pose and depth which has been shown to be one of the keys to accurate estimations.
Besides the high accuracy, R-SLAM also shows improved robustness with a high success rate on several datasets.
Weaknesses
In general, the novelty of the paper is not substantial. It builds upon the established systems and methodologies -- a hybrid SLAM design that combines both the classical and learning worlds, differentiable BA / Gauss-Newton layer, covisibility graph, and the recurrent update operator. Overall it feels like a system as DeepV2D + RAFT, and no fundamental theoretical or architectural innovations are proposed. Above being said, I believe combining the previous works and making the right choices that lead to notable improvements is still quite valuable and inspiring to the community.
R-SLAM is trained on TatanAir while most of the compared methods (DeepFactors, DeepV2D, D3VO, DeepTAM) are not. Although some of the aforementioned methods are trained with the train/test split on the same data while R-SLAM shows mostly the cross-dataset generalization, it still poses a certain level of unfairness on the comparisons since, in general, the TantanAir dataset contains more images and more diversities. As a suggestion, it would be better to show a method trained and evaluated using exactly the same dataset as R-SLAM and report the comparison, for which DeepV2D could be a good baseline.
It is unclear how the non-keyframe poses are estimated if they are estimated at all. Since it is a SLAM system instead of an SfM system, R-SLAM should be able to deliver both keyframe and non-keyframe poses for real-world applications. However, from the paper one cannot see how the non-keyframe poses are handled. In fact, this infers other issues on the evaluation -- R-SLAM is evaluated with the keyframe pose without non-keyframe poses, but is it also the case for the other compared methods?; are the poses used in the evaluations the outputs of the global BA? If so, how good are the poses from local bundle adjustment since some compared methods are Visual Odometry systems without any global BA, e.g, DeepV2D, D3VO, DSO, etc?
The details of the full dense bundle adjustment are missing. R-SLAM performs joint dense BA on pose and depth which is quite expensive, especially when doing the global BA which optimizes tens or even hundreds of frame depth and poses. Is there a customized GPU-based optimizer used?
It would be better if more qualitative results can be shown, e.g., camera trajectories comparison among the methods and the estimated dense depth maps.
The writing of the paper can still be improved. There is a number of typos in the current manuscript. I list some of them here but there should be more:
L10: "focused" -> "focuses"
L13: "approach" -> "approached"
L15: "are have" -> "have"
L20: "But despite" -> "Despite" -
L57: "solves computes" -> "computes"
L81: "is that features" -> remove
L115: "operators" -> "operates"
Eq. (5): -> ?
The name "R-SLAM" already exists in [1] and [2]. In order to avoid confusion in the future, I would suggest the authors proposing a new name of the system.