[논문 정리] Center-based 3D Object Detection and Tracking

728x90

개인 생각

2D object detection에서 쓰이던 기술들이 대거 3D에서도 쓰이는 것 같다.

이름만 봐도 2D에 사용됐었던 모델이 보인다.

제시된 CenterPoint는 3D만의 방법을 찾아내는 과도기에 있는 것 같다는 생각이 들었다.

Abstract

문제
- 3차원 세계의 객체들은 특별한 방향을 갖지 않음
- Box-based detector 들은 axis-aligned bounding box를 rotated object들로 fitting하는걸 어려워함

해결
- 3차원 객체들을 point의 형태로 표현하고, detect하고 track
- 3d object tracking을 greedy closest-point matching 문제로 단순화

Introduction

잘 알려진 2D detection problem과 point-cloud에 대한 3D detection은 몇가지 challenge들이 있다.
1. Point-cloud들은 sparse하다.
2. 결과로 나오는 3D box는 global coordinate frame에서 well align 되지 않는 경우가 종종 있다.
3. 3D object들은 다양한 범위의 size, shape, aspect ratio를 갖는다.

핵심은 axis-align된 2D box는 free-form 3D object에 대해 좋지 않은 방식이라는 것이다.
- 물론 다양한 template에서 각 객체들의 orientation을 분류할 수도 있지만, 1차적으로 computational한 부담이 있고, 2차적으로는 false-positive를 많이 만들 수 있다는 잠재적인 위험이 있다.

저자들은 2D와 3D 도메인을 연결하는 것은 객체를 표현하는 방법에 크게 연관되어있다고 판단했다.

Framework CenterPoint는 다음과 같은 과정을 갖는다.
1. Point-cloud 입력에 대한 표현을 만들기 위해 standard Lidar-based backbone network를 사용한다.
  - VoxelNet
  - PointPillars
2. 이런 표현을 Overhead map-view로 flatten 시키고, standard image-based ketpoint detector를 사용해 object center를 찾는다.
3. Detect 된 각 center의 point-feature에서 object property들을 regression한다.
  - 3D size
  - Orientation
  - Velocity

Second stage를 light-weighted하게 사용해서 object location을 refine한다.
  - 여기서 estimate 된 3D box의  3D center로부터 point-feature를 extract한다.
  - Striding이나 limited receptive field 때문에 사라진 local geometric 정보를 recover한다.

Center-based representation은 몇가지 advantage를 갖는다.
1. Bounding box와는 다르게 point는 intrinsic orientation이 없다.
- Backbone이 객체의 상대적인 rotation으로부터 rotational invariance와 rotational equivariance를 학습할 수 있다.
  - object detector의 search space를 드라마틱하게 줄일 수 있다.
2. Tracking과 같은 downstream task를 단순화 한다.
  - 객체가 point로 표현될 수 있다면, tracklet은 space in time에 대해 path로 나타낼 수 있다.
  - 연속적인 frame에 있어 velocity와 같은 relative offset을 예측한뒤, greedy하게 연결한다.
3. Point-based feature extraction은 이전 approach들보다 빠르고 효과적인 two-stage refinement module 디자인을 가능하게 한다.

CenterPoint

CenterPoint는 아래에 소개되는 heatmap과 regression loss들을 섞어서 jointly하게 optimizing 한다.

객체의 모든 property들은 현재 객체의 center-feature로부터 추정된다.
- 객체의 정확한 localization을 위한 충분한 정보를 포함하지 못할 수도 있다.
- 센서도 객체의 중심이 아닌 corner를 보기 때문

- 따라서 추가적인 작업을 통해 성능을 개선할 수 있다. --> Second Stage

Center heatmap head

목표
- Heatmap의 peak이 detected object의 중심에서 생성되도록 한다.

Heatmap $\hat{Y}$ 는 K개의 channel로 이루어져있다.
- K개의 class에 일대일 대응
- Class에 대한 학습을 `Focal Loss`로 함
- CenterPoint의 Dense Head 부분을 보면 `loss_cls`를 GuassianFocalLoss로 만든다.

라벨링된 bounding box의 3D center가 map-view로 projection 되면서 생긴 *2D gaussian* 을 목표로 한다.

Top-down map view에서의 object들은 이미지보다 더 sparse하다.
Image-view는 perspective하게 바라보므로 거리가 왜곡되는 반면, map-view에서는 거리가 absolute하게 보인다.
- Road scene을 보면 map-view에서 차량이 가진 영역은 작은 반면, image-view에서는 몇개의 큰 물체가 screen의 대부분을 차지한다.
- Image-view의 perspective projection에서 depth-dimension을 compress하는 것은 자연스럽게 object들 서로의 center들을 더 가깝게 위치시킨다.

CenterNet의 standard supervision에 따른 결과는 매우 sparse한 supervisory signal이고, 대부분의 위치들은 배경으로 고려된다.
- 각 ground truth 객체의 center에 대해 guassian peak을 키워서 rendering 한 headmap $Y$ 를 사용하는 positive supervision을 사용하여 이를 상쇄하고자 했다.
- Gaussian radius $\sigma = max(f(w\cdot l), \tau), \ \ where \ \ \tau = 2$
- $\tau$ : 최소 반지름
- $f$ : CornerNet에 정의된 radius function

Regression head

객체의 center-feature에 대해 몇가지 object property들을 저장해놓는다.
1. $o \in \mathbb{R}^2$ : Sub-voxel location refinement
- Voxelization, backbone network의 striding으로 인한 quantization error를 줄여준다.

2.  $h_g \in \mathbb{R}$ : Height-above-ground
  - 3차원에서의 object의 localize를 돕는다.
  - Map-view projection으로 인해 제거되는 elevation information을 추가해준다.

3.  $s \in \mathbb{R}^3$ : 3D size

4. $(sin(\alpha), cos(\alpha)) \in \mathbb{R}^2$ : `Yaw location angle`
  - Orientation prediction은  yaw angle의 sin과 cos을 continuous한 regression target으로 설정한다.

학습할때는 오직 ground truth center들만 regression한다.
  - L1 loss 사용
  - Logarithmic size를 사용하는게 다양한 shape의 box들을 다루기 더 좋았다.

Inference에서는 각 object의 peak location에 대한 dense regression head의 결과에서 위의 모든 property들을 추출하여 사용했다.

Velocity head and tracking

학습 시 2차원 velocity estimation을 추가했다.
  - L1 Loss 사용
  - 현재와 이전 step의 map-view를 입력으로 받는다.

Inference때 velocity를 다음과 같이 사용했다.
- 현재 프레임에서 잡힌 객체의 center에서 velocity를 통해 구할 수 있는 offset을 뺀다.
  - Offset을 뺀 center와 이전 프레임의 center를 거리를 기준으로 greedy하게 매칭
  - SORT에 따라 unmatched track은 3개의 frame까지 가지고있다가 버림
  - Unmatched track들을 last known velocity estimation으로 업데이트 시켜줌

Two-stage CenterPoint

Second-stage에서는 추가적인 point-feature를 추출한다.
- Bounding box의 6면중 위, 아래를 제외한 나머지 부분에 대해 3D center를 추출
  - 위, 아래를 제외하는 이유는 map-view로 봤을때, 같은 위치에 찍히기 때문
  - Backbone의 map-view output M에서 bilinear interpolation을 사용
  - 추가된 4개의 point를 concatnate하여 MLP를 통과시킨다.

Second-stage에서는 one-stage CenterPoint의 결과를 활용하여 class랑 상관없는 confidence score와 box refinement를 수행하게 된다.

Confidence score prediction은 다음 논문을 참고하였고
- Acquisition of Localization Confidence for Accurate Object Detection
  - GS3D
  - PV-RCNN
  - From points to parts

Ground truth의 bounding box에 대응하는 box의 3D IoU로부터 구해진 score target $I$ 를 사용했다.
$$
I = min(1, \ max(0, \ 2 \times IoU_t - 0.5))
$$
- $IoU_t$ : $t$ 번째 proposal box 와 ground truth같의 IoU

BCE로 학습된다.
$$
L_{score} = -I_t\cdot log(\hat{I}_t) - (1-I_t) \cdot log(1-\hat{I}_t)
$$
- $\hat{I}_t$ : Predicted confidence score

최종적인 confidence score($\hat{Q}_t$)는 first-stage($\hat{Y}_t$)와 second-stage($\hat{I}_t$)의 confidence score사이 기하평균을 사용했다.
$$
\hat{Q}_t = \sqrt{\hat{Y}_t * \hat{I}_t}
$$
- $\hat{Y}_t = max_{0 \leq k \leq K} \ \hat{Y}_{p,k}$

저작자표시 비영리 (새창열림)

'논문 정리 > 3D Object Detection' 카테고리의 다른 글

[논문 정리] FocalFormer3D: Focusing on Hard Instance for 3D Object Detection (0)	2024.09.13
[논문 정리] TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers (0)	2024.08.23
[논문 정리] VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking (1)	2024.07.28

개인 생각

Abstract

Introduction

CenterPoint

Center heatmap head

Regression head

Velocity head and tracking

Two-stage CenterPoint

'논문 정리 > 3D Object Detection' 카테고리의 다른 글

티스토리툴바