Comparative Analysis of Z-GMOT

Datasets comparison

In the following section, we delve into a detailed comparison showcased in Table 1. This table highlights the performance differences between our innovative Z-GMOT approach and other established fully-supervised MOT methods, specifically evaluated on the Refer-Animal dataset.

Table 1: Comparison of existing datasets of SOT, MOT, GSOT, GMOT. ``#'' represents the quantity of the respective items. , Vid. denote Categories and Videos. NLP indicates textual natural language descriptions.
Datasets NLP #Cat. #Vid. #Frames #Tracks #Boxs
SOT OTB2013 10 51 29K 51 29K
VOT2017 24 60 21K 60 21K
TrackingNet 21 31K 14M 31K 14M
LaSOT 70 1.4K 3.52M 1.4K 3.52M
TNL2K - 2K 1.24M 2K 1.24M
MOT MOT17 1 14 11.2K 1.3K 0.3M
MOT20 (Dendorfer et al., 2020) 1 8 13.41K 3.45K 1.65M
Omni-MOT (Sun et al., 2020b) 1 - 14M+ 250K 110M
DanceTrack (Sun et al., 2022) 1 10 105K 990 -
TAO (Dave et al., 2020) 833 2.9K 2.6M 17.2K 333K
SportMOT (Cui et al., 2023) 1 240 150K 3.4K 1.62M
Refer-KITTI (Wu et al., 2023) - 18 6.65K - -
GSOT GOT-10 (Huang et al., 2019) 563 10K 1.5M 10K 1.5M
Fish (Kay et al., 2022) 1 1.6K 527.2K 8.25K 516K
GMOT AnimalTrack (Zhang et al., 2022b) 10 58 24.7K 1.92K 429K
GMOT-40 (Bai et al., 2021) 10 40 9K 2.02K 256K
Refer-Animal(Ours) 10 58 24.7K 1.92K 429K
Refer-GMOT(Ours) 10 40 9K 2.02K 256K

Quantitative Results

We conduct extensive experiments to empirically prove the performance of our proposed Z-GMOT including both detection with Open-CSOD and association with MAC-SORT in the GMOT problem. Our strategy can help bridging the gap between human's intention and computer understanding to provide flexibility in tracking objects with distinctive characteristics follow input texts.

Table 2: Tracking comparison on Refer-GMOT40 dataset between our iGLIP with SOTA OS-OD on various trackers. For each tracker, the best scores are highlighted in bold.
Trackers Detectors #-Shot HOTA↑ MOTA↑ IDF1↑
SORT
[Bewley et al., 2016]
OS-OD one-shot 30.05 20.83 33.90
iGLIP (Ours) zero-shot 54.21 62.90 64.34
DeepSORT
[Wojke et al., 2017]
OS-OD one-shot 27.82 17.96 30.37
iGLIP (Ours) zero-shot 50.45 58.99 57.55
ByteTrack
[Zhang et al., 2022c]
OS-OD one-shot 29.89 20.30 34.70
iGLIP (Ours) zero-shot 53.69 61.49 66.21
OC-SORT
[Cao et al., 2023]
OS-OD one-shot 30.35 20.60 34.37
iGLIP (Ours) zero-shot 56.51 62.76 67.40
Deep-OCSORT
[Maggiolino et al., 2023]
OS-OD one-shot 30.37 21.10 35.12
iGLIP (Ours) zero-shot 55.89 64.02 66.52
MOTRv2
[Zhang et al., 2023]
OS-OD one-shot 23.75 13.87 25.17
iGLIP (Ours) zero-shot 31.32 18.54 31.28
Table 3: Tracking comparison on Refer-GMOT40 dataset between our MA-SORT with other trackers. Our proposed iGLIP is used as the object detection. The best scores are highlighted in bold.
Trackers HOTA↑ MOTA↑ IDF1↑
SORT
[bewley2016simple]
54.21 62.90 64.34
DeepSORT
[wojke2017simple]
50.45 58.99 57.55
ByteTrack
[zhang2021bytetrack]
53.69 61.49 66.21
OC-SORT
[cao2023observation]
56.51 62.76 67.40
Deep-OCSORT
[maggiolino2023deep]
55.89 64.02 66.52
MOTRv2
[zhang2023motrv2]
31.32 18.54 31.28
MA-SORT (Ours) 56.75 64.62 68.17
Table 4: Tracking comparison on Refer-Animal between our Z-GMOT and existing fully-supervised MOT methods. The best scores are highlighted in bold.
Tracker Detector Train HOTA MOTA IDF1
SORT FRCNN
[ren2015faster]
42.80 55.60 49.20
DeepSORT FRCNN
[ren2015faster]
32.80 41.40 35.20
ByteTrack YOLOX
[yolox2021]
40.10 38.50 51.20
TransTrack YOLOX
[yolox2021]
45.40 48.30 53.40
QDTrack YOLOX
[yolox2021]
47.00 55.70 56.30
MA-SORT (Ours) YOLOX
[yolox2021]
57.86 68.32 63.01
MA-SORT (Ours) iGLIP (Z-GMOT) (Ours) 53.28 57.64 58.43
Table 5: Ablation study of generalizability of Z-GMOT on DanceTrack validation set with MOT task.
Trackers Detectors Train HOTA↑ MOTA↑ IDF1↑
SORT
[bewley2016simple]
YOLOX
[yolox2021]
47.80 88.20 48.30
DeepSORT
[wojke2017simple]
YOLOX
[yolox2021]
45.80 87.10 46.80
MOTDT
[Chen2018RealTimeMP]
YOLOX
[yolox2021]
39.20 84.30 39.60
ByteTrack
[zhang2021bytetrack]
YOLOX
[yolox2021]
47.10 88.20 51.90
OC-SORT
[cao2023observation]
YOLOX
[yolox2021]
52.10 87.30 51.60
MA-SORT (Ours) YOLOX
[yolox2021]
53.44 87.31 53.78
MA-SORT (Ours) iGLIP Z-GMOT (Ours) 47.57 83.11 46.58
Table 6: Ablation study of effectiveness of MA-SORT on MOT20 test set with MOT task. As ByteTrack, OC-SORT (gray) uses different thresholds for test set sequences and offline interpolation procedure, we also report scores by disabling these as ByteTrack, OC-SORT. The best scores are highlighted in bold.
Trackers HOTA↑ MOTA↑ IDF1↑
MeMOT (Cai et al., 2022a) 54.1 63.7 66.1
FairMOT (Zhang et al., 2021) 54.6 61.8 67.3
TransTrack (Sun et al., 2020a) 48.9 65.0 59.4
TrackFormer (Meinhardt et al., 2022b) 54.7 68.6 65.7
ReMOT (Fan Yang and Nakamura, 2021) 61.2 77.4 73.1
GSDT (Wang et al., 2020) 53.6 67.1 67.5
CSTrack (Chao Liang and Zou, 2022) 54.0 66.6 68.6
TransMOT (Peng Chu and Liu, 2023) - 77.4 75.2
ByteTrack (Zhang et al., 2022c) 61.3 77.8 75.2
OC-SORT (Cao et al., 2023) 62.4 75.7 76.3
ByteTrack (Zhang et al., 2022c) 60.4 74.2 74.5
OC-SORT (Cao et al., 2023) 60.5 73.1 74.4
MA-SORT (Ours) 61.4 77.6 75.5