In the following section, we delve into a detailed comparison showcased in Table 1. This table highlights the performance differences between our innovative Z-GMOT approach and other established fully-supervised MOT methods, specifically evaluated on the Refer-Animal dataset.
| Datasets | NLP | #Cat. | #Vid. | #Frames | #Tracks | #Boxs | |
|---|---|---|---|---|---|---|---|
| SOT | OTB2013 | ✖ | 10 | 51 | 29K | 51 | 29K |
| VOT2017 | ✖ | 24 | 60 | 21K | 60 | 21K | |
| TrackingNet | ✖ | 21 | 31K | 14M | 31K | 14M | |
| LaSOT | ✔ | 70 | 1.4K | 3.52M | 1.4K | 3.52M | |
| TNL2K | ✔ | - | 2K | 1.24M | 2K | 1.24M | |
| MOT | MOT17 | ✖ | 1 | 14 | 11.2K | 1.3K | 0.3M |
| MOT20 (Dendorfer et al., 2020) | ✖ | 1 | 8 | 13.41K | 3.45K | 1.65M | |
| Omni-MOT (Sun et al., 2020b) | ✖ | 1 | - | 14M+ | 250K | 110M | |
| DanceTrack (Sun et al., 2022) | ✖ | 1 | 10 | 105K | 990 | - | |
| TAO (Dave et al., 2020) | ✖ | 833 | 2.9K | 2.6M | 17.2K | 333K | |
| SportMOT (Cui et al., 2023) | ✖ | 1 | 240 | 150K | 3.4K | 1.62M | |
| Refer-KITTI (Wu et al., 2023) | ✔ | - | 18 | 6.65K | - | - | |
| GSOT | GOT-10 (Huang et al., 2019) | ✖ | 563 | 10K | 1.5M | 10K | 1.5M |
| Fish (Kay et al., 2022) | ✖ | 1 | 1.6K | 527.2K | 8.25K | 516K | |
| GMOT | AnimalTrack (Zhang et al., 2022b) | ✖ | 10 | 58 | 24.7K | 1.92K | 429K |
| GMOT-40 (Bai et al., 2021) | ✖ | 10 | 40 | 9K | 2.02K | 256K | |
| Refer-Animal(Ours) | ✔ | 10 | 58 | 24.7K | 1.92K | 429K | |
| Refer-GMOT(Ours) | ✔ | 10 | 40 | 9K | 2.02K | 256K |
We conduct extensive experiments to empirically prove the performance of our proposed Z-GMOT including both detection with Open-CSOD and association with MAC-SORT in the GMOT problem. Our strategy can help bridging the gap between human's intention and computer understanding to provide flexibility in tracking objects with distinctive characteristics follow input texts.
| Trackers | Detectors | #-Shot | HOTA↑ | MOTA↑ | IDF1↑ |
|---|---|---|---|---|---|
| SORT [Bewley et al., 2016] |
OS-OD | one-shot | 30.05 | 20.83 | 33.90 |
| iGLIP (Ours) | zero-shot | 54.21 | 62.90 | 64.34 | |
| DeepSORT [Wojke et al., 2017] |
OS-OD | one-shot | 27.82 | 17.96 | 30.37 |
| iGLIP (Ours) | zero-shot | 50.45 | 58.99 | 57.55 | |
| ByteTrack [Zhang et al., 2022c] |
OS-OD | one-shot | 29.89 | 20.30 | 34.70 |
| iGLIP (Ours) | zero-shot | 53.69 | 61.49 | 66.21 | |
| OC-SORT [Cao et al., 2023] |
OS-OD | one-shot | 30.35 | 20.60 | 34.37 |
| iGLIP (Ours) | zero-shot | 56.51 | 62.76 | 67.40 | |
| Deep-OCSORT [Maggiolino et al., 2023] |
OS-OD | one-shot | 30.37 | 21.10 | 35.12 |
| iGLIP (Ours) | zero-shot | 55.89 | 64.02 | 66.52 | |
| MOTRv2 [Zhang et al., 2023] |
OS-OD | one-shot | 23.75 | 13.87 | 25.17 |
| iGLIP (Ours) | zero-shot | 31.32 | 18.54 | 31.28 |
| Trackers | HOTA↑ | MOTA↑ | IDF1↑ |
|---|---|---|---|
| SORT [bewley2016simple] |
54.21 | 62.90 | 64.34 |
| DeepSORT [wojke2017simple] |
50.45 | 58.99 | 57.55 |
| ByteTrack [zhang2021bytetrack] |
53.69 | 61.49 | 66.21 |
| OC-SORT [cao2023observation] |
56.51 | 62.76 | 67.40 |
| Deep-OCSORT [maggiolino2023deep] |
55.89 | 64.02 | 66.52 |
| MOTRv2 [zhang2023motrv2] |
31.32 | 18.54 | 31.28 |
| MA-SORT (Ours) | 56.75 | 64.62 | 68.17 |
| Tracker | Detector | Train | HOTA | MOTA | IDF1 |
|---|---|---|---|---|---|
| SORT | FRCNN [ren2015faster] |
✔ | 42.80 | 55.60 | 49.20 |
| DeepSORT | FRCNN [ren2015faster] |
✔ | 32.80 | 41.40 | 35.20 |
| ByteTrack | YOLOX [yolox2021] |
✔ | 40.10 | 38.50 | 51.20 |
| TransTrack | YOLOX [yolox2021] |
✔ | 45.40 | 48.30 | 53.40 |
| QDTrack | YOLOX [yolox2021] |
✔ | 47.00 | 55.70 | 56.30 |
| MA-SORT (Ours) | YOLOX [yolox2021] |
✔ | 57.86 | 68.32 | 63.01 |
| MA-SORT (Ours) | iGLIP (Z-GMOT) (Ours) | ✖ | 53.28 | 57.64 | 58.43 |
| Trackers | Detectors | Train | HOTA↑ | MOTA↑ | IDF1↑ |
|---|---|---|---|---|---|
| SORT [bewley2016simple] |
YOLOX [yolox2021] |
✔ | 47.80 | 88.20 | 48.30 |
| DeepSORT [wojke2017simple] |
YOLOX [yolox2021] |
✔ | 45.80 | 87.10 | 46.80 |
| MOTDT [Chen2018RealTimeMP] |
YOLOX [yolox2021] |
✔ | 39.20 | 84.30 | 39.60 |
| ByteTrack [zhang2021bytetrack] |
YOLOX [yolox2021] |
✔ | 47.10 | 88.20 | 51.90 |
| OC-SORT [cao2023observation] |
YOLOX [yolox2021] |
✔ | 52.10 | 87.30 | 51.60 |
| MA-SORT (Ours) | YOLOX [yolox2021] |
✔ | 53.44 | 87.31 | 53.78 |
| MA-SORT (Ours) | iGLIP Z-GMOT (Ours) | ✖ | 47.57 | 83.11 | 46.58 |
| Trackers | HOTA↑ | MOTA↑ | IDF1↑ |
|---|---|---|---|
| MeMOT (Cai et al., 2022a) | 54.1 | 63.7 | 66.1 |
| FairMOT (Zhang et al., 2021) | 54.6 | 61.8 | 67.3 |
| TransTrack (Sun et al., 2020a) | 48.9 | 65.0 | 59.4 |
| TrackFormer (Meinhardt et al., 2022b) | 54.7 | 68.6 | 65.7 |
| ReMOT (Fan Yang and Nakamura, 2021) | 61.2 | 77.4 | 73.1 |
| GSDT (Wang et al., 2020) | 53.6 | 67.1 | 67.5 |
| CSTrack (Chao Liang and Zou, 2022) | 54.0 | 66.6 | 68.6 |
| TransMOT (Peng Chu and Liu, 2023) | - | 77.4 | 75.2 |
| ByteTrack (Zhang et al., 2022c) | 61.3 | 77.8 | 75.2 |
| OC-SORT (Cao et al., 2023) | 62.4 | 75.7 | 76.3 |
| ByteTrack (Zhang et al., 2022c)† | 60.4 | 74.2 | 74.5 |
| OC-SORT (Cao et al., 2023)† | 60.5 | 73.1 | 74.4 |
| MA-SORT (Ours) | 61.4 | 77.6 | 75.5 |