Z-GMOT with MA-SORT: Zero-shot Generic Multiple Object Tracking (GMOT) with Motion Appearance SORT (MA-SORT)

Comparative Analysis of Z-GMOT

Datasets comparison

In the following section, we delve into a detailed comparison showcased in Table 1. This table highlights the performance differences between our innovative Z-GMOT approach and other established fully-supervised MOT methods, specifically evaluated on the Refer-Animal dataset.

Table 1: Comparison of **existing datasets** of SOT, MOT, GSOT, GMOT. ``#'' represents the quantity of the respective items. , Vid. denote Categories and Videos. NLP indicates textual natural language descriptions.
	Datasets	NLP	#Cat.	#Vid.	#Frames	#Tracks	#Boxs
SOT	OTB2013	✖	10	51	29K	51	29K
	VOT2017	✖	24	60	21K	60	21K
	TrackingNet	✖	21	31K	14M	31K	14M
	LaSOT	✔	70	1.4K	3.52M	1.4K	3.52M
	TNL2K	✔	-	2K	1.24M	2K	1.24M
MOT	MOT17	✖	1	14	11.2K	1.3K	0.3M
	MOT20 (Dendorfer et al., 2020)	✖	1	8	13.41K	3.45K	1.65M
	Omni-MOT (Sun et al., 2020b)	✖	1	-	14M+	250K	110M
	DanceTrack (Sun et al., 2022)	✖	1	10	105K	990	-
	TAO (Dave et al., 2020)	✖	833	2.9K	2.6M	17.2K	333K
	SportMOT (Cui et al., 2023)	✖	1	240	150K	3.4K	1.62M
	Refer-KITTI (Wu et al., 2023)	✔	-	18	6.65K	-	-
GSOT	GOT-10 (Huang et al., 2019)	✖	563	10K	1.5M	10K	1.5M
GSOT	Fish (Kay et al., 2022)	✖	1	1.6K	527.2K	8.25K	516K
GMOT	AnimalTrack (Zhang et al., 2022b)	✖	10	58	24.7K	1.92K	429K
	GMOT-40 (Bai et al., 2021)	✖	10	40	9K	2.02K	256K
	Refer-Animal(Ours)	✔	10	58	24.7K	1.92K	429K
	Refer-GMOT(Ours)	✔	10	40	9K	2.02K	256K

Quantitative Results

We conduct extensive experiments to empirically prove the performance of our proposed Z-GMOT including both detection with Open-CSOD and association with MAC-SORT in the GMOT problem. Our strategy can help bridging the gap between human's intention and computer understanding to provide flexibility in tracking objects with distinctive characteristics follow input texts.

Table 2: Tracking comparison on *Refer-GMOT40* dataset between our iGLIP with SOTA OS-OD on various trackers. For each tracker, the best scores are highlighted in **bold**.
Trackers	Detectors	#-Shot	HOTA↑	MOTA↑	IDF1↑
SORT [Bewley et al., 2016]	OS-OD	one-shot	30.05	20.83	33.90
SORT [Bewley et al., 2016]	iGLIP (Ours)	zero-shot	54.21	62.90	64.34
DeepSORT [Wojke et al., 2017]	OS-OD	one-shot	27.82	17.96	30.37
DeepSORT [Wojke et al., 2017]	iGLIP (Ours)	zero-shot	50.45	58.99	57.55
ByteTrack [Zhang et al., 2022c]	OS-OD	one-shot	29.89	20.30	34.70
ByteTrack [Zhang et al., 2022c]	iGLIP (Ours)	zero-shot	53.69	61.49	66.21
OC-SORT [Cao et al., 2023]	OS-OD	one-shot	30.35	20.60	34.37
OC-SORT [Cao et al., 2023]	iGLIP (Ours)	zero-shot	56.51	62.76	67.40
Deep-OCSORT [Maggiolino et al., 2023]	OS-OD	one-shot	30.37	21.10	35.12
Deep-OCSORT [Maggiolino et al., 2023]	iGLIP (Ours)	zero-shot	55.89	64.02	66.52
MOTRv2 [Zhang et al., 2023]	OS-OD	one-shot	23.75	13.87	25.17
MOTRv2 [Zhang et al., 2023]	iGLIP (Ours)	zero-shot	31.32	18.54	31.28

Table 3: Tracking comparison on *Refer-GMOT40* dataset between our *MA-SORT* with other trackers. Our proposed *iGLIP* is used as the object detection. The best scores are highlighted in **bold**.
Trackers	HOTA↑	MOTA↑	IDF1↑
SORT [bewley2016simple]	54.21	62.90	64.34
DeepSORT [wojke2017simple]	50.45	58.99	57.55
ByteTrack [zhang2021bytetrack]	53.69	61.49	66.21
OC-SORT [cao2023observation]	56.51	62.76	67.40
Deep-OCSORT [maggiolino2023deep]	55.89	64.02	66.52
MOTRv2 [zhang2023motrv2]	31.32	18.54	31.28
MA-SORT (Ours)	56.75	64.62	68.17

Table 4: Tracking comparison on *Refer-Animal* between our *Z-GMOT* and existing *fully-supervised MOT* methods. The best scores are highlighted in **bold**.
Tracker	Detector	Train	HOTA	MOTA	IDF1
SORT	FRCNN [ren2015faster]	✔	42.80	55.60	49.20
DeepSORT	FRCNN [ren2015faster]	✔	32.80	41.40	35.20
ByteTrack	YOLOX [yolox2021]	✔	40.10	38.50	51.20
TransTrack	YOLOX [yolox2021]	✔	45.40	48.30	53.40
QDTrack	YOLOX [yolox2021]	✔	47.00	55.70	56.30
MA-SORT (Ours)	YOLOX [yolox2021]	✔	57.86	68.32	63.01
MA-SORT (Ours)	iGLIP (Z-GMOT) (Ours)	✖	53.28	57.64	58.43

Table 5: Ablation study of generalizability of Z-GMOT on *DanceTrack* validation set with *MOT task*.
Trackers	Detectors	Train	HOTA↑	MOTA↑	IDF1↑
SORT [bewley2016simple]	YOLOX [yolox2021]	✔	47.80	88.20	48.30
DeepSORT [wojke2017simple]	YOLOX [yolox2021]	✔	45.80	87.10	46.80
MOTDT [Chen2018RealTimeMP]	YOLOX [yolox2021]	✔	39.20	84.30	39.60
ByteTrack [zhang2021bytetrack]	YOLOX [yolox2021]	✔	47.10	88.20	51.90
OC-SORT [cao2023observation]	YOLOX [yolox2021]	✔	52.10	87.30	51.60
MA-SORT (Ours)	YOLOX [yolox2021]	✔	53.44	87.31	53.78
MA-SORT (Ours)	iGLIP Z-GMOT (Ours)	✖	47.57	83.11	46.58

Table 6: Ablation study of effectiveness of *MA-SORT* on *MOT20* test set with *MOT task*. As ByteTrack, OC-SORT (gray) uses different thresholds for test set sequences and offline interpolation procedure, we also report scores by disabling these as ByteTrack^†, OC-SORT^†. The best scores are highlighted in **bold**.
Trackers	HOTA↑	MOTA↑	IDF1↑
MeMOT (Cai et al., 2022a)	54.1	63.7	66.1
FairMOT (Zhang et al., 2021)	54.6	61.8	67.3
TransTrack (Sun et al., 2020a)	48.9	65.0	59.4
TrackFormer (Meinhardt et al., 2022b)	54.7	68.6	65.7
ReMOT (Fan Yang and Nakamura, 2021)	61.2	77.4	73.1
GSDT (Wang et al., 2020)	53.6	67.1	67.5
CSTrack (Chao Liang and Zou, 2022)	54.0	66.6	68.6
TransMOT (Peng Chu and Liu, 2023)	-	77.4	75.2
ByteTrack (Zhang et al., 2022c)	61.3	77.8	75.2
OC-SORT (Cao et al., 2023)	62.4	75.7	76.3
ByteTrack (Zhang et al., 2022c)^†	60.4	74.2	74.5
OC-SORT (Cao et al., 2023)^†	60.5	73.1	74.4
MA-SORT (Ours)	61.4	77.6	75.5

Z-GMOT demo website

Comparative Analysis of Z-GMOT

Datasets comparison

Quantitative Results