Results

Preamble

Our code, by default, uses automatic mixed precision (AMP). Its effect on the output is negligible. All speeds reported in the paper are recorded with AMP turned off (--benchmark). Due to refactoring, there might be slight differences between the outputs produced by this code base with the precomputed results/results reported in the paper. This difference rarely leads to a change of the least significant figure (i.e., 0.1).

For most complete results, please see the paper (and the appendix)!

All available precomputed results can be found [here].

Pretrained models

We provide four pretrained models for download:

XMem.pth (Default)
XMem-s012.pth (Trained with BL30K)
XMem-s2.pth (No pretraining on static images)
XMem-no-sensory (No sensory memory)

The model without pretraining is for reference. The model without sensory memory might be more suitable for tasks without spatial continuity, like mask tracking in a multi-camera 3D reconstruction setting, though I would encourage you to try the base model as well.

Download them from [GitHub] or [Google Drive].

Long-Time Video

[Precomputed Results]

Long-Time Video (1X)

Model	J&F	J	F
XMem	89.8±0.2	88.0±0.2	91.6±0.2

Long-Time Video (3X)

Model	J&F	J	F
XMem	90.0±0.4	88.2±0.3	91.8±0.4

DAVIS

[Precomputed Results]

DAVIS 2016

Model	J&F	J	F	FPS	FPS (AMP)
XMem	91.5	90.4	92.7	29.6	40.3
XMem-s012	92.0	90.7	93.2	29.6	40.3
XMem-s2	90.8	89.6	91.9	29.6	40.3

DAVIS 2017 validation

Model	J&F	J	F	FPS	FPS (AMP)
XMem	86.2	82.9	89.5	22.6	33.9
XMem-s012	87.7	84.0	91.4	22.6	33.9
XMem-s2	84.5	81.4	87.6	22.6	33.9
XMem-no-sensory	85.1	-	-	23.1	-

DAVIS 2017 test-dev

Model	J&F	J	F
XMem	81.0	77.4	84.5
XMem-s012	81.2	77.6	84.7
XMem-s2	79.8	61.4	68.1
XMem-s012 (600p)	82.5	79.1	85.8

YouTubeVOS

We use all available frames in YouTubeVOS by default. See INFERENCE.md if you want to evaluate with sparse frames for some reason.

[Precomputed Results]

[Precomputed Results (sparse)]

YouTubeVOS 2018 validation

Model	G	J-Seen	F-Seen	J-Unseen	F-Unseen	FPS	FPS (AMP)
XMem	85.7	84.6	89.3	80.2	88.7	22.6	31.7
XMem-s012	86.1	85.1	89.8	80.3	89.2	22.6	31.7
XMem-s2	84.3	83.9	88.8	77.7	86.7	22.6	31.7
XMem-no-sensory	84.4	-	-	-	-	23.1	-

YouTubeVOS 2019 validation

Model	G	J-Seen	F-Seen	J-Unseen	F-Unseen
XMem	85.5	84.3	88.6	80.3	88.6
XMem-s012	85.8	84.8	89.2	80.3	88.8
XMem-s2	84.2	83.8	88.3	78.1	86.7

Multi-scale evaluation

Please see the appendix for quantitative results.

[DAVIS-MS Precomputed Results]

[YouTubeVOS-MS Precomputed Results]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RESULTS.md

RESULTS.md

Results

Preamble

Pretrained models

Long-Time Video

Long-Time Video (1X)

Long-Time Video (3X)

DAVIS

DAVIS 2016

DAVIS 2017 validation

DAVIS 2017 test-dev

YouTubeVOS

YouTubeVOS 2018 validation

YouTubeVOS 2019 validation

Multi-scale evaluation

Files

RESULTS.md

Latest commit

History

RESULTS.md

File metadata and controls

Results

Preamble

Pretrained models

Long-Time Video

Long-Time Video (1X)

Long-Time Video (3X)

DAVIS

DAVIS 2016

DAVIS 2017 validation

DAVIS 2017 test-dev

YouTubeVOS

YouTubeVOS 2018 validation

YouTubeVOS 2019 validation

Multi-scale evaluation