CVPR 2026

TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction

Qin Wang^1,2 Abigail Morrison^1,2 Hanno Scharr¹ Kai Krajsek¹

¹ Forschungszentrum Jülich, ² RWTH Aachen University

TLDR: TimeBridge extends iBOT for video representation learning by reconstructing intermediate frames from temporally separated start and end frames.

Animated TimeBridge project teaser. — **Method overview.** TimeBridge samples a start frame and an end frame, then reconstructs equidistant in-between frames to learn the temporal evolution bridging the two observations.

Overview

Abstract

TimeBridge is a self-supervised video representation learning method that models how visual content evolves between a start frame and an end frame. It learns temporal transformations by reconstructing in-between frames, producing stronger motion-aware features for dense video prediction tasks such as video object segmentation and part propagation.

Core Idea

Learn time as a bridge, not just a sequence

Jointly encode the start and end observations, then predict the latent visual evolution in between.

Representation Benefit

Capture motion-aware semantics

The learned features reflect how objects change over time, making them strong for dense propagation tasks.

Downstream Impact

Strong performance on dense video understanding

The representation transfers to video object segmentation and semantic or part propagation benchmarks.

Results

Comparison with Prior Work

Performance comparison on DAVIS 2017, VIP, and JHMDB.

Method	Backbone	Dataset	Epoch	DAVIS			VIP	JHMDB
Method	Backbone	Dataset	Epoch	J & F_m	J_m	F_m	mIoU	PCK@0.1	PCK@0.2
DINO	ViT-S/16	ImageNet	800	61.8	60.2	63.4	36.2	45.6	75.0
iBOT	ViT-S/16	ImageNet	800	62.8	61.2	64.5	37.9	44.6	74.6
CrOC	ViT-S/16	ImageNet	300	44.7	43.5	45.9	-	-	-
VideoMAE	ViT-S/16	Kinetics	-	39.3	39.7	38.9	23.3	41.0	67.9
SiamMAE	ViT-S/16	Kinetics	2000	62.0	60.3	63.7	37.3	47.0	76.1
SiamMAE	ViT-S/16	Kinetics	400	57.9	56.0	60.0	33.2	46.1	74.0
CropMAE	ViT-S/16	Kinetics	400	58.6	55.8	61.4	33.7	42.9	71.1
CropMAE	ViT-S/16	INSub	400	60.4	57.6	63.3	33.3	43.6	72.0
RSP	ViT-S/16	Kinetics	400	60.1	57.4	62.8	33.8	44.6	73.4
T-CoRe	ViT-S/16	Kinetics	400	64.7	63.5	66.0	37.8	47.0	75.2
Ours	ViT-S/16	Kinetics	400	66.2 (+1.5)	64.3 (+0.8)	68.1 (+2.1)	39.4 (+1.5)	45.8	74.0
DINO	ViT-S/8	ImageNet	800	69.9	66.6	73.1	39.5	56.5	80.3
SiamMAE	ViT-S/8	Kinetics	2000	71.4	68.4	74.5	45.9	61.9	83.8
Ours	ViT-S/8	Kinetics	100	72.4	69.6	75.2	44.1	57.6	81.0
Ours	ViT-S/8	Kinetics	400	73.5 (+2.1)	70.6 (+2.2)	76.5 (+2.0)	47.5 (+1.6)	59.2	82.6

Empty entries indicate that no result was reported in the corresponding publication. Values in parentheses denote the gains highlighted in the paper.

Qualitative Comparison

Sequence Comparisons

Three qualitative comparison clips from the DAVIS 2017 object segmentation propagation benchmark.

dance-twirl

Qualitative comparison sequence from the supplementary material.

shooting

Qualitative comparison sequence from the supplementary material.

soapbox

Qualitative comparison sequence from the supplementary material.

Citation

@InProceedings{Wang_2026_CVPR,
  author    = {Wang, Qin and Morrison, Abigail and Scharr, Hanno and Krajsek, Kai},
  title     = {TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {39647-39658}
}