CVPR 2026

TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction

Qin Wang1,2 Abigail Morrison1,2 Hanno Scharr1 Kai Krajsek1

1 Forschungszentrum Jülich, 2 RWTH Aachen University

TLDR: TimeBridge extends iBOT for video representation learning by reconstructing intermediate frames from temporally separated start and end frames.

Animated TimeBridge project teaser.
Method overview. TimeBridge samples a start frame and an end frame, then reconstructs equidistant in-between frames to learn the temporal evolution bridging the two observations.

Overview

Abstract

TimeBridge is a self-supervised video representation learning method that models how visual content evolves between a start frame and an end frame. It learns temporal transformations by reconstructing in-between frames, producing stronger motion-aware features for dense video prediction tasks such as video object segmentation and part propagation.

Core Idea

Learn time as a bridge, not just a sequence

Jointly encode the start and end observations, then predict the latent visual evolution in between.

Representation Benefit

Capture motion-aware semantics

The learned features reflect how objects change over time, making them strong for dense propagation tasks.

Downstream Impact

Strong performance on dense video understanding

The representation transfers to video object segmentation and semantic or part propagation benchmarks.

Results

Comparison with Prior Work

Performance comparison on DAVIS 2017, VIP, and JHMDB.

Method Backbone Dataset Epoch DAVIS VIP JHMDB
J & Fm Jm Fm mIoU PCK@0.1 PCK@0.2
DINO ViT-S/16 ImageNet 800 61.8 60.2 63.4 36.2 45.6 75.0
iBOT ViT-S/16 ImageNet 800 62.8 61.2 64.5 37.9 44.6 74.6
CrOC ViT-S/16 ImageNet 300 44.7 43.5 45.9 - - -
VideoMAE ViT-S/16 Kinetics - 39.3 39.7 38.9 23.3 41.0 67.9
SiamMAE ViT-S/16 Kinetics 2000 62.0 60.3 63.7 37.3 47.0 76.1
SiamMAE ViT-S/16 Kinetics 400 57.9 56.0 60.0 33.2 46.1 74.0
CropMAE ViT-S/16 Kinetics 400 58.6 55.8 61.4 33.7 42.9 71.1
CropMAE ViT-S/16 INSub 400 60.4 57.6 63.3 33.3 43.6 72.0
RSP ViT-S/16 Kinetics 400 60.1 57.4 62.8 33.8 44.6 73.4
T-CoRe ViT-S/16 Kinetics 400 64.7 63.5 66.0 37.8 47.0 75.2
Ours ViT-S/16 Kinetics 400 66.2 (+1.5) 64.3 (+0.8) 68.1 (+2.1) 39.4 (+1.5) 45.8 74.0
DINO ViT-S/8 ImageNet 800 69.9 66.6 73.1 39.5 56.5 80.3
SiamMAE ViT-S/8 Kinetics 2000 71.4 68.4 74.5 45.9 61.9 83.8
Ours ViT-S/8 Kinetics 100 72.4 69.6 75.2 44.1 57.6 81.0
Ours ViT-S/8 Kinetics 400 73.5 (+2.1) 70.6 (+2.2) 76.5 (+2.0) 47.5 (+1.6) 59.2 82.6

Empty entries indicate that no result was reported in the corresponding publication. Values in parentheses denote the gains highlighted in the paper.

Qualitative Comparison

Sequence Comparisons

Three qualitative comparison clips from the DAVIS 2017 object segmentation propagation benchmark.

dance-twirl

Qualitative comparison sequence from the supplementary material.

shooting

Qualitative comparison sequence from the supplementary material.

soapbox

Qualitative comparison sequence from the supplementary material.

Citation

@InProceedings{Wang_2026_CVPR,
  author    = {Wang, Qin and Morrison, Abigail and Scharr, Hanno and Krajsek, Kai},
  title     = {TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {39647-39658}
}