Skip to content

CompVis/rayder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RayDer

Project Page Paper Weights

Scalable Self-Supervised Novel View Synthesis from Real-World Video

CompVis @ LMU Munich, MCML

Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute — following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.

RayDer enables training NVS from abundant general video, removing the static-scene data bottleneck

This is a minimal, self-contained PyTorch re-implementation of RayDer (covering inference, training code coming soon).

Results

Zero-shot qualitative comparison of RayDer-L against E-RayZer in various NVS settings and an extreme setting with near-zero context-view overlap:

Zero-shot qualitative samples of RayDer-L compared with E-RayZer in typical NVS and extreme settings

See the project page for additional samples (incl. videos) and analysis.

Usage

Setup

The model only depends on a recent torch, torchvision, einops, and jaxtyping; the demos additionally need gradio, Pillow and imageio/imageio-ffmpeg. Install them via:

pip install -r requirements.txt

Standalone

If you want to integrate RayDer into your own codebase, copy rayder/model.py and you should be good to go. Then instantiate the model as:

from rayder.model import RayDer_L

model = RayDer_L()
model.load_state_dict(torch.load("rayder_l_576.pt", weights_only=True))
model.requires_grad_(False)
model.eval()

The RayDer class exposes two high-level inference methods:

  • predict_cameras(x): estimate camera parameters from a set of input views (trained for 8 views, but the models extrapolate quite well)
  • predict_views(x_in, cam_in, cam_target): synthesize novel views at target camera poses (trained for 1-7 input views, arbitrarily many output views)

Cameras are represented as custom dataclasses that can be directly sliced/indexed as a whole.

About the Codebase

Code is separated into clearly labeled blocks with comments explaining relevant design choices and conventions. For all public-facing APIs involving tensors, type hints with jaxtyping are provided (e.g. img: Float[torch.Tensor, "b t h w c"]), annotating dtype, tensor type, and shape.

Conventions. Images are channels-last (b, t, h, w, 3), not the PyTorch-default (b, t, 3, h, w), with pixel values in [-1, 1]. Camera extrinsics use the camera-to-world (c2w) convention: R rotates camera-space directions into world space and t is the camera position in world coordinates. The focal length f is normalized by the shorter image side: f = f_pixels / min(h-1, w-1).

Generating Videos

Use generate_video.py to produce smooth view-interpolation videos from a set of input images:

python generate_video.py --image_dir /path/to/input/images --output output.mp4 --steps_per_pair 10 --fps 15

A checkpoint will be downloaded automatically if not explicitly specified.

Interactive Demo

Launch the Gradio app for an interactive browser-based demo:

python app.py

Upload a set of views, adjust the number of interpolation steps, and generate a novel-view video. The RayDer-L-576² model is loaded automatically via torch.hub.

Models

We currently release the following model variants:

Variant Width Depth Params Resolution torch.hub name
RayDer-L 1024 24 ~743M 256² rayder_l
RayDer-L-576² 1024 24 ~743M 576² rayder_l_576

Weights are released via HuggingFace. Additional model variants and licensing available upon request.

Acknowledgments

RayDer conceptually builds upon RayZer (Jiang et al., ICCV 2025), which introduced self-supervised NVS from unposed images via ray map-conditioned rendering. We extend their method to more general pretraining on dynamic video, consolidate the architecture, and enable variable input-view-count inference.

Parts of this repo are taken from the Flow Poke Transformer (Baumann et al., ICCV 2025) public implementation (MIT). We also acknowledge code adapted from HDiT (Crowson et al., ICML 2024; MIT). The gradio app loosely adapts some code from E-RayZer (Zhao et al., CVPR 2026; MIT).

License

This software is released under a license for personal and scientific non-commercial research purposes -- see LICENSE.md for the full terms. For any commercial use or exploitation, please contact license.compvis@ifi.lmu.de.

Citation

If you find our model or code useful, please cite our paper:

@misc{prestel2026rayderscalableselfsupervisednovel,
      title={RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video}, 
      author={Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer},
      year={2026},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages