Skip to content

Commit c3147b8

Browse files
chunhanyao-stableVikram Voleti
and
Vikram Voleti
authored
add SV4D 2.0 (#440)
* add SV4D 2.0 * add SV4D 2.0 * Combined sv4dv2 and sv4dv2_8views sampling scripts --------- Co-authored-by: Vikram Voleti <vikram@ip-26-0-153-234.us-west-2.compute.internal>
1 parent 1659a1c commit c3147b8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1000
-116
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@
1212
/outputs
1313
/build
1414
/src
15-
/.vscode
15+
/.vscode
16+
**/__pycache__/

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,46 @@
55
## News
66

77

8+
**April 4, 2025**
9+
- We are releasing **[Stable Video 4D 2.0 (SV4D 2.0)](https://huggingface.co/stabilityai/sv4d2.0)**, an enhanced video-to-4D diffusion model for high-fidelity novel-view video synthesis and 4D asset generation. For research purposes:
10+
- **SV4D 2.0** was trained to generate 48 frames (12 video frames x 4 camera views) at 576x576 resolution, given a 12-frame input video of the same size, ideally consisting of white-background images of a moving object.
11+
- Compared to our previous 4D model [SV4D](https://huggingface.co/stabilityai/sv4d), **SV4D 2.0** can generate videos with higher fidelity, sharper details during motion, and better spatio-temporal consistency. It also generalizes much better to real-world videos. Moreover, it does not rely on refernce multi-view of the first frame generated by SV3D, making it more robust to self-occlusions.
12+
- To generate longer novel-view videos, we autoregressively generate 12 frames at a time and use the previous generation as conditioning views for the remaining frames.
13+
- Please check our [project page](https://sv4d20.github.io), [arxiv paper](https://arxiv.org/pdf/2503.16396) and [video summary](https://www.youtube.com/watch?v=dtqj-s50ynU) for more details.
14+
15+
**QUICKSTART** :
16+
- `python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputs` (after downloading [sv4d2.safetensors](https://huggingface.co/stabilityai/sv4d2.0) from HuggingFace into `checkpoints/`)
17+
18+
To run **SV4D 2.0** on a single input video of 21 frames:
19+
- Download SV4D 2.0 model (`sv4d2.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`: `huggingface-cli download stabilityai/sv4d2.0 sv4d2.safetensors --local-dir checkpoints`
20+
- Run inference: `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
21+
- `input_path` : The input video `<path/to/video>` can be
22+
- a single video file in `gif` or `mp4` format, such as `assets/sv4d_videos/camel.gif`, or
23+
- a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or
24+
- a file name pattern matching images of video frames.
25+
- `num_steps` : default is 50, can decrease to it to shorten sampling time.
26+
- `elevations_deg` : specified elevations (reletive to input view), default is 0.0 (same as input view).
27+
- **Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using [Clipdrop](https://clipdrop.co/) or [SAM2](https://github.com/facebookresearch/segment-anything-2) before running SV4D.
28+
- **Low VRAM environment** : To run on GPUs with low VRAM, try setting `--encoding_t=1` (of frames encoded at a time) and `--decoding_t=1` (of frames decoded at a time) or lower video resolution like `--img_size=512`.
29+
30+
Notes:
31+
- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D).
32+
- Download the model from huggingface: `huggingface-cli download stabilityai/sv4d2.0 sv4d2_8views.safetensors --local-dir checkpoints`
33+
- Run inference: `python scripts/sampling/simple_video_sample_4d2.py --model_path checkpoints/sv4d2_8views.safetensors --input_path assets/sv4d_videos/chest.gif --output_folder outputs`
34+
- The 5x8 model takes 5 frames of input at a time. But the inference scripts for both model take 21-frame video as input by default (same as SV3D and SV4D), we run the model autoregressively until we generate 21 frames.
35+
- Install dependencies before running:
36+
```
37+
python3.10 -m venv .generativemodels
38+
source .generativemodels/bin/activate
39+
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # check CUDA version
40+
pip3 install -r requirements/pt2.txt
41+
pip3 install .
42+
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
43+
```
44+
45+
![tile](assets/sv4d2.gif)
46+
47+
848
**July 24, 2024**
949
- We are releasing **[Stable Video 4D (SV4D)](https://huggingface.co/stabilityai/sv4d)**, a video-to-4D diffusion model for novel-view video synthesis. For research purposes:
1050
- **SV4D** was trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 context frames (the input video), and 8 reference views (synthesised from the first frame of the input video, using a multi-view diffusion model like SV3D) of the same size, ideally white-background images with one object.
@@ -164,6 +204,7 @@ This is assuming you have navigated to the `generative-models` root after clonin
164204
# install required packages from pypi
165205
python3 -m venv .pt2
166206
source .pt2/bin/activate
207+
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
167208
pip3 install -r requirements/pt2.txt
168209
```
169210

assets/sv4d2.gif

9.71 MB
Loading

assets/sv4d_videos/bear.gif

2.16 MB
Loading

assets/sv4d_videos/bee.gif

638 KB
Loading

assets/sv4d_videos/bmx-bumps.gif

2.23 MB
Loading

assets/sv4d_videos/bunnyman.mp4

-47.1 KB
Binary file not shown.

assets/sv4d_videos/camel.gif

1.93 MB
Loading

assets/sv4d_videos/chameleon.gif

1.4 MB
Loading

assets/sv4d_videos/chest.gif

2.2 MB
Loading

assets/sv4d_videos/cows.gif

1.67 MB
Loading

assets/sv4d_videos/dance-twirl.gif

1.15 MB
Loading

assets/sv4d_videos/dolphin.mp4

-33.9 KB
Binary file not shown.

assets/sv4d_videos/flag.gif

2.12 MB
Loading

assets/sv4d_videos/gear.gif

446 KB
Loading

assets/sv4d_videos/green_robot.mp4

-50.3 KB
Binary file not shown.

assets/sv4d_videos/guppie_v0.mp4

-72.3 KB
Binary file not shown.

assets/sv4d_videos/hike.gif

1.6 MB
Loading

assets/sv4d_videos/hiphop_parrot.mp4

-56.8 KB
Binary file not shown.

assets/sv4d_videos/horsejump-low.gif

1.41 MB
Loading

assets/sv4d_videos/human5.mp4

-133 KB
Binary file not shown.

assets/sv4d_videos/human7.mp4

-123 KB
Binary file not shown.

assets/sv4d_videos/lucia_v000.mp4

-74.1 KB
Binary file not shown.

assets/sv4d_videos/monkey.mp4

-33.4 KB
Binary file not shown.

assets/sv4d_videos/pistol_v0.mp4

-36.2 KB
Binary file not shown.

assets/sv4d_videos/robot.gif

947 KB
Loading

assets/sv4d_videos/snowboard.gif

1.5 MB
Loading

assets/sv4d_videos/snowboard_v000.mp4

-123 KB
Binary file not shown.

assets/sv4d_videos/stroller_v000.mp4

-130 KB
Binary file not shown.

assets/sv4d_videos/test_video2.mp4

-25.2 KB
Binary file not shown.

assets/sv4d_videos/train_v0.mp4

-37.9 KB
Binary file not shown.

assets/sv4d_videos/wave_hello.mp4

-143 KB
Binary file not shown.

assets/sv4d_videos/windmill.gif

2.36 MB
Loading

requirements/pt2.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,16 @@ einops>=0.6.1
55
fairscale>=0.4.13
66
fire>=0.5.0
77
fsspec>=2023.6.0
8+
imageio[ffmpeg]
9+
imageio[pyav]
810
invisible-watermark>=0.2.0
911
kornia==0.6.9
1012
matplotlib>=3.7.2
1113
natsort>=8.4.0
1214
ninja>=1.11.1
13-
numpy>=1.24.4
15+
numpy==2.1
1416
omegaconf>=2.3.0
17+
onnxruntime
1518
open-clip-torch>=2.20.0
1619
opencv-python==4.6.0.66
1720
pandas>=2.0.3

0 commit comments

Comments
 (0)