Deep Learning and Generative AI for Computer Vision
| Title: |
Deep Learning and Generative AI for Computer Vision |
| DNr: |
Berzelius-2026-138 |
| Project Type: |
LiU Berzelius |
| Principal Investigator: |
Fredrik Kahl <fredrik.kahl@chalmers.se> |
| Affiliation: |
Chalmers tekniska högskola |
| Duration: |
2026-05-01 – 2026-11-01 |
| Classification: |
10207 |
| Homepage: |
https://neural3d.github.io/ |
| Keywords: |
|
Abstract
This project is a continuation and expansion of Berzelius-2025-351 ("Generative AI for Autonomous Driving Scene Generation and Reconstruction"). It also incorporates the work previously described in Berzelius-2025-417 ("Deep Learning for 3D Computer Vision using Geometric Information and Generative Models"), following NSC's recommendation to consolidate all scientific subprojects under a single allocation. The previous project was scoped to generative methods for autonomous driving; this proposal broadens the scope to cover the research activities of the Computer Vision group at Chalmers University of Technology. The group has several active WASP-funded projects that share a common dependence on large-scale GPU compute for training and inference of modern deep learning architectures.
The subprojects are:
1. Generative AI for Autonomous Driving (Bernardo Taveira, Tianyu Wu): Generative image and video diffusion models for correcting artifacts from 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRFs) in autonomous driving scenes; World Foundation Models for generating novel 4D driving scenes from text or road markers for use in closed-loop simulation; reinforcement learning of autonomous driving models in dynamic 3D Gaussian Splatting driving scenes; and regularization methods for improved 3DGS reconstruction quality.
2. Consistent Image and Video Generation and Editing (Josef Bengtson, Yaroslava Lochman): Enforcing 3D geometric and semantic consistency in foundation image and video generation models without retraining or fine-tuning. Approaches include guidance of the denoising process, output verification, and input augmentation combined with 3D reconstruction methods. Pipelines involve large foundation models (e.g., FLUX.2, RoMA v2, DepthAnything, AnySplat) requiring 30–80+ GiB of GPU memory, with backpropagation through the denoising process adding further memory and compute demands. Validation requires processing 20–50 scenes of up to 100 images each.
3. Geometric Deep Learning for 3D Vision (David Nordström): Training large-scale deep learning models with geometric inductive biases for problems in 3D vision. Active research directions include: improving the efficiency of Vision Transformers via equivariance constraints (ICML 2025 spotlight); delivering state-of-the-art feature matching models such as RoMa v2 and LoMa; and self-supervised multi-view learning for 3D understanding (MuM, accepted at CVPR 2026). Representative training runs range from 16×A100 for 3–4 days to 64×A100 for 3 days, with extensive ablation studies on top.
4. Generative Models for State Estimation and System Identification (Karl Hammar): Using generative models — neural SDEs, normalizing flows, and other architectures — as proposal functions to accelerate traditional state estimation techniques such as particle filters. Models are small (< 5 GB) but require many experiments to be run in parallel; typical training runs last 1–12 hours with minimal storage requirements.