../

proj4/

├── overview/ Neural Radiance Fields
├── part0/ Camera Calibration and 3D Scanning
├── part1/ 2D Neural Field
├── part2/ 3D Neural Radiance Field
└── part2.6/ Own Data NeRF

Neural Radiance Fields!

CS180 Project 4: Neural Radiance Fields and Volume Rendering

Overview

This project explores Neural Radiance Fields (NeRF), a cutting-edge technique for 3D scene representation and view synthesis. We implement neural networks that learn implicit representations of 3D scenes from multi-view images, enabling photorealistic view synthesis. The project spans from 2D neural fields to full 3D NeRF implementation, including camera calibration, volume rendering, and training on both synthetic and real-world data.

Part 0: Camera Calibration and 3D Scanning

ArUco-based camera calibration and pose estimation

For this section, I calibrated my camera using ArUco tags and captured multi-view images of a chosen object. The calibration process involved capturing 30-50 images of ArUco calibration tags, extracting corner coordinates, and using OpenCV's camera calibration to compute intrinsic parameters and distortion coefficients.

After calibration, I captured images of my chosen object (a small figurine) alongside a single ArUco tag, then used the calibrated camera parameters to estimate camera poses via solvePnP. The final step involved undistorting images and packaging everything into a dataset format compatible with NeRF training.

Sample Images from Data Capture

Examples of captured images showing the object alongside the ArUco tag for pose estimation:

Sample Image 1
Captured Image 1
Sample Image 2
Captured Image 2
Sample Image 3
Captured Image 3

Camera Frustum Visualizations

Screenshots of the camera frustums visualization in Viser, showing the estimated camera poses:

Camera Frustums 1
Camera Frustums Visualization - View 1
Camera Frustums 2
Camera Frustums Visualization - View 2
Camera Frustums 3
Camera Frustums Visualization - View 3
Camera Frustums 4
Camera Frustums Visualization - View 4

Part 1: 2D Neural Field

Fitting neural networks to represent 2D images

Before implementing 3D NeRF, I started with a simpler 2D version to understand the fundamentals. I created a Multilayer Perceptron (MLP) with Sinusoidal Positional Encoding that takes 2D pixel coordinates as input and outputs RGB color values. The network was trained to fit entire images by optimizing pixel-wise color predictions.

2D Neural Field Architecture

2D Neural Field Architecture

Network Architecture Details:

Training Progression

Below shows the training progression for both the provided fox image and my own Messi image:

Fox Image Training

Fox Epoch 0
Epoch 0
Fox Epoch 100
Epoch 100
Fox Epoch 400
Epoch 400
Fox Epoch 1600
Epoch 1600
Fox Epoch 2999
Epoch 2999
Original Fox Image
Original Fox Image
Fox Final Reconstruction
Neural Field Reconstruction

Messi Image Training

Messi Epoch 0
Epoch 0
Messi Epoch 100
Epoch 100
Messi Epoch 400
Epoch 400
Messi Epoch 1600
Epoch 1600
Messi Epoch 2999
Epoch 2999
Original Messi Image
Original Messi Image
Messi Final Reconstruction
Neural Field Reconstruction

Hyperparameter Analysis

I experimented with different positional encoding frequencies (L) and network widths:

L2 W64
$L=2$, Width=64
L2 W256
$L=2$, Width=256
L10 W64
$L=10$, Width=64
L10 W256
$L=10$, Width=256

As expected, higher positional encoding frequencies ($L=10$) capture fine details better than lower frequencies ($L=2$). Similarly, wider networks (256 units) produce smoother reconstructions compared to narrower ones (64 units).

Training Metrics

Fox Image Training Curves

Fox PSNR Curve
PSNR Training Curve (Fox Image)
Fox Loss Curve
Loss Training Curve (Fox Image)

Messi Image Training Curves

Messi PSNR Curve
PSNR Training Curve (Messi Image)
Messi Loss Curve
Loss Training Curve (Messi Image)

Part 2: Neural Radiance Field from Multi-view Images

Full 3D NeRF implementation with volume rendering

Building on the 2D neural field foundation, I implemented a full Neural Radiance Field that represents 3D scenes from multi-view images using inverse rendering from calibrated cameras.

Part 2.1: Create Rays from Cameras

I implemented three core coordinate transformations to generate camera rays. First, I used 4×4 transformation matrices to convert points from camera space to world coordinates: $\mathbf{x}_w = \mathbf{R} \mathbf{x}_c + \mathbf{t}$. Then I inverted the camera intrinsic matrix to convert pixel coordinates to 3D camera coordinates: $\mathbf{x}_c = K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}$. Finally, I generated rays by computing the origin as the camera position $\mathbf{o} = \mathbf{t}$ and the direction as the normalized vector from camera center through each pixel.

Part 2.2: Sampling

I implemented random ray sampling from multi-view images, adding 0.5 to UV coordinates to sample from pixel centers rather than corners. For each ray, I uniformly sampled 64 points between near and far planes by dividing the ray into equal intervals. During training, I added random perturbations $t_i = t_i + \mathcal{U}(0, \Delta t)$ to sample positions to prevent overfitting and ensure the model explores all locations along each ray.

Part 2.3: Putting the Dataloading All Together

I created a unified $\texttt{RaysData}$ class that precomputes ray origins and directions for all training images, then efficiently samples 10,000 random rays per iteration. This dataloader returns ray origins, directions, and corresponding ground truth RGB colors for batch training. I verified the implementation by visualizing camera frustums, rays, and 3D sample points using Viser.

Rays Visualization 1
Camera Frustums and Rays - View 1
Rays Visualization 2
Camera Frustums and Rays - View 2
Rays Visualization 3
Camera Frustums and Rays - View 3

Part 2.4: Neural Radiance Field

I built an 8-layer MLP that takes positionally-encoded 3D coordinates ($L=10$) and view directions ($L=4$) as input. The network uses skip connections at layer 5 and has separate heads for density prediction (with ReLU) and view-dependent color prediction (with Sigmoid). There's also a skip connection for the RGB branch where the view directions ($\mathbf{r}_d$) are concatenated at the second RGB fully connected layer. This architecture allows the model to capture both geometric structure and view-dependent appearance effects.

3D Neural Radiance Field Architecture

3D Neural Radiance Field Architecture

Part 2.5: Volume Rendering

I implemented the discrete volume rendering equation to composite colors and densities along rays into final pixel colors: $C(\mathbf{r}) = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}_i$. The function computes alpha values $\alpha_i = 1 - \exp(-\sigma_i \delta_i)$ and transmittance $T_i = \exp(-\sum_{j=1}^{i-1} \sigma_j \delta_j)$ using cumulative sums for proper alpha compositing. My implementation passed the provided torch.allclose() test, ensuring correct mathematical computation of the rendering integral.

Training Progression

Training progression showing the NeRF learning to represent the Lego scene. The model was trained for 2000 epochs using Adam optimizer with learning rate $5 \times 10^{-4}$, sampling 10,000 rays per iteration and 64 points per ray. Near and far planes were set to $z_{\text{near}}=2.0$ and $z_{\text{far}}=6.0$ respectively. The network uses positional encoding with $L=10$ frequencies for 3D coordinates and $L=4$ frequencies for view directions, with 256 hidden units per layer.

200 Epochs
200 Epochs
400 Epochs
400 Epochs
800 Epochs
800 Epochs
1500 Epochs
1500 Epochs
Ground Truth
Ground Truth
Final
Final Result (2000 Epochs)

Training Metrics

Lego Training Curve
Training Loss and PSNR Curve

Spherical Rendering

Spherical rendering of the Lego scene:

Lego NeRF - Spherical Rendering

Part 2.6: Training NeRF on Own Data

Using the dataset I created in Part 0, I trained a NeRF on my own captured object. This involved adapting the network hyperparameters for real-world data, adjusting the near/far sampling bounds, and fine-tuning the training process to handle the challenges of real camera data compared to the synthetic Lego scene.

Hyperparameters for Real Data

Training Configuration:

Training Progression

Training progression showing the NeRF learning to represent my captured object. We can observe significant improvements in fine details as training progresses - the harmonium keys become increasingly well-defined and the yellow holes on the top near the ArUco tag become more visible and accurately rendered:

2000 Epochs
2000 Epochs
4000 Epochs
4000 Epochs
6000 Epochs
6000 Epochs
8000 Epochs
8000 Epochs
Ground Truth
Ground Truth
Final
Final Result

Training Metrics

Own Data Training Curve
Training Loss and PSNR Curve for Own Data

Spherical Rendering

Circular orbit rendering of my captured object:

Own Data NeRF - Circular Rendering
© 2025 Sukhamrit Singh. All rights reserved.