A Beginner's 12-Step Visual Guide to Understanding NeRF: Neural Radiance Fields for Scene Representation and Visual Synthesis | by Aqeel Anwar | January, 2025

Basic understanding of NeRF operation through visual representations

Who should read this article?
This article aims to provide a basic beginner's level understanding of NeRF functionality through visual presentations. Although various blogs provide detailed explanations of NeRF, these are generally aimed at students with a strong technical background in volume rendering and 3D graphics. In contrast, this article seeks to explain NeRF with the minimum information required, with optional technical snippets at the end for curious readers. For those interested in the mathematical details behind NeRF, a list of further reading is provided at the end.
What is NeRF and How Does it Work?
NeRF, short for Neural Radiance Fieldsis a 2020 paper that presents a novel method for rendering 2D images in 3D scenes. Traditional methods rely on physics-based, computer-intensive techniques such as ray tracing and ray tracing. This involves tracing a ray of light from each pixel of the 2D image back to the particles in the scene to estimate the pixel's color. Although these methods provide high accuracy (eg, images captured by phone cameras are very close to what the human eye sees at the same angle), they are often slow and require significant computational resources, such as GPUs, for parallel processing. As a result, using these methods on peripheral devices with limited computing power is almost impossible.
NeRF addresses this issue by working as a scene compression method. It uses an overlaid multi-layer perceptron (MLP) to encode scene information, which can be queried from any viewpoint to produce a 2D rendered image. If properly trained, NeRF greatly reduces maintenance requirements; for example, a simple 3D scene can be compressed into about 5MB of data.
At its core, NeRF answers the following question using MLP:
What will I see if I look at the scene this way?
This question is answered by giving the direction of view (in terms of two angles (θ, φ), or a unit vector) to the MLP as input, and the MLP provides the RGB (rectified extracted color) and volume density, which are then processed. volumetric rendering to produce the final RGB value seen by the pixel. To create an image of a certain resolution (say HxW), the MLP is queried HxW times for each pixel's field of view, and the image is created. Since the release of the first NeRF paper, many revisions have been made to improve the quality and speed of rendering. However, this blog will focus on the first NeRF paper.
Step 1: Input images for multiple views
NeRF requires various images from different viewing angles to compress a scene. MLP learns to combine these images to find invisible directions (novel views). The image perspective information is provided using internal and external camera matrices. The more images that span multiple viewing directions, the better the NeRF reconstruction of the scene. In short, basic NeRF takes input camera images, and their associated internal and external camera matrices. (You can read more about camera matrices in the blog below)
Step 2 to 4: Sampling, Pixel iteration, and Ray streaming
Each image in the input images is processed independently (for simplicity). From the input, the image and associated camera matrices are derived. For each camera image pixel, a ray is traced from the center of the camera to the pixel and then stretched outward. If the center of the camera is defined as o, and the direction of view as the directional vector d, then the ray r