Using Sun RGB-D: Indoor Scene Dataset with 2D & 3D Annotations

By Jessie Hobb On Mar 9, 2024

Simple Python code for accessing Sun RGB-D and similar datasets

3D understanding from 2D images is the first step into a larger world.

As many of the primitive tasks in computer vision approach a solved state — decent, quasi-general solutions now being available for image segmentation and text-conditioned generation, with general answers to visual question answering, depth estimation, and general object detection well on the way — I and many of my colleagues have been looking to use CV in larger tasks. When a human looks at a scene, we see more than flat outlines. We comprehend more than a series of labels. We can perceive and imagine within 3D spaces. We see a scene, and we can understand it in a very complete way. This capability should be within reach for CV systems of the day… If only we had the right data.

Sun RGB-D is an interesting image dataset from 2015 that satiates many of the data hungers of total scene understanding. This dataset is a collection of primarily indoor scenes, collected with a digital camera and four different 3D scanners. The linked publication goes into greater detail on how the dataset was collected and what it contains. Most importantly though, this dataset contains a wealth of data that includes both 2D and 3D annotations.

Source: *SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite*

With this dataset, CV and ML algorithms can learn much deeper (excuse the pun) features from 2D images. More than that though, using data like this could open opportunities in applying 3D reasoning to 2D images. But that is a story for another time. This article will simply provide the basic python code to access this Sun RGB-D data, so that readers can use this wonderful resource in their own projects.

Dataset Layout

After downloading the dataset from here, you will end up with a directory structure like this.

These separate the data by the type of scanner used to collect them. Specifically, the Intel RealSense 3D Camera for tablets, the Asus Xtion LIVE PRO for laptops, and the Microsoft Kinect versions 1 and 2 for desktop.

Moving into “kv2”, we see two directories: align_kv2 and kinect2data. This is one problem with the Sun RGB-D dataset… its directory structure is not consistent for each sensor type. In “realsense”, there are four directories containing data: lg, sa, sh, and shr. In “xtion” there is a more complex directory structure still. And worse, I have been unable to find a clear description of how these sub-directories are different anywhere in the dataset’s paper, supplementary materials, or website. If anyone knows the answer to this, please let me know!

For the time being though, lets skip down into the consistent part of the dataset: the data records. For align_kv2, we have this:

For all of the data records across all of the sensor types, this part is largely consistent. Some important files to look at are described below:

annotation2Dfinal contains the most recent 2D annotations including polygonal object segmentations and object labels. These are stored in a single JSON file which has the x and y 2D coordinates for each point in each segmentation, as well as a list for object labels.
annotation3Dfinal is the same for 3D annotations. These are in the form of bounding shapes — polyhedra that are axis-aligned on the y (up-down) dimension. These can also be found in the singular JSON file of the directory.
depth contains the raw depth images collected by the sensor. depth_bfx contains a cleaned-up copy that addresses some of the limitations from the sensor.
The original image can be found in the image directory. A full resolution, uncropped version can also be found in fullres.
Sensor extrinsics and intrinsics are saved in text files as numpy-like arrays. intrinsics.txt contains the intrinsics, but extrinsics is stored in the singular text file within the extrinsics folder.
Finally, the type of scene (office, kitchen, bedroom, etc) can be found as a string in scene.txt.

Setup

First things first, we will need to read in files from a few formats. JSON and txt primarily. From those text files, we need to pull out a numpy array for both the extrinsics and intrinsics of the sensor. There are also allot of files here that don’t seem to follow a strict naming convention but will be the only one of its type in the same directory, so get_first_file_path will be useful here.

https://medium.com/media/18105a952d3bc6b42b344724201a14e5/href

I’d also like this code to output a simple 3D model of the rooms we find in the dataset. This can give us some easy data visualization, and lets us distill down the basic spatial features of a scene. To achieve this, we’ll utilize the OBJ file format, a standard for representing 3D geometry. An OBJ file primarily consists of lists of vertices (points in 3D space), along with information on how these vertices are connected to form faces (the surfaces of the 3D object). The layout of an OBJ file is straightforward, beginning with vertices, each denoted by a line starting with ‘v’ followed by the x, y, and z coordinates of the vertex. Faces are then defined by lines starting with ‘f’, listing the indices of the vertices that form each face’s corners, thus constructing the 3D surface.

In our context, the bounding shapes that define the spatial features of a scene are polyhedra, 3D shapes with flat faces and straight edges. Given that the y dimension is axis-aligned — meaning it consistently represents the up-down direction across all points — we can simplify the representation of our polyhedron using only the x and z coordinates for defining the vertices, along with a global minimum (min_y) and maximum (max_y) y-value that applies to all points. This approach assumes that vertices come in pairs where the x and z coordinates remain the same while the y coordinate alternates between min_y and max_y, effectively creating vertical line segments.

The write_obj function encapsulates this logic to construct our 3D model. It starts by iterating over each bounding shape in our dataset, adding vertices to the OBJ file with their x, y, and z coordinates. For each pair of points (with even indices representing min_y and odd indices representing max_y where x and z are unchanged), the function writes face definitions to connect these points, forming vertical faces around each segment (e.g., around vertices 0, 1, 2, 3, then 2, 3, 4, 5, and so on). If the bounding shape has more than two pairs of vertices, a closing face is added to connect the last pair of vertices back to the first pair, ensuring the polyhedron is properly enclosed. Finally, the function adds faces for the top and bottom of the polyhedron by connecting all min_y vertices and all max_y vertices, respectively, completing the 3D representation of the spatial feature.

https://medium.com/media/caae3e0f8082d3a7400a0d16dda706df/href

Finally, lets make the basic structure of our dataset, with a class that represents a dataset (a directory with subdirectories each containing a data record) and the data records themselves. This first object has a very simple function: it will create a new record object for every sub-directory within ds_dir.

https://medium.com/media/74ac39852562f1946bd2b71325c532b1/href

Accessing 2D Segmentations

Accessing 2D segmentation annotations is easy enough. We must make sure to load the json file in annotation2Dfinal. Once that is loaded as a python dict, we can extract the segmentation polygons for each object in the scene. These polygons are defined by their x and y coordinates, representing the vertices of the polygon in the 2D image space.

We also extract the object label by storing the object ID that each bounding shape contains, then cross-referencing with the ‘objects’ list. Both the labels and segmentations are returned by get_segments_2d.

https://medium.com/media/3928afeefd79c9cf315cfb3cc49e4a68/href

Note that the transpose operation is applied to the coordinates array to shift the data from a shape that groups all x coordinates together and all y coordinates together into a shape that groups each pair of x and y coordinates together as individual points.

Accessing 3D Bounding Shapes

Accessing the 3D bounding shapes is a bit harder. As mentioned before, they are stored as y-axis aligned polyhedra (x is left-right, z is forward-back, y is up-down). In the JSON, this is stored as a polygon with an min_y and max_y. This can be extracted to a polyhedron by taking each 2D point of the polygon, and adding two new 3D points with min_y and max_y.

The JSON also provides a useful field which states whether the bounding shape is rectangular. I have preserved this in our code, along with functions to get the type of each object (couch, chair, desk, etc), and the total number of objects visible in the scene.

https://medium.com/media/912d80455d4a9cabe2d54bcc001481a2/href

Accessing the Room Layout

Finally, the room layout has its own polyhedron that encapsulates all others. This can be used by algorithms to understand the broader topology of the room including the walls, ceiling, and floor. It is accessed in much the same way as the other bounding shapes.

https://medium.com/media/ae5a322164f7181cef3fae82616e9e0b/href

Full Code

Below is the full code with a short testing section. Besides visualizing the 2D annotations from one of the data records, we also save 3d .obj files for each identified object in the scene. You can use a program like meshlab to visualize the output. The sensor intrinsics and extrinsics have also been extracted here. Intrinsics refer to the internal camera parameters that affect the imaging process (like focal length, optical center, and lens distortion), while extrinsics describe the camera’s position and orientation in a world coordinate system. They are important for accurately mapping and interpreting 3D scenes from 2D images.

https://medium.com/media/86f4b98ae7c587c402416e7a89ec7e26/href

Code is also available here: https://github.com/arcosin/Sun-RGDB-Data-Extractor.

This repo may or may not be updated in the future. I would love to add functionality for accessing this as a PyTorch dataset with minibatches and such. If anyone has some easy updates, feel free to make a PR.

Left: the simple 3D representation of the scene shown in meshlab. Note the transparent room bounding shape and the many objects represented as boxes. Right: the original image.

Conclusion

I hope this guide has been helpful in showing you how to use the Sun RGB-D Dataset. More importantly, I hope it’s given you a peek into the broader skill of writing quick and easy code to access datasets. Having a tool ready to go is great, but understanding how that tool works and getting familiar with the dataset’s structure will serve you better in most cases.

Extra Notes

This article has introduced some easy-to-modify python code for extracting data from the Sun RGB-D dataset. Note that an official MATLAB toolbox for this dataset already exists. But I don’t use MATLAB so I didn’t look at it. If you are a MATLABer (MATLABster? MATLABradour? eh…) then that might be more comprehensive.

I also found this for python. It’s a good example of extracting only the 2D features. I borrowed some lines from it, so go throw it a star if you feel up to it.

References

This article utilizes the Sun RGB-D dataset [1] licensed under CC-BY-SA. This dataset also draws data from previous work [2, 3, 4]. Thank you to them for their outstanding contributions.

[1] S. Song, S. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite,” Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR2015), Oral Presentation.

[2] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, “Indoor segmentation and support inference from RGBD images,” ECCV, 2012.

[3] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell, “A category-level 3-D object dataset: Putting the Kinect to work,” ICCV Workshop on Consumer Depth Cameras for Computer Vision, 2011.

[4] J. Xiao, A. Owens, A. Torralba, “SUN3D: A database of big spaces reconstructed using SfM and object labels,” ICCV, 2013.

Using Sun RGB-D: Indoor Scene Dataset with 2D & 3D Annotations was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.