Alternating the left and right views gives the illusion of depth through structure from motion.
In the ideal depth image, white areas are closer than the darker areas.
Architecting the Neural Network
Originally I started with much larger images (320x240) with the thought that the better resolution would allow the neural net to better judge the depth. However, I quickly found that the dense layer I wanted in the neural net required too many weights at that resolution (320 x 240 [pixel inputs] x 32 [features from previous convolutional net] x 320 x 240 [dense layer nodes] = ~189 billion. Multiply by 4 bytes to get close to 750 GB in weights for 1 layer!). So, I shrunk the images down to 100x100 pixels, and also removed the color so my inputs are gray scale.With the smaller inputs, I was able to start searching for a good network architecture. I tried the following, in order, using rectified linear nodes throughout the architectures, for the 100x100 pixel images:
- 32 feature, 5x5 kernel convolutional layer, 16 feature, 1x16 kernel convolutional layer, with final output flattened by max pooling
- 32 feature, 5x5 kernel convolutional layer, 100 node dense layer, 100x100 dense layer
- 64 feature, 3x3 kernel convolutional layer, 16 feature, 15x15 kernel convolutional layer, 100 feature, 31x31 kernel convolutional layer, with final output flattened by max pooling
- 10 feature, 7x7 kernel convolutional layer, 10 feature, 29x29 kernel convolutional layer, 1 feature, 3x3 kernel convolutional layer
- 32 feature, 5x5 kernel convolutional layer, 2x2 max pool, 32 feature, 3x3 kernel convolutional layer, 1024 node dense layer, 1024 node dense layer, with final output reshaped to output image size
- 32 feature, 5x5 kernel convolutional layer, 32 feature, 5x5 kernel convolutional layer, 1024 node dense layer, 1024 node dense layer, with final output reshaped to output image size
The top row is the stereo image pair, the bottom left is the ideal image, and the bottom right is the neural net output after training.
The neural net does not produce as close an output as the ideal image. In part, this is due to the much smaller resolution. The displacement for the corners of the polygons is at most 3 pixels, which gives the neural net about 4 levels of depth, whereas the ideal image has values of 0-255. There is also a blurriness I cannot explain and needs more exploration.
With a reasonable neural net found as a good starting place, I began to dissect the working one. What do the kernel weights look like? The dense layers? What does the output of each layer look like? What does each layer do?
The First Layer - 32x5x5 Convolutional
The first layer detects both features and the small shifts between the images.
Visualizing the weights of the first layer. Black close to -1, white close to 1, and gray about 0.
Visualizing the output of the first layer, gamma adjusted for more contrast.
The Second Layer - 32x5x5 Convolutional
The second layer quickly becomes less understandable, because the features for this layer are looking for combinations of features detected in the first layer.
Visualizing the weights of the second layer. Each row is a feature, each column is a "channel", or feature from the previous layer, within that feature.
Visualizing the output of the second layer, gamma adjusted for more contrast.
The Third Layer - 1024 (32x32) Dense
The third layer weights are much more sparse. The weights in the dense layer created a HUGE 1088x34816 image, so I took an excerpt from it. I noted that the nodes associated with the edges of the image were much more sparse than those associated with the middle, implying that the edges contain less information than the middle. Given that the polygon generator tried to keep the polygons within the view, this is not surprising, as it would have been biased toward the center.
Excerpt of 4 nodes from the center of the image, with each row being a node. The 32 columns correspond with the 32 features from the second layer.
Distribution of the weights in the dense layer, taking a roughly Gaussian shape centered at 0. The spike for black color is caused by having black margins.
Output of the third layer, with black being closer to -1 and white closer to 1. The dense layer is abstracting the image, and reminds me of Wanka vision.
The Fourth Layer - 1024 (32x32) Dense
The fourth layer reconstructs this statistical
Excerpt of 5 nodes weights from the fourth layer. Gray is close to 0, with black being closer to -1.
Distribution of the weights in the fourth layer. They are skewed toward negative values, perhaps to use the non-linear part of the rectified linear units which output only positive values.
The fourth layers output, which is the final output of the network.
Further Explorations
Does the third layer need to have so many nodes, since the weights seem to be so sparse? Reducing the third layer would majorly reduce the number of weights in the network, and thus reduce the computational complexity of the network. With the saved weights of the model taking 132 MB for a 32x32 image, I would not want to try to use this on a real-time embedded system for a robot.Where does the blurriness come from? Could longer training help this? The error was very slowly reducing when I stopped the training process, but still descending.
Could rectangular, wider convolutional filters help pick up more spacial disparity?











No comments:
Post a Comment