Project 2: Deep Learning for Image Colorization
Due: Thurs Nov 3 (11:59 PM)
This project involves automatic colorization of input images. The goal is to train a ConvNet regressor inspired by the SIGGRAPH 2016 paper on deep colorization.
Example of colorization for a face using a simple CNN, trained on the Labeled Faces in the Wild dataset . Left: grayscale input. Right: colorized. Note that the face is given reasonable color, but the hand is not: this dataset lacks enough training examples of hands to learn hand coloration.
We suggest you use Torch, which is built on Lua (more precisely, LuaJIT). However, you are free to use any deep learning framework that you like.
One option is to install Torch locally on your machine if you have Mac or Linux (or if you are on Windows, you can install a Linux VM). Another option is to use the Torch variants that are pre-installed on the following machines, which you should be able to SSH to using your compute ID:
You can run
cuda1.cs.virginia.edu (must first SSH to power1 to get to cuda1)
- The compute nodes, which can be accessed through SLURM.
The artemis compute nodes have GPUs, which must be requested by the SLURM option
srun --gres=gpu th yourscript.lua).
th to open a LuaJIT/Torch shell. Note that on the department machines one should not use the
luajit command because that is broken. Additionally, some people report that it is possible to use Torch on AWS free tier, but it requires some configuration.
We provide two datasets:
These datasets can be loaded into LuaJIT by downloading the load_images.lua file, then calling from LuaJIT:
load_images = require 'load_images'
images = load_images.load('face_images', 750)
The first argument of load() is the subdirectory containing image datafiles (unzipped from one of the datasets above to the current directory), and the second argument is the number of images in the subdirectory. The resulting tensor is size nimages x channels x height x width, where nimages is the number of images, channels is 3 (for the RGB colors), and height and width are both 128.
The goal is to create a LuaJIT script that will train and eventually test an automatic deep image colorization model. We break this down into a number of steps, which should make it easier to debug any partially completed system you build:
- (10 points). Load an image dataset into Torch using the above loader utility (these are loaded as a Tensor class). Randomly shuffle your dataset using
torch.randperm. To reduce memory requirements, we ask that you please set the default Torch datatype to 32-bit float with the following command at the top of your program (before calling the loader):
- (15 points). To reduce overfitting, augment your dataset by a small factor such as 10, by using the image package to transform your original images. You can do this by creating a torch Tensor class that is 10x larger along the first dimension as the original image dataset, but has the same sizes for the other dimensions. You can use a Lua for loop to populate this tensor with augmented copies of your original input images. Include in your dataset augmentation horizontal flips, random crops, and scalings of the input RGB values by a single scalar randomly chosen between [0.6, 1.0] (the same scalar should be applied to each of R, G, B channels). The crop operation can first crop the image, and then resize the image to be the same size as the input. It should be possible for all three "augmentation" operations to be applied to a single input image.
- (10 points). Convert your images to L*a*b* color space. This color space is an alternative to the ordinary RGB color space. L*a*b* color space separates luminance (in channel 1, L*) information from color or chrominance information (in channels 2 and 3, a* b*). You can do this by creating a torch Tensor class that is the same shape as the previous augmented dataset. There are commands in the image package for color space conversion, in particular, image.rgb2lab and image.lab2rgb.
- (15 points). As a sanity check, build a simple regressor using convolutional layers, that predicts the mean chrominance values for the entire input image. This regressor is given as input a grayscale image, so only the L* channel, and it predicts the mean chrominance (take the mean across all pixels to obtain mean a* and mean b*) values across all pixels of the image, ignoring pixel location (so it outputs only 2 scalars). You should be able to train this and find that it decreases in training error, and predicts some color information. To get started, you may want to check out examples of training simple fully connected networks, as well as a more complicated example that includes ConvNets. Once you have this working, make a copy of this code so that you can submit it later.
The way I did this was to create a network containing 7 modules, each module consisting of a SpatialConvolution layer followed by a ReLU activation function*. For the SpatialConvolution layer I made sure to set the padding and stride appropriately so that the image after convolution is exactly half the size of the input image. This way, the sizes of the images as they go through the CNN are decreasing powers of two: 128, 64, 32, 16, .... I also used a small number of feature maps (3) in the hidden layers for now, since I only wanted to get the network working at this point. Additionally, I scaled the input L* channel (which originally ranges from 0 to 100) to the range [0, 1].
* In the last layer you can optionally use a Tanh unit instead of ReLU, similar to the SIGGRAPH paper. In this case you will have to rescale the a* and b* channels to be in the range -1 to 1 (and then scale them back during testing).
- (15 points). Extend the previous network to colorize the image by including upsampling/deconvolution layers (such as SpatialFullConvolution or SpatialUpSamplingNearest). The image size output by your network should match the image size input to your network, except that it should have two color channels (a* and b*) for the output versus only 1 color channel (L*) for the input.
One reasonable architecture is to reduce the number of downsampling layers to N and then also use N upsampling/deconvolution layers. My current architecture uses N = 5 (but you can experiment with what gives the best results). For my current architecture, the spatial resolutions at different parts of the ConvNet are therefore 128, 64, 32, 16, 8, 4, 8, 16, 32, 64, 128.
- (10 points). The previous network will train and generalize better if it incorporates batch normalization. So incorporate batch normalization (e.g. using SpatialBatchNormalization layer in Torch). This can be inserted directly after each
SpatialConvolution layer. However, note that the
SpatialBatchNormalization layer requires 4D tensor inputs, so you have to divide your training dataset into mini-batches of say 10 images each (so the inputs are size nbatchsize x 1 x height x width, and the outputs are size nbatchsize x 2 x height x width).
- (10 points). Divide your dataset into two parts: 90% of the images can be in the training part, and 10% of the images can be in the testing part. Also, make sure that the testing images have not been subjected to data augmentation. Train your regressor on the training part, and then test on the testing part. To evaluate the test images, print a numerical mean square error value. Also, run the input luminance of the image through your network, then merge the a* and b* values predicted by your regressor with the input luminance, and convert back to RGB color space using image.lab2rgb. This will let you view the colorized images. (You may have to use the program SFTP to transfer them back to a local machine if you are using remote compute). For example: if you have 750 images in the dataset in total, you could divide this into 90% training (675 images), and 10% testing (75 images): the training images are augmented and result in e.g. 6,750 images, whereas the test images are not augmented (so you only have 75 images).
To avoid running out of memory on large datasets and/or many feature maps, it is necessary to test in mini-batches (otherwise the hidden convolutional layers will allocate too much memory). Also, the batch normalization has to be put in a testing mode to get the best results for testing (this can be done by calling
- (15 points). Explore changing the number of feature maps for the interior ConvNets to see if you can gain better test accuracy. Report in a brief writeup (it can be just a paragraph or two) what experiments you did for the architecture, your best architecture, and the best mean square error (in terms of a* and b*) you were able to achieve.
- (7 points) Optional extra credit: for additional speed you can move your ConvNet to the GPU, as explained in this tutorial. You will need to add require commands for cutorch and cunn (to verify that your Torch is CUDA-enabled, first run this in an interactive LuaJIT prompt:
require 'cutorch'). There are limited numbers of GPUs available on the department machines, which will likely become congested near the assignment due date. Thus, the extra credit can be submitted later, through Sun Nov 6 (11:59 PM).
- Zhang et al. show that a classification loss results in more vivid colors than regression loss.
Feel free to collaborate on solving the problem but write your code individually.
Submit your assignment in a zip file named
yourname_project2.zip. Please include:
Finally submit your zip to UVA Collab.
- Programs to train and test your best resulting colorization model.
- Example image colorizations from both the training and test sets.
- The program that predicts mean chrominance from step 4.
- Your brief writeup
If you submit the extra credit component after the main assignment is due, please do not resubmit your main project, which would cause your project to be marked as late in Collab. Instead, please submit to the "Project 2 extra credit" assignment on Collab. You can submit an updated writeup and code for that component.