The encoder and decoder are LSTMs with 1200 recurrent hidden units. Unlike to the original DRAM model, I no longer use linear operation L when predicting the parameters for parametric distributions, but add non-linearity. This is a simple trick to prevent the model from learning trivial solutions. The model will need to try a bit harder to optimize its parameters which will end up to a slightly better solution at the end.
I used Adam  as an optimizer, and the initial learning rate was set to 0.0005. and were set to 0.9 and 0.999, respectively. The baseline model has three hidden layers and 64 feature detection channels per each layer. At the top of the CNN, there are two fully-connected layers with 512 hidden units, and at the topmost there is a softmax classifier. I also trained this model with Adam with same configuration and applied dropout (Srivastava et al., 2014) to prevent the model from overfitting. I used early stopping to all the models to obtain the best test performance.
Table 1 compares the test accuracies between the baseline model (stage 1) and the HDRAMs trained on different image sizes. For fair comparison, we should compare the performance between the baseline model and the HDRAM trained on 48 X 48 X 3 images to make sure that experimental setting is matching. Although it is difficult to control the size of models which are completely different approaches (CNN and RNN).
I cannot clearly explain why the performance degrades when the size of images contained in the dataset grows. One possible explanation for this could be poorly chosen hyper parameters such as glimpse size of the input to the encoder and output of the decoder. Which means that we have not selected the best model for this task. Since the training takes in average 3~4 days with a GTX Titan Black, hyper parameter search takes long time.
Above figure shows how the model generate cat images (if you click it changes into full size). At the below is the figure showing the write operation of dog images.
The model always start writing from the top-left corner of the canvas and asymptotically fills out the remaining region of the canvas.
The results of HDRAM were not that good as what I initially expected. As explained above one reason might be insufficient exploration of the hyper parameter space. The computation could not be parallelized as CNN which makes the training extremely slower. Each training lasts for few days, and I did not have enough time to explore all the hyper parameter space. I will leave this as future works.
DRAM has been shown to do decent jobs on datasets such as MNIST, cluttered and translated MNIST and SVHN. These datasets are monotonic and contain relatively simple structures. I assume that the attention mechanism is not functioning well when the datasets contain more complicated structures. For instance, CIFAR10 has high variance in the backgrounds, but still the objects are centered. When the objects are centered well, attention mechanism might not be that crucial. However, Dogs vs. Cats dataset has highly variant backgrounds, and the objects are not centered. Sometimes the label information makes non-sense. The discriminative and generative performances will rely much on the attention mechanism in this case.
 Diederik Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, arxiv preprint 1412.6980, 2015.