# Model configuration and classification & generation results

Model configuration
The encoder and decoder are LSTMs with 1200 recurrent hidden units. Unlike to the original DRAM model, I no longer use linear operation L when predicting the parameters for parametric distributions, but add non-linearity. This is a simple trick to prevent the model from learning trivial solutions. The model will need to try a bit harder to optimize its parameters which will end up to a slightly better solution at the end.

I used Adam [1] as an optimizer, and the initial learning rate was set to 0.0005. ${\beta_1}$ and ${\beta_2}$ were set to 0.9 and 0.999, respectively. The baseline model has three hidden layers and 64 feature detection channels per each layer. At the top of the CNN, there are two fully-connected layers with 512 hidden units, and at the topmost there is a softmax classifier. I also trained this model with Adam with same configuration and applied dropout (Srivastava et al., 2014) to prevent the model from overfitting. I used early stopping to all the models to obtain the best test performance.

Classification results

Table 1 compares the test accuracies between the baseline model (stage 1) and the HDRAMs trained on different image sizes. For fair comparison, we should compare the performance between the baseline model and the HDRAM trained on 48 X 48 X 3 images to make sure that experimental setting is matching. Although it is difficult to control the size of models which are completely different approaches (CNN and RNN).

I cannot clearly explain why the performance degrades when the size of images contained in  the dataset grows.  One possible explanation for this could be poorly chosen hyper parameters such as glimpse size of the input to the encoder and output of the decoder. Which means that we have not selected the best model for this task. Since the training takes in average 3~4 days with a GTX Titan Black, hyper parameter search takes long time.

Generation results

Above figure shows how the model generate cat images (if you click it changes into full size). At the below is the figure showing the write operation of dog images.

The model always start writing from the top-left corner of the canvas and asymptotically fills out the remaining region of the canvas.

The results of HDRAM were not that good as what I initially expected. As explained above one reason might be insufficient exploration of the hyper parameter space. The computation could not be parallelized as CNN which makes the training extremely slower. Each training lasts for few days, and I did not have enough time to explore all the hyper parameter space. I will leave this as future works.

DRAM has been shown to do decent jobs on datasets such as MNIST, cluttered and translated MNIST and SVHN. These datasets are monotonic and contain relatively simple structures. I assume that the attention mechanism is not functioning well when the datasets contain more complicated structures. For instance, CIFAR10 has high variance in the backgrounds, but still the objects are centered. When the objects are centered well, attention mechanism might not be that crucial. However, Dogs vs. Cats dataset has highly variant backgrounds, and the objects are not centered. Sometimes the label information makes non-sense. The discriminative and generative performances will rely much on the attention mechanism in this case.

[1] Diederik Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, arxiv preprint 1412.6980, 2015.

# Discriminative and Generative Training of Sequential Variational Autoencoder with Attention

DRAM is trained with following train objective:

${\mathcal{L}(\mathbf{x};\theta,\phi)={\langle\mathcal{L}_{\theta}^{\mathbf{x}}+\mathcal{L}_{\phi}^{\mathbf{z}}\rangle}_{\mathbf{z}\sim q_{\phi}}}$

where ${\mathcal{L}_{\theta}^{\mathbf{x}}}$ is a reconstruction loss defined as expectation of negative log probability of ${\mathbf{x}}$ conditioned by ${\mathbf{z}}$ such as

${\mathcal{L}_{\theta}^\mathbf{x}=-\frac{1}{L}\sum_{l=1}^L\log p_{\theta}(\mathbf{x}^{i}|\mathbf{z}^{(i,l)})}$,

where ${\mathbf{z}}$ is sampled from  ${q_{\phi}(\mathbf{z}|\mathbf{x})}$. The term ${\mathcal{L}_{\phi}^{\mathbf{z}}}$ is defined as Kullback-Leibler (KL) divergence between two distribtunios in latent space which are ${q_{\phi}(\mathbf{z}|\mathbf{x})}$ and ${p(\mathbf{z})}$.

Details of the equations for each component of DRAM could be found in the original paper, and since our goal is not to reproduce the DRAM model, I will skip these parts.

We want to do classification using representation obtained by a novel attention mechanism, and at the same time, we also want to do conditional generation using label information. We infer  ${q_{\phi}(y|\mathbf{x},\mathbf{z})=\mathrm{Cat}(y|\pi_{\phi}(\mathbf{x},\mathbf{z}))}$ instead of ${q_{\phi}(y|\mathbf{x})=\mathrm{Cat}(y|\pi_{\phi}(\mathbf{x}))}$, where {\mathrm{Cat}}&s=2&bg=ffffff\$ is usually a softmax distribution.

Precisely we build a confidence level for ${y}$ at each time step${t}$, using the output of the read layer and latent variables as features, therefore, we have an additional canvas layer for ${y}$. The update and inference for this quantity becomes

${y_t=y_{t-1}+score(\mathbf{r}_t,\mathbf{z}_t)}$,

and we obtain the final prediction as ${\mathrm{Cat}(y_{T})}$. Since we are using the attention mechanism to update the score, we could observe which region of an image was truly effective at obtaining the score for the final decision.

In order to train the model to also do classification, we add a discriminative term to the training objective which is cross entropy error between the true label and the prediction. We weight this term with a hyper parameter ${\alpha}$ which is usually set as 0.1 of the size of a mini-batch. Therefore, the training objective turns into a hybrid objective such as

${\mathcal{L}(\mathbf{x},y;\theta,\phi)={\langle\mathcal{L}_{\theta}^{\mathbf{x}}+\mathcal{L}_{\phi,\gamma}^{\mathbf{z}}\rangle}_{\mathbf{z}\sim q_{\phi}}+\alpha\mathcal{L}_{\phi}^{y}(y,\mathrm{Cat}(y_T))}$.

We also want to use the label information for the write operation. In order to do this we simply provide the true label as additional input to the decoder. During the traing time, we will use the score of ${\mathrm{Cat}(y_{t-1})}$ at each timestep ${t}$. The inference rule for the recurrent units of the decoder changes into

${s_{t}^{dec}=RNN^{dec}(s_{t-1}^{dec},\mathbf{z}_t, \mathrm{Cat}(y_{t-1})))}$.

We call the model using above modification a hybrid DRAM (HDRAM).

# Implementing DRAM with sequential conditioning and supervised learning

As a next step, I have implemented DRAW with sequential conditioning (a technique that I have proposed earlier for training sequential VAEs). It seems like optimization is working well.I will add curves after the whole training.

In order to extend DRAW algorithm into a binary classifier, we could consider the differentiable RAM (DRAM) approach as explained in the paper [1]. What I would like try out is to combine the DRAM with the approach Kingma has introduced in his NIPS paper [2]. It is very interesting to train a generative model with class labels. Because conditional generative models are what we will eventually want to have as products. In addition, class labels could make the training of generative models better by providing additional information to inference the latent variables. We could write our posterior as ${q_{\phi}(\mathbf{z}|\mathbf{x},y)}$. Defining the likelihoods ${p_{\theta}(\mathbf{x}|\mathbf{z})}$${p_{\theta}(y|\mathbf{z})}$ or ${p_{\theta}(\mathbf{x}|\mathbf{y},z)}$ will be up to user’s choice, depending on how one will consider the connections between variables.

I consider RNN-VAEs (DRAM is also in this family of functions (models)) could be powerful since there are appealing ingredients in this type of models. Recurrent Attention Model (RAM) [3] has shown the power of recurrent net when combined with attention mechanism which can outperform CNN in classification tasks. DRAM (differentiable, variational lower bound) could do better than RAM in classification on cluttered & translated MNIST. With sequential conditioning and proper designs of conditional probabilities we could build even powerful model.

[1] Karol Gregor, Ivo Danihelka, Alex Graves and Daan Wierstra, “DRAW: A Recurrent Neural Network for Image Generation”, arxiv preprint 1502.04263, 2015.

[2] Diederik P. Kingma, Shakir Mohamed, Jimenez Danilo and Max Welling, “Semi-supervised learning with deep generative models”, in NIPS, 2014.

[3] Volodymyr Mnih, Nicolas Heess, Alex Graves and Koray Kavukcuoglu, “Recurrent model of visual attention”, in NIPS, 2014.

# First stage complete

I have just finished doing experiments with my baseline.

It is a very naive CNN model with 3 hidden & pooling layers and 2 fully connected layers with dropout.

I’m using binary cross entropy to train my model.

I used my personal research library so called clé (you can find the library here and the script file of the model in here).

It takes 1.5~2 mins to achieve 80% accuracy using Titan Black and cuDNN v2.

Here is the pipeline with more details.

I borrowed some of Kyle Kastner’s codes (his blog and his repo).

I used the dataset provided by Kyle where images are rescaled in to 48 X 48 X 3.

I normalized the data by feature dimension.

I could try to improve the baseline a bit more, but I have a different goal.

I will try out multi-task learning learning with DRAW model.

What I would expect from this is:

1. get performance which is comparable or not that bad to the baseline (65~80%).

2. each task will regularize the other.

3. verify how efficient the attention mechanism is on image sets with highly variant backgrounds and non-centered objects.

# How should we shrink alpha?

I spent hard time discovering influences between hyper parameters (momentum, learning rate, and alpha). I have few observations from my new experiments. If you run my code the model will diverge when $\alpha$ falls around 0.5. This depends on your hyper parameter setting, but for me this happens when the number of epoch is between 250 to 300. At first I assumed that this phenomenon occurs due to the accumulation of errors. Because we are using predicted samples as inputs during the training and they will start to accumulate errors and affect the next prediction as I wrote in my last post. It is partially true, but my opinion is that this shouldn’t be crucial. I will say that predictive training helps to learn the frequency structure of the speech pretty well but also requires a carefully tuning to learn the shape of the envelope. I will try to explain using below figures as an example.

Red: ground truth, blue: predicted speech, green: generated speech

This is a nice example because our ground truth (red wave) has a sudden decrease in its envelope at the end, and I want to point out why my model diverges when $\alpha$ decrease to a small value.

1. Speech prediction results are nice even though $\alpha=1$

2. But we also want to do nice generation so we use predictive training and shrink $\alpha$

3. It seems that $\alpha$ helps capturing the frequency structure of the wave and also slightly captures the magnitude

4. But still the model is not good at handling the shape of the envelope so it continuously predicts samples with same pattern (regardless controlling the magnitude of the sample)

5. Let’s say around time-step 1400~1500 the model will predict samples with large values as it has been keep doing for previous 1300 samples.

6. If $\alpha$ is small, this prediction will be directly used to compute the error with the target and there will be unexpected large gradients

My assumption is that these large gradients spoil the optimization and cause the model to diverge. I’m not experienced in RNN, but maybe gradient clipping might help to resolve this problem (which is used to prevent exploding gradients). I came up with a solution which is try to decrease the learning rate exponentially when $\alpha$ becomes smaller than 0.75. In addition, it is not necessary to decrease $\alpha$ to zero (let’s say reasonably small value is fine I use 0.5).

# Supplement of last post (Apr. 4)

I am trying to figure out the relationship between hyper parameters (e.g., learning rate, alpha, momentum). Apparently it seems that scheduling of each hyper parameter considerably affects the generation performance than I expected. I was using an unseen sequence (during training) to provide initial seed for generation and observed how the RNN generates the sequence. After talking with professor Yoshua, I decided to slightly change my experiment. I will keep doing experiments to see whether my model overfits to larger training set and will try sliding 10 samples per each time-step. You can download my code here (my code is based on Vincent Dumoulin‘s PyLearn2 implementation).  The train script has been modified, so I will clarify the training setting of previous experiment. I used 10 sequences which correspond to phoneme /aa/ and sliced the data into 20~30 subsequences (each example contains 500 acoustic samples) without overlap. At test time (both prediction and reconstruction) I ran my RNN on one of the sequences corresponding to phoneme /aa/.

# Generating /aa/ with RNN

I would like to share some results from my recent experiment. I will explain my model, which dataset I used, and how I got the results. I’m basically working with RNN and the objective of this experiment is to reveal the effectiveness of predictive training. We know that predicting a sample using ground truth is easy. On the other hand speech generation is extremely difficult. This is because there is a gap between training and testing. Speech prediction fits well with our training objective but speech generation is far different. The simple solution we could try is to make the training objective similar to testing. My RNN has 240 visible units and 500 hidden units and can learn 261 time steps (with stride=1). My model was trained with 10 phonemes from class /aa/. For training, I generated fixed length subsequences of size 500 without any overlap. I removed the momentum because it seems that momentum term is an obstacle during the predictive training (interfere the parameter updates, so it makes alpha to shrink faster than expected parameter changes). I run first 100 epochs without predictive training. $\alpha$ begins to decrease from epoch=100 to epoch=300 and training goes on until epoch=500. After the transition (when $\alpha=1$) the model is completely trained with predictive training. The configuration on my figure is similar with last posts but one difference is the subplot(4, 1) shows both squared error of ground truth-prediction (magenta) and ground truth-generation (cyan).

1. epoch=95

2. epoch=105

3. epoch=130

4. epoch=185

5. epoch=200

The model was able to capture frequency information before predictive training but it couldn’t capture the shape of envelope what we expected (red wave). The transition to predictive training started at epoch=100. After training of few epochs, the model started to generate a sequence which looks similar to the ground truth. Still the error was high though (black wave in subplot(2, 2) and cyan squared errors in subplot(4, 1)), this is due to temporal offset.

You can listen to the phonemes at below links: