Recurrent Neural Networks are amusing. They are amongst the finest examples of how simple mathematical models can achieve exciting and useful results. Recurrent Neural Networks (or RNNs for short) are a type of Neural Network architecture that are used for sequential data. That is inputs that occur as a sequence like language, audio, video frames. These simple ( I would rather prefer the term ‘cute’) models are used to generate new language sequences, music in the style of Mozart, translate a sentence/document in one language to another, classify video and even play games!!!.

In this post we will learn about RNNs, use them to generate text and visualise its working.

I assume that you have a basic understanding about simple Neural Networks, matrices and matrix multiplication. Even if you don’t, feel free to jump over to fun section of the post where I trained an RNN to talk about Special Relativity, prepare a speech in the style of Indian Prime Minister Narendra Modi and create some C# code.

The code for this post is available here .

What is hidden behind an RNN?

Nothing except a bunch of matrix multiplication!!!.
Let’s discuss a simple RNN (yes there are complicated ones too!). People have developed various visual representations of RNN, each having its own pros and cons but in my opinion the easiest way to understand RNNs is to look at the equations. Keep in mind these are matrix equations. A good way to understand them is to note the dimensions of the matrices.

$$h_{t} = \sigma (W \cdot h_{t-1} + U \cdot x_t)$$ $$y_t = softmax(V \cdot h_t)$$ Where,

$h_t \in \mathbb{R}^{D_h}$ . $h_t$ is called the hidden state of the RNN (it is a vector). Here $D_h$ is the dimension of hidden state vector (something you are free to chose while creating an RNN). It must be kept in mind that a fresh hidden state vector is calculated after every timestep and fed back again in the next time step. (see the first equation)
$x_t \in \mathbb{R}^{d}$ where $d$ is dimension of input vector. The input to the RNN is in form of a vector. This vector might represent anything, it can be words, characters etc. Basically it will represent the discrete unit of input.
$h_{t-1} \in \mathbb{R}^{D_h}$. In the equations given above this represents the hidden state vector computed after the previous time step.
$U \in \mathbb{R}^{D_h \times d}$. This is the weight matrix applied to the input vector $x_t$ (see the equations).
$W \in \mathbb{R}$. This is the weight matrix applied to the hidden state from previous time step.
$V \in \mathbb{R}^{D_y \times D_h}$. This is applied to the hidden state to generate the output for a particular timestep
$y_t \in \mathbb{R}^{D_y}$. This is the output of the current timestep.
$\sigma$ is the non-linearity we intorduce in a neural network.

The weight matrices are the parameters of the RNN which it learns through training. In practice one concatenates $h_t$ and $x_t$ and instead of 2 W's in the first equation we use a larger weight matrix $W_{hx} \in \mathbb{R}^{D_h \times (D_h + D_x)}$. If you know what simple neural networks are you can see that these equations essentially correspond to a 1-layer simple neural network having $W_{hx}$ as weight matrix.

Remember that these equations are to be applied at every timestep of the sequence. For example if we apply a ‘character level RNN’ (which we will do shortly), on a sequence “I love maths”, then after we compute hidden state for timestep ‘o’, we use the same hidden state (and also the same W’s) to compute hidden state and output for next timestep ‘v’ (using the equations). Of course one has to pass the inputs as vectors and not as raw string.

Internals of RNN

Let’s see how this looks in code.

class RNN:
	def __init__(self,input_size,hidden_size,output_size):
		self.input_size  = input_size
		self.hidden_size = hidden_size
		self.output_size = output_size
		# here we initialize the weights with random numbers
		self.Whx = np.randn(hidden_size, input_size+hidden_size)
		self.V  = np.randn(output_size, hidden_size)

This is the initialisation part where we define the parameters of our RNN Cell. Now the forward propagation part;

class RNN:
	def __init__(self,input_size,hidden_size,output_size):
		# .....

	def forward(x,h):
		# x is (input_size X 1) and h is (hidden_size X 1)
		h_t = np.dot(self.Whx, np.concatenate(h,x))
		y_t = np.dot(self.V, h)

		return y_t, h_t

.....................................

# in the main function
rnn = RNN(input_size, hidden_size, output_size)
......................................

# inside the training loop
h = np.zeros(hidden_size,1)
for t in range(timesteps):
	y,h = rnn.forward(x_t,h)

Take time to digest it. It is perfectly fine if you can’t. Come back to it later but make sure you do because the beauty of RNNs (like other Machine Learning models/algorithms) lies in their characteristic equations.

LSTMs and GRUs

The hidden state of the RNN is the vector that keeps track of the past input (due to its dependence on the previous timestep). So it essentially acts as a memory bank that gets updated at each timestep. Through the hidden state vector, an RNN (the one described above) can (theoretically) model any sequence, It turns out that in practice they are hard to train. The problem is called ‘The Vanishing Gradient Problem’. What it essentially means is that RNNs have a really short memory (of the past) and can’t model sequences effectively.

In order to overcome this deficiency, researchers over the years have developed many variants of the simple RNN. The most successful (and also the most popular ones) are Long Short Term Memory (LSTM) and Gated Recurrent Units. Nobody actually uses the simple RNN we discussed above (but understanding it is a pre-requisite for understanding LSTMs and GRUs). Also GRUs and LSTMs are almost equivalent (when it comes to accuracy) and there is no particular advantage in using one over the other, except for the fact GRUs have simpler equations.

GRUs

The GRU equations are as follows :-

$$z_{t} = \sigma (W_{z} \cdot h_{t-1} + U_{z} \cdot x_t)$$ $$r_{t} = \sigma (W_{r} \cdot h_{t-1} + U_{r} \cdot x_t)$$ $$\hat{h}_{t} = tanh (W_{h} \cdot h_{t-1} + U_{h} \cdot x_t)$$ $$h_t = (1-z_{t}) \circ \hat{h}_t + z_t \circ h_{t-1}$$ $$y_t = softmax(V \cdot h_t)$$ where $\circ$ means element wise product. $z_t, r_t$ have the same dimensions as $h_t$. The interesting part of these equations is the introduction of vectors $z_t$ and $r_t$ which are referred to as gates. $z_t$ is called the 'update gate' and $r_t$ is known as the 'reset gate'. The gist of all this is that these two vectors help the GRU to efficiently decide (as training progresses) how much to remember.

The LSTM has similar but complex equations with more number of gates, but those gates essentially have the same task as GRU gates.

The Fun part

I trained RNNs on texts (taking one character at a time) to generate new text based on the training data. The amazing thing was that the RNN has no notion of words, sentences and punctuations (we are just feeding one character at a time), but after training it (surprisingly) was able to generate correct spellings and sometimes meaningful sentences too. Keep in mind that RNN has no knowledge of natural language (It is just a collection of matrix operations!) all it does is trying to guess a pattern in input sequence.

At each time step the output vector actually gives a probability distribution over all possible characters. The character having maximum probability is chosen and fed as input to the next time step. The RNN I trained was actually was a multi-layered RNN (Many RNNs connected together). The RNNs were trained on Floydhub

Special Relativity and RNN

I copied Einstein’s paper titled “On The Electrodynamics of Moving Bodies” (in which he laid down his ideas of Special Relativity) from this site and fed that to RNN. The RNN (LSTM to be precise) had 3 layers, and was trained for 10000 epochs.

Let’s see the samples generated by the RNN

electron the motion of the equations, instant is in the velocity of change equations inter dependent moving systems of electromotive forces of Maxwell is there relatively to the coordinates of the field good be theorewerise at rest assing of the motion of the lan it is our k axis to the electrodynamics is to the starties for that is the moves the equations of the smalons \begin{displaymath}=t)^2(=\phi(v))f and and with the velocity of mark fund the obseraterisses bereates resure epparted in at the epomiturises in event are electron be form of $\frac{v}{c}{\rm of v}=c). \ E NK+p.

In the event a stationary system how when the viewed and we call the are electron. If and ponderable measuring-rod at the imarke hold good in we material motions all result equations \begin{displaymath}\man\frac{v}{c^2-vc2v_c2-clccurt{\rm S}=0. \end{eqnarray*} It and for the time $\phi(v)\beta$, (v) we maintain the length of the equations of the velocity of electric field denotes for mass in it.

Observe how the RNN tries to generate latex expression for some weird mathematical equation. Also note in the last line it seems to be talking about some relationship between electric field and mass (somewhere close to understanding Relativity! :D ). Another point to be noted is that it is quite accurate when it spells the words “electrodynamics”, “electromotive” and “electric field”. In the second line The RNN also cites Maxwell.

Narendra Modi’s Speech

I took the text version of Narendra Modi’s speech on independence day and trained an LSTM on it. Here are some samples.

India ancient nation global histry and cultural heritage of thousands of countrry. With best nation take will only world and economical year this who has aman, when we sirated in the racharive to arable sistone weow have has incrice in this this tratial bast in our commentruration of that our chat from Upnishads to satellites in built only the efforst to with the long historic journey and heritage, was there Mahatma Gandhi, and from Bhim Rao of Mahabharata. And the Vedas is country, . This were to gived life. of to only in the 3 crore post to government wear soster one the is to more also have invester are and a, I assured 2 ceas, it. We have get, in the benewant our good country country forward under a new system, when this sotear lefting of the some So will there is way our the at intent fellow dedicated themselves to free of hope dmake a beter human a in a drem to mak India for one Lord Krishna when when wants should generations our gorgist in raticited the struggle of using from the rest re “Ek Bharat resshsstha Bharat.

The LSTM was able to grasp a few (but important words). Its interesting to observe that the LSTM generates the phrase “Bhim Rao of Mahabharat”. I checked the speech once again and sure there was a mention of Bhima from Mahabharat and Dr.Bhim Rao Ambedkar.

AtomOS Source Code

After the previous results I tried the LSTM on a more structured data. I took AtomOS source code merged a few files together and started training. AtomOS is an operating system written in C# (interesting isn’t it? Do check out its repo).The output on structured data was more pleasing.

/*
* PROJECT:          Atomix Development
* LICENSE:          BSD 3-Clause (LICENSE.md)
* PURPOSE:          FAT Helper
* PROGRAMMERS:      Aman Priyadarshi (aman.eureka@gmail.com)
*/

// assembly is controlled through attributes the fvalues FAT 
// set of . Change these information attribute  to modify the 
// associatede ID of the typelib with class an assembly
namespace Atomix.Kernel.CompilerExtIO.FileSystem.FAAssemblerT.Find
{
    internal class Any : Comparison
    {
        internal Any() { }

        internal override bool Compare(byte[] data, int offset, FatType type)
        {
        	public readonly string Name;
        	public readonly string alue;
        	public readonly bool bData;
        	public readonly CPUArch CPUArch;
            switch ((FileNameAttribute)dat[offserwt + (uint)Etry])
            {
                case FileNameAttribute.LastEntry:
                case FileNameAttribute.Delted:
                case FileNameAttribute:
                CPUArch = aCpuArch;
                AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
                    return false;
                default:
                    return true;
            }
        }
    
}

The text file I used had lot of instances of the initial license statement. The LSTM, it seems was able to learn it by heart. Also the brackets are more or less complete and indentation was consistent. Although the model seems to have poor memory for variables which it declared.

Understanding What Is Happening

I followed the approach taken by Karpathy and Justin Johnson in this paper and tried to visualize output of each element of hidden state vector (hidden cell) as the function of input. In normal natural language dataset I couldn’t get significantly interpretable result. I simplified the problem by using a more structured and context free language having short term dependencies The language (a rule based sequence actually) consists of only parenthesis and numbers and has the following grammar:-

The alphabet consists of [ (, ), 1, 2, 3, 4 ] (separated by space)
The maximum nesting level allowed is 4
Inside a nesting level, numbers are randomly placed inside but the number must indicate the nesting level.

With the above simple ‘parenthesis language’ as training data, I trained a 1-layer GRU having 15 hidden cells (hidden vector has 15 elements) and generated a heatmap showing their activation value. In the heatmap, orangeish cells represent values less than zero while the blueish cells represent values greater than zero.

One can see some hidden cells didn’t capture any sequential relation at all (like cell-2 and cell-9).
But cell-4 and cell-12 easily reveal the pattern they capture.
cell-4 remains positive throughout the sequence but gives spikes of negative values just before observing a 4. In other words this cell keeps track of the final level of indentation.
cell-12 gives an activation of about -0.7 whenever it encounters ( in the sequence.

These were the patterns that were easiest to spot in the heat map. The code for generating this visualisation is available here along with the instructions on how to run it. If you observe any interesting pattern in this dataset (or any other) do share it with me.

This clearly demonstrates that the ‘magic’ behind RNNs is the hidden state which tries to learn a representation of sequences.

Code

I used PyTorch, my favourite Deep Learning framework for creating and training Recurrent Neural Nets. Even if you are not familiar with PyTorch do have a look at the code. I believe one of the greatest strengths of PyTorch is that its syntax is similar to native Python + Numpy code, making a PyTorch code more readable than any other DL Framework.

While working on CharRNN , I experimented with the following approaches

Feeding one hot representation of characters directly into the RNN (code available in the simple folder of the repository)
Using an embedding table and allowing the RNN to learn vector representation of the characters (code available in efficient folder)

I found that the latter approach gives a better result. One problem that I faced with the first approach was that after a few hundred epochs the output mainly consisted of repeated character sequences.

The interactive heatmap was generated using Bokeh Plotting Library.

Conclusion

RNNs are Great! RNNs are Simple! RNNs are Fantastic!

Recurrent Nets are powerful models for training on sequential input. The 2 most common forms of RNNs are LSTMs and GRUs. The secret mantra behind the power of RNN is its hidden state which captures sequential pattern. Practically RNNs are used with another ‘trick’ called Attention Mechanism. It is a concept that is loosely based on the attention mechanism found in humans. You can read about attention here

On a personal note, this blog is my attempt to follow Albert Einstein’s philosphy

If you can’t explain it simply, you don’t understand it well

I am trying to be good at explaining things. I would really appreciate if you can give some constructive feedback about the post. Also feel free to ping me if there was something you couldn’t understand or are facing some issues with the code.

Resources

For Deep Learning, I think the best place to start is Siraj Raval’s videos . He is a great communicator. His videos contain simple explainations of many intricate concepts.
Christopher Olah’s blog : His blog is a treasure trove for Deep Learning. I am particularly fond of his diagrams.
Andrej Karpathy’s blog : This blog post (my first) was inspired by Karpathy’s blog.

Playing with Recurrent Neural Networks

Nilay Shrivastava