The holy grail of Computer Science and Artificial Intelligence research is to develop programmes that can combine knowledge/information from multiple domains to perform actions that currently humans are good at. In this spirit, Image Captioning stands as a great test-bed for AI algorithms since it involves building understanding of an image and then generating meaningful sentences on top of it. Formally Image Captioning is defined as the process of automatically generating descriptions of the scene shown in an image. The aim of this post is not to provide a full tutorial on Image Captioning. For that I would encourage you to go through Andrej Karpathy’s presentation and Google’s Show and Tell paper.
This post summarises my approach for implementing a captioning model for browser (that is doing all computations in browser) using Tensorflow.js.
Deploying deep learning models on web is challenging since web imposes size constraints. This forced me to reduce number of parameters of the model substantially which in turn decreased accuracy. So the captions may sometimes be wayward.
Most state-of-the-art neural architectures have been trained using Microsoft Common Objects in Context (MSCOCO) dataset. The dataset weighs around 25GB and contains more than 200k images across 80 object categories having 5 captions per image. Unfortunately I am an undergrad and I don’t have access to computational power required to process such a huge dataset.
Therefore I used Flickr8k Dataset provided by University of Illinois Urbana-Champaign. The dataset is 1GB large and consists of 8k images each having 5 captions. Due to its relatively small size I could easily use Flickr8k with Google Colab notebooks.
As shown above the complete neural captioning architecture can be divided into two parts.
- The Feature Extractor : which takes an image as input and outputs a low dimensional condensed repersentation of the image.
- The Language Model : which takes the condensed representation and a special START token and generates a caption.
We stop the caption generation process when the language model emits an END token or the length of the caption increases a threshold. In the demo, this threshold has been set to 40 which is the maximum length of captions in Flickr8k dataset.
Feature Extraction from Image: MobileNets
Since the input data is an image, it is clear Convolutional Neural Networks (CNNs) are an attractive option as feature extractors. For high accuracy, most image captioning projects on Github use Inception or Oxford’s VGG Model. Though good for a desktop demonstration, these models aren’t suited for a fully front-end demo as they are quite heavy and compute intensive.
So I turned to MobileNet which is a class of light low-latency convolutional networks specially designed for resource constrained use-cases. In the complete architecture, MobileNet generates a low-dimensional (1000 dimensional Tensor) representation of input image, which is fed to the language model for sentence generation.
Natural Language Generation: LSTMs
To generate captions, Long Short Term Memory (LSTM) layers were used. LSTM based models are in most cases de-facto models for sequence modelling tasks. These are actually a specialised variant of a larger class of models called Recurrent Neural Networks which I have described in a previous post.
The feature vector obtained from MobileNet is applied at every time step in the LSTM layer. This is essentially a design choice and I was influenced by the repositories I referred.
Skeleton Code for the Model and Model Summary
The above image shows the code structure of the architecture. MobileNet as described above will be used to output a 1000 dimensional represantation of input image. We take this representation and insert into the image model which contains a Keras layer called RepeatVector that will essentially create copies to be fed into every time-step of the LSTM language model as discussed above
The caption model will be built from the Embedding lookup and a TimeDistributed Dense Layer
Finally both modules are merged in the following way
The model obtained can be summarised as follows:
Caption Generation in Browser
The trained Keras model was sharded using tfjs-converter. The shards as well as the model.json file obtained using the converter are then loaded in the browser using the following code snippet:
For the prediction part I used normal max search which was implemented using the following subroutine:
The Hard Part: Engineering Issues
The primary issue faced while developing the demo were:
- Reducing model size while still retaining significant accuracy.
- Learning how to use FileReader API in order to add an upload option, something which I hadn’t done before.
- Orthogonal Initialization: Tensorflow.js complains that this is would slow down the application but I couldn’t find any solution. I think it is because of the LSTM layers used in the model
- The while loop generating the caption (in caption()) blocks the UI thread making the app unresponsive for sometime. I tried a lot of solution including making the loop completely recursive as suggested here, but couldn’t succeed. I would welcome any help on these issues!!
The complete code can be found here. The training code is in form of a Google Colab Jupyter Notebook for easy reference. All you need to do to run it is to first upload the Flickr8k dataset on your google drive, then upload the notebook on google colab and execute it.