Aarne Talman

PhD Student in Language Technology at University of Helsinki

Research Update - Developing a Baseline Natural Language Inference System

19 Mar 2018

Some of my friends and former colleagues have been asking for an update on my research, so here comes the first one.

The first two weeks as a full-time PhD student are now behind and it has been really amazing to be back in academia. The first task I set for myself was to develop a strong baseline system for natural language inference (NLI).

Natural language inference is the problem of determining whether a natural language hypothesis can be inferred from a natural language premise. A simplified example of such a task would be to determine whether h below can be inferred from p:

p    So far this week, four mine disasters have claimed the lives of at least 60 workers and left 26 others missing

h    Mine accidents cause deaths in China

Although the above example is a very simple one, and humans are very good at recognizing validity of such inferences, for computers this has been quite a hard task. The ability to do reasoning with language is a fundamental ingredient of natural language understanding and arguably of AI more generally.

In the past many NLI systems have either used a rule-based or some “shallow” machine learning approach, however recently neural network models have gained a lot of popularity following the publication of Stanford Natural Language Inference (SNLI) corpus, which is large enough to allow development of deep learning models. SNLI, like the other similar datasets, contains a large set sentence pairs labelled for classification with the labels entailment, contradiction, and neutral.

My first goal has been to develop a baseline neural network model trained on the SNLI corpus. I wanted to develop a system giving me good enough accuracy so that I can start experimenting with different model architectures. So far the progress has been much better than I expected. During the first two weeks I developed a simple system in Python and Keras adapting the architecture used in Bowman et al. in their 2015 paper.  The architecture contains:

The system turned out to be quite decent, as I have so far been able to reach the test accuracy of 83,5% (300 dimension GloVe embedding + 300 dimension LSTM + 600 dimension MLP). This is still far from the state of the art, which for sentence encoding-based models is 86,3% and for other NN models (utilizing e.g. attention) 89.3. However, it is a very good starting point as the it improves the original baseline for 300D LSTM model used by Bowman et al. in their 2016 article  by 2,9 percentage points.

I’ve also experimented with architectures containing an ensemble of multiple similar models that are combined by averaging the weights at the final layer. So far this has only  helped to reduce overfitting.

So what’s next? My plan is to continue experimenting with different architectures, but I also want to see how changing the semantic representation of the words and sentences could improve the system. Now that I have a decent baseline I can start looking into this challenge. I also plan to test the system with other datasets, like the Multi-Genre NLI Corpus (MultiNLI) and SciTail.

As the last note: I’m also starting to look into neural machine translation, but more on that later.