Why Natural Language Processing Needs Both Data and Linguistics

The rise of Amazon’s Alexa, Apple’s Siri and Google Now have thrust the field of natural language processing (NLP) into the limelight. This competition between the giants of the technology sector has also fuelled research in Dialogue, Question Answering, and Natural Language Generation. Another reason for this shift is that deep learning approaches can be used to tackle tasks that have been difficult to model with older techniques.

For all the current focus on and hype around deep learning – also known as multi-layered neural networks – we in the Winton Natural Language Patterns team believe that deep learning underperforms when only a small amount of training data is available. We also believe that, by employing a combination of machine learning techniques like these with solid linguistic theory, we can produce better quality results for small-data problems. Indeed, we commend the general ideas (if not all of the strong language!) contained in a Yoav Goldberg blog post. Goldberg emphasises the importance of taking a multi-pronged approach rather than relying solely on deep learning.

At the recent annual Association for Computational Linguistics (ACL) conference, many contributors picked up on the importance of data-driven approaches like deep learning for NLP – while also extolling the benefits of using linguistic theory in parallel.

Noah Smith gave an eloquent defence of the significance of linguistics to NLP in his keynote speech, when he argued that linguistic models can be used to provide helpful structure for deep learning models. For example, a deep learning model that explicitly learns to parse text in parallel will be biased to produce grammatical output – this is known as inductive bias. Such bias should reduce the amount of data and time required to train NLP models.

Nonetheless, the pervasiveness of deep learning in NLP was one of the major themes of the conference. Most noticeably, papers related to “Generation and Summarization” accounted for ~17% of this year’s submissions. In 2014, the topic was not big enough to warrant a category of its own. The other winner is a category of NLP applications comprising Information extraction, Retrieval, Question-Answering (QA) etc. In contrast, traditional NLP tasks such as Machine Translation and Document Categorization have seen a significant drop in popularity.

ACL submissions in 2014 (top) and 2017 (bottom). Figures recreated from ACL data

A second important theme of the conference was using deep learning to exploit multiple modalities in NLP tasks. Because deep learning models encode words, documents, images, videos, parse trees, and semantic graphs as vector-based representations, it is then possible to co-train or translate between these representations. This theme was expanded upon by Mirella Lapata’s keynote, where she described Machine Translation, Paraphrasing, QA, Video Captioning, Code Generation, and Document-level Natural Language Generation as translating between modalities to a textual representation.

Eriguchi et al. presented work that neatly combined both of these themes. They built a neural network that was both a dependency parser and a machine translation model. This neural network was co-trained with translated documents and labelled parse trees. At test time, the network produces both parse trees and translations, with higher quality translations than an equivalent single-trained Neural Machine Translation model.

An issue with multi-modal learning is the small amounts of training data available for supervised learning, and Mirella proposed reinforcement learning as a solution to this problem. Reinforcement learning views neural networks as agents interacting with textual environments, with words generated as an output of a sequence of actions. In NLP, it is better suited to tasks which lend themselves to subjective evaluation criteria, versus more traditional string-matching metrics. A related approach by Kreutzer et al. adapts a trained Neural Machine Translation system to new domains through bandit learning. Such approaches also tie in with another up-and-coming class of models (widely talked about at NIPS 2016) called “Generative Adversarial Networks” (GANs). The key differentiating factor between the two is the reward, which in reinforcement learning is often a predefined function of predictions made by the model, whereas in GANs it is learned in the “discriminator” part of the network.

Can we infer any future trends from the conference? In terms of models, the bidirectional recurrent neural network with attention is currently dominant. However, these recurrent networks are expensive in both training and prediction. As a possible alternative to recurrence, Gehring et al. demonstrate that strong and fast NLP models can be built with convolutional networks and attention. We also believe that the trend towards dialogue and generation tasks will continue. One interesting future sports generation task is described by Wiseman et al. and we are looking forward to their presentation at Empirical Methods for Natural Language Processing (EMNLP).

ACL is highly regarded by both academia and industry. Winton attended ACL 2017 in Vancouver, along with academics from the world’s best universities and employees of Amazon, Apple, Baidu, Facebook, Google, and Tencent.

Below, we link to other work presented at the conference that we enjoyed:

Affect-LM: A Neural Language Model for Customizable Affective Text Generation

This paper is a step toward emotion-aware natural language generation with tunable type and strength of the desired emotion. It is a recurrent neural network language model with a modified softmax layer promoting different words depending on the emotion category of the surrounding words. The emotion categories are inferred during training using Linguistic Inquiry and Word Count (LIWC).

Neural AMR: Sequence to Sequence Models for Parsing and Generation

This work describes a model for mapping between a semantic graph representation and text. Circumventing the small amounts of labelled data, the training procedure makes clever use of unlabelled data for text generation.

Hafez: an Interactive Poetry Generation System

A poetry generation system, where users can polish the generated output with style configurations. Available as an Alexa skill!

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Dialogue systems are hard to evaluate, because the system can generate many diverse but acceptable responses. Traditional NLP evaluation metrics are not suitable for dialogue as they use a small number of similar references. This work describes a deep learning model to evaluate the response of a dialogue system, and the challenges faced when building a model-based evaluator.

Detecting annotation noise in automatically labelled data

Generating large high-quality annotated datasets is an expensive process. Here, an unsupervised generative model is combined with human supervision from active learning. This approach is able to detect a large number of annotation errors with few false positives.