39 minutes reading time
Introduction to Neural Network

Live Training With Hadelin
Discover the 5 steps to Unlock your Career!
Register My Spot Now
05
Days
18
Hours
11
Minutes
02
Seconds
"On Natural Language Processing, Part of Speech Taggers, Bert and Elmo, Neural Networks, Long Short-Term Memory Networks, Quick Long-Semester Ephemeral Nodes, and Transformers.
"

An introduction to NLP
Complete the following three-word sentence: Call me ____.
What popped into your head? Was it “Crazy?”; “Maybe?”; “By Your Name” ; or “Ishmael”? The modern oracle of Delphii, (Google), gets the latter three, but not the first one.
How are you able to fill in the blank? Neuroscientists would call it activation, or if we get hand-wavy about the length of time involved, ‘priming .’ In essence, your brain implicitly knows about phrases and which concepts are likely to follow one another, and there were clues in the subtitle of this piece. When presented with the task above, your brain attempted to choose the best ending, based on past learned knowledge, and the recency with which it has read or listened to Herman Melville or Carly Rae Jepsen.
Some of the better minds of this generation have been hard at work to get computers to do a version of what you did when filling out the “call me” sentence. Historically, this work was done by linguists, and data scientists have more recently stepped into the fray.
On the academic side, one way to solve the problem of “how language works” is through things like “Categorical Combinatory Grammar Parsers” which seek to learn the compositional units of language using lambda calculus (more commonly known as LISP). Theoretically, one could use a CCG approach to come up with various practical possibilities to answering the quiz at the beginning of this blog post.
A naïve and strictly computational approach suggests that the problem is not that we need the relative probability of all semantic words (as opposed to words like “of” “the” and “pffft!”) but that we need their conditional probabilities of appearing next, given some context. The problem is that if we take this naïve approach and limit our vocabulary to the 3000 most common words, we’d need a maximum of 9,000,000 probabilities to accurately predict the most likely next word in an arbitrary two-word phrase.
Of course, not only do we operate with rules of grammar, but also with implicit and higher-order expectations regarding how words, concepts, and semantic units of thought can be composed together. For example, English has strict rules governing the order of adjectives and nouns. To illustrate this, let’s look at the following translation of one of Basho’s most celebrated haikus:
old silent pond
frog hops in
splash
In the first line, Pond Old Silent isn’t grammatically correct, and Silent Old Pond sounds just a little weird, even though both communicate all of the necessary concepts. Rules of grammar can help us cut down the number of data-points we’d need to get a computer to do the fill-in-the-blank game. We might, for instance, tag the three words in the first line with their respective parts of speech: Old => Adjective, Silent => Adjective, and pond => Noun. If we did this for every set of three consecutive English words in, say, Wikipedia, when presented with “Call (verb) me (pronoun) _____” we could be fairly certain that the final word is less likely to be a preposition, like “of” or an adverb like “firmly.”
If we had enough time and resources, we could theoretically run a large amount of text through a state of the art part of speech tagger, and then look at the likely distributions of various patterns. In technical parlance, these are often referred to as N-Grams, which means collections N words of text, or N parts of speech. (What’s the most common pattern of parts of speech for an arbitrary seven-word chunk in the English language? Only Google knows.)
Astute readers will note that one can find a word from most parts of speech categories that could fit in the aforementioned sentence. What’s interesting about this exercise is that traditional parts of speech categories are too broad for the practical purposes of predicting words, or translation – they abstract away all semantic content for a syntactic category.
To rectify this, we need to know more about the semantic content of the word. It’d be nice if there was a way to encode this without having to refer to the word itself. In other words, something at a level of abstraction between the specificity of the word and its supervening part-of-speech category.
On the data science side of things, this problem can be solved by word embeddings or techniques that associate each word with a unique position in a multidimensional “semantic space.” BERT and ELMO two of several approaches (from Google and AllenNLP, respectively) that allow one to assign each word in a sentence to a corresponding vector. Differences in these vectors roughly correspond to differences between the meanings of their respective words; the further the distance between the points the vectors describe, the more dissimilar the underlying words.
Word embeddings allow one to perform fast calculations of the semantic distance between two words. The best concrete illustration of semantic distance and word embeddings that we can mention is Jonathan S. Give it two words, and it comes up with conceptually similar portmanteaus and rhymes: one of the results you get from entering in “Entropy” and “Dinner” is “Fluctuation Celebration.” The algorithm works by finding all words that either rhyme or have some common syllables within a certain semantic distance. (More details can be found here.)
imon’s Entendrepeneur
In BERT’s case, the neural network trained over the Wikipedia corpus, and is fed sentences “all at once.” The network is tasked with minimizing its error when predicting some of the words the researchers masked from the input. Once the training is done, one can feed BERT a sentence and get back a series of vectors that encode the semantic and syntactic content of the original words. Now if the BERT, ELMO and other word-embedding approaches are all the output of neural networks, how do those neural networks work? This brings us to Act II.
How those good old fashion neural networks work
Our favorite way to fearlessly dive into understanding how neural networks for natural language processing work is via covering an architecture known as Sequence2Sequence, or Seq2Seq. As their name implies, these neural networks can learn transformations between fixed-length sequences.
Let’s use the problem of machine translation for illustration. Say you have a gold-standard corpus of paired sentences, where the first sentence in the source language and the second is in the target. One approach is to feed the pairs to a neural network such that it learns not just how to successfully transform (or “transduce” if you want to be academic about it) one source-language sentence to the target language, but so that it can do it for any arbitrary sentence. One of the problems to overcome is that cross-language sentence equivalents aren’t necessarily the same length; so we’ll need to have some way of being flexible.
To pull this off, you might use a Recurrent Neural Network or RNN. Nodes in RNNs are unlike their counterparts in convolutional neural networks. Each has two inputs, as opposed to just one. The second input to a recurrent node is the node’s most recent output, effectively giving the RNN node a small amount of working memory. Hook many of these nodes up together, and you now have a way of learning contextual transformations between sequences. (Or in other words, something more or less like learning very complex versions of IF the previous few values follow pattern A, THEN output pattern B statements.)
The key thing about using recurrent units is that their function must be continuous. This allows the network to sample the derivative of the node’s output, and then update the node’s internal workings and weights via gradient descent. Without a continuous function, the network’s prediction error cannot be back-propagated through the network, and thus the network cannot learn.
In the Seq2seq architectures, two different recurrent neural networks are trained. The first (“the encoder”) reads in words one at a time until it hits a symbol that marks the end of the sentence. Then, the encoder sends a fixed-length, interstitial representation of the sentence to the decoder, which outputs one word at a time (recurrently combining the previously outputted word and the encoder’s output) until the ensuing sentence is complete.
When researchers first tried this approach, though they saw improvements, they ran into a problem: the recurrent unit’s memory wasn’t “big enough” to track so-called long-range dependencies, like this abrupt meta-reference to the poem that appeared in the previous section. Our working memories allow us to know that we’re referring to the bit about the frog, but a computer wouldn’t know that automatically. Of course, this is an exaggerated version of a long-range dependency, they typically happen within a sentence or paragraph. Linguistically, this is also known as co-reference: you, as a reader, know what the word “this” refers to at the beginning of this sentence, but an RNN won’t without further help.
One way this problem was overcome was via a bi-directional RNN, where half of the nodes look at the target sequence in reverse. Another way this problem was solved was by effectively giving a node the ability to implicitly vary the length of its memory. Because we are dealing with neural networks and operations must be continuous, this was achieved by applying weights to past steps. If you extend these ideas further, we get what’s known as a Long Short Term Memory network, where each node in the network can be made to forget, update or pass along information about the recent linguistic context. Christopher Olah’s fantastic blog post does a great job of illustrating the nitty-gritty.
To reiterate, in a sense the breakthrough with LSTMS was making operations like forgetting or shrinking the effective memory size of a node differentiable so that the machine can learn when those operations are most appropriate. However, in some sense LSTM nodes are a hack to fix an architecture that insists on encoding and decoding one word at a time.
The Transformer
One of the key advances to improving RNNs has been through adding what is called “attention.” Without attention, all parts of a Seq2Seq’s Encoder’s output are weighted equally, even though the meaning of the next word a Decoder may be working on maybe more dependent on some parts of the sentence than others. In the original paper by Bahdanau et al., the attention mechanism (in a bi-directional neural network) was calculated by training a feed-forward (FF) neural network, whose inputs were a concatenation of the forwards and backwards recurrent nodes, and whose intended output was a calculation of word-level alignment. Thus they used the FF neural network to create a function “which scores how well the inputs around position j and the [equivalent] output at position i match.” Effectively, this gave the rest of the RNN a prediction of where in the output sentence position the equivalent of the input word would land.
Practically speaking, attention can be thought of as a matrix that, for a given set of input symbols and output symbols, stores a weight that governs how much the words have to do with each other.
Attentional mechanisms capture the idea that words, on their own, are not enough to communicate meaning. The context of words in a sentence matters and the units of meaning communicated by a sentence do not cleanly map onto individual words. Semantic content, Word-sense disambiguation, intent, tone, emotional resonance, and so on, are the product of both the bottom-up meaning contributed by each word, and the top-down meaning contributed by context.
Initially, Attention was seen as RNN accouterment. But in late 2017 Vaswani et al. (Googlers, mostly) turned this around. They found that parallelized attention layers or “attention heads” were enough to do the heavy lifting without any recurrence, and they called this new architecture a “Transformer.”
Unlike Seq2Seq, transformers treat the whole sentence at once. In a way, transformers harken back to the older phrase-based approaches to machine translation – both do a version of mapping between words in one language with words in another. However, one of Vaswani et al.’s insights was that to effectively translate a sentence, you don’t just need to know how to map the meanings of words in one language to the meanings in another language. You also need a map of how all of the words in a sentence map meaning to each other, and they did this through so-called self-attention. Here’s a quote from their blog post:
“In each step, [the encoder] applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position.”
When a transformer is being trained, it keeps track of a) how all of the words in an input sentence relate to each other, b) how the input words map to output words, and c) how all of the words of the output relate to each other. By keeping track of these three relationships it is possible to adjust the mapping of input to output words to respect grammatical rules without sacrificing the semantic content.
Note that the transformer algorithm is more complex than we have covered here. For example, the transformers' attention mechanism also utilizes similarity between input and previously learned key-value pairs, and works by what is effectively fuzzy key-value matching. More details on Transformer Architectures can be found here and here.
To bring things full circle, the full name of those BERT similarity vectors mentioned in the previous section is Bidirectional Encoder Representations from Transformers. Without transformers, we wouldn’t have things like Entendrepeneur, and the world would be a sadder place.
Conclusion
To recap, the NLP researchers are currently experimenting with different architectures for machine translation tasks. These have included RNNs, LSTMs, Seq2Seq, and more recently Transformers. Some of these architectures can theoretically be applied to any sequence (like LSTMs), others, like Transformers, are tuned more for language than they are for other problems like forecasting electricity demand. (By the way, if you want to dive into Transformers, the NLP/chatbot company HuggingFace recently released a library that allows for interoperability between different transformer representations.)
When considering these cutting-edge algorithms, it is important to be aware of self-induced hype. Though these NLP architectures are fantastic for what they are, there are, of course, certain problems they can not solve. One such problem happens to be our favorite variant of the machine translation problem; it is what you might call linguistic style transfer. In essence, it asks given an arbitrary sentence, how might a machine re-write it to be in the style of Melville, or of P.G. Wodehouse when he writes about Jeeves?
One reason that this problem cannot currently be solved by current NLP architectures is that the type of high English found in Jeeves is unique to PG Wodehouse, and maybe a few really good parodies. Same with Melville. All machine learning tasks currently require a large amount of training data, and even if a neural network could theoretically do linguistic style transfer, it turns out that with the exception of a few things like various versions of the bible, the requisite training data doesn’t really exist. So until someone comes up with better neural network architecture, it might be slow going for a while. (Having said that, Neurowriter is something to keep an eye on.)