For today’s paper summary, I will be discussing one of the “classic”/pioneer papers for Language Translation, from 2014 (!): “Sequence to Sequence Learning with Neural Network” by Ilya Sutskever et al
The Seq2Seq with Neural Networks was one of the pioneer papers to show that Deep Neural Nets can be used to perform “End to End” Translation. The paper demonstrates that LSTM can be used with minimum assumptions, proposing a 2 LSTM (an “Encoder”- “Decoder”) architecture to do Langauge Translation from English To French, showing the promise of Neural Machine Translation (NMT) over Statistical Machine Translation (SMT)
To highlight again, please keep in mind that the paper is from 2014, when there were no widely open sourced Frameworks such as TF or PyTorch and DNN(s) were just starting to show promise so many ideas presented in the paper might seem very obvious to us today.
The task is to perform Translation of a “Sequence” of sentences/words from English to French.
The DNN techniques expected a fixed dimensionality which was a limitation for NLP, Speech.
The paper proposes using 2 Deep LSTM Networks:
First one acts an Encoder:Takes your input and maps it into a fixed dimension vector
The second acts as a Decoder:Takes the fixed vector and maps it to an output sequence.
The Model
The LSTM is tasked to predict the conditional probability of a target sequence given an input sequence generated from the last layer. The generated sequence using this probability may have a length different from the source text.
Two LSTM(s) (Encoder-Decoder):This allows training the LSTM on multiple language pairs simultaneously.
“Deep LSTM(s)”:The paper mentions Deep LSTM(s) of 4 layers perform better.
Reversing the order of Input:The paper really highlights the trick of inverting the input sequence when mapping it to the output sequence which makes it “easier for SGD” to “establish communication” between input and output.It also enhances both short and long term predictions of the LSTM. The authors suggest that this might be due to “minimal time lag” where the distance between the generated and source words is minimized by reversing the order.
Training details:
Hypothesis are the pairs of sentences that are generated
Special Thanks to Tuatini GODARD for his suggestions and proofreading. Tuatini is a Full-Time DL Freelancer. I had the chance to interview him, if you’d like to know more about him, you can find the interview here
If you found this interesting and would like to be a part of My Learning Path, you can find me on Twitter here.
If you’re interested in reading about Deep Learning and Computer Vision news, you can check out my newsletter here.
If you’re interested in reading a few best advice from Machine Learning Heroes: Practitioners, Researchers, and Kagglers. Please click here