Notes: Deep learning, review of basics
This autumn I had the opportunity to attend Deep Learning 2017. Here are my review notes for the final course exam.
Includes both (some) notes from the lectures + additional stuff from the internet. Not very polished, might be unreadable.
(I might write more in detail about our group project presentation later.)
Topics
Neural networks
Note. Real neurons are quite different beast than the mathematical model in ANNs, but they serve as an inspiration for DL models: Human thought arises from a neural structure made of a massive amount simple processors connected to each other in a particular way (massive parallelism, connectionism). The connectome encodes information in the brain; connectome is sparse (number of connections per neuron is bounded).
For network models, see ch 6 of DLB.
History:
 First wave started with cybernetics, 1940s–1960s. Some concepts: McCulloch and Pitts neuron (1943). Rosenblatt perceptron (1958). Adaline. Widrow and Hoff (1960). Everyone is disappointed when singlelayer network can only do linear separation, and multilayer are difficult.
 Second wave: Reviewed interest in 1980s. Rumelhart and McClelland publish a book Parallel Distributed Processing. Backpropagation.
 Third wave: Deep Learning, post 2006. Lots of labeled training data made training deep models possible. Deep models became computationally feasible (advances in hardware and software). Some algorithmic innovations (e.g. crossentropy instead of MSE, ReLUs).
 Universal approximation theorem. Hornik et al. 1989, Cybenko 1989
 Size of ANN models has doubled approximately every 2.4 years.
For more history, see ch 1 and sec 6.6. of DLB.
Refs
 Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. link
Gradient descent and backpropagation
 How to minimize the loss function? Gradient descent w.r.t network model parameters. How to compute g.d.?
 The answer is: by backpropagation: chain rule by dynamic programming.
Some activation functions
 Logistic sigmoid .
 Hyperbolic tangent .
 Rectified linear unit .
 Leaky ReLu.
 Softplus.
Nowadays, ReLUs are recommended over smooth activations like sigmoids. Main benefit of ReLU’s is that we can avoid vanishing gradient problem (rectified linear unit: the linear part has constant derivate). link Nondifferentiability is not as much of a problem as people of yesteryear thought.
Libraries
 Bunch of libraries available.
 Theano, Tensorflow, Keras, PyTorch and Caffe
Recentish cool DL stuff
 https://github.com/phillipi/pix2pix
 http://robohub.org/deeplearninginrobotics/
 DQNs. https://deepmind.com/research/dqn/
 https://medium.com/applieddatascience/alphagozeroexplainedinonediagram365f5abf67e0
 Relja Arandjelović, Andrew Zisserman. 2017. “Look, Listen and Learn” https://arxiv.org/abs/1705.08168 “We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? “
CNNs and pooling
 Convolution in maths
 Note. In DL literature, both convolutions and crosscorrelations often are called convolutions.
 link
 Weight sharing.
 CNNs act as “feature detectors”.
 Average pooling, max pooling.
Geoffrey Hinton on pooling: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” source
How to visualize visual DNNs? See this.
Dataset augmentation
Often the datasets we have are not as large as we’d like. Luckily (at least with e.g. images), we can easily create new labeled data by transforming the data we have (shear, shift, rotate, etc).
Link: basic examples with Keras.
Coadaptation problem and dropout
Large neural models easily start to overfit to the data (because of coadaptation, neurons rely too much on other neurons). Application of dropout (remove random neurons for training) helps to prevent this.
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. “Dropout: a simple way to prevent neural networks from overfitting”. JMLR, 15(1), pp.19291958. link
Batch Normalization.
The basic idea: for each batch being trained, take the layer activation $x$, and normalize it (zero mean, unit variance) before feeding it to the next layer. This operation acts as a regularization method which enforces that distribution of a batchnormalized layer stays stable over time (over different batches). Batch normalization can be viewed as an adaptation of a more traditional idea that the input data should be normalized, applied to “internally” inside the intermediate inputs of the NN architecture to reduce covariate shift.
(Covariate shift being the shift in the distributions of $x$ during the training.)
See
 Sergey Ioffe, Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arxiv:1502.03167
 Shimodaira, H., 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2), pp.227244. link. (For discussion on covariate shift.)
Some other highlevel explanations form the internet:
Degradation problem and ResNets
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2015 “Deep Residual Learning for Image Recognition.” arxiv:1512.03385
He et al 2015:
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments.
Deeper learning results in worse training error! How puzzling:
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.
He et al 2015 present as ResNet as a solution. Basic idea: introduce shortcut connections so that the “identity” mapping is always available (and the extra layer now should be able to learn parameters that result in smaller loss than simple identity).
1 x 1 convolutions and Inception
Reduce the number of filters: Nice explanation by Aaditya Prakash. They decrease the dimension of the feature space.
For Inception architecture, see Szegedy et al. 2014. “Going Deeper with Convolutions.” arxiv:1409.4842.
Transfer learning
In one sentence: Utilizing pretrained networks as building blocks.
CNNs for text
Word2Vec embeddings:
 Skipgram (given a word, predict nearby words aka context)

Continuous bagofwords (given a context, predict a word)
 Tensorflow tutorial
 McCormick tutorial pt 1: the SkipGram model
 McCormick tutorial pt 2: Negative Sampling (NS: when training the network, restrict attention to only handful of negative example to speed things up)
 Xin Rong. 2014. (2016.) “word2vect Parameter Learning Explained.” arxiv:1411.2738
Convolutions for text:
 one introductory tutorial
 Main content: Yoon Kim. 2014. “Convolutional Neural Networks for Sentence Classification.” EMNLP: Empirical Methods in Natural Language Processing, pp 1746–1751.
 Fun Demo by Turku NLP group: link.
Kim 2014 idea summarized: Have fixedlength representation for words in text (e.g. word2vec embeddings); apply convolution over them (over the one spatial / time axis).
RNNs for text
See this tutorial. Also this and DLB chapter 10.
 RNNs shortly described: introduce recurrence to NNs (output of NN used as input at the next step).
 Learning: Backpropagation through time.
 We have a problem of vanishing gradients, again: recurrence can be viewed as “chain” of NNs; in BPTT we end up with very long chains.
 Answer: LSTMs, GRUs.