Notes: Deep learning, review of basics

This autumn I had the opportunity to attend Deep Learning 2017. Here are my review notes for the final course exam.

Includes both (some) notes from the lectures + additional stuff from the internet. Not very polished, might be unreadable.

(I might write more in detail about our group project presentation later.)

Topics

Neural networks

Note. Real neurons are quite different beast than the mathematical model in ANNs, but they serve as an inspiration for DL models: Human thought arises from a neural structure made of a massive amount simple processors connected to each other in a particular way (massive parallelism, connectionism). The connectome encodes information in the brain; connectome is sparse (number of connections per neuron is bounded).

For network models, see ch 6 of DLB.

History:

First wave started with cybernetics, 1940s–1960s. Some concepts: McCulloch and Pitts neuron (1943). Rosenblatt perceptron (1958). Adaline. Widrow and Hoff (1960). Everyone is disappointed when single-layer network can only do linear separation, and multilayer are difficult.
Second wave: Reviewed interest in 1980s. Rumelhart and McClelland publish a book Parallel Distributed Processing. Backpropagation.
Third wave: Deep Learning, post 2006. Lots of labeled training data made training deep models possible. Deep models became computationally feasible (advances in hardware and software). Some algorithmic innovations (e.g. cross-entropy instead of MSE, ReLUs).
Universal approximation theorem. Hornik et al. 1989, Cybenko 1989
Size of ANN models has doubled approximately every 2.4 years.

For more history, see ch 1 and sec 6.6. of DLB.

Refs

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. link

Gradient descent and backpropagation

How to minimize the loss function? Gradient descent w.r.t network model parameters. How to compute g.d.?
The answer is: by backpropagation: chain rule by dynamic programming.

Some activation functions

Logistic sigmoid $1/(1+ \exp(x))$ .
Hyperbolic tangent $(\exp(x) - \exp(-x))/(\exp(x) + \exp(-x))$ .
Rectified linear unit $\max (0, x)$ .
Leaky ReLu.
Softplus.

Nowadays, ReLUs are recommended over smooth activations like sigmoids. Main benefit of ReLU’s is that we can avoid vanishing gradient problem (rectified linear unit: the linear part has constant derivate). link Non-differentiability is not as much of a problem as people of yesteryear thought.

Libraries

Bunch of libraries available.
Theano, Tensorflow, Keras, PyTorch and Caffe

Recent-ish cool DL stuff

https://github.com/phillipi/pix2pix
http://robohub.org/deep-learning-in-robotics/
DQNs. https://deepmind.com/research/dqn/
https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0
Relja Arandjelović, Andrew Zisserman. 2017. “Look, Listen and Learn” https://arxiv.org/abs/1705.08168 “We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? “

CNNs and pooling

Convolution in maths
Note. In DL literature, both convolutions and cross-correlations often are called convolutions.
link
Weight sharing.
CNNs act as “feature detectors”.
Average pooling, max pooling.

Geoffrey Hinton on pooling: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” source

How to visualize visual DNNs? See this.

Karpathy on ImageNet 2014.

Dataset augmentation

Often the datasets we have are not as large as we’d like. Luckily (at least with e.g. images), we can easily create new labeled data by transforming the data we have (shear, shift, rotate, etc).

Link: basic examples with Keras.

Co-adaptation problem and dropout

See this Quora answer.

Large neural models easily start to overfit to the data (because of co-adaptation, neurons rely too much on other neurons). Application of dropout (remove random neurons for training) helps to prevent this.

Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. “Dropout: a simple way to prevent neural networks from overfitting”. JMLR, 15(1), pp.1929-1958. link

Keras docs

Batch Normalization.

The basic idea: for each batch being trained, take the layer activation $x$, and normalize it (zero mean, unit variance) before feeding it to the next layer. This operation acts as a regularization method which enforces that distribution of a batch-normalized layer stays stable over time (over different batches). Batch normalization can be viewed as an adaptation of a more traditional idea that the input data should be normalized, applied to “internally” inside the intermediate inputs of the NN architecture to reduce covariate shift.

(Covariate shift being the shift in the distributions of $x$ during the training.)

See

Sergey Ioffe, Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arxiv:1502.03167
Shimodaira, H., 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2), pp.227-244. link. (For discussion on covariate shift.)

Some other high-level explanations form the internet:

Degradation problem and ResNets

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2015 “Deep Residual Learning for Image Recognition.” arxiv:1512.03385

He et al 2015:

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher train-ing error, as reported in [11, 42] and thoroughly verified by our experiments.

Deeper learning results in worse training error! How puzzling:

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.

He et al 2015 present as ResNet as a solution. Basic idea: introduce shortcut connections so that the “identity” mapping is always available (and the extra layer now should be able to learn parameters that result in smaller loss than simple identity).

1 x 1 convolutions and Inception

Reduce the number of filters: Nice explanation by Aaditya Prakash. They decrease the dimension of the feature space.

For Inception architecture, see Szegedy et al. 2014. “Going Deeper with Convolutions.” arxiv:1409.4842.

Transfer learning

In one sentence: Utilizing pretrained networks as building blocks.

CNNs for text

Word2Vec embeddings:

Skip-gram (given a word, predict nearby words aka context)
Continuous bag-of-words (given a context, predict a word)
Tensorflow tutorial
McCormick tutorial pt 1: the Skip-Gram model
McCormick tutorial pt 2: Negative Sampling (NS: when training the network, restrict attention to only handful of negative example to speed things up)
Xin Rong. 2014. (2016.) “word2vect Parameter Learning Explained.” arxiv:1411.2738

Convolutions for text:

one introductory tutorial
Main content: Yoon Kim. 2014. “Convolutional Neural Networks for Sentence Classification.” EMNLP: Empirical Methods in Natural Language Processing, pp 1746–1751.
Fun Demo by Turku NLP group: link.

Kim 2014 idea summarized: Have fixed-length representation for words in text (e.g. word2vec embeddings); apply convolution over them (over the one spatial / time axis).

RNNs for text

See this tutorial. Also this and DLB chapter 10.

RNNs shortly described: introduce recurrence to NNs (output of NN used as input at the next step).
Learning: Backpropagation through time.
We have a problem of vanishing gradients, again: recurrence can be viewed as “chain” of NNs; in BPTT we end up with very long chains.
Answer: LSTMs, GRUs.