Notes: Deep learning, review of basics
This autumn I had the opportunity to attend Deep Learning 2017. Here are my review notes for the final course exam.
Includes both (some) notes from the lectures + additional stuff from the internet. Not very polished, might be unreadable.
(I might write more in detail about our group project presentation later.)
Note. Real neurons are quite different beast than the mathematical model in ANNs, but they serve as an inspiration for DL models: Human thought arises from a neural structure made of a massive amount simple processors connected to each other in a particular way (massive parallelism, connectionism). The connectome encodes information in the brain; connectome is sparse (number of connections per neuron is bounded).
For network models, see ch 6 of DLB.
- First wave started with cybernetics, 1940s–1960s. Some concepts: McCulloch and Pitts neuron (1943). Rosenblatt perceptron (1958). Adaline. Widrow and Hoff (1960). Everyone is disappointed when single-layer network can only do linear separation, and multilayer are difficult.
- Second wave: Reviewed interest in 1980s. Rumelhart and McClelland publish a book Parallel Distributed Processing. Backpropagation.
- Third wave: Deep Learning, post 2006. Lots of labeled training data made training deep models possible. Deep models became computationally feasible (advances in hardware and software). Some algorithmic innovations (e.g. cross-entropy instead of MSE, ReLUs).
- Universal approximation theorem. Hornik et al. 1989, Cybenko 1989
- Size of ANN models has doubled approximately every 2.4 years.
For more history, see ch 1 and sec 6.6. of DLB.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. link
Gradient descent and backpropagation
- How to minimize the loss function? Gradient descent w.r.t network model parameters. How to compute g.d.?
- The answer is: by backpropagation: chain rule by dynamic programming.
Some activation functions
- Logistic sigmoid .
- Hyperbolic tangent .
- Rectified linear unit .
- Leaky ReLu.
Nowadays, ReLUs are recommended over smooth activations like sigmoids. Main benefit of ReLU’s is that we can avoid vanishing gradient problem (rectified linear unit: the linear part has constant derivate). link Non-differentiability is not as much of a problem as people of yesteryear thought.
- Bunch of libraries available.
- Theano, Tensorflow, Keras, PyTorch and Caffe
Recent-ish cool DL stuff
- DQNs. https://deepmind.com/research/dqn/
- Relja Arandjelović, Andrew Zisserman. 2017. “Look, Listen and Learn” https://arxiv.org/abs/1705.08168 “We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? “
CNNs and pooling
- Convolution in maths
- Note. In DL literature, both convolutions and cross-correlations often are called convolutions.
- Weight sharing.
- CNNs act as “feature detectors”.
- Average pooling, max pooling.
Geoffrey Hinton on pooling: “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” source
How to visualize visual DNNs? See this.
Often the datasets we have are not as large as we’d like. Luckily (at least with e.g. images), we can easily create new labeled data by transforming the data we have (shear, shift, rotate, etc).
Link: basic examples with Keras.
Co-adaptation problem and dropout
Large neural models easily start to overfit to the data (because of co-adaptation, neurons rely too much on other neurons). Application of dropout (remove random neurons for training) helps to prevent this.
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. “Dropout: a simple way to prevent neural networks from overfitting”. JMLR, 15(1), pp.1929-1958. link
The basic idea: for each batch being trained, take the layer activation $x$, and normalize it (zero mean, unit variance) before feeding it to the next layer. This operation acts as a regularization method which enforces that distribution of a batch-normalized layer stays stable over time (over different batches). Batch normalization can be viewed as an adaptation of a more traditional idea that the input data should be normalized, applied to “internally” inside the intermediate inputs of the NN architecture to reduce covariate shift.
(Covariate shift being the shift in the distributions of $x$ during the training.)
- Sergey Ioffe, Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arxiv:1502.03167
- Shimodaira, H., 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2), pp.227-244. link. (For discussion on covariate shift.)
Some other high-level explanations form the internet:
Degradation problem and ResNets
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2015 “Deep Residual Learning for Image Recognition.” arxiv:1512.03385
He et al 2015:
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher train-ing error, as reported in [11, 42] and thoroughly verified by our experiments.
Deeper learning results in worse training error! How puzzling:
The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart.
He et al 2015 present as ResNet as a solution. Basic idea: introduce shortcut connections so that the “identity” mapping is always available (and the extra layer now should be able to learn parameters that result in smaller loss than simple identity).
1 x 1 convolutions and Inception
Reduce the number of filters: Nice explanation by Aaditya Prakash. They decrease the dimension of the feature space.
For Inception architecture, see Szegedy et al. 2014. “Going Deeper with Convolutions.” arxiv:1409.4842.
In one sentence: Utilizing pretrained networks as building blocks.
CNNs for text
- Skip-gram (given a word, predict nearby words aka context)
Continuous bag-of-words (given a context, predict a word)
- Tensorflow tutorial
- McCormick tutorial pt 1: the Skip-Gram model
- McCormick tutorial pt 2: Negative Sampling (NS: when training the network, restrict attention to only handful of negative example to speed things up)
- Xin Rong. 2014. (2016.) “word2vect Parameter Learning Explained.” arxiv:1411.2738
Convolutions for text:
- one introductory tutorial
- Main content: Yoon Kim. 2014. “Convolutional Neural Networks for Sentence Classification.” EMNLP: Empirical Methods in Natural Language Processing, pp 1746–1751.
- Fun Demo by Turku NLP group: link.
Kim 2014 idea summarized: Have fixed-length representation for words in text (e.g. word2vec embeddings); apply convolution over them (over the one spatial / time axis).
RNNs for text
See this tutorial. Also this and DLB chapter 10.
- RNNs shortly described: introduce recurrence to NNs (output of NN used as input at the next step).
- Learning: Backpropagation through time.
- We have a problem of vanishing gradients, again: recurrence can be viewed as “chain” of NNs; in BPTT we end up with very long chains.
- Answer: LSTMs, GRUs.