Author: Sophia Peneva
Machine Learning is more than a buzzword to us at Lab08 as we have the great opportunity to work on products, relying on AI and ML. This inevitably piques the interest of our employees to deep dive into specifics and learn more about the mechanics of ML. Sophie is no exception and her curiosity has led her into the depths of the topic. After sharing her general thoughts on the Hows and Whys of Machine Learning, we now present you with vol. 2 dedicated to neural networks. Enjoy!
I have never been all that great at biology nor have I, to be completely sincere, taken any particular interest in anything biology-related. I will however make an attempt to describe something I am genuinely infatuated with by a brain analogy. I brought up the topic of ‘Machine learning’ in my previous article but in it, I just scratched the surface. This time the aim is to dig a little deeper, so I’ll get my scalpel and let’s get started.
First things first, how are machine learning and biology even remotely related? Well to quote Wikipedia, “The understanding of the biological basis of learning, memory, behaviour, perception, and consciousness has been described by Eric Kandel as the “ultimate challenge” of the biological sciences.”, emphasis on learning. This is to some extent, also Machine learning’s challenge. When we, as human beings, are learning a new skill our brains are forming neurons and connections between neurons inside of them. These connections can be both strengthened and weakened. Practising a particular skill set, for example, can result in a strong connection between two neurons, meaning that firing one of them would most likely trigger the other one. This process has been the inspiration behind neural networks.
Last time, I described the process of classifying images using linear image classification. As a quick recap, in order to determine whether our image was a depiction of a horse, a cat, a car or etc. we turn our image into an input vector containing the pixel data of our image, we multiply it by a matrix of weights and add a vector of biases to the result. This gives us a vector of possibilities, where each row represents the possibility for our image to be of a certain class (a horse, a cat, a dog, etc). Long story short, output = Wx + b. The weights and biases are what we train in order to get the correct output from the classifier.
So far so good, but what if we don’t have linearly separable data? What if we want better results? That’s where neural networks come in. Let’s take a look at neural networks’ architecture:
Our neural network consists of an input vector, an output vector and a hidden layer. This illustration is of a two-layer Neural network (notice that the input vector is not accounted for when counting the number of layers). The layers are ‘fully connected’ which means that between every two adjacent layers all neurons are pairwise connected (NB neurons from the same layer share no connection).
There are a couple of things to consider when it comes to deciding what the architecture of your neural network will be — how the layers will be connected and the size of the network. Let’s first take a look at how we connect the layers.
When we were creating a simple linear model, we get the output vector by multiplying the input vector with a weights matrix and adding a vector of biases:
y = Wx + b
So what happens when we put an additional layer in between the input and the output vector. How does that affect our equation? The easiest solution is to consider the following: if we get the hidden layer’s vector as HL = Wx + b we can multiply it by another set of weights and add another set of biases to get the output layer:
Y = W1(Wx+b) + b1
Unfortunately, this is not correct. The whole point of neural networks is to represent more complex functions. Will we achieve that with this equation though? Since the product of two matrices is a matrix, if we open the brackets we will get:
Y = W1(Wx+b) + b1
Y = W2x+b + b1
Y = W2x + b2
Which is not much different from our linear model. The only difference being the values of the weights which are supposed to be gathered in the process of training anyway. Essentially, we still have a linear model. What we have done here is sort of like having a normal function y=x + 1, multiply it by 2 and add 1 and expect to have a nonlinear function. Obviously, that is not the case.
So how do we model nonlinear functions? That’s where activation functions come in.
Activation functions (nonlinear functions) take a number, perform mathematical operations on it and return a number. If we use such a function after every layer (except for the last one), we will achieve nonlinearity. So the way we generate our output should look something like this:
Y = W1(f(Wx + b)) + b1
Where f is our activation function. Naturally the question of what we should use as an activation function arises. There are a few that are widely used in Machine Learning for neural networks.
All of the above have pros and cons however in recent years ReLU has been the most popular choice. There is no correct answer to the question ‘Which one should I use for my model?’. In order to find out which one works best for your data, you could try experimenting with different activation functions.
Neural network sizes
In order to measure the size of a neural network, two things are taken into consideration: the number of neurons and the number of learnable parameters. In Fig. 1 we have 5+2=7 neurons (we are disregarding the number of neurons from the input vector), 3×5 + 5×2 = 25 weights and 5+2=7 biases, for a total of 25+7=32 learnable parameters. For comparison, modern Convolutional Networks contain orders of 100 million parameters and around 10–20 layers (hence deep learning). This naturally brings up the question ‘So how big should my neural network be?’. Let’s first see what a change in the number of neurons means for our model. Here I have prepared 3 examples for ML models with a different number of neurons.
At first glance, the last model which has the most neurons has the best predictions. Every red point is correctly classified as ‘red’, the same goes for every green point. Overall, the models seem to be getting better at predicting, the more neurons they have. Is that really the case, though?
Despite the fact that the third model seems to represent the data perfectly, it has actually grossly overfitted it. What this means is that our model has become extremely good at classifying the points that it was trained on but it probably won’t perform well out of sample — a.k.a on new data points. The reason this happens is that the more neurons we have, the more complex functions our model can build to separate the classes in our data. The problem occurs because our initial data will most likely have some sort of ‘noise’ (pieces of information that bring nothing to the table — outliers that should be ignored). If we treat this noise as important contributors to our data instead of as outliers, we run the risk of misclassifying a lot of other data points. In the example, in Fig.4 our model has created a pretty complex function in order to correctly classify two additional green points in our training dataset. What happens when we run the tests on the real data after that? We’ll potentially misclassify a lot of points.
Okay, so does that mean that we should aim for neural networks with fewer neurons? Well, no. Thankfully there are other ways that we can deal with overfittings such as regularization and dropout. You might be wondering why we should go through this additional effort of adding regularization into the mix if we can simply make our model with fewer neurons. The big problem of smaller neural networks, however, is that they are harder to train. Their loss functions have relatively few local minima. Most of these local minima are bad but easy to converge to. This means that we are more likely to end up at a local minimum that has a high error rate. On the other hand, bigger neural networks have a lot more local minima and most of them have relatively low errors. To sum up, by choosing a bigger neural network we stop depending so much on a random chance for the initial weight initialization in order to get a good final result. So in the end, to decide how deep we want to go with our neural network, we should answer how deep we are willing to reach into our pockets as larger neural networks need a lot of computational power.
I mentioned regularization as a way of reducing overfitting. What is regularization though and how does it work?
At its core, regularization is penalizing for complexity. As mentioned earlier, if our model is overfitting it works well on the training data but due to being too complex, it cannot generalize well and therefore performs poorly on new data. So how do we make our model less complex? When we are training our neural networks, we heavily rely on the loss function. You might remember from the previous article that after computing the loss, i.e. the error, we make changes to our weights. Let’s take a look at the Sigmoid activation function once again.
You might notice that for Zs close to 0, the function is pretty linear (Fig. 6).
So if we maintain our weights to have lower values, we will keep a relatively linear result function even after the activation. This means that with the help of regularization we can achieve a relatively simple function that will not overfit our data. So what we should aim to do is penalize the loss function in such a way that our weights become smaller.
If we take the L2 regularization, for example, the cost there is computed as follows:
You can see here that it is the sum of the loss function and the so-called ‘regularization term’. If λ is equal to 0 then there is no difference in how we compute the cost. The idea behind the cost function is that we minimize it with every iteration during the training process. If we simply minimize the loss ( i.e. the error) we will eventually overfit our data. However, if we minimize both the loss and the complexity we will have a more balanced model which will neither overfit nor underfit. Obviously, the value that we choose for λ is very important and playing around with it can have a dramatic effect on the resulting model.
In order to illustrate the effect, here are 3 models that all have the same number of neurons but with different regularization strengths (going from smallest to largest). As you can see, the third model which has the largest regularization strength did not overfit the data despite having a lot of neurons. Now compare it to the first model which did the exact opposite.
You can play around with different neural networks’ sizes and regularization strengths here
Everything we discussed so far was but a glimpse into what the topic of neural networks has to offer. There are so many more things to cover but it would be overkill to try and do it all in this article (trying to cover everything would lead to overfitting anyway). What I hopefully achieved was to give you a general understanding of what neural networks are and perhaps piqued your interest in them.
 Overfitting and underfitting https://satishgunjal.com/underfitting_overfitting/
Sophia Peneva is a Software Developer at Lab08, working on the UserTribe platform. Previously, she has worked for SAP and last summer, she became one of Google’s interns, working on a Youtube algorithm project. She has experience with C++, PHP, Python, Java, MySQL, MongoDB and others. When it comes to deep learning, Sophie now has Tensorflow experience in her pocket. Sophie has been interested in ML for quite some time, showing great passion and engagement in getting her hands dirty. We hope you’ve enjoyed the product of her enthusiasm!
Be sure to follow us on social media to receive updates about other similar content!