Optimization

Let's get physical

Sometimes, nothing beats holding a copy of a book in your hands. Writing in the margins, highlighting sentences, folding corners. So this book is also available from Amazon as a paperback.

Buy now on Amazon

Code
Use Case
Variables
Loss Function
Gradient Descent
Optimizer
Cleaning Up
Summary

In the chapter “What is a Neural Network?” we covered the concept of training a neural network. This training process is the compute-heavy, number-crunching we associate with machine learning, this training is called optimization.

The good news is that this is what TensorFlow.js is good at, in this lecture, we’ll cover the mechanics of optimization using the low-level core library.

Code

The code for this lecture, and the next lecture on optimization, is in the tensorflow-optimization folder in the source code associated with this course.

That folder has three files, like so:

.
├── index.html  (1)
├── main.js     (2)
└── scratch.js  (3)

1	This index.html file loads tensorflow.js and also just the scratch.js file.
2	This file contains all the completed code for this lecture.
3	This file should be empty.

Open the index.html as we taught in the setup-instructions lecture and then open the console in the browser, this is where the messages will go.

Add your code to scratch.js and refresh the browser to execute it. If you have problems, check main.js to see the correct completed code.

Use Case

To demonstrate how optimization works, let’s take an embarrassingly simple use case, something so simple we can deduce the best value in our minds, and then let’s use TensorFlow.js to figure it out for us.

Imagine we have an array [2, -5, 16, -24, 3] we want to multiply each value of the array by a number x so that after the multiplication, they all add up to 0.

If x was 0 then the array would end up looking like [2 x 0, -5 x 0, 16 x 0, -24 x 0, 3 x 0] which results in [0, 0, 0, 0, 0].

If you multiply everything by 0, you get 0. I did mention that this was an embarrassingly simple use case!

We know the optimal value for x is 0, what if x started life as 4.12, how would you use TensorFlow.js to optimize, to train, x to become 0?

Variables

We first need to define our values, like so:

var x = tf.variable(4.12);
var ys = tf.tensor([2, -5, 16, -24, 3]);

x is our variable Tensor, which we create using the special tf.variable function, this tells TensorFlow.js that x is trainable.

ys is a Tensor to hold a sample list of numbers.

Note

Tensors are read-only by default in TensorFlow; once you create them, they cannot change. Variables are different; variables can change over time. We want TensorFlow to optimize x to 0, so we define it as a variable.

Loss Function

In any optimization, there is a loss function, a function that returns a number, which indicates how wrong we are. In this case, it will return how wrong our value of x is.

In the previous lecture I introduced you to the handy Mean Squared Error equation, we can use that here as our loss function like so:

var loss = ys.mul(x).square().mean().print();

With a value of 4.12 this initially results in:

Tensor
    2953.54541015625

With a value of 0 this results in:

Tensor
    0

With a value of -4.12 this results in:

Tensor
    2953.54541015625

So we know that if the value of x is trained to below 0, the loss will increase again. The minimum value of our loss will be 0.

Gradient Descent

Another reason to use such a simple use case with one variable is that we can visualize the process in a graph. As you add variables, the number of dimensions of the graph increases, and it becomes harder to reason about.

Figure 1. The error curve

On the x-axis, we see values for x. On the y-axis, we see values for the mean squared error. The dotted curve is the loss at different values of 0.

As x moves towards 0, the mean squared error also goes down, our loss goes down, as we go past 0 into negative territory, the number starts going up again.

The lowest point of the curve is the optimal value of x.

We start at 4.12, and we slide down the gradient of the curve till we reach the lowest point.

Figure 2. Where is our starting point?

The thing is a computer doesn’t know the slope of the curve, so how would it figure out how to slide down it?

Figure 3. Deciding which way is down towards the least error value

One solution is just to try 4.11 and 4.13 if the loss is less with 4.11, then the curve seems to be sloping down in that direction, so just follow it, perhaps later try 4.10, 4.09.

You keep on doing that until you reach the lowest point, and it starts going up again, then you are reasonably sure you are at the lowest point.

That’s called Gradient Decent, and it’s a conventional algorithm for training Machine Learning models.

Optimizing for one variable is a 2D curve, optimizing for two variables is a 3D surface, the lowest point on that surface is the optimal value of those two variables.

Figure 4. The optimal value for a 3D surface

Important

Whether you are dealing with one variable or 10,000, there is a surface with a lowest point. The value of all those variables at that lowest point is the optimal set of values.

Optimizer

That’s what we are doing theoretically, how do we do it practically with TensorFlow.js? We use something called an optimizer.

var learningRate = 0.001;
var optimizer = tf.train.sgd(learningRate);

We construct an optimizer by using one of the available Training Optimizers^[1] in TensorFlow.js. The one above is the Stochastic Gradient Descent^[2] (sgd) optimizer, a faster implementation of the Gradient Descent mechanism discussed above.

The sgd optimizer takes as a parameter the learning rate; the lower the learning rate, the smaller increments it tunes the variables. A large learning rate means the training will be fast, but it might never converge to the actual optimum value if it’s too large. A small learning rate will train slower but is more likely not to step over the optimum value and converge.

Note

=== Choosing the right optimizer requires a much deeper knowledge of Machine Learning than covered in this course. A thorough analysis of these different optimizers and when to use them can be found in the article An overview of gradient descent optimization algorithms^[3].

The TensorFlow.js documentation for all the optimizers, apart from sgd, have links to academic papers discussing the use of that optimizer. For example, adam^[4] links to the paper Adam: A Method for Stochastic Optimization^[5].

But a simple guide for beginners might be to use sgd for shallow networks without many layers and use adam or rmsprop for bigger networks with more layers. ===

Once you’ve created an optimizer, you call optimiser.minimise to perform the optimization. In our use case, we have one variable x which we want TensorFlow.js to try to optimize for us.

optimiser.minimise takes as input a loss function, a loss function that needs to return a loss as a Tensor, like so:

console.log(x.dataSync()); (1)
optimizer.minimize(() => { (2)
  return ys
    .mul(x)
    .square()
    .mean();
});
console.log(x.dataSync());  (3)

1	This prints out the current value of `x`, which at the start should be `4.12`
2	Our `minimize` function which takes as input a loss function, a function that returns a Tensor telling the optimizer how wrong the current value of `x` is.
3	This prints out the value of `x` after optimization.

The loss function has to use x somewhere in its calculation, if x isn’t used then there is no point optimizing for it, TensorFlow.js will return an error. We used the mean squared error function we have discussed above.

After the single iteration of optimization above the value of x should be different, on my computer with the learning rate of 0.0001, the variable x becomes 3.98, from a starting point of 4.12.

How do we get it to 0? We simply run it again and again with the same data. For our example let’s run it 200 times with a simple loop like so:

var x = tf.variable(tf.scalar(4.12));
var ys = tf.tensor([2, -5, 16, -24, 3]);

var optimizer = tf.train.sgd(0.0001);

console.log(x.dataSync());
for (let i = 0; i < 200; i++) {
    optimizer.minimize(() => {
        return ys
            .mul(x)
            .square()
            .mean();
    });
    console.log(x.dataSync());
}

By the end of 200 iterations (epochs), I get 0.00345, not zero but close. If I run it 1000 times, I get 1.707e-15.

That’s the simplicity of supervised machine learning; you get some data, define a loss function, choose an optimizer, and run it across the data repeatedly until you get the desired outcome. That’s training, that’s Machine Learning.

Cleaning Up

JavaScript does much of the cleaning up after you. In other languages, if you create a variable, you have to remember to tell the computer when you are finished with it, so it knows it can clean it up and let the memory be used by something else. Tensors, however, use your Graphics Card, your GPU. When using your GPU, JavaScript can’t automatically clean up after itself, so you need to clean up after yourself. If you fail to do this, then your application will have a memory leak and will eventually consume all the memory on your computer and die.

There are two methods of doing this, either using the dispose function or the tf.tidy function, let’s first look at the dispose function.

for (let i = 0; i < 200; i++) {
    var loss = null;
    optimizer.minimize(() => {
        loss = ys
            .mul(x)
            .square()
            .mean();
        return loss;
    });
    loss.dispose() (1)
    console.log(x.dataSync());
}

1	However you do it, make sure after you have finished with a Tensor to call `dispose` on it.

This can become tedious and error prone to remember all the Tensors that are getting created, so TensorFlow.js has a helper function called tf.tidy which you can use like so:

for (let i = 0; i < 200; i++) {
    tf.tidy(() => { (1)
        optimizer.minimize(() => {
            return ys
                .mul(x)
                .square()
                .mean();
            return loss;
        });
        console.log(x.dataSync());
    }); (1)
}

1	We wrap all the code in our app that is creating Tensors with a `tf.tidy` function when the inner function returns it automatically deletes all the Tensors that have been created.

Summary

The process of training a Neural Network is called optimization.

A Neural Network is just a large TensorFlow.js graph of different Tensors and operations performed on those Tensors. Some of those Tensors are read-only; for instance, training data, some of those nodes are variables, for example, weights.

An optimizer is the thing that tunes those variables, adjusts them based on the information it gets about how wrong a Neural Network is, information it gets from a loss function.

We can run the optimizer as many times as we want, each iteration we call an epoch.

This is how we do supervised machine learning. We run the neural network with some sample data, compare the result it gives with the known good result, calculate a loss, then let the optimizer tune the variables, and then repeat it until we think we have optimized enough.

1. Training Optimizers https://js.tensorflow.org/api/latest/#Training-Optimizers

2. Stochastic Gradient Descent https://js.tensorflow.org/api/latest/#train.sgd

3. An overview of gradient descent optimization algorithms https://ruder.io/optimizing-gradient-descent/

4. Adam TensorFlow.js Optimizer API https://js.tensorflow.org/api/latest/#train.adam

5. Adam: A Method for Stochastic Optimization https://arxiv.org/abs/1412.6980

Advanced JavaScript

This unique course teaches you advanced JavaScript knowledge through a series of interview questions. Bring your JavaScript to the 2021's today.

Level up your JavaScript now!

[🌲,🌳,🌴].push(🌲)

If you find my courses useful, please consider planting a tree on my behalf to combat climate change. Just $4.50 will pay for 25 trees to be planted in my name. Plant a tree!