Thursday, April 25, 2019


  • In most learning networks, error is calculated as the difference between the actual output and the predicted output.
  • The error function is which tells us how far are we from the solution.
  • The function that is used to compute this error is known as loss function.
  • Different loss functions will give different errors for the same prediction and thus would have a considerable effort on the performance of the model.
Imagine, we are standing on top of a mountain(mount Everest) and we want to descend.It is not that easy and it is cloudy and it is big and we cant see the big picture.We would look at all the possible directions where we can walk.

If we constantly take step to decrease the error by decreasing the height then we would reach all the way down the mountain.

  • In this case the key matrix that we use to solve the problem is the height. 
  • We can call the height the error.This would say that how badly we are doing at the mountain and how far we are then an ideal solution.

How do we tell the computer how far are they from a perfect solution?
We can count the no of mistakes. (example that is our height)
Now let us try to decrease the no of errors

1) After moving once step we had decreased the errors
2) After moving another steps we have decreased all the errors

  • The problem with this approach is that,algorithm would be taking very small steps.The reason for this is calculus. Tiny steps would be calculated by derivatives.
  • The problem with small steps is:
    1. We start with 2 errors and we move a small amount.
    2. We are still at 2 errors.
    3. Even after moving a tiny amount we are at 2 error.
This is equivalent to using gradient descend from an aesthetic pyramid with flat steps.

  • If we are standing above and looking for errors, then we would always get 2 errors and we would get confused what to do.
  • In this case we can figure out which direction we can decrease the most.
  • In math terms, in order for us , it means to do the gradient descent, our error function cannot be discrete. It should be continuous
 As shown above in the figure, we have 6 points, out of which 4 are correctly classified and other 2 are incorrectly classified.

Assuming error function would give penalty to incorrectly classified points an small penalty to the 4 correctly specified points. Here we are representing the size of the points as penalty
  • Penalty is the distance from the boundary when the points are miss-classified and 0 when they are correctly classified
  • Lets add all the errors from the corresponding points
  • The idea now is to move the line around to decrease these errors. In the fig below we have decreased the error.
 We can now use gradient descent to solve our problem.

  1. We explore around to see what direction brings us the down most or equivalent. 
  2. We take a step in that direction.
  3. In the mountain we go one step down and in the graph we reduce the error a bit by classifying one of the points.
        4.Now we look again and follow the steps described above

On the left we have reduced the height and have successfully descended from the mountain and on the right we have reduced the error to the min possible value and successfully classified our points.

No comments:

Post a Comment