04-29-2024, 08:32 AM
Great opening discussion.
I think the "synthesis" between the two views is found by examining the role played by randomness during training.
During training, the procedure of backpropagation (BP) is performed by an algorithm called "stochastic gradient-descent", or SGD. The name SGD might seem quite intimidating, but the concept is not. Basically, imagine the error-rate of the neural-net as a 2D landscape. Initially, when you begin training, you are at a very high peak, because the error-rate is very high. You want to find the lowest point in the landscape (or a point that is nearly the same altitude), how do you do that? Well, you find the direction where the gradient has the steepest descent, and you follow that. This is called gradient descent and this is how classical BP is done. But there's a problem. Because GD is deterministic, if you had started at a very slightly different point on the mountain peak, you could have ended up at a completely different "low point" which may actually be nowhere near the lowest altitude (lowest error-rate). This is called a "local minimum", think of a bowl-shaped valley which is at the top of an old mountain; the valley is the lowest point near itself, but the whole valley is still at a high altitude. If you had just happened to go down towards that valley, instead of down the outer sides of the mountain, you would have gotten stuck in a local minimum. SGD adds stochasticity to gradient descent, which acts a little bit like having a big pogo-stick which you occasionally pull out and use to take a random leap in some direction. By doing this, you have greatly improved chances of finding a global minimum (or a minimum close to the global minimum), and significantly reduced chances of ending trapped up in a local minimum. In the case of the bowl-shaped mountain-top valley, if you got your pogo-stick out while you were part way down the descent into the valley, you could end up back on the outside slope of the mountain and escape the local-minimum.
OK, so far so good, this is just an algorithm like any other algorithm. However, let's examine very closely the source of randomness in this procedure. There is a concept in computational complexity theory called derandomization. Basically, it is provable that any algorithm that uses randomness can be derandomized without altering any of the provable properties of that algorithm, in particular, without altering the running time, precision, etc. What this means is that, for every random algorithm A which you give me (a "random algorithm" is any algorithm that uses randomness), I can give you back a non-random algorithm A' that satisfies all the provable properties of A. Thus, there exists a non-randomized version of SGD that performs just as well as SGD does.
Now, let's move to the domain of password cracking. In cracking, we would love to fool a user into believing he has used a truly random seed when, in fact, it is a pseudo-random (non-random) seed which we ourselves control. In this way, we have fooled the user into choosing a password/key that was random when, in fact, it was non-random and we can quickly reconstruct the key without having to do a brute-force search. You can think of this kind of password-generation attack as an instance of derandomization. Password-generation is a random algorithm that just consists of saying, "Give me X random bits". Derandomizing that algorithm allows me to satisfy whatever constraints you have on your password-generator ("use lower-case, upper-case, numerals, special-characters", etc.) but in a way where the result that is generated is actually not random.
Connect With Us