Function Approximation (Regression)

Next: Sequential Model Up: Extensions to the Basic Previous: Regularization Contents

Function Approximation (Regression)

SNoW implements a function approximation version of Perceptron's additive update rule - a stochastic approximation to the Gradient Descent algorithm (GD) - and of Winnow's multiplicative update rule - an exponentiated Gradient Descent algorithm (EGD) [Kivinen and Warmuth, 1997]. See option -G for usage details.

Both use the same training policy. They are non-mistake-driven, meaning that each target node an example is presented to will perform an update on that example whether the network can be said to have made a mistake on that example or not. They are also on-line, as opposed to the true Gradient Descent algorithm in which the total error incurred by a weight vector with respect to an entire training set and the total update to that weight vector with respect to that error are calculated before any weights are updated. Mitchell calls the on-line version a stochastic approximation. A disadvantage to this approximation algorithm is that it requires a smaller learning rate to guarantee convergence. An advantage is that it is less susceptible to ``getting stuck'' in a local minimum that is not the global minimum [Mitchell, 1997].

Specifically, let ${\cal A}_t = \{i_1, \ldots, i_m \}$ be the set of active features in a given example that are linked to target node , let be the weight of feature in , let $\alpha$ be the learning rate, let be the strength of feature in the given example, and let be the strength of target in that same example^4.9. Then $\Omega_t = \sum_{i \in {\cal A}_{t}} w_{t,i} s_i$ is the predicted activation of target node before updating, and will represent the correct activation for this example while updating.

A regression update in SNoW proceeds as follows:

GD: When enabled in conjunction with Perceptron, the GD update rule becomes:

$\displaystyle w_{t,i} = w_{t,i} + \alpha * ( s_t - \Omega_t ) * s_i$
EGD: When enabled in conjunction with Winnow, the corresponding algorithm is referred to as Exponentiated Gradient Descent in the literature:

$\displaystyle w_{t,i} = w_{t,i} * \alpha^{( (s_t - \Omega_t) * s_i )}$

(Note that $\beta$ is never used in this case.)

Naive Bayes does not have a Gradient Descent update rule. In both Winnow and Perceptron, each example is presented only to targets active in that example.

Footnotes

... example ^4.9: See Section 6.1 for more information on feature (and target) strengths.

Next: Sequential Model Up: Extensions to the Basic Previous: Regularization Contents

Cognitive Computations 2004-08-20