A
The activation level of
a neuron in an artificial neural network is
a real number often limited to the range 0 to 1, or –1 to 1. In the case of an input neuron the value is obtained externally to the network. In the case
of a hidden neuron or output neuron the
value is obtained from the neuron's activation function. The activation of a neuron is sometimes thought of as
corresponding to the average firingrate of a biological neuron.
In neural networks,
an activation function is the function that describes the output behaviour of a
neuron. Most network architectures start by computing the weighted sum of the
inputs (that is, the sum of the product of each input with the weight associated with that input. This quantity, the total net input is then usually transformed in some way, using what is
sometimes called a squashing
function. The simplest squashing
function is a step function: if the total net input is less than 0 (or more
generally, less than some threshold T) then the output
of the neuron is 0, otherwise it is 1. A common squashing function is the logistic function.
In
summary, the activation function is the result of applying a squashing function
to the total net input.
A propositional
learning system, developed by
Michalski.
asynchronous vs synchronous
When
a neural network is viewed as a collection of connected computation devices,
the question arises whether the nodes/devices share a common clock, so that
they all perform their computations ("fire") at the same time, (i.e. synchronously)
or whether they fire at different times, e.g. they may fire equally often on
average, but in a random sequence (i.e. asynchronously). In the
simplest meaningful case, there are two processing nodes, each connected to the
other, as shown below.
In the asynchronous case, if the yellow node
fires first, then it uses the then current value of its input from the red node
to determine its output in time step 2, and the red node, if it fires next,
will use the updated output from the yellow node to compute its new output in
time step 3. In summary, the output values of the red and yellow nodes in time
step 3 depend on the outputs of the yellow and red nodes in time steps 2 and 1,
respectively.
In the synchronous case, each node obtains the
current output of the other node at the same time, and uses the value obtained
to compute its new output (in time step 2). In summary, the output values of
the red and yellow nodes in time step 2 depend on the outputs of the yellow and
red nodes in time step 1. This can produce a different result from the
asynchronous method.
Some neural network algorithms are firmly tied
to synchronous updates, and some can be operated in either mode. Biological
neurons normally fire asynchronously.
An attribute is a
property of an instance that may be used to determine its classification. For example, when classifying objects into different types in a
robotic vision task, the size and shape of an instance may be appropriate
attributes. Determining useful attributes that can be reasonably calculated may
be a difficult job - for example, what attributes of an arbitrary chess
end-game position would you use to decide who can win the game? This particular
attribute selection problem has been solved, but with considerable effort and
difficulty.
Attributes
are sometimes also called features.
The axon is the
"output" part of a biological neuron. When a neuron fires, a pulse of electrical
activity flows along the axon. Towards its end, or ends, the axon splits into a
tree. The ends of the axon come into close contact with the dendrites of other neurons. These junctions are termed synapses. Axons may be short (a couple of millimetres) or long (e.g. the
axons of the nerves that run down the legs of a reasonably large animal.)
In decision tree pruning one of the issues
in deciding whether to prune a branch of the tree is whether the estimated
error in classification is greater if the branch is present or pruned. To estimate
the error if the branch is present, one takes the estimated errors associated
with the children of the branch nodes (which of course must have been
previously computed), multiplies them by the estimated frequencies that the
current branch will classify data to each child node, and adds up the resulting
products. The frequencies are estimated from the numbers of training data
instances that are classified as belonging to each child node. This sum is
called the backed-up error estimate for the branch node. (The concept of a
backed-up error estimate does not make sense for a leaf node.)
See
also expected error estimate.
The phase of the error backpropagation learning algorithm when the weights are updated, using the delta rule or some modification of it.
The
backward pass starts at the output layer of the feedforward network, and updates the incoming weights to units in that layer using
the delta rule. Then it works backward, starting with the penultimate layer
(last hidden layer), updating the incoming weights to
those layers.
Statistics
collected during the forward
pass are used during
the backward pass in updating the weights.
see error
backpropagation
In feedforward and some other neural networks, each hidden unit and each output unit is
connected via a trainable weight to a unit (the bias unit) that
always has an activation level of –1.
This
has the effect of giving each hidden or output a trainable threshold, equal to the value of the weight from the bias unit to the unit.
Biological neurons come in a variety of types. There is a
lot of further structure and physiology that could be considered. The features
of a neuron shown above are those of most interest for those constructing artificial
neural networks (other than spiking-neuron-based models, and those relying on
synchronous activation, such as the Shastri and Ajjanagadde model: see
L.Shastri and V. Ajjanagadde: Behavioral and Brain Sciences (1993)
16, 417-494).
However, from the artificial neural network
point of view, a biological neuron operates as follows: electrical pulses from
other neurons cause the transfer of substances called neurotransmitters (of
which there are several varieties) from the synaptic terminals of a neuron's
axon (think "output") across a structure called a synapse to the
dendrites of other neurons (call them downstream neurons). The arrival of the
neurotransmitter in the dendrite of the downstream neuron increases the
tendency of the downstream neuron to send an electrical pulse itself
("fire"). If enough dendrites of a neuron receive neurotransmitters
in a short enough period of time, the neuron will fire.
Caveat: neurotransmitter substances may be excitatory or inhibitory.
The text above assumes that only excitatory neurotransmitters are involved.
Inhibitory neurotransmitters, as the name suggests,reduce the
tendency of a neuron to fire. Some neurons have a mixture of excitatory
synapses and inhibitory synapses (i.e. synapses using inhibitory neurotransmitters)
and will only fire if there is enough additional excitatory input to overcome
the effect of the inhibitory synapses.
C4.5 is a later version
of the ID3 decision tree induction algorithm.
C5 is a later version of
the ID3 decision tree induction algorithm.
The cell body is the
large open space in the "middle" of the neuron, between the dendrites
and the axon, where the cell nucleus lives. A landmark in biological neuron architecture, but not of significance in relation to
artificial neural networks (except for those trying to model biological neurons
as distinct from using simplified neuron models to solve diverse problems).
When a neuron in an neural network has its value forcibly set and fixed to some externally
provided value, it is said to be clamped. Such a neuron serves as an input unit for the net.
In a classification task
in machine learning, the task is to take each instance and assign it to a
particular class. For example, in a machine vision application, the task might
involve analysing images of objects on a conveyor belt, and classifying them as nuts, bolts,
or other components of some object being assembled. In an optical character
recognition task, the task would involve taking instances representing images
of characters, and classifying according to which character they are.
Frequently in examples, for the sake of simplicity if nothing else, just two
classes, sometimes called positiveand negative, are
used.
A decision tree
induction program - a precursor of ID3.
A conjunctive expression
is an expression like:
size=large and colour in {red,
orange}
that is the conjunction of
two or more simple predicates like size=large. The term conjunction refers
to the presence of the logical operator and, which joins or conjoins the
simple predicates. Occasionally the conjunctive expression might just consist
of a single predicate.
See also propositional learning systems and covering
algorithm.
Connectionism
is the neural network approach to solving problems in artificial intelligence -
the idea being that connectionists feel that it is appropriate to encode
knowledge in the weightedconnections between nodes in a neural net.
The word "connectionist" is sometimes used with all the heat and
force associated with political "isms": "He's a
connectionist" can be just as damning, coming from the wrong person, as
"He's a communist (or capitalist)". It is also sometimes used as a
simple adjective, as in "NetTalk is a connectionist system."
A covering algorithm, in
the context of propositional learning systems, is an algorithm that develops a cover for
the set of positive examples - that is, a set of conjunctive expressions that account for all the examples but none
of the non-examples.
The
algorithm - given a set of examples:
1.
Start with an empty
cover.
2.
Select an example.
3.
Find the set of all conjunctive
expressions that cover that example.
4.
Select the
"best" expression x from that set, according to some
criterion (usually "best" is a compromise between generality and
compactness and readability).
5.
Add x to
the cover.
6.
Go to step 2, unless
there are no examples that are not already covered (in which case, stop).
A
decision tree is a tree in which each non-leaf node is labelled with
an attribute or a question of some sort, and in which
the branches at that node correspond to the possible values of the attribute,
or answers to the question. For example, if the attribute was shape, then there would be
branches below that node for the possible values of shape, say square, round and triangular. Leaf nodes are
labelled with a class. Decision trees are used for classifying instances - one starts at the root of the tree, and, taking
appropriate branches according to the attribute or question asked about at each
branch node, one eventually comes to a leaf node. The label on that leaf node
is the class for that instance.
The delta rule in error backpropagation learning specifies the update to be made to each weight during backprop learning. Roughly speaking, it states that
the change to the weight from node i to
node jshould be proportional to output of node j and
also proportional to the "local gradient" at node j.
The
local gradient, for an output node, is the product to the derivative of the squashing function evaluated at the total net input to node j,
and the error signal (i.e. the difference between the target output and the
actual output). In the case of a hidden node,
the local gradient is the product of the derivative the squashing function (as
above) and the weighted sum of the local gradients of the nodes to which node j is
connected in subsequent layers of the net. Got it?
A dendrite is one of the
branches on the input end of a biological neuron. It has connections, via synapses to the axons of other neurons.
For our purposes, the
entropy measure
–Σi pilog2pi
gives us the average
amount of information in bits in some attribute of an instance. The information referred to is information about what class the instance belongs to, and pi is the probability that an instance belongs to class i.
The
rationale for this is as follows: –log2(p) is the amount
of information in bits associated with an event of probability p -
for example, with an event of probability ½, like flipping a fair coin, log2((p) is –log2(½) = 1, so there is one
bit of information. This should coincide with our intuition of what a bit means
(if we have one). If there is a range of possible outcomes with associated
probabilities, then to work out the average number of bits, we need to multiply
the number of bits for each outcome (–log2(p)) by the
probability p and sum over all the outcomes. This is where the
formula comes from.
Entropy
is used in the ID3 decision tree induction algorithm.
In training a neural
net, the term epoch is used
to describe a complete pass through all of the training patterns. The weights in the neural net
may be updated after each pattern is presented to the net, or they may be
updated just once at the end of the epoch. Frequently used as a measure of
speed of learning - as in "training was complete after x epochs".
The error
backpropagation learning algorithm is a form of supervised learning used to train mainly feedforward neural
networks to perform some task. In outline, the algorithm is as follows:
1.
Initialization: the weights of the network are initialized to small random values.
2.
Forward
pass: The inputs of each
training pattern are presented to the network. The outputs are computed using
the inputs and the current weights of the network.
Certain statistics are kept from this computation, and used in the next phase.
The target outputs from each training pattern are compared with the actual activation levels of the output units - the difference between the two is termed the error.
Training may be pattern-by-pattern or epoch-by-epoch. With pattern-by-pattern
training, the pattern error is provided directly to the backward pass. With
epoch-by-epoch training, the pattern errors are summed across all training
patterns, and the total error is provided to the backward pass.
3.
Backward
pass: In this phase, the
weights of the net are updated. See the main article on the backward pass for some more detail.
4.
Go back to step 2.
Continue doing forward and backward passes until the stopping criterion is satisfied.
See also forward pass, backward pass, delta rule, error surface, local minimum, gradient descent and momentum.
Error
backpropagation learning is often familiarly referred to just as backprop.
When total error of
a backpropagation-trained neural network is
expressed as a function of the weights, and graphed (to the
extent that this is possible with a large number of weights), the result is a
surface termed the error surface. The course of learning can be
traced on the error surface: as learning is supposed to reduce error, when the
learning algorithm causes the weights to change, the current point on the error
surface should descend into a valley of the error surface.
The
"point" defined by the current set of weights is termed a point in weight
space. Thus weight space is the set of all possible values of the weights.
See
also local minimum and gradient descent.
see weight.
In pruning a decision tree, one needs to be able to estimate the expected error at any node
(branch or leaf). This can be done using the Laplace error estimate,
which is given by the formula
E(S) = (N – n + k –
1) / (N + k).
where
S
|
is the set of instances in a node
|
k
|
is the number of classes (e.g. 2
if instances are just being classified into 2 classes: say positive and
negative)
|
N
|
is the is the number of instances
in S
|
C
|
is the majority class in S
|
n
|
out of N examples
in S belong to C
|
feature
see attribute.
A
kind of neural network in which the nodes can be numbered,
in such a way that each node has weighted connections only to nodes with higher numbers. Such nets can be trained
using the error backpropagation
learning algorithm.
In
practice, the nodes of most feedforward nets are partitioned into layers -
that is, sets of nodes, and the layers may be numbered in such a way that the
nodes in each layer are connected only to nodes in the next layer - that is,
the layer with the next higher number. Commonly successive layers are totally
interconnected - each node in the earlier layer is connected to every
node in the next layer.
The
first layer has no input connections, so consists of input units and is termed the input layer (yellow nodes
in the diagram below).
The
last layer has no output connections, so consists of output units and is termed the output layer (maroon
nodes in the diagram below).
The
layers in between the input and output layers are termed hidden layers, and consist of hidden units (light
blue nodes and brown nodes in the diagram below).
When
the net is operating, the activations of
non-input neurons are computing using each neuron's activation function.
Feedforward network. All connections (arrows) are in one direction; there are no cycles of activation flow (cyclic subgraphs). Each colour identifies a different layer in the network. The layers 1 and 2 are fully interconnected, and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.
1. In a biological neural network: neurons in a biological neural network fire
when and if they receive enough stimulus via their (input) synapses. This means that an electrical impulse is propagated along the
neuron's axon and transmitted to other neurons via the
output synaptic connections of the neuron. The firing rate of
a neuron is the frequency with which it fires (cf. activation in anartificial
neural network.
2.
In an expert system: when a rule in the expert system is used, it is said
to fire.
include connectionist
and statistical techniques of machine learning. The idea is that machine
learning means learning, from a number of examples or instances or training patterns, to compute a function which has as its
arguments variables corresponding to the input part of the training pattern(s),
and has as its output variables corresponding to the output part of the
training patterns, which maps the input part of each training pattern to its
output part. The hope is that the function will interpolate / generalize from
the training patterns, so that it will produce reasonable outputs when given
other inputs.
See
also symbolic learning
algorithms.
In the forward pass in
backpropagation, each training pattern is presented to the input units of the
network. The hidden unit activations are computed from the inputs and
input-to-hidden unit weights, and then (in the case of a 3-layer network, with
only a single layer of hidden units) the outputs are computed using the hidden
layer activations and the current hidden-to-output weights. Certain statistics are
kept from this computation, and used in the backward pass. The target
outputs from each training pattern are compared with the actual activation levels of the output units - the difference between the two is termed the error.
Training may be pattern-by-pattern or epoch-by-epoch. With pattern-by-pattern
training, the pattern error is provided directly to the backward pass. With
epoch-by-epoch training, the pattern errors are summed across all training
patterns, and the total error is provided to the backward pass.
Learning
in backprop seems to operate by first of all getting a rough set of weights which fit the training patterns in a general sort of way, and then working
progressively towards a set of weights that fit the training patterns exactly.
If learning goes too far down this path, one may reach a set of weights that
fits the idiosyncrasies of the particular set of patterns very well, but does
not interpolate (i.e. generalize) well.
Moreover,
with large complex sets of training patterns, it is likely that some errors may
occur, either in the inputs or in the outputs. In that case, and again particularly
in the later parts of the learning process, it is likely that backprop will be
contorting the weights so as to fit precisely around training patterns that are
actually erroneous! This phenomenon is known as over-fitting.
This
problem can to some extent be avoided by stopping learning early. How does one
tell when to stop? One method is to partition the training patterns into two
sets (assuming that there are enough of them). The larger part of the training
patterns, say 80% of them, chosen at random, form the training set, and the
remaining 20% are referred to as the test set. Every now and again
during training, one measures the performance of the current set of weights on
the test set. One normally finds that the error on the training set drops monotonically
(that's what a gradient descent algorithm is supposed to do, after all).
However, error on the test set (which will be larger, per pattern, than the
error on the training set) will fall at first, then start to rise as the
algorithm begins to overtrain. Best generalization performance is gained by
stopping the algorithm at the point where error on the test set starts to rise.
An
improvement on the in error backpropagation learning. If the learning rate (often
denoted by η) is small, the backprop algorithm proceeds slowly, but accurately
follows the path of steepest descent on the error surface. If η is too large, the algorithm may "bounce off the canyon
walls of the error surface" - i.e. not work well. This can be largely
avoided by modifying the delta rule to include amomentum term:
Δwji(n) = α Δwji(n–1) + η δj(n) yi(n)
in
the notation of Haykin's text (Neural networks - a comprehensive foundation).
The constant α is a termed the momentum constant and can be adjusted to achieve
the best effect. The second summand corresponds to the standard delta rule,
while the first summand says "add α × the previous change to this
weight."
This
new rule is called the generalized delta rule. The effect is that if the basic
delta rule would be consistently pushing a weight in the same direction, then
it gradually gathers "momentum" in that direction.
Understanding this term
depends to some extent on the error surface metaphor.
When
an artificial neural network learning algorithm causes the weights of the net to change, it will do so in such a way that the
current point on the error surface will descend into a valley of the error
surface, in a direction that corresponds to the steepest (downhill) gradient or
slope at the current point on the error surface. For this reason, backprop is said to be a gradient descent method, and to
perform gradient descent in weight space.
See
also local minimum.
Neurons or units in a feedforward net are
usually structured into two or more layers. The input units constitute the input layer. The output units constitute the output layer. Layers in between
the input and output layers (that is, layers that consist of hidden units) are termed hidden layers.
In
layered nets, each neuron in a given layer is connected by trainable weights to each neuron in the next layer.
A hidden unit in a neural network is
a neuron which is neither an input unit nor an output unit.
Term used in analysing machine learning methods. The hypothesis language refers to the notation used
by the learning method to represent what it has learned so far. For example, in ID3, the hypothesis language would be the notation used to represent
the decision tree (including partial descriptions of incomplete decision
trees). In backprop, the hypothesis language would be the notation
used to represent the current set of weights. In Aq, the hypothesis language would be the notation used to represent
the class descriptions (e.g.
class1 ← size=large and colour in {red,
orange}).
See
also observation language.
A decision
tree induction algorithm,
developed by Quinlan. ID3 stands for "Iterative Dichotomizer (version)
3". Later versions include C4.5 and C5.
see weight.
An input unit in a neural network is
a neuron with no input connections of its own. Its activation thus comes from outside the neural net. The input unit is
said to have its value clamped to the external value.
This term has two, only
distantly related, uses:
1. An instance is a term used in machine learning particularly with symbolic learning algorithms, to describe a single training or test item, usually in the form
of a description of the item in terms of its attributes, along with its intended classification.
With supervised connectionist learning
algorithms, it is more usual to speak of (training or test) patterns.
2.
In general AI parlance,
an instance frame is a frame representing a particular individual, as opposed to a generic frame.
this is described in the
article on expected error estimates.
see article on feedforward
networks.
Normal programs P produce
the same output y each time they receive a particular input x.
Learning programs are capable of improving their performance so that they may
produce different (better) results on second or later times that they receive
the same input x.
They achieve this by being able to alter their internal
state, q. In effect, they are computing a function of two
arguments, P(x | q) = y. When
the program is in learning mode, the program computes a new state q'
as well as the output y, as it executes.
In the case of supervised learning, in order to construct q', one needs a set of inputs xi and
corresponding target outputs zi (i.e. you want P(xi | q)
= zi when learning is complete). The new state
functionL is computed as:
L(P, q, ((x1,z1),
..., (xn, zn))) = q'
See
also unsupervised learning, observation language, and hypothesis
language.
a constant used in error
backpropagation learning and
other artificial neural network learning algorithms to affect the
speed of learning. The mathematics of e.g. backprop are based on small changes
being made to the weights at each step: if the changes made to
weights are too large, the algorithm may "bounce around" the error surface in a counter-productive fashion. In this case, it is
necessary to reduce the learning rate. On the other hand, the small the
learning rate, the more steps it takes to get to the stopping criterion. See also momentum.
A linear threshold unit
is a simple artificial neuron whose output is its thresholded total net input. That is, an LTU with threshold T calculates the
weighted sum of its inputs, and then outputs 0 if this sum is less than T,
and 1 if the sum is greater than T. LTU's form the basis of perceptrons.
Understanding this term
depends to some extent on the error surface metaphor.
When
an artificial neural network learning algorithm causes the total error of the net to descend into a valley of the error surface,
that valley may or may not lead to the lowest point on the entire error
surface. If it does not, the minimum into which the total error will eventually
fall is termed a local minimum. The learning algorithm is sometimes referred to
in this case as "trapped in a local minimum."
In
such cases, it usually helps to restart the algorithm with a new, randomly
chosen initial set of weights - i.e. at a new random point in weight
space. As this means a new starting point on the error surface, it is likely to
lead into a different valley, and hopefully this one will lead to the true
(absolute) minimum error, or at least a better minimum error.
The function
φ(x) = 1/(1 + exp(–x))
which, when graphed,
looks rather like a smoothed version of the step function
step(x) = 0 if x < 0, = 1 if x ≥
0
It is used to transform the total net input of an artificial neuron in some implementations of backprop-trained networks.
A
related function, also sometimes used in backprop-trained networks, is
2φ(x)–1, which can also be expressed as tanh(x/2). tanh(x/2)
is, of course, a smoothed version of the step function which jumps from –1 to 1
at x = 0, i.e. the function which = –1 if x <
0, and = 1 if x ≥ 0.
Machine learning is said
to occur in a program that can modify some aspect of itself, often referred to
as its state, so that on a subsequent execution with the same
input, a different (hopefully better) output is produced. See unsupervised
learning and supervised learning, and also function approximation algorithms and symbolic learning algorithms.
See article on generalized delta rule.
See also feedforward
network. Such a neural network differs from earlier perceptron-based
models in two respects:
·
most importantly, in
their activation function, which consists of transforming the total net input by a sigmoid function, rather than simply thresholding it;
·
using a multiple-layer
model, with hidden units, rather than just a single input layer and a
single output layer.
An artificial neural
network is a collection of simple artificial neurons connected by directed weighted connections. When the system is set running, the activation levels of the input units is clamped to desired values. After this the
activation is propagated, at each time step, along the directed weighted
connections to other units. The activations of non-input neurons are computing
using each neuron'sactivation function. The system might either settle into a stable
state after a number of time steps, or in the case of a feedforward network, the activation might flow through to output units.
Learning
might or might not occur, depending on the type of neural network and the mode
of operation of the network.
see neuron.
A simple model of a biological
neuron used in neural
networks to perform a small part of some overall computational problem. It has
inputs from other neurons, with each of which is associated aweight - that is, a number which indicates the degree of importance
which this neuron attaches to that input. It also has an activation function, and a bias. The bias acts like a
threshold in a perceptron.
Also
referred to a neurode, node, or unit.
see neuron, for a node in a neural network.
See graph for a node in a graph.
See graph for a node in a graph.
The term
"noise" in this context refers to errors in the training data for
machine learning algorithms. If a problem is difficult enough and complicated
enough to be worth doing with machine learning techniques, then any reasonable
training set is going to be large enough that there are likely to be errors in
it. This will of course cause problems for the learning algorithm.
See also decision
tree pruning and generalization in backprop.
Term used in analysing machine learning methods. The observation language refers to the notation
used by the learning method to represent the data it uses for training. For
example, in ID3, the observation language would be the notation
used to represent the training instances, including attributes and their allowable values, and the way instances are
described using attributes. In backprop, the observation
language would be the notation used to represent the training patterns. In Aq, the observation
language would again be the notation used to represent the instances, much as
in ID3.
See
also hypothesis language.
An output unit in a neural network is a neuron with no output connections of its own. Its activation thus serves as one of the output values of the neural net.
see
article on generalization in backprop
A perceptron is a simple
artificial neuron whose activation function consists of taking the total net input and ouputting 1 if this is above a threshold T, and 0 otherwise.
Perceptrons
were originally used as pattern classifiers, where the term pattern is here
used not in the sense of training pattern, but just in the sense of an input pattern that is to be put into
on of several classes. Perceptual pattern classifiers of this sort (not based
on perceptrons!) occur in simple animal visual systems, which can distinguish
between prey, predators, and neutral environmental objects.
See
also perceptron learning, and the XOR problem.
The
perceptron learning algorithm:
1. All weights are initially set to zero.
2. For each training example:
o if the perceptron outputs 0 when it should
output 1, then add the input vector to the weight vector.
o if the perceptron outputs 1 when it should
output 0, then subtract the input vector to the weight vector.
3. Repeat step 2 until the perceptron yields the
correct result for each training example.
A propositional learning
system represents what it learns about the training instances by expressions
equivalent to sentences in some form of logic (e.g.
class1 ← size=large and colour in {red,
orange})
. See Aq and covering algorithm.
The data used to
generate a decision tree, using an algorithm like the ID3 tree induction algorithm, can include noise - that is, instances that contain
errors, either in the form of a wrong classification, or in the form of a wrong attribute value. There could also be instances whose classification is
problematical because the attributes available are not sufficient to
discriminate between some cases.
If
for example, a node of the tree contains, say, 99 items in class C1 and 1 in
class C2, it is plausible that the 1 item in class C2 is there because of an
error either of classification or of feature value. There can thus be an
argument for regarding this node as a leaf node of class C1. This termed pruning the
decision tree.
The
algorithm given in lectures for deciding when to prune is as follows:
At a branch node that is a candidate for pruning:
At a branch node that is a candidate for pruning:
1.
Approximate expected error for the node.
2.
Approximate backed-up error from the children assuming that we do not prune.
3.
If expected error is
less than backed-up error, then prune.
A
recurrent network is a neural network in which there is at least one cycle of activation flow. To put it another way, underlying any neural network there is a directed graph, obtained by regarding theneurons as nodes in the
graph and the weights as directed edges in the graph. If this
graph is not acyclic, the network is recurrent.
A recurrent
connection is one that is part of a directed cycle, although term is
sometimes reserved for a connection which is clearly going in the
"wrong" direction in an otherwise feedforward network.
Recurrent
networks include fully recurrent networks in which each neuron
is connected to every other neuron, and partly
recurrent networks in which greater or lesser numbers of recurrent
connections exist. See also simple recurrent network.
This
article is included for general interest - recurrent networks are not part of
the syllabus of COMP9414 Artificial Intelligence.
Feedforward networks are fine for classifying objects, but
their units (as distinct from their weights) have no memory of previous inputs. Consequently they are unable
to cope with sequence prediction tasks - tasks like predicting, given a
sequence of sunspot activity counts, what the sunspot activity for the next
time period will be, and financial prediction tasks (e.g. given share prices
for the last ndays, and presumably other economic data, etc.,
predict tomorrow's share price).
Simple recurrent nets can tackle tasks like this, because they do have
a kind of memory for recording information derived from activation values from previous time steps.
This
article is included for general interest - sequence prediction tasks are not
part of the syllabus of COMP9414 Artificial Intelligence.
Another name for the logistic function and certain related functions (such as tanh(x)).
Sigmoidal functions are a type of squashing function. They are called because sigma is the Greek
letter "s", and the logistic functions look somewhat like a sloping
letter "s" when graphed.
A simple recurrent
network is like a feedforward network with an input layer, and output layer, and
a single hidden layer, except that there is a further group of units called state units or context units.
There is one state unit for each hidden unit. The activation function of the state unit is as follows: the
activation of a state unit in time step n is the same of that
of the corresponding hidden unit in time step n–1. That is, the
state unit activations are copies of the hidden unit activations from the
previous time step. Each state unit is also connected to each hidden unit by a trainable weight - the direction of this connection is from
the state unit to the hidden unit.
Simple
recurrent networks can learn sequence prediction tasks.
See
also recurrent networks.
This
article is included for general interest - simple recurrent networks are not
part of the syllabus of COMP9414 Artificial Intelligence.
The point of the ID3 algorithm is to
decide the best attribute, out of those not already used, on which to
split the training instances that are classified to a particular branch
node.
The
algorithm, in outline, is as follows:
1.
if all the instances
belong to a single class, there is nothing to do (except create a leaf node labelled with the name of that class).
2.
otherwise, for each
attribute that has not already been used, calculate the information gain that
would be obtained by using that attribute on the particular set of instances
classified to this branch node.
3.
use the attribute with
the greatest information gain.
This
leaves the question of how to calculate the information gain associated with
using a particular attribute A. Suppose that there are k classes C1, C2, ..., Ck, and that of the N instances classified to this
node, I1 belong to class C1, I2 belong to class C2, ..., and Ik belong to class Ck.
Let p1 = I1/N, p2 = I2/N, ..., and pk = Ik/N.
The initial entropy E at this node is:
Let p1 = I1/N, p2 = I2/N, ..., and pk = Ik/N.
The initial entropy E at this node is:
–p1log2(p1) –p2log2(p2) ... –pklog2(pk).
Now split the instances on each value of the
chosen attribute A. Suppose that there are r attribute
values for A, namely a1, a2, ..., ar.
For a particular value aj, say, suppose that there are Jj,1 instances in class C1, Jj,2 instances in class C2, ..., and Jj,k instances in class Ck, for a total of Jj instances having attribute value aj.
Let qj,1 = Jj,1/Jj, qj,2 = Jj,2/Jj, ..., and qj,k = Jj,k/Jj.
The entropy Ej associated with this attribute value aj this position is:
For a particular value aj, say, suppose that there are Jj,1 instances in class C1, Jj,2 instances in class C2, ..., and Jj,k instances in class Ck, for a total of Jj instances having attribute value aj.
Let qj,1 = Jj,1/Jj, qj,2 = Jj,2/Jj, ..., and qj,k = Jj,k/Jj.
The entropy Ej associated with this attribute value aj this position is:
–qj,1log2(qj,1) –qj,2log2(qj,2)
... –qj,klog2(qj,k).
Now compute:
E – ((J1/N).E1 + (J2/N).E2 + ... + (Jr/N).Er).
This is the information gain for
attribute A.
Note that Jj/N is the estimated probability that an instance classified to this node will have value aj for attribute A. Thus we are weighting the entropy estimates Ej by the estimated probability that an instance has the associated attribute value.
Note that Jj/N is the estimated probability that an instance classified to this node will have value aj for attribute A. Thus we are weighting the entropy estimates Ej by the estimated probability that an instance has the associated attribute value.
In
terms of the example used in the
lecture notes, (see also calculations in lecture notes), k = 2 since there are just
two classes, positive and negative. I1 = 4 and I2 = 3, and N =7,
and so p1 = 4/7 and p2= 3/7, and E = –p1log2(p1) –p2log2(p2) = –(4/7)×log2(4/7) – (3/7)×log2(3/7). In the example,
the first attribute A considered is size, and the
first value of size considered, large, corresponds
to a1, in the example in the lecture notes, so J1,1 = 2 = J1,2,
and J1 = 4. Thus q1,1 = J1,1/J1 = 2/4 = ½, and q1,2 = J1,2/J1 = 2/4 = ½, and E1 = –q1,1log2(q1,1) –q1,2log2(q1,2)
= -½×log2(½) – ½×log2(½) = 1.
Similarly E2 = 1 and J2 = 2 (size = small), and E3 = 0 and J3 = 1 (size = medium) so the final information gain,
Similarly E2 = 1 and J2 = 2 (size = small), and E3 = 0 and J3 = 1 (size = medium) so the final information gain,
E – ((J1/N).E1 + (J2/N).E2 + ... + (Jr/N).Er)
= E – ((4/7)×E1 + (2/7)×E2 + (1/7)×E3)
= E – ((4/7)×E1 + (2/7)×E2 + (1/7)×E3)
which turned out to be about 0.13 in the
example.
One of several
functions, including sigmoid and step functions (see threshold used to transform the total net input to
a neuron to obtain the final output of the neuron.
Possible stopping
criteria in error backpropagation learning include:
·
total error of the
network falling below some predetermined level;
·
a certain number of epochs having been completed;
Combinations of the two (e.g. whichever of the
two occurs first) and other stopping conditions are possible. See the reference
by Haykin (Neural networks: a comprehensive foundation p. 153) for
more details.
Supervised learning is a
kind of machine learning where the learning algorithm is provided with a set of
inputs for the algorithm along with the corresponding correct outputs, and
learning involves the algorithm comparing its current actual output with the
correct or target outputs, so that it knows what its error is, and modify
things accordingly.
Contrast unsupervised learning.
Symbolic learning
algorithms learn concepts by constructing a symbolic expression (such as a
decision tree, as in Aq ) that describes a class (or classes) of objects. Many such
systems work with representations equivalent to first order logic.
Such
learning algorithms have the advantage that their internal representations and
their final output can be inspected and relatively easily understood by humans.
See
also function approximation
algorithms.
A synapse, in a biological neuron, is a connection between the axon of one neuron and the dendrite of another. It
corresponds to a weight in an artificial neuron. Synapses have varying strengths and can be changed by learning
(like weights). The operation mechanism of synapses is biochemical in nature,
with transmitter substances crossing the tiny "synaptic
gap" between axon and dendrite when the axon fires.
synchronous vs asynchronous
See asynchronous
The target output of an
artificial neuron is the output provided with the training pattern. The difference between this and the actual
output of the neuron is the pattern error. This is used in .
Machine learning algorithms are trained on a collection
of instances or patterns. Once training is
complete, it is usual to test the trained system on a collection of test
instances or test patterns which were not used when training the system, in
order to find out to what degree the system is able to generalise beyond its
training data.
One of the parameters of
a perceptron . The output of the perceptron is 0 if the total net input of the perceptron is less than its threshold T,
otherwise it is 1.
Compare bias.
The total net input to a artificial neuron refers to the weighted sum of the neuron's inputs (that is,
the sum of the product of each input with the weight associated with that input. This quantity is then usually
transformed in some way (see squashing function) to obtain the neuron's output.
See
also activation function.
A measure of how well a backprop-trained neural network is doing at a particular
point during its learning.
Given
a training pattern, its squared error is obtained by squaring the
difference between the target output of an output neuron and the actual output.
The sum-squared error, or pattern sum-squared error (PSS), is
obtained by adding up the sum-squared errors for each output neuron. The total
sum-squared error is obtained by adding up the PSS for each training pattern.
In artificial neural networks that use weighted connections, some of those connections may have fixed values - i.e. they are
specified as not being subject to change when the network is trained, e.g.
using the error backpropagation
learning algorithm.
The
other weights, the ones that are subject to change when the
net is learning, are referred to as trainable weights.
This term is used to
refer to a set of input values and the corresponding set of desired or target
output values for use in training an artificial neural network. Usually, a largish number of training patterns
would be used in the training of any genuinely useful neural network.
In
toy problems like the XOR
problem, only a few training
patterns (in the case of XOR, just 4) may be used.
Patterns
used for testing the trained network are referred to as test patterns.
Compare instance.
This article describes
the basic tree induction algorithm used by ID3 and successors.
The basic idea is to pick an attribute A with
values a1, a2, ..., ar, split the training instances into subsets Sa1, Sa2, ...,Sar consisting of those instances that have the corresponding
attribute value. Then if a subset has only instances in a single class, that
part of the tree stops with a leaf node labelled with the single class. If not,
then the subset is split again, recursively, using a different attribute.
This
leaves the question of how to choose the best attribute to
split on at any branch node. This issue is handled in the article on splitting criterion in ID3.
see neuron.
Unsupervised learning
signifies a mode of machine learning where the system is not told the
"right answer" - for example, it is not trained on pairs consisting
of an input and the desired output. Instead the system is given the input
patterns and is left to find interesting patterns, regularities, or clusterings
among them. To be contrasted to supervised learning, as
in error backpropagation and ID3, where the system is told the
desired or target outputs to associate with each input pattern used for
training.
A weight, in a artificial
neural network, is a parameter
associated with a connection from one neuron, M, to
another neuron N. It corresponds to a synapse in a biological neuron, and it determines how much notice the neuron N pays
to the activation it receives from neuron N. If the weight is
positive, the connection is called excitatory, while if the weight
is negative, the connection is calledinhibitory. See also <=""
a="">to a neuron.
See
the article on error surfaces in weight space.
Windowing, in ID3, and similar
algorithms, is a way of dealing with really large sets of training instances.
Without
windowing, such an algorithm can be really slow, as it needs to do its
information gain calculations (see tree induction algorithms) over huge amounts of data.
With
windowing, training is done on a relatively small sample of the data, and then
checked against the full set of training data.
Here
is the windowing algorithm:
1.
Select a sample S of the
training instances at random - say 10% of them. The actual proportion chosen
would need to be small enough that ID3 could run fairly fast on them, but large
enough to be representative of the whole set of examples.
2.
Run the ID3 algorithm on
the set of training instances to obtain a decision tree.
3.
Check the decision tree
on the full data set, to obtain a set E of training instances that are
misclassified by the tree obtained.
4.
If E is empty, stop.
5.
Let the new set of
training instances S' be the union of S and E.
6.
Go to step 2 and run the
ID3 calculations with S'.
p XOR q is a binary
logical operator which can be defined either as not (p equiv q),
or as not p and q or p and
not q. It rose to fame in the late 60's when Minsky and Papert
demonstrated thatperceptron-based systems cannot learn to compute the XOR function. Subsequently,
when backprogation-trained multi-layer perceptrons were introduced, one of the first tests performed was to
demonstrate that they could in fact learn to compute XOR. No one in their right
mind would probably ever have cause to compute XOR in this way in practice, but
XOR and generalizations of XOR remain popular as tests of backpropagation-type
systems.
No comments:
Post a Comment