198 lines
9.5 KiB
Text
198 lines
9.5 KiB
Text
|
|
This is not yet another gp system (nyagp). For one, it is not general.
|
|
It does one thing, find mathematical functions, and tries to do that well.
|
|
|
|
So, if you're trying to steer ants on various New Mexico trails, or build your
|
|
own tiny block world, you're in the wrong place. However, if you're interested
|
|
in finding mathematical functions either through direct application on data or
|
|
running it through a simulator, you might find what you're looking for here.
|
|
|
|
=== Representation (sym/ + gen/) ========
|
|
|
|
Mathsym has a few interesting characteristics. First and foremost is the
|
|
basic representation. It uses trees, but these trees are stored in a
|
|
reference counted hashtable. This means that every distinct subtree that is alive
|
|
is stored once and only once.
|
|
The reference counting mechanism takes care of memory management.
|
|
|
|
The idea of using a hashtable (for offline analysis) comes from Walter Tackett, in his
|
|
1994 dissertation. The current system is just a real-time implementation of this
|
|
idea, adding the reference counting for ease of use.
|
|
|
|
The hashtable brings overhead. It's still pretty fast, but a string based representation
|
|
would run circles around it. However, by virtue of it storing every subtree only once, it
|
|
is fairly tight on memory. This helps tremendously when confronted with excessively growing populations, bloat.
|
|
The hashtable implementation can not stop bloat, but does make it more manageable. In a typical
|
|
GP run, the number of distinct subtrees is only 10-20% of the total number of subtrees.
|
|
|
|
Other advantages of the hashtable are in the ability to examine the run more thoroughly. It is easy
|
|
to check how many subtrees are present in the system, and for each subtree you can check the reference
|
|
count.
|
|
|
|
The basic tree is called a Sym. A Sym is simply a tree, and has children, accessible through args().
|
|
A Sym simply contains an iterator (== decorated pointer) to an entry in the hashtable.
|
|
Every time you create a Sym, it is either looked up in the hashtable or added to the hashtable.
|
|
A Sym has several members: size, depth, args, etc. One interesting member is the refcount().
|
|
This returns the reference count of the Sym in the hashtable, and thus returns the number
|
|
of distinct contexts in which the Sym is used.
|
|
|
|
Another nice thing of these hashtable Syms is that a check for equality reduces to a pointer comparison.
|
|
|
|
The Sym nodes are identified by a simple token, of type token_t (usually an unsigned int). It
|
|
is completely generic and could conceivably be adapted to steer ants. The rest of the library
|
|
is however targeted at mathematical functions purely.
|
|
|
|
sym/Sym.h is the file to look into for the functionality provided by Sym. The sym/ directory
|
|
is where the source files are stored that are relevant for the generic Sym functionality. The
|
|
'gen/' directory contains some generic functionality to build and traverse trees, independent of
|
|
the function and terminal set.
|
|
|
|
The file sym/README.cpp documents the use of the sym library for general GP use.
|
|
|
|
=== Function Set (fun/) ===
|
|
|
|
The standard GP function set of binary functions: addition, multiplication, subtraction and
|
|
division is NOT supported.
|
|
|
|
What is however supported are the functions of:
|
|
|
|
summation: arbitrary arity, arity zero meaning 0.0. Arity 2 is standard addition
|
|
product: arbitrary arity, arity zero meaning 1.0. Arity 2 is standard multiplication
|
|
inversion: 1.0 / x. Only arity 1
|
|
unary minus: -x. Only arity 1
|
|
|
|
Plus a whole bunch of other functions (see "fun/FunDef.h")
|
|
|
|
The reason for this is the observation (actually from a friend of mine, thanks Luuk),
|
|
that this set of functions is complete and slightly more orthogonal than a binary set.
|
|
|
|
The directory 'fun' contains the functionality for the function and terminal set, together
|
|
with ERC's etc. fun/FunDef.cpp contains the definition of the functionality. Stuff can be
|
|
added here, but best to contact me if you miss particular functions.
|
|
|
|
With the sym and the function set in place, some fairly nice overloading is possible. A quick tour:
|
|
|
|
To create a variable that reads the first value from the inputs, do:
|
|
|
|
Sym var = SymVar(0);
|
|
|
|
To create a constant of value 0.4432, do
|
|
|
|
Sym cnst = SymConst(0.4432);
|
|
|
|
The constants are also stored uniquely so that:
|
|
|
|
Sym cnst2 = SymConst(0.4432)
|
|
|
|
will lead to:
|
|
|
|
cnst == cnst2
|
|
|
|
to be true (this happens without value compare, they point to the same element in the hashtable)
|
|
|
|
To add two values, do
|
|
|
|
Sym sym = var + const;
|
|
|
|
This will create a tree with three nodes. Other operators are overloaded similarily.
|
|
|
|
=== Evaluation (eval/) ===
|
|
|
|
The second important thing is evaluation. Although Syms can be evaluated through an interpreter,
|
|
this is not the fastest way to go about with it. The standard way of evaluating a Sym is to
|
|
first *compile* it to a function, and then run it in your favourite environment. Compilation
|
|
is done through the use of the excellent tinycc compiler, which is blazingly fast and produces
|
|
pretty good functions.
|
|
|
|
Compilation comes in several flavours: compile a single function and retrieve a pointer to a function
|
|
of signature:
|
|
|
|
double func(const double* x);
|
|
|
|
where x is the input array. Another option is to compile a bunch of functions in one go, and retrieve an array
|
|
of such function pointers. The Syms are simply printed and compiled. An example:
|
|
|
|
double func(const double* x) { return x*x + x * 1./x; }
|
|
|
|
The batch version proceeds significantly more quickly than calling compile every time. The function pointers
|
|
can be given to a simulation for extremely quick evaluation.
|
|
|
|
A third option is to compile a complete population in one go, and return a single pointer of signature
|
|
|
|
void func(const double* x, double* y);
|
|
|
|
Where 'y' is the (preallocated) output array. This allows to evaluate a complete population in one function
|
|
call, storing the results in 'y'. It uses the hashtable to store every calculation only once. An example
|
|
for the two function x*x + x*1./x and x + sin(x*x) is:
|
|
|
|
void func(const double* x, double* y) {
|
|
double a0 = x;
|
|
double a1 = a0 * a0;
|
|
double a2 = 1.0;
|
|
double a3 = a2 / a0;
|
|
double a4 = a2 * a3;
|
|
y[0] = a4;
|
|
double a5 = sin(a1);
|
|
double a6 = a0 + a5;
|
|
y[1] = a6;
|
|
}
|
|
|
|
This is the fastest way to evaluate even humongous populations quickly. You might be surprised at
|
|
the amount of code re-use in a GP population.
|
|
|
|
The three compilation functions can be found in eval/sym_compile.h
|
|
|
|
A limiting factor in tinycc is that the struct TCCState that is used to hold the compilation context,
|
|
is not really self-contained. This unfortunately means that with every call to 'compile' ALL previous
|
|
pointers that have been produced become unsafe for use. I'm still looking at ways to circumvent this.
|
|
|
|
To work with mathsym, a few small changes in tccelf.c were necessary, check README.TCC for details.
|
|
|
|
=== Interval Arithmetic (eval/) ===
|
|
|
|
GP is pretty good at finding mathematical expressions that are numerically unsound. Take for instance
|
|
the function '1 / x'. This is well defined only when x is strictly positive, but will lead to problems
|
|
when x equals 0. The standard answer is to define some pseudo-arithmetical function called 'protected
|
|
division' that will return some value (usually 1) when a division by zero occurs. This leads to a number
|
|
of protected functions (sqrt, log, tan, etc.) which all need to be protected. Interpreting results from
|
|
GP using such functions is in general hard.
|
|
|
|
Interval arithmetic (through another excellent library boost/numeric/interval) is used to calculate
|
|
if particular functions can conceivably produce problems. This completely annihilates the use for Koza-style
|
|
protected operators and is a more safe and sound method. For interval arithmetic to function, the bounds
|
|
on the input variables need to be known. As for every function we can calculate a guarenteed,
|
|
though not necessarily tight, output interval given the input intervals, we can check arbitrary functions
|
|
for possible problems. If, for example for division, the input interval contains 0, we know that a division
|
|
by zero is theoretically possible. It's then best to throw away the entire function.
|
|
|
|
Interval Arithmetic is accessible through the class IntervalBoundsCheck (eval/BoundsCheck.h)
|
|
|
|
=== More generic support (gen/) ===
|
|
|
|
The gen subdirectory contains some general utility classes for defining function sets and for
|
|
creating trees. The idea is that these functions are generic and only append on the sym/ part
|
|
of the library. Unfortunately, the language table currently needs an ERC function, a default
|
|
implementation is hidden inside fun/FunDef.cpp. Will fix at some point.
|
|
|
|
gen/LanguageTable.cpp -> defines the functions/terminals that can be used
|
|
gen/TreeBuilder.cpp -> can create trees based on a LanguageTable
|
|
|
|
=== Data and Errors (regression/) ===
|
|
|
|
The above classes are generic and apply for any type of problem where a mathematical function can be
|
|
used to steer some process, run a simulation, whatever. First check the intervals, then compile the
|
|
Sym(s) to a (set of) function pointer(s), and use the pointers in some way to evaluate for fitness.
|
|
One particular type of problem for which support is built in is 'symbolic regression'. This type of
|
|
problem involves finding an mathematical input/output relationship based on some data.
|
|
|
|
To enable this, regression/ introduces the class Dataset to contain the data and ErrorMeasure to calculate
|
|
error. Currently supported: mean squared error, mean absolute error and mean squared error scaled (proportional
|
|
to correlation squared). They use some helper classes such as Scaling and TargetInfo.
|
|
|
|
=== EO interface (eo_interface/) ===
|
|
|
|
Contains the classes to make it all work with EO. Check the root application 'symreg' for ways to use this
|
|
|
|
|
|
|
|
|