README

00001 
00002 This is not yet another gp system (nyagp). For one, it is not general.
00003 It does one thing, find mathematical functions, and tries to do that well.
00004 
00005 So, if you're trying to steer ants on various New Mexico trails, or build your
00006 own tiny block world, you're in the wrong place. However, if you're interested
00007 in finding mathematical functions either through direct application on data or
00008 running it through a simulator, you might find what you're looking for here.
00009 
00010 === Representation (sym/ + gen/) ========
00011 
00012 Mathsym has a few interesting characteristics. First and foremost is the
00013 basic representation. It uses trees, but these trees are stored in a 
00014 reference counted hashtable. This means that every distinct subtree that is alive
00015 is stored once and only once. 
00016 The reference counting mechanism takes care of memory management. 
00017 
00018 The idea of using a hashtable (for offline analysis) comes from Walter Tackett, in his
00019 1994 dissertation. The current system is just a real-time implementation of this
00020 idea, adding the reference counting for ease of use.
00021 
00022 The hashtable brings overhead. It's still pretty fast, but a string based representation
00023 would run circles around it. However, by virtue of it storing every subtree only once, it
00024 is fairly tight on memory. This helps tremendously when confronted with excessively growing populations, bloat.
00025 The hashtable implementation can not stop bloat, but does make it more manageable. In a typical
00026 GP run, the number of distinct subtrees is only 10-20% of the total number of subtrees.
00027 
00028 Other advantages of the hashtable are in the ability to examine the run more thoroughly. It is easy
00029 to check how many subtrees are present in the system, and for each subtree you can check the reference
00030 count. 
00031 
00032 The basic tree is called a Sym. A Sym is simply a tree, and has children, accessible through args().
00033 A Sym simply contains an iterator (== decorated pointer) to an entry in the hashtable. 
00034 Every time you create a Sym, it is either looked up in the hashtable or added to the hashtable.
00035 A Sym has several members: size, depth, args, etc. One interesting member is the refcount().
00036 This returns the reference count of the Sym in the hashtable, and thus returns the number
00037 of distinct contexts in which the Sym is used.
00038 
00039 Another nice thing of these hashtable Syms is that a check for equality reduces to a pointer comparison.
00040 
00041 The Sym nodes are identified by a simple token, of type token_t (usually an unsigned int). It
00042 is completely generic and could conceivably be adapted to steer ants. The rest of the library
00043 is however targeted at mathematical functions purely. 
00044 
00045 sym/Sym.h is the file to look into for the functionality provided by Sym. The sym/ directory
00046 is where the source files are stored that are relevant for the generic Sym functionality. The
00047 'gen/' directory contains some generic functionality to build and traverse trees, independent of 
00048 the function and terminal set.
00049 
00050 The file sym/README.cpp documents the use of the sym library for general GP use.
00051 
00052 === Function Set (fun/) ===
00053 
00054 The standard GP function set of binary functions: addition, multiplication, subtraction and
00055 division is NOT supported. 
00056 
00057 What is however supported are the functions of:
00058 
00059 summation: arbitrary arity, arity zero meaning 0.0. Arity 2 is standard addition
00060 product:   arbitrary arity, arity zero meaning 1.0. Arity 2 is standard multiplication
00061 inversion:  1.0 / x. Only arity 1
00062 unary minus: -x. Only arity 1
00063 
00064 Plus a whole bunch of other functions (see "fun/FunDef.h")
00065 
00066 The reason for this is the observation (actually from a friend of mine, thanks Luuk),
00067 that this set of functions is complete and slightly more orthogonal than a binary set.
00068 
00069 The directory 'fun' contains the functionality for the function and terminal set, together
00070 with ERC's etc. fun/FunDef.cpp contains the definition of the functionality. Stuff can be
00071 added here, but best to contact me if you miss particular functions.
00072 
00073 With the sym and the function set in place, some fairly nice overloading is possible. A quick tour:
00074 
00075 To create a variable that reads the first value from the inputs, do:
00076 
00077 Sym var = SymVar(0);
00078 
00079 To create a constant of value 0.4432, do
00080 
00081 Sym cnst = SymConst(0.4432);
00082 
00083 The constants are also stored uniquely so that:
00084 
00085 Sym cnst2 = SymConst(0.4432)
00086 
00087 will lead to:
00088 
00089 cnst == cnst2
00090 
00091 to be true (this happens without value compare, they point to the same element in the hashtable)
00092 
00093 To add two values, do
00094 
00095 Sym sym = var + const;
00096 
00097 This will create a tree with tree nodes. All other operations work identically.
00098 
00099 === Evaluation (eval/) ===
00100 
00101 The second important thing is evaluation. Although Syms can be evaluated through an interpreter,
00102 this is not the fastest way to go about with it. The standard way of evaluating a Sym is to 
00103 first *compile* it to a function, and then run it in your favourite environment. Compilation
00104 is done through the use of the excellent tinycc compiler, which is blazingly fast and produces
00105 pretty good functions.
00106 
00107 Compilation comes in several flavours: compile a single function and retrieve a pointer to a function
00108 of signature:
00109 
00110 double func(const double* x);
00111 
00112 where x is the input array. Another option is to compile a bunch of functions in one go, and retrieve an array
00113 of such function pointers. The Syms are simply printed and compiled. An example: 
00114 
00115 double func(const double* x) { return x*x + x * 1./x; }
00116 
00117 The batch version proceeds significantly more quickly than calling compile every time. The function pointers
00118 can be given to a simulation for extremely quick evaluation.
00119 
00120 A third option is to compile a complete population in one go, and return a single pointer of signature
00121 
00122 void func(const double* x, double* y);
00123 
00124 Where 'y' is the (preallocated) output array. This allows to evaluate a complete population in one function
00125 call, storing the results in 'y'. It uses the hashtable to store every calculation only once. An example
00126 for the two function x*x + x*1./x and x + sin(x*x) is:
00127 
00128 void func(const double* x, double* y) {
00129     double a0 = x;
00130     double a1 = a0 * a0;
00131     double a2 = 1.0;
00132     double a3 = a2 / a0;
00133     double a4 = a2 * a3;
00134     y[0] = a4;
00135     double a5 = sin(a1);
00136     double a6 = a0 + a5;
00137     y[1] = a6;
00138 }
00139 
00140 This is the fastest way to evaluate even humongous populations quickly. You might be surprised at
00141 the amount of code re-use in a GP population.
00142 
00143 The three compilation functions can be found in eval/sym_compile.h
00144 
00145 A limiting factor in tinycc is that the struct TCCState that is used to hold the compilation context,
00146 is not really self-contained. This unfortunately means that with every call to 'compile' ALL previous
00147 pointers that have been produced become unsafe for use. I'm still looking at ways to circumvent this.
00148 
00149 To work with mathsym, a few small changes in tccelf.c were necessary, check README.TCC for details.
00150 
00151 === Interval Arithmetic (eval/) ===
00152 
00153 GP is pretty good at finding mathematical expressions that are numerically unsound. Take for instance
00154 the function '1 / x'. This is well defined only when x is strictly positive, but will lead to problems
00155 when x equals 0. The standard answer is to define some pseudo-arithmetical function called 'protected
00156 division' that will return some value (usually 1) when a division by zero occurs. This leads to a number
00157 of protected functions (sqrt, log, tan, etc.) which all need to be protected. Interpreting results from
00158 GP using such functions is in general hard.
00159 
00160 Interval arithmetic (through another excellent library boost/numeric/interval) is used to calculate
00161 if particular functions can conceivably produce problems. This completely annihilates the use for Koza-style
00162 protected operators and is a more safe and sound method. For interval arithmetic to function, the bounds
00163 on the input variables need to be known. As for every function we can calculate a guarenteed, 
00164 though not necessarily tight, output interval given the input intervals, we can check arbitrary functions
00165 for possible problems. If, for example for division, the input interval contains 0, we know that a division
00166 by zero is theoretically possible. It's then best to throw away the entire function.
00167 
00168 Interval Arithmetic is accessible through the class IntervalBoundsCheck (eval/BoundsCheck.h)
00169 
00170 === More generic support (gen/) ===
00171 
00172 The gen subdirectory contains some general utility classes for defining function sets and for 
00173 creating trees. The idea is that these functions are generic and only append on the sym/ part
00174 of the library. Unfortunately, the language table currently needs an ERC function, a default
00175 implementation is hidden inside fun/FunDef.cpp. Will fix at some point.
00176 
00177 gen/LanguageTable.cpp -> defines the functions/terminals that can be used
00178 gen/TreeBuilder.cpp   -> can create trees based on a LanguageTable
00179 
00180 === Data and Errors (regression/)  ===
00181 
00182 The above classes are generic and apply for any type of problem where a mathematical function can be 
00183 used to steer some process, run a simulation, whatever. First check the intervals, then compile the 
00184 Sym(s) to a (set of) function pointer(s), and use the pointers in some way to evaluate for fitness. 
00185 One particular type of problem for which support is built in is 'symbolic regression'. This type of
00186 problem involves finding an mathematical input/output relationship based on some data.
00187 
00188 To enable this, regression/ introduces the class Dataset to contain the data and ErrorMeasure to calculate
00189 error. Currently supported: mean squared error, mean absolute error and mean squared error scaled (proportional
00190 to correlation squared). They use some helper classes such as Scaling and TargetInfo.
00191 
00192 === EO interface (eo_interface/) ===
00193 
00194 Contains the classes to make it all work with EO. Check the root application 'symreg' for ways to use this
00195 
00196 
00197 
00198 

Generated on Thu Apr 19 11:02:29 2007 for EO by  doxygen 1.4.7