Before dwelling into the details, let's first discuss some basic concepts and terminology: What do we call an Artificial Intelligence (AI) ? And, what algorithms and techniques can we use to make an AI ?
Actually, in many good programs, several of these algorithms and techniques are combined. In the following paragraphs I have listed the main techniques I have used in FOREST, with some advices about what works best, or what doesn't work well.
Function that evaluates, or estimates, the value of a given position in the game. In the Othello game, traditional evaluation functions are based on prior knowledge of the game strategy, and are a combination of the number of discs of the players, on their mobility (number of moves they can make), on the frontier of their position (numer of outbound discs), and on tables of specific patterns such as edges, corners, and diagonals patterns. The evaluation usually provides an estimate of the final score, i.e. the final disc difference between the two players (64..+64).
In the early days of Othello programming, evaluation functions were empiric, i.e. manually tuned by their programmer after after playing many many games against their program to find out the best coefficients for each of the above parameters. Then, starting with the famous Logistello program written by Michael Buro, the coefficients of the evaluation functions were automatically computed by Logistic Regression applied to a large number of positions with known or well estimated results. That method proved to be superior and Logistello, as well as subsequent programs using that technique such as EDAX or Zebra, defeated all other programs and even the Othello world champion in 1997.
Until its version 5.0, FOREST was using a basic evaluation function combining occupation, mobility, and frontier parameters plus tables of edges, diagonals, and corner patterns, manually tuned. That evaluation function performed reasonably well against player, but has never been able to match the best programs, such as Zebra.
The most basic tree search technique, recursive, trying to minimize the opponent's moves while maximizing the players moves, according to the evaluation of their respective position after every move. The Minimax algorithm guarantees to find the optimal solution, but can be very long, depending on the expansion factor of the tree (average number of moves at each turn).
Improvement of the Minimax search, where branches of the tree are cut if outside of the [alpha, beta] window, because it is remembered from previous branches examination that the branches outside of that interval would not influence the partial result already obtained for a given node. Alpha beta is a big improvement of the Minimax search and allows to dramatically improve the search depth at equal search time.
Improvement of the AlphaBeta search, where, at a given node, once the best branch has been explored, subsequent branches are searched with a null window to check whether they are better on not than the best branch already found. The search with a null window being much faster than an alphabeta search, Negascout can save significant time if the branches are properly ordered, i.e. if the best branch is explored first. This assumes that a preliminary sort of the branches can be performed efficiently with a good accurracy, otherwise Negascout can even lose time compared to the plain AlphaBeta.
Somewhat an improvement of alphabeta algorithms where "weak" branches are automatically cut, based on their value at an intermediate depth and precomputed probabilities that branches with that value will not improve when searching deeper. Probcut and its multi propcut variation, when carefully tuned, can lead to a significant pruning of the search tree, allowing deeper searches. Programs with deep searches (20 moves lookahead and more), such as Logistello, NTest, Zebra, Edax, all use probcut techniques. However, probcut is risky, because there is always a probability that you may cut a branch that eventually would give a good result at a high depth.
Personally, I don't like probcut and FOREST doesn't use it. I beleive that the accuracy of neural networks and their capability to properly order and evaluate moves at any depth can lead to similar and even better results than probcut, even with much lower search depths.
Neural networks are combinations of "neurons" that take inputs, apply linear combinations of these inputs, then fire a signal according to a nonlinear activation function to the next neuron. The digital neurons in a neural net are a simplistic modelisation of human neurons. The number of neurons, and of their parameters, roughly conditions the capacity of the neural to learn how to properly process the input signal.
Neural networks can be used for estimating, predicting, classifying events, for recognizing patterns (image recognition), for taking decisions, and for many other purposes. In the case of board games such as Othello, we see that there are several elements that make neural networks relevant: patterns (combinations of discs on the board) are important, the score can always be predicted in a finite range [0, 64] for the player, and there are tactics and strategies you can learn in order to improve your chances of weening. Basically, given a given position of the discs on the board, there is one and only one possible final score if both players play the best possible moves at every turn. Ohello is a "perfect information games", like Chess or Go, but unlike Poker.
There are two phases in using a neural net:
The training phase requires a lot of "tagged" samples, i.e. of inputs for which the outputs are known. The more you have neurons in your neural network, the more you need training samples. Regarding Othello, my experience is that a neural network able to decently estimate the score of a position easily requires 100's of 1000's of parameters... That means you need a very large number of games, and of board positions at every turn of the game, for which the final score is known. There exist libraries of Othello games, such as the "WTHOR database", that contain 100.000+ games played by human players and computers, that constitute a good basis. However, such libraries are not sufficient, mainly because the final score resulting from every position at every turn is not always known. In the WTHOR database, only positions with 24 empty squares or less are resolved, but positions with more empty squares are more or less certains, depending on the strength of the players of each game. And, anyway, 100.000 games are an order of magnitude inferior to the number of "good" games you really need. Practically, that means you need to produce a lot of additional games, for instance by using known good programs, such as EDAX, NTEST or ZEBRA, and letting them play games between each others.
In the prediction phase, the neural network "observes" a position of the discs on the board, and predicts the final score, more or less accurately. The more the prediction is accurate, the better your program plays. My experience is that it is extremely difficult to obtain a good accurracy, i.e. to predict the exact final score resulting from a position, probably because the game of Othello is very "chaotic", in the sense that, at every turn, many discs can be flipped and the board pattern significantly changed. This particularity of Othello, which increases as a game gets closer to its end, makes it almost impossible to estimate the final score of a position with an accurracy of 1 disc by just examining the position, unless that position is final or near to final. But, if we notice that the neural network basically acts as an "evaluation function", the we can use it in a slightly different way... Tradictional Othello (or Chess, or Go) programs use alphabeta tree searches to anticipate moves in the game, combined to an evaluation function computed at each leaf of the tree, to estimate the value of a given position or move. Why not combining that traditional alphabeta approach with a neural network used as evaluation function ? This is what FOREST does, and it works well. Actually, I observed that the alphabeta is "smoothing" the estimations computed by the neural network, and gives position estimations that are more an more accurate as the depth of the alphabeta grows. I have no mathematical proof, but my guess is that the alphabeta, with its large number of leaves, is doing some "bagging" or "ensembling" of the results, techniques which are known to improve the accurracy of neural networks. The only caveat of that method is that the alphabeta, even improved by negascout, can't go very deep, because the neural network requires a lot of computations, much more than traditional evaluation functions. Traditional programs such as ZEBRA or EDAX will easily explore the game 24 moves ahead, whereas with a neural network, reaching 12 moves ahead is already a challenge. But... the neural network evaluates a position much more accurately than traditional evaluation functions, which compensates the lower lookahead depth.
Why is a neural network more accurate ?
Neural networks can have different structures and can be composed of neurons of different kinds: dense layers made of fully connected neurons, convolutional networks, recurrent networks, residual networks, etc.
Regarding board games such as Othello, there are basically three approaches:
The first approach works very well on large boards such as the 19x19 board of Go, where convolutional neurons can scan the board and detect patterns without being too much disturbed by the edges of the board. AlphaGo has proved CNN are very efficient on detecting the stones patterns of the Go game. But for Othello (and likely for Chess), my experience is that it doesn't work well, because on a 8x8 board a 3x3 convolutional neurons can only move 6 times in each dimensions before hiting an edge, and because the importance of edges and corners at Othello is such that the convolutional neuron cannot learn anything generic enough about its 3x3 pattern. Then, a second layer of 3x3 convolutional neurons can only move 4 times, which is even worse. The net is that I gave up with that approach.
The second approach is theoretically perfect, but has a major drawback: if each square is considered as a threestate input with a possible value of (1, 0, +1), then the first hidden dense layer of N neurons will need to perform a N x 64 matrix multiplication at run time, and the next dense layer of M neurons will do N x M multiplications, which is a lot. The resulting neural network used as an evaluation function will be slow, which will seriously penalize the depth of the Alpha Beta, hence reduce drastically the overall accuracy of evaluations. Another drawback of that approach is the very high number of training position it requires. I also rapidly gave up with that approach.
The third approach is the one I currently use, and which works rather well. Subdividing the board in a number of regions allows to leverage the 8 symetries of the Othello game, and the mapping of each region to a limited number of pattern allows to apply dimensionality reduction. The idea is that each pattern can be represented by a number, just as you can give a sequence number to the words in a language dictionary. Once you have done this, you can apply a welknown technique of natural language processing: word/pattern embedding. The advantage of that approach is that each discrete pattern will be automatically translated, at training time, into a reasonably sized vertor of floating point values, that can be directly injected into a dense layer. The number of coefficients of the embedding layer is very high, but they are computed at training time, and by drastically reducing the dimension of the inputs, that embedding layer limits the number of multiplications for the first dense hidden layer. You can view this as having a precomputed first dense layer, hence you basically save one layer of neurons and the associated costly matrix multiplication.Another aspect of neural networks is their depth (number of hidden layers) and width (number of neurones in each layer). Initially, we are often tempted to put many layers and many neurons in the network, driven by the belief that the problem to solve is complex. That approach generally leads to overfitting. The right approach is rather to start with the simplest network, see the results, and augment the number the number of neurons an layers to improve accuracy, to the point the network starts to overfit. Remember that a 1layer neural network, known as a perceptron, is already a universal function approximator, i.e. it can approximate about any continuous function, hence it is already quite powerful.
Eventually, each layer of the neural networks has an activation function. After trying the Sigmoïd, Tanh, Relu, Leaky Relu, and PRelu, the conclusion is that unsurprisingly the Relu family works best, and the Leaky Relu slightly better than the standard Relu. I also tried the PRelu (Parametric Relu), but it didn't bring any additional accurracy and requires much more training data.
The last aspect of neural networks is their output layer. The two main options are:
After many tries, I eventually chosed the... first approach, the reason being that the Softmax function is extremely expensive to compute at run time, with many multiplications and exponentials. The second approach is certainly more accurate, but it is just too slow. Sometimes we have to make compromises.
To be continuated.
Neural networks need lots of training data. The more the network has parameters, the more it needs data to tune those parameters using the gradient descent method. The neural network I use in FOREST doesn't have that many neurons (less than 70), but due to its embedding layer and the multiple phases of the game, each with its own set of parameters, it has more than 1 million parameters overall. Computing this 1+ million parameters with a good accurracy actually requires hundreds of millions of labelled positions, i.e. of positions for which the final score is known, either exactly or with a good approximation.
But, how do you obtain such a number of labelled positions ?
There are several options, but anyway you need a very large number of games, played by human players and computers. A first set of games can be found in the WTHOR database, containing more than 100.000 games played during tournaments, between human and computer players of various strength. But these games are not sufficient, and you need to complement that database by confronting your program against a reference program such as EDAX, which can easily be used in scripts. Those automatically generated games can either have a random beginning, or, better, be generated by a Monte Carlo Tree Search (MCTS).
Once you have a large number of games, you need to compute the value of each position of those games. Resolving positions with 26 empty squares or less is reasonably quick with a program like EDAX and a fast multicore computer, but resolving positions with more empty squares is slow or even impossible. The only option left is to estimate the value of those positions as accurately as possible. This can be done in two ways: either you use your reference program to evaluate those position with its own alphabeta search and evaluation function, or you make statistics on your database of games and estimate the value of each position with a minimax ordering of the game tree. Also, don't forget that the Othello game has 8 symetries, which provides an easy mean to augment the set of training positions by x8.
As an indication of what is needed to train a good neural network to play Othello, FOREST uses a game database of more than one million games. And, the more game I add to the database, the better accuracy I obtain. Training a neural network is a never ending process.
Note that, for many technical reasons, and to help the neural network to generalize, both the input data and the output scores must be normalized, i.e. scaled to the [1, +1] range. Ideally, the intermediate outputs of the hidden layers should also be normalized (batch normalization), but this raises difficulties at prediction time and is not absolutely necessary when the number of hidden layers is small, which is the case for Othello.
One of the main challenges of defining and training a neural network is to stay on the thin line between underfitting and overfitting.
Although overfitting is realatively easy to detect (the network accuracy reaches excellent scores on the training data, but remains poor on a set of validation data, it is a difficult problem to tackle. Adding more training data can be difficult when those data are expensive to obtain, and it is the case for the game of Othello. Techniques known to be efficient, such as dropout and batch normalisation, are easy to apply at training time, but pose a serious problem when using the neural network as an evaluation function at run time, because the neural network is used on individual positions rather than on batches of positions, hence it is very difficult to compensate the effects of dropout and batch regularization on individual evaluations. My experience on Othello is that, practically, the only techniques that work to prevent overfitting are to properly tune the network size and depth, and L2 regularization.
One of the key differentiators of neural networks and deep learning is their capability to generalize, i.e. to apply what they have learnt on a training data set to apply that knowledge to inputs they have never seen and to deliver meaningful predictions.
However, that generalization capability only works when the neural network doesn't overfit, i.e. its coefficients are not customized to match exactly the training set. The best way I have found to achieve this is L2 regularization, which penalize the largest coefficients (or weights) of the neural network, thus keeping those weights in a reasonable range and avoiding the extreme values that would be needed to match the inputs exactly.
Tuning the regularization parameter, known as "lambda", is tricky. There is no rule other than it should be small, let's say in the range of 10e5. Too small it will have no effect; too large it will hamper the prediction accuracy. The right way to find the correct value(s) is to try different values until an optimum of the prediction on validation data (not training data) is attained. What I have observed with othello is that lambda is progressiveny decreasing as the games progresses. Which is consistent with the intuition that the beginning of the Othello game is more strategic (i.e. good players apply "general" strategic rules to reach a solid position), whereas the end of the game more tactical (i.e. players try traps and start counting stable discs to maximize their final score).
Monte Carlo methods are a category of mathematical methods based on randomness; their name come from the randomness of games at the casinos in Monte Carlo. One specific Monte Carlo method that applies to games such as Othello (as well as to Chess and Go) is the Monte Carlo Tree Search (MCTS), which can be used as an alternative to the traditional AlphaBeta tree search.
The idea of the MCTS to search a tree structure for the score of the optimal branch is to explore a random subset of the branches, and to assume that the optimal score is the average of the scores of the branches that were explored. The more random branches you take, the closer your estimate is to the theoretical result (which could be obtained by an AlphaBeta). The MCTS can be extremely efficient because, practically, you might only need to explore very few branches, i.e. to play or simulate very few games, to get a decent estimate of the result. How many games to you need to simulate ? That depends... I have no mathematical proof, but my intuition is that it depends on the "quiescence" or "stability" of the game. Games such as Go and Chess are relatively quiet, i.e. strength or value of a player's position evolves relatively quietly throughout the game, without brutal evolutions, except perhaps for the final mat phase in the Chess game. Hence, MCTS works extremely well for the game of Go, as proved by the AlphaGo program. But Othello is quite different from Go and Chess: the begining of an Othello game is relatively quiet, but the game gets more and more chaotic as it appraches its end, and in the 20 last moves the score can completely change at every move. That makes of Othello a challenge for the MCTS, because playing completely random games will give an average score which is the average of a random variable with a high variance, hence a meaningless predicted score.
How can we adapt the MCTS for Othello ?
Nevertheless, those improvements of the MCTS algorithm have a negative consequence: for performance reasons, they make the MCTS almost unusable at runtime, i.e. during live games, because the execution of a decent selection algorithm at every move plus the resolution of all endgames at 18 empty squares or more dramatically reduce the number of games that can be tested in a short amount of time (unless you have a huge parallele system available for free, which is out of reach of individuals). The AlphaBeta and its optimized variants still remain the best option for live games... So, can the MCTS still be useful ? Yes it can, in two areas:
Most of the early Othello programs were representing, including FOREST until its version 5.x, were internally representing and manipulating the game board as a vector of bytes, let's say a 10 x 10 = 100 bytes in order to contain the 8 x 8 squares plus two additional rows and columns to represent the edges and facilitate the detection of these edges by the loops traversing the board. Each square of the board then contained a three state value, lets say 0 for an empty square, 1 for a white disc and 2 for a black disc, or (1, 0, +1). That representation was straightforward and allowed to implement the basic algorithms of the game (finding an empty square, finding a licit move, and playing a move plus flipping the adversarial discs, copying a game) in a few lines of code. However, that representation had three major drawbacks:
Of course, CPU's were improving generation after generation, with better branch prediction reducing the cost of the loops, larger internal caches reducing the need to access the RAM, optimized block copy operations reducing the cost of copying game boards, and deeper pipelines with out of order and parallel execution reducing the cost of flipping and unflipping discs. Also, programs such as FOREST 4.x were extensively using loop unrolling techniques to improve the key algorithms. But the real breakthrough came with the new generations of 64bits CPUs allowing to store an entire game board in a pair of register, one register storing the presence of white discs (0 or 1) in each square, the other register doing so with black discs.
The new techniques using pairs of 64 bits registers to hold and manipulate board representations are referred to as "bitboard manipulations". The use of bit masking, shift operations, and of the classical and/or/xor bitwise operations enable the development of very fast routines to play Othello (and Chess). You can even list all the licit moves on a given board position in a few instructions without any access to the RAM, the algorithm is known as Dumb7Fill.
I started using bitboards with the versions 5.x of FOREST, which gave the program a speedup of about 40%. Actually, that is less than what I was expecting, likely because the program was already very optimized, and because the latest x86 CPU's are so good at caching/fetching data and predicting branches that they have significantly reduced the drawbacks of the vector representation of board games. But, anyway, a 40% improvement is worth the effort of adopting bitboards.
Modern CPU's have more and more cores, and are even able to execute more than one logical thread per physical core. Actually, the increase of CPU horsepower is now mainly due to the increase of the number of cores (horizontal scaling) rather than on the increase of the core frequency or IPC (instructions per cycle). However, the power of all these cores remains difficult to exploit, because a CPU is not able, by itself, to distribute the workload on all its cores. The logic of the applications needs to be modified to exploit all the cores, and the operating system needs a scheduler able to distribute threads of workload over all the available cores.
How can multicores architectures be used in an Othello program ? The computing workload generated by FOREST comprises four main algorithms: the AlphaBeta tree search in midgame, the Neural Network that serves as evaluation function, the WinLossDraw search in endgame, and the exact score determination in endgame. Each of these algorithms are opportunities for parallelism:
The bottom line is that distributed tree searches are tough to implement, and their benefit does not grow linearly with the number of available cores. Although very strong programs such as EDAX use that method, this is not the path I have chosen.
Note that another approach would be to add an aditional output layer to the neural network to rank all the possible moves on a given position, in addition to providing the value of that position. This approach is known as a "Policy Head" and has been used with success by AlphaGo. It certainly works also for Othello, but I still need to test it to check what accuracy can be reached in the ranking.
Once it is decided what can be parallelised using mult_threading, the next question is "how ?". The C++ language and Standard Template Library (STL) offers nice highlevel multithreading and synchronization primitives since the C++11 standard, the Future and Promise objects, which are very convenient for launching asynchronous jobs and waiting for their result. The issue with those high level primitive is their overhead. They are incredibly slow and can only be applied to macro tasks. Using them to parallelize small tasks, such as several computations of the neural network, is actually slower than doing the computations sequentially. There are two reasons for this:
The solution to the two above problems is known as "Thread Pooling", which consists in creating a fixed number of threads at program initialization, then assigning workload to those threads when needed. This eliminates the cost of dynamically creating threads and allows to control exactly the number of threads that will run concurrently.
Unfortunately, so far, there is no standard thread pooling mechanism in C++ (although some C++ threading primitives use a Windows thread pooling mechanism, but in an uncontrolled manner). The only solution is therefore to develop a custom thread pooling mechanism, using the low level Condition Variables and Mutex primitives of the C++ STL.
The Mutex mechanism has also, by itself, a large impact on the performance of multithreading. In a thread pool implementation, Mutex's allow a given thread to wait for a task to perform, or to pick an available task and perform it. For relatively small tasks such as the computation of a Neural Network, the times it takes to test the Mutex, block the thread, and awaken the thread are therefore critical. The issue is that different compilers have different implementations of the std::mutex object: GCC/G++ rely on the Pthread POSIX library, whereas Visual Studio C++ relies on the Windows Mutex mechanism. And, the Pthread implementation is at least 10 times faster than the standard Windows Mutex. This is because Pthreads relies on Windows "Critical Sections", which are actually an interthread exclusion mechanism implemented at the "user level" of the operating system, whereas the Windows Mutex is an interprocess exclusion mechanism implemented at the kernel level, i.e. that requires going downto the kernel of the operating system. The Windows Mutex is therefore more powerfull, because it is capable of interprocess mutual exclusion, but it is also much slower. Considering that for Othello computations we only need multithreading, not multiprocessing, using Windows Critical Sections is therefore much more efficient than using the regular Windows Mutex mechansm. And, if we want to use Visual Studio C++ rather than GCC, and avoid to drag the Pthread library with the program, the best option is to develop a custom Mutex object that relies on the Windows Critical Section (or on the Eventing and WaitForSingleObject primitives). Practically this is only a few lines of code, not a big deal.
Note that it is possible to implement further optimizations using lockfree mechanisms, such as "memory barriers", but I won't enter into the details, even though I actually use such a mechanism to retrieve the results of the tasks performed by my thread pool without blocking the thread that launched the tasks.
Modern CPUs such as Intel Core, AMD Rizen, IBM Power, and some ARM processors have SIMD (Single Instruction Multiple Data) capabilities, i.e. they can process multiple data with a single instruction. Such instructions are usually available for the +  x / operations, the binary operations (and or xor not), shift and rotate operations, and for integer as well as floating point data. Depending on the processor, on its generation and the instruction set, SIMD instructions can process 128, 256, 512 bits at once. Intel processors, for instance, support AVX2 (Advanced Vector Extensions) instruction since the Haswell generation (Sandy Bridge series in 2013), which can process 256 bits at once, i.e. four 64 bits integers or eight 32 bits floating point values. Such instructions can considerably accelerate bitboard manipulations, as well as the matrix operations of the neural network.
However, the difficulty is to properly use those instructions in conjunction to the programming language you are using. There are actually two options to do this:
For instance, a call to __m256 _mm256_add_ps (__m256 a, __m256 b) will generate a single SIMD instruction that will return the sum of 8 floating point values in the 256 bits register "a" to the 8 floating point values in the 256 bits register b.
Programming directly with those intrinsics allows you to use exactly the optimal instructions you want, but at the price of a C/C++ code which no more portable, since it will be tied to the capabilities of specific processors. What I actually do is to use the C/C++ preprocessor to have conditional sections where I put both the generic code and the code with intrinsics, so either of them compiles depending on a configuration switch.
Note that the latest Visual C++ 2019 compiler has an excellent autovectorizer, and is even able to apply to SIMD instructions all the kinds of optimizations that are applicable to regular instructions (arithmetic simplifications, subexpresssion elimination, detection of fused multiplyadds, constant folding, pipeline optimization, etc.), which results in very fast code, see the details here.
There has been many technical papers and even research papers on the programming of Othello/Reversi and on the use of various machine learning, deep learning techniques, and reinforcement learning in that game. Here are the most notable ones with my comments about their content and usefulness.
A worldchampionshiplevel Othello program,
Paul S. Rosenbloom, 1981

One of the very first papers explaining with details all the techniques and steps that led to the development of the IAGO program, which one of the first really good programs prior to the arrival of Logistello. In particular, there is a good discussion about the "features", i.e. the key factors, that play a role in the strategy of the game of Othello. 
An evaluation function for Othello
based on statistics, Michael Buro, 2002

The mustread paper from Michael Buro, the author of the famous Logistello program, explaining for the first time how to determine mathematically an efficient evaluation function. Although Michael used a logistic regression technique, the principles of his supervised learning methodology still appies today with neural networks. 
The evolution of Strong Othello programs, Michael Buro, 2003  A paper from the author of the famous Logistello program relating the progress of stong Othello programs from Iago and its hand crafted evaluation function to Logistello and its mathematically tuned evaluation function. 
Experiments with MultiProbCut and a new HighQuality evaluation Function for Othello, Michael Buro, 2000  An excellent paper from Michael Buro on the multiprobcut technique allowing to prune AlphaBeta search tree
hence to dramatically augment the search depth.

Toward Opening Book Learning, Michael Buro, 2000  Another important paper from Michael Buro on the automatic learning of opening books. This is the most sophisticated and efficient technique I have seen so far. 
Knowledgebased feature discovery for evaluation functions, Tom E. Fawcet, 1995  A very interesting approach to discover the key features of the Othello game, i.e. the characteristics that play a key role in the value (expected score) of a board position. 
Linear versus NonLinear Learning in the Context of Othello, Ali Alanjawi, 2000  A short paper with some indications that non linear evaluation functions are superior than linear ones. This is also the only paper I found using Support Verctor Machines (SVM) for Othello. 
Observing the Evolution of Neural Networks, Learning to Play the Game of Othello, Siang Y. Chong, 2005

A paper summarizing the main thoughts behind the use of neural networks to play the game of Othello. This paper doesn't list the sophisticated techniques that are now possible with modern AI frameworks, but is is the first paper to mention the use of a combination of pattern detection neurones (convolutional neurons) and of game strategy features. In that sense it is quite interesting. 
On Bayesian Upper Conﬁdence Bounds for Bandit Problems, Emilie Kaufmann, Olivier Cappe, Aurelien Garivier , 2012  The essential paper to understand and use the UCB algortithms in Monte Carlo Tree Searches.

FOREST is architected in two layers: the user interface which is developped in C++ and uses the MFC GUI classes, and the game logic which is mostly in C for performance reasons, with a couple of C++ modules when specific features of the C++ standard library are necessary, such as multithreading. It is worth noting that some parts of the C code, such as bitboard manipulations and the neural network logic, are automatically generated by Python scripts to unroll loops and adapt to the neural network hyper parameters. The C++/C compiler I use is the Visual Studio 2017 C++ compiler. I have tried GCC and CLang, but the VS C++ compiler generates faster code in my case (about 10% faster). In particular, the autovectorizer of VS C++ does an excellent job which is absolutely critical for the performance of the neural network. This doesn't mean the VS C++ compiler would is always faster than others, it really depends on your code and you have to try all af them to find out the one that delivers the best performance for your code.
But the runtime code of FOREST, what you execute when playing with it, is only the visible part of the iceberg. There is actually more code to manage the training data, train the neural network, and to postprocess the outputs of the training before FOREST can be compiled and executed. All that code is written in Python, which, in my mind, is definitively the most flexible programming language for data science. The features I use the most in Python are of course the lists and dictionaries, that are so usefull to manipulate game positions and build game trees. The only drawbacks of Python are its relative slowness (it is a semiinterpretor) and its high consumption of RAM (all variables are dynamically linked, hence pointers and dynamic memory allocation everywhere), which become problems when processing tens of millions of data records. The best ways to overcome these drawbacks are a powerfull computer with lots of RAM (I am at 32 Gb RAM and will soon double that amount), and the use specific extensions written in Cython for specific routines which must absolutely run fast. Cython is basically a Python compiler which simplifies the development of Python extensions written in pure C for absolute performance. Python alone is very nice, but most of its power actually comes from its vast catalog of extensions, which allow to integrate most of the known algorithms for data processing. Here are the extensions I use the most:
To be continuated.
Back to FOREST home page.