### abstract ###
Fitness functions based on test cases are very common in Genetic Programming (GP)
This process can be assimilated to a learning task, with the inference of models from a limited number of samples
This paper is an investigation on two methods to improve generalization in GP-based learning: 1) the selection of the best-of-run individuals using a three data sets methodology, and 2) the application of parsimony pressure in order to reduce the complexity of the solutions
Results using GP in a binary classification setup show that while the accuracy on the test sets is preserved, with less variances compared to baseline results, the mean tree size obtained with the tested methods is significantly reduced
### introduction ###
GP is particularly suited for problems that can be assimilated to learning tasks, with the minimization of the error between the obtained and desired outputs for a limited number of test cases -- the training data, using a ML terminology
Indeed, the classical GP examples of symbolic regression, boolean multiplexer and artificial ant  CITATION  are only simple instances of well-known learning problems (i e respectively regression, binary classification and reinforcement learning)
In the early years of GP, these problems were tackled using a single data set, reporting results on the same data set that was used to evaluate the fitnesses during the evolution
This was justifiable by the fact that these are toy problems used only to illustrate the potential of GP
In the ML community, it is recognized that such methodology is flawed, given that the learning algorithm can overfit the data used during the training and perform poorly on unseen data of the same application domain  CITATION
Hence, it is important to report results on a set of data that was not used during the learning stage
This is what we call in this paper a  two data sets methodology , with a training set used by the learning algorithm and a test set used to report the performance of the algorithm on unseen data, which is a good indicator of the algorithm's generalization (or robustness) capability
Even though this methodology has been widely accepted and applied in the ML and PR communities for a long time, the EC community still lags behind by publishing papers that are reporting results on data sets that were used during the evolution (training) phase
This methodological problem has already been spotted (see  CITATION ) and should be less and less common in the future
The two data sets methodology prevents reporting flawed results of learning algorithms that overfit the training set
But this does not prevent by itself overfitting the training set
A common approach is to add a third data set -- the validation set -- which helps the learning algorithm to measure its generalization capability
This validation set is useful to interrupt the learning algorithm when overfitting occurs and/or select a configuration of the learning machine that maximizes the generalization performances
This third data set is commonly used to train classifiers such as back-propagation neural networks and can be easily applied to EC-based learning
But this approach has an important drawback: it removes a significant amount of data from the training set, which can be harmful to the learning process
Indeed, the richer the training set, the more representative it can be of the real data distribution, and the more the learning algorithm can be expected to converge toward robust solutions
In the light of these considerations, an objective of this paper is to investigate the effect of a validation set to select the best-of-run individuals for a GP-based learning application
Another concern of the ML and PR communities is to develop learning algorithms that generate simple solutions
An argument behind this is the Occam's Razor principle, which states that between solutions of comparable quality, the simplest solutions must be preferred
Another argument is the minimum description length principle  CITATION , which states that the ``best'' model is the one that minimizes the amount of information needed to encode the model and the data given the model
Preference for simpler solutions and overfitting avoidance are closely related: it is more likely that a complex solution incorporates specific information from the training set, thus overfitting the training set, compared to a simpler solution
But, as mentioned in  CITATION , this argumentation should be taken with care as too much emphasis on minimizing complexity can prevent the discovery of more complex yet more accurate solutions
There is a strong link between the minimization of complexity in GP-based learning and the control of code bloat  CITATION , that is an exaggerated growth of program size in the course of GP runs
Even though complexity and code bloat are not exactly the same phenomenon, as some kind of bloat is generated by neutral pieces of code that have no effect on the actual complexity of the solutions, most of the mechanisms proposed to control it  CITATION  can also be used to minimize the complexity of solutions obtained by GP-based learning
This paper is a study of GP viewed as a learning algorithm
More specifically, we investigate two techniques to increase the generalization performance and decrease the complexity of the models: 1) use of a validation set to select best-of-run individuals that generalize well, and 2) use of lexicographic parsimony pressure  CITATION  to reduce the complexity of the generated models
These techniques are tested using a GP encoding for binary classification problems, with vectors taken from the learning sets as terminals, and mathematical operations to manipulate these vectors as branches
This approach is tested on six different data sets from the UCI ML repository  CITATION
Even if the proposed techniques are tested in a specific context, we argue that they can be extended to the frequent situations where GP is used as a learning algorithm
