Abstract
Decision trees are potentially powerful predictors and explicitly represent the structure of a dataset. Standard decision tree learners such as C4.5 expand nodes in depth-first order (Quinlan, 1993), while in best-first decision tree learners the ”best ” node is expanded first. The ”best ” node is the node whose split leads to maximum reduction of impurity (e.g. Gini index or information in this thesis) among all nodes available for splitting. The resulting tree will be the same when fully grown, just the order in which it is built is different. In practice, some branches of a fully-expanded tree do not truly reflect the underlying information in the domain. This problem is known as overfitting and is mainly caused by noisy data. Pruning is necessary to avoid overfitting the training data, and discards those parts that are not predictive of future data. Best-first node expansion enables us to investigate new pruning techniques by determining the number of expansions performed based on cross-validation. This thesis first introduces the algorithm for building binary best-first decision trees for classification problems. Then, it investigates two new pruning methods that
Keywords
Related Publications
Programs for Machine Learning
Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 a...
Quickly Boosting Decision Trees - Pruning Underachieving Features Early
Boosted decision trees are one of the most popular and successful learning techniques used today. While exhibiting fast speeds at test time, relatively slow training makes them ...
OC1: randomized induction of oblique decision trees
This paper introduces OC1, a new algorithm for generating multivariate decision trees. Multivariate trees classify examples by testing linear combinations of the features at eac...
A Communication-Efficient Parallel Algorithm for Decision Tree
Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and...
Combining randomization and discrimination for fine-grained image categorization
In this paper, we study the problem of fine-grained image categorization. The goal of our method is to explore fine image statistics and identify the discriminative image patche...
Publication Info
- Year
- 2007
- Type
- dissertation
- Citations
- 229
- Access
- Closed