Deep Learning with Tree Search: Unveiling the Limitations

Artificial intelligence (AI) research and engineering are currently experiencing significant hype around deep learning. However, it's crucial to remember that large neural networks are merely tools with their own inherent biases and weaknesses. A well-rounded skill set is essential for effectively and efficiently solving problems, especially when deep learning falls short.

This article delves into the limitations of deep learning in the context of tree search, particularly when compared to tree-based models, and explores how these limitations impact various applications.

The Intriguing Discrepancy: Tree-Based Models vs. Deep Learning on Tabular Data

Machine learning practitioners across diverse domains have observed that tree-based models, such as Random Forests (RFs), often outperform deep learning/neural networks when analyzing tabular data. This article will explore the reasons behind this phenomenon and provide insights into creating optimal AI pipelines.

Contextualizing the Findings: Preprocessing and Hyperparameter Tuning

Before analyzing the reasons behind the performance differences, it's important to consider the preprocessing steps and hyperparameter tuning methods used in studies comparing these models.

The paper "Why do tree-based models still outperform deep learning on tabular data?" involved significant preprocessing, including the removal of missing data, which can handicap tree performance. Random Forests are known for their ability to handle missing data effectively. When working with Johns Hopkins University to predict the impact of health-system policy changes on public health, Random Forests proved superior to more advanced solutions due to their robustness in handling noisy data with numerous features and dimensions.

Read also: After Freshman English

While the preprocessing steps used in the paper are generally standard, excessive preprocessing can lead to the loss of valuable nuances within a dataset. Therefore, it's crucial to consider these constraints when evaluating the results. If your datasets differ significantly, the findings may not be directly applicable.

The paper also employed random search for hyperparameter tuning, an industry-standard practice. However, Bayesian Search can be more effective for exploring extensive search spaces.

Reason 1: Neural Nets are Biased to Overly Smooth Solutions

One of the primary reasons why deep learning neural networks struggle to compete with Random Forests is their tendency to produce overly smooth solutions. Neural networks often struggle to create the best-fit functions when dealing with non-smooth functions or decision boundaries. Random Forests, on the other hand, excel at handling irregular patterns.

This limitation may stem from the use of gradients in neural networks, which rely on differentiable search spaces that are inherently smooth. Pointy, broken, and random functions cannot be easily differentiated. This is why understanding AI concepts such as Evolutionary Algorithms, traditional searches, and other basic concepts is important. These can be used for great results in a variety of situations when NNs fail.

Random Forests generate more precise decision boundaries compared to deep learners. In one visualization, a RandomForest was able to learn irregular patterns on the x-axis (corresponding to the date feature) that a Multilayer Perceptron (MLP) could not learn. This behavior is typical of neural networks, and finding hyperparameters to successfully learn such patterns can be challenging.

Read also: Comprehensive LSAT Prep

Tree-based methods also have lower tuning costs, making them more efficient solutions.

Reason 2: Uninformative Features Affect More MLP-like NNs

Another significant factor, especially when working with large datasets that encode multiple relationships, is the impact of uninformative features. Irrelevant features can significantly degrade the performance of neural networks and waste valuable training resources. This highlights the importance of thorough Exploratory Data Analysis (EDA) and domain exploration to understand the features and ensure smooth model operation.

Studies have shown that removing irrelevant features reduces the performance gap between tree-based models and neural networks. This suggests that trees are better insulated from the effects of less important features. Conversely, adding random features leads to a sharper decline in the performance of neural networks compared to tree-based methods. ResNet, in particular, is heavily affected by useless features, while the attention mechanism in transformers may offer some protection.

The superior performance of tree-based models in handling uninformative features can be attributed to their design. Decision Trees utilize concepts like Information Gain and Entropy to select the best paths by comparing the remaining features and choosing the one that allows for the best choices.

Reason 3: NNs are Invariant to Rotation. Actual Data is Not

Neural Networks are invariant to rotation, meaning that rotating the dataset does not change their performance. After rotating datasets, the performance ranking of different learners can change, with ResNets (previously the worst performers) rising to the top. ResNets maintain their original performance, while other learners experience performance drops.

Read also: About the LSAT

Taking linear combinations of features, which makes ResNets invariant, may misrepresent features and their relationships. According to the authors, there is a natural basis (the original basis) that encodes the best data-biases, which cannot be recovered by models invariant to rotations that potentially mix features with very different statistical properties.

Addressing Deep Learning's Shortcomings: Feature Engineering and Alternative Approaches

The limitations of deep learning in tree search highlight the importance of feature engineering. Decision Trees are particularly sensitive to feature engineering. Without careful cleaning and engineering of features, the results can be significantly worse than those of "black box" models like neural networks.

The convolutional structure is hardcoded in neural networks, only the weights are learned. However, this approach can be replicated with other basic models as well.

Decision Trees: A Closer Look

Decision Trees offer several advantages, including their ability to capture multi-feature interactions that single-layer linear models cannot. Each tree path from root to leaf can be seen as a data-driven formulation or synthesis of a higher-level feature built out of logical conjunctions ('AND' operation). These auto-synthesized/discovered features are then ORed at the top.

While differentiable models are often favored over trees, Decision Trees can be very effective. The space of trees, however, cannot be searched in a principled way. In Decision Trees, there is no notion of progressing smoothly toward better high-level features that track the end goal at every step of the synthesis. Instead, Decision Trees use a combinatorial search over the feature space partitioning, essentially by trial and error.

Decision Trees may struggle in cases where columns are sparse (low entropy to begin with). Achieving high throughput runtime performance can also be challenging unless the Decision Tree is compiled into machine code.

Monte Carlo Tree Search (MCTS)

In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games.

The MCTS algorithm has also been used in programs that play other board games (for example Hex, Havannah, Game of the Amazons, and Arimaa), real-time video games (for instance Ms. Pac-Man), and trick-based card games.

The application of Monte Carlo tree search in games is based on many playouts, also called roll-outs. In each playout, the game is played out to the very end by selecting moves at random. The most basic way to use playouts is to apply the same number of playouts after each legal move of the current player, then choose the move which led to the most victories. The efficiency of this methodâcalled Pure Monte Carlo Game Searchâoften increases with time as more playouts are assigned to the moves that have frequently resulted in the current player's victory according to previous playouts.

The four key steps in MCTS are:

Selection: Starting from the root node, select successive child nodes until a leaf node is reached.
Expansion: Unless the leaf node ends the game decisively, create one or more child nodes and choose one of them.
Simulation: Complete one random playout from the chosen child node.
Backpropagation: Update the information in the nodes along the path from the chosen child node back to the root node.

The UCT (Upper Confidence Bound 1 applied to trees) formula is commonly used to balance exploitation and exploration in games. Most contemporary implementations of Monte Carlo tree search are based on some variant of UCT that traces its roots back to the AMS simulation optimization algorithm for estimating the value function in finite-horizon Markov Decision Processes (MDPs).

The game tree in Monte Carlo tree search grows asymmetrically as the method concentrates on the more promising subtrees.

Various modifications of the basic Monte Carlo tree search method have been proposed to shorten the search time. Monte Carlo tree search can use either light or heavy playouts. Light playouts consist of random moves while heavy playouts apply various heuristics to influence the choice of moves.

The Role of Decision Trees in Expert Systems

Experts' nebulous decision making can often be modeled with simple decision trees and even decision chains (linked lists). Even when the expert thinks their decision making is more complex, a simple decision tree better models the expert's decision than the rules proposed by the experts themselves.

Decision trees might just be a fancy word for kD-Trees partitioning the possibility space and attaching an action to the volumes. The nebulous decision making could stem from expert's decisions being more nuanced in the granularity of the surface separating 2 distinct actions.

tags: #deep #learning #with #tree #search #limitations