How can we characterize machine learning as a field? What make machine learning work?
Machine learning is a fast changing field. The list of ideas is practically endless: Decision trees, ensemble learning, random forests, boosting, neural networks, hidden Markov models, graphical models, kernel methods, conditional random fields, sparsity, compressed sensing, budgeted learning, multi-kernel learning, transfer learning, co-training, active learning, multitask learning, deep learning, lifelong learning and many more.
The problem is, ideas come and go, and bounce back, roughly every 10-15 years. Long enough for a grad student learns the tricks, makes a big noise, graduates when it is still hot and gets a good academic job, IF he is lucky to start early in the cycle. Also long enough so that the new batch of students are not aware of the hot things of the previous wave. How familiar is “particle filtering” to you?
Popular in the late 1990s and early 2000s, particle filtering is a fast way to generate samples of state for a dynamic system when an observation is made.
When I started my grad training in 2004, I asked one of my supervisors on what hot topic I should focus on. He said, pick either graphical models or kernel methods (which meant SVM at the time). I picked graphical models, and then was given conditional random fields (CRFs) to work on. By the time I submitted my PhD thesis in early 2008, CRFs were largely gone. SVMs were gone a couple of years before that, just around the time neural nets bounced back under a new umbrella, deep learning, in 2006. It used to be all about convex loss functions (SVMs & CRFs), now everything is non-convex. Local minima? Doesn’t matter, adaptive stochastic gradient descents such as Adagrad, Adam or RMSprop will find a really good one for you.
Applying machine learning is like flying commercial aircraft
Ever wanted to apply a technique to your problem? A sure way is to employ a PhD in machine learning! Packages available, but what are the correct ways to use, let alone the best way? Think about flying commercial aircrafts. There are hundreds of knobs to tune. There are even autopilot mode. We just need to have two human pilots: one to tune the right knob at the right time, and the other making sure that the correct things are being done.
Wanna use deep learning? You need to decide between: feedforward, recurrent, convolutional nets and any combination of these three. Will attention be used? How about memory? Which loss function? Embedding size? Optimizers and their parameters? and many many more.
I work with clinicians on clinical problems. At least two of them — young, smart and highly motivated — insisted to come over and observe how I do machine learning and learn to do it themselves. They claimed they could do Statra, R and sometimes Python. My boss was crossed. This is not how collaboration should work, right? You want to learn our art for free, then trash us?
But I told the boss, let them come.
They came and left, even more puzzled. I ended up doing the job I usually did and so did they.
Machine learning in three lines
I once delivered an internal talk on deep learning. My boss requested that I talked only about three things. Just three things. This bugged me a lot. But the trick actually worked.
Here I am trying to characterize the current state of machine learning in general and it should apply to deep learning. Machine learning works by:
- Having good priors of the problem at hand (80-90% gain)
- Accounting for data uncertainty and model uncertainty with ensemble and Bayesian methods. (1-5% gain)
- Reusing models when data/model redundancies are available (5-10% gain)
Priors are king
By “good priors”, I meant several things:
- Features that capture all meaningful signals from data. Getting good features are the job of feature engineering, which usually accounts for 80-90% of total effort in a machine learning project. Once you have good features, most modern classifiers will work just fine. Deep learning succeed partly because it solves this problem.
- Problem structures are respected. For example, sequential data would suggest the use of Hidden Markov Models (HMM) or chain-like Conditional Random Fields (CRF). In deep learning, it reduces to architecture engineering!
- Knowledge about the model class. E.g., will linearity be the dominant model? What are the expected complexity and nonlinearity? Will interpretability needed? What is about transparency? Is sparsity important? For neural nets, how many layers? For SVMs, will one kernel type be enough?
- Assumptions about data manifold. One well-studied phenomenon is the intrinsic low dimensionality of data embedded in a very high dimensional space. This is usually manifested through data redundancies. Another assumption is separation of classes, e.g., the region at the class boundary is usually sparse, but is very dense near the class examplars. This assumption essentially gives rise to semi-supervised learning.
- Assumptions about the data space. How many data instances? Will characterizing the data variance enough? If yes then use PCA. What about factors of variation are the key? If yes then RBM perhaps helps.
Don’t forget uncertainty
Even with a good prior, we would never be sure that our choices are correct. Model uncertainty is there and must be accounted for. A popular way is to use many (diverse) models, then employ model averaging, ensemble methods and Bayesian approach. Deep learning has dropout as one of the best tricks invented in the past 10 years. It works like wonder. Bayesian neural nets, which were studied in mid 1990s, is also back!
In fact, every single modern challenge was won by some ensemble, mostly gradient boosting by the time of this writing AND model blending. One of the best known example is the Netflix challenge, which was won by blending hundreds of models — so complex that Netflix found it was useless to implement in practice.
Many are easier than one
I often told my 5-year old daughter: do one thing at a time. But by listening to me AND playing at the same time, she has already multi-tasked. Humans seem to learn better that way. We learn by making senses of many co-occurring feedback signals.
A key idea in recent machine learning is model reuse. It has many forms:
- Domain adaption, which is about reusing previous model on similar domains with minimal changes.
- Transfer learning, which is about reusing models on similar tasks with minimal changes. In neural nets, it is equivalent to not forgetting the old trained nets when learning a new concepts.
- Multi-task learning, which is about learning more than ones correlated tasks at a time. The idea is that models can be partly shared among tasks, leading to less training data and less overfitting.
- Lifelong learning, which is like continual version of transfer learning. Just like humans, we learn to do new things from birth to death, every single day! Popularized by Sebastian Thrun in mid 1990s, lifelong learning is now back in various forms: never-ending learning at CMU, reinforce learning in robotics and games at a various labs.
- Multi-X, where X is substituted by view, modality, instance, label, output/outcome, target, type and so on.