Operations research (OR) is in the initial stages of a revolution driven by reinforcement learning (RL).
When I was at eHarmony years ago, we used classical OR techniques to drive the matchmaking process. Machine learning played a critical role, but was limited to specfiying parameters to the OR solver. In essence, machine learning was to used to estimate the value function, and the OR solver then produced a policy.
OR has historically focused on highly tractable specializations of convex optimization. In an age of scarce compute, this made perfect sense: indeed, at that time eHarmony was pushing the limits of what was possible using high-end commercial OR solvers. However, compute is less scarce now: in predictive modeling convex optimization has been spectacularly superseded by non-convex techniques (aka “deep learning”). A similar revolution is unfolding in OR. The recipe is: develop a generative model of the problem (aka a “simulator”) and then directly optimize a policy on simulated data using RL techniques.
All models are wrong, but some models are useful. At first blush it seems implausible that using advanced RL techniques on a crude approximation of the real world would yield substantial benefits. However traditional OR techniques also preform extremely aggressive optimization (to machine precision (!)) on an approximation of the real world. OR models are useful despite typically making tremendous simplifications such as replacing all random variables with their expected values (or in more sophisticated setups, high probability bounds).
The simplifications for the RL techniques involve the assumptions in the generative model, such as a particular parametric model for probability of an airplane service event. Early research results suggest that for some economically important problems, relatively crude simulators coupled with RL techniques can induce superior policies to those developed using traditional OR techniques. Furthermore, simulators admit expressing increasingly refined approximations of reality without the constraints imposed by classical OR formulations.
|Reaction time is a factor in this so please pay attention.|
Reaction time. You should almost never take seriously anyone’s explanation for why something works. Nonetheless, I’ll give you my intution as to why RL will eventually dominate OR.
“Classical OR techniques are optimized to try to avoid bad events ballistically, whereas RL trained policies are optimized to react to events as they occur.”
If this is true then it doesn’t matter if the simulation exactly gets the probability of tail events right as long as they are all present in the simulation and somewhat rare, because the “use of remediation actions” portion of the learned policy will be conditioned on events as they actually occur in practice. (If the events are not rare, then getting the coocurrence statistics right could matter.)
If this explanation has merit, then the upside to RL will be large for scenarios where classic OR optimization is frequently re-run in order to react to new events, because RL will have the “reaction time advantage”.