We consider the reinforcement learning problem of controlling an unknown dynamical system to maximise the long-term average reward along a single trajectory. Most of the literature considers system interactions that occur in discrete time and discrete state-action spaces. Although this standpoint is suitable for games, it is often inadequate for systems in which interactions occur at a high frequency, if not in continuous time, or those whose state spaces are large if not inherently continuous. Perhaps the only exception is the linear quadratic framework for which results exist both in discrete and continuous time. However, its ability to handle continuous states comes with the drawback of a rigid dynamic and reward structure. This work aims to overcome these shortcomings by modelling interaction times with a Poisson clock of frequency \varepsilon^-1 which captures arbitrary time scales from discrete (\varepsilon=1) to continuous time (\varepsilon\downarrow0). In addition, we consider a generic reward function and model the state dynamics according to a jump process with an arbitrary transition kernel on \mathbbR^d. We show that the celebrated optimism protocol applies when the sub-tasks (learning and planning) can be performed effectively. We tackle learning by extending the eluder dimension framework and propose an approximate planning method based on a diffusive limit (\varepsilon\downarrow0) approximation of the jump process. Overall, our algorithm enjoys a regret of order \tilde\mathcalO(\sqrtT) or \tilde\mathcalO(\varepsilon^1/2 T+\sqrtT) with the approximate planning. As the frequency of interactions blows up, the approximation error \varepsilon^1/2 T vanishes, showing that \tilde\mathcalO(\sqrtT) is attainable in near-continuous time.
2023
Diffusion limit control and reinforcement learning, with applications to online auctions
We consider the diffusive limit of a generic pure-jump Markov control problem as the intensity of the driving Poisson process tends to infinity. We quantify the convergence speed in terms of the Hölder exponent of the Hessian of the limit problem. This approximation provides an efficient alternative method for the numerical resolution of these problems when the intensity of jumps is large. Considering the control of unknown systems, we extend this approach to the context of online Reinforcement Learning. Under the \acrlongOFU paradigm, we leverage the eluder dimension framework for learning and the diffusive limit for approximate resolution of the planning sub-problem. Our algorithm extends existing theory from discrete processes to continuous states and actions. Our study of diffusion limit systems is motivated and illustrated by the bidding problem in a high-frequency online auction against a revenue-maximising seller.
Reinforcement Learning in near-continuous time for continuous state-action spaces
We consider the Reinforcement Learning problem of controlling an unknown dynamical system to maximise the long-term average reward along a single trajectory. Most of the literature considers system interactions that occur in discrete time and discrete state-action spaces. Although this standpoint is suitable for games, it is often inadequate for mechanical or digital systems in which interactions occur at a high frequency, if not in continuous time, and whose state spaces are large if not inherently continuous. Perhaps the only exception is the Linear Quadratic framework for which results exist both in discrete and continuous time. However, its ability to handle continuous states comes with the drawback of a rigid dynamic and reward structure. This work aims to overcome these shortcomings by modelling interaction times with a Poisson clock of frequency \varepsilon^-1, which captures arbitrary time scales: from discrete (\varepsilon=1) to continuous time (\varepsilon\downarrow0). In addition, we consider a generic reward function and model the state dynamics according to a jump process with an arbitrary transition kernel on \mathbbR^d. We show that the celebrated optimism protocol applies when the sub-tasks (learning and planning) can be performed effectively. We tackle learning within the eluder dimension framework and propose an approximate planning method based on a diffusive limit approximation of the jump process. Overall, our algorithm enjoys a regret of order \tilde\mathcalO(\varepsilon^1/2 T+\sqrtT). As the frequency of interactions blows up, the approximation error \varepsilon^1/2 T vanishes, showing that \tilde\mathcalO(\sqrtT) is attainable in near-continuous time.
2022
Diffusive limit approximation of pure jump optimal ergodic control problems
Motivated by the design of fast reinforcement learning algorithms, we study the diffusive limit of a class of pure jump ergodic stochastic control problems. We show that, whenever the intensity of jumps is large enough, the approximation error is governed by the Hölder continuity of the Hessian matrix of the solution to the limit ergodic partial differential equation. This extends to this context the results of [1] obtained for finite horizon problems. We also explain how to construct a first order error correction term under appropriate smoothness assumptions. Finally, we quantify the error induced by the use of the Markov control policy constructed from the numerical finite difference scheme associated to the limit diffusive problem, this seems to be new in the literature and of its own interest. This approach permits to reduce very significantly the numerical resolution cost.
Diffusive limit approximation of pure-jump optimal stochastic control problems
We consider the diffusive limit of a typical pure-jump Markovian control problem as the intensity of the driving Poisson process tends to infinity. We show that the convergence speed is provided by the Hölder constant of the Hessian of the limit problem, and explain how correction terms can be constructed. This provides an alternative efficient method for the numerical approximation of the optimal control of a pure-jump problem in situations with very high intensity of jump. We illustrate this approach in the context of a display advertising auction problem.
2020
Real-Time Optimisation for Online Learning in Auctions
In display advertising, a small group of sellers and bidders face each other in up to 10^12 auctions a day. In this context, revenue maximisation via monopoly price learning is a high-value problem for sellers. By nature, these auctions are online and produce a very high frequency stream of data. This results in a computational strain that requires algorithms be real-time. Unfortunately, existing methods inherited from the batch setting suffer O(\sqrtt) time/memory complexity at each update, prohibiting their use. In this paper, we provide the first algorithm for online learning of monopoly prices in online auctions whose update is constant in time and memory.