Artificial intelligence tools powered by machine learning have shown considerable improvements in a variety of experimental domains, from education to healthcare. In particular, the reinforcement learning (RL) and the multi-armed bandit (MAB) framework hold great promise for defining sequential designs with the aim of delivering optimized adaptive interventions with outcomes and resource (e.g., cost, time, or sample size) benefits. In this work, we discuss the opportunity offered by RL and MABs to current trends in healthcare experimentation, as well as the specific modeling challenges posed by this framework. Motivated by three case studies–in mobile health, digital mental health, and clinical trials–differing in their type of outcome, we illustrate our methodological contribution to this framework by integrating elements of traditional statistics. Specifically, we combine common offline data models for count and rating scale outcomes, increasingly common in digital and mobile health, with the Thompson sampling technique, which is possibly the most popular MAB algorithm. We discuss the theoretical properties of some of the proposed solutions and evaluate their empirical advantages in terms of balancing the exploitation (outcome performance) and exploration (learning performance) trade-off typical of reinforcement learning problems. Further considerations are provided under the unique challenging case of small samples, where parametric assumptions are often unrealistic. In such settings, we demonstrate how RL-based solutions combined with bootstrap approaches represent a flexible yet improved strategy for achieving a near-optimal balance between patient benefit within the study and enhancing statistical operating characteristics in a small population.