UCB Under Non-stochastic Enviroment

Introduction

In my previous blog post, we explored how the exploration parameter c in the UCB (Upper Confidence Bound) algorithm affects arm selection behavior and UCB value dynamics in stochastic multi-armed bandit problems. However, in my V2X (Vehicle-to-Infrastructure) research, I encountered a new challenge: non-stationary arms where reward distributions change over time.

This blog explores how the UCB algorithm performs when arm distributions shift suddenly.


Experimental Setup

Initial Configuration:

  • 5 arms with reward distributions (Bernoulli): [0, 0.85, 0.3, 0, 0.45]
  • Arm 2 is initially the best choice (mean = 0.85)
  • Total time steps: varies by experiment (150-400 steps)
  • UCB parameter c: tested with [0.25, 0.5, 1.0]

Experiment 1: Moderate Distribution Shift

Setup:

  • Initial means: [0, 0.85, 0.3, 0, 0.45]
  • After t=50: Arm 2 shifts to 0.45 (tied with Arm 5)
  • Shifted means: [0, 0.45, 0.3, 0, 0.45]
  • Time steps: 150
Baseline: UCB behavior with fixed arm distributions (no shift)
UCB behavior when Arm 2's mean drops from 0.85 to 0.45 at t=50

Key Findings

  1. Rapid UCB Value Decline: After the distribution shift, Arm 2’s UCB value drops quickly as it receives lower rewards.

  2. Historical Bias Problem: With smaller c values, the algorithm still tends to favor historically high-UCB arms (Arm 2) even after the shift, because:

    • The accumulated mean reward from early time steps remains high
    • The exploration bonus c√(ln t / n) shrinks as the number of trials n increases
    • Since the new mean (0.45) is close to other arms, the algorithm struggles to differentiate

Experiment 2: Large Distribution Shift

Setup:

  • Initial means: [0, 0.85, 0.3, 0, 0.45]
  • After t=50: Arm 2 drastically drops to 0.1
  • Shifted means: [0, 0.1, 0.3, 0, 0.45]
  • Time steps: 400 (longer horizon to observe adaptation)
UCB behavior when Arm 2's mean drops dramatically from 0.85 to 0.1 at t=50 (Run 1)
UCB behavior under the same configuration (Run 2) - showing variability in arm discrimination

Key Findings

  1. Faster UCB Decline: Arm 2’s UCB value drops more dramatically with the larger shift.

  2. Parameter-Dependent Adaptation Speed:
    • c = 0.25: The algorithm maintains historical preference for a prolonged period. When it finally switches, it shows high sensitivity to the current highest-UCB arm (Arm 5 in this case).
    • After several switches, it eventually stabilizes on the new optimal arm.
  3. Discrimination Challenge with c = 0.5: Higher exploration parameters require more time to distinguish between arms with moderate differences.
    • Arm 3 (mean = 0.3) vs. Arm 5 (mean = 0.45)
    • In one run (ucbshift2.jpg), the algorithm incorrectly preferred Arm 3
    • In another run (ucbshift3.jpg), it successfully identified Arm 5 as better
    • This inconsistency highlights the stochastic nature of reward observations

General Conclusions

1. UCB Confidence Bounds Alone Are Insufficient for Rapid Adaptation

Even when arm distributions shift dramatically (0.85 → 0.1), the UCB algorithm with sparse exploration (small c) struggles to:

  • Quickly abandon historically good arms
  • Rapidly converge to new optimal arms

The time required for adaptation depends heavily on:

  • The magnitude of the distribution shift
  • The exploration parameter c
  • The similarity between competing arms’ new means

2. The Exploration-Discrimination Trade-off

Small c values (e.g., 0.25):

  • ✅ Better at detecting fine differences between arms (higher resolution)
  • ❌ Over-exploit historically good arms
  • ❌ Slow to adapt to distribution changes

Large c values (e.g., 0.5-1.0):

  • ✅ More exploration increases probability of trying other arms
  • ❌ May fail to distinguish between arms with small mean differences (especially when means are between 0-1)
  • ❌ Potential to select suboptimal arms

3. Need for Additional Information

In non-stationary environments, relying solely on UCB values may be insufficient. We may need:

  • Change detection mechanisms: Detect when distributions shift
  • Discounted rewards: Give more weight to recent observations
  • Sliding windows: Only consider recent history
  • Contextual information: In V2X scenarios, use vehicle position, velocity, or signal strength as context

Next Steps

In the next blog post, I’ll explore:

  • Gradual distribution changes: What happens when arm means decay progressively rather than shifting suddenly?
  • Different decay rates: How does the speed of change affect UCB performance?

Stay tuned! 🚗📡


Further Reading:

  • Garivier, A., & Moulines, E. (2011). On Upper-Confidence Bound Policies for Switching Bandit Problems. ALT 2011.
  • Besbes, O., Gur, Y., & Zeevi, A. (2014). Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards. NIPS 2014.
  • Russac, Y., Vernade, C., & Cappé, O. (2019). Weighted Linear Bandits for Non-Stationary Environments. NeurIPS 2019.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • UCB Under Gradual Distribution Decay
  • UCB Exploration in Sparse Rewarded Bandits
  • MAB: A Introduction
  • SUMO Simulation with Ray Tracing