UCB Under Non-stochastic Enviroment

Introduction

In my previous blog post, we explored how the exploration parameter c in the UCB (Upper Confidence Bound) algorithm affects arm selection behavior and UCB value dynamics in stochastic multi-armed bandit problems. However, in my V2X (Vehicle-to-Infrastructure) research, I encountered a new challenge: non-stationary arms where reward distributions change over time.

This blog explores how the UCB algorithm performs when arm distributions shift suddenly.

Experimental Setup

Initial Configuration:

5 arms with reward distributions (Bernoulli): [0, 0.85, 0.3, 0, 0.45]
Arm 2 is initially the best choice (mean = 0.85)
Total time steps: varies by experiment (150-400 steps)
UCB parameter c: tested with [0.25, 0.5, 1.0]

Experiment 1: Moderate Distribution Shift

Setup:

Initial means: [0, 0.85, 0.3, 0, 0.45]
After t=50: Arm 2 shifts to 0.45 (tied with Arm 5)
Shifted means: [0, 0.45, 0.3, 0, 0.45]
Time steps: 150

Baseline: UCB behavior with fixed arm distributions (no shift)

UCB behavior when Arm 2's mean drops from 0.85 to 0.45 at t=50

Key Findings

Rapid UCB Value Decline: After the distribution shift, Arm 2’s UCB value drops quickly as it receives lower rewards.
Historical Bias Problem: With smaller c values, the algorithm still tends to favor historically high-UCB arms (Arm 2) even after the shift, because:
- The accumulated mean reward from early time steps remains high
- The exploration bonus c√(ln t / n) shrinks as the number of trials n increases
- Since the new mean (0.45) is close to other arms, the algorithm struggles to differentiate

Experiment 2: Large Distribution Shift

Setup:

Initial means: [0, 0.85, 0.3, 0, 0.45]
After t=50: Arm 2 drastically drops to 0.1
Shifted means: [0, 0.1, 0.3, 0, 0.45]
Time steps: 400 (longer horizon to observe adaptation)

UCB behavior when Arm 2's mean drops dramatically from 0.85 to 0.1 at t=50 (Run 1)

UCB behavior under the same configuration (Run 2) - showing variability in arm discrimination

Key Findings

Faster UCB Decline: Arm 2’s UCB value drops more dramatically with the larger shift.
Parameter-Dependent Adaptation Speed:
- c = 0.25: The algorithm maintains historical preference for a prolonged period. When it finally switches, it shows high sensitivity to the current highest-UCB arm (Arm 5 in this case).
- After several switches, it eventually stabilizes on the new optimal arm.
Discrimination Challenge with c = 0.5: Higher exploration parameters require more time to distinguish between arms with moderate differences.
- Arm 3 (mean = 0.3) vs. Arm 5 (mean = 0.45)
- In one run (ucbshift2.jpg), the algorithm incorrectly preferred Arm 3
- In another run (ucbshift3.jpg), it successfully identified Arm 5 as better
- This inconsistency highlights the stochastic nature of reward observations

General Conclusions

1. UCB Confidence Bounds Alone Are Insufficient for Rapid Adaptation

Even when arm distributions shift dramatically (0.85 → 0.1), the UCB algorithm with sparse exploration (small c) struggles to:

Quickly abandon historically good arms
Rapidly converge to new optimal arms

The time required for adaptation depends heavily on:

The magnitude of the distribution shift
The exploration parameter c
The similarity between competing arms’ new means

2. The Exploration-Discrimination Trade-off

Small c values (e.g., 0.25):

✅ Better at detecting fine differences between arms (higher resolution)
❌ Over-exploit historically good arms
❌ Slow to adapt to distribution changes

Large c values (e.g., 0.5-1.0):

✅ More exploration increases probability of trying other arms
❌ May fail to distinguish between arms with small mean differences (especially when means are between 0-1)
❌ Potential to select suboptimal arms

3. Need for Additional Information

In non-stationary environments, relying solely on UCB values may be insufficient. We may need:

Change detection mechanisms: Detect when distributions shift
Discounted rewards: Give more weight to recent observations
Sliding windows: Only consider recent history
Contextual information: In V2X scenarios, use vehicle position, velocity, or signal strength as context

Next Steps

In the next blog post, I’ll explore:

Gradual distribution changes: What happens when arm means decay progressively rather than shifting suddenly?
Different decay rates: How does the speed of change affect UCB performance?

Stay tuned! 🚗📡

Further Reading:

Garivier, A., & Moulines, E. (2011). On Upper-Confidence Bound Policies for Switching Bandit Problems. ALT 2011.
Besbes, O., Gur, Y., & Zeevi, A. (2014). Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards. NIPS 2014.
Russac, Y., Vernade, C., & Cappé, O. (2019). Weighted Linear Bandits for Non-Stationary Environments. NeurIPS 2019.

Introduction

Experimental Setup

Experiment 1: Moderate Distribution Shift

Key Findings

Experiment 2: Large Distribution Shift

Key Findings

General Conclusions

1. UCB Confidence Bounds Alone Are Insufficient for Rapid Adaptation

2. The Exploration-Discrimination Trade-off

3. Need for Additional Information

Next Steps

Enjoy Reading This Article?