UCB Under Non-stochastic Enviroment
Introduction
In my previous blog post, we explored how the exploration parameter c in the UCB (Upper Confidence Bound) algorithm affects arm selection behavior and UCB value dynamics in stochastic multi-armed bandit problems. However, in my V2X (Vehicle-to-Infrastructure) research, I encountered a new challenge: non-stationary arms where reward distributions change over time.
This blog explores how the UCB algorithm performs when arm distributions shift suddenly.
Experimental Setup
Initial Configuration:
- 5 arms with reward distributions (Bernoulli):
[0, 0.85, 0.3, 0, 0.45] - Arm 2 is initially the best choice (mean = 0.85)
- Total time steps: varies by experiment (150-400 steps)
- UCB parameter
c: tested with[0.25, 0.5, 1.0]
Experiment 1: Moderate Distribution Shift
Setup:
- Initial means:
[0, 0.85, 0.3, 0, 0.45] - After t=50: Arm 2 shifts to
0.45(tied with Arm 5) - Shifted means:
[0, 0.45, 0.3, 0, 0.45] - Time steps: 150
Key Findings
-
Rapid UCB Value Decline: After the distribution shift, Arm 2’s UCB value drops quickly as it receives lower rewards.
-
Historical Bias Problem: With smaller
cvalues, the algorithm still tends to favor historically high-UCB arms (Arm 2) even after the shift, because:- The accumulated mean reward from early time steps remains high
- The exploration bonus
c√(ln t / n)shrinks as the number of trialsnincreases - Since the new mean (0.45) is close to other arms, the algorithm struggles to differentiate
Experiment 2: Large Distribution Shift
Setup:
- Initial means:
[0, 0.85, 0.3, 0, 0.45] - After t=50: Arm 2 drastically drops to
0.1 - Shifted means:
[0, 0.1, 0.3, 0, 0.45] - Time steps: 400 (longer horizon to observe adaptation)
Key Findings
-
Faster UCB Decline: Arm 2’s UCB value drops more dramatically with the larger shift.
- Parameter-Dependent Adaptation Speed:
- c = 0.25: The algorithm maintains historical preference for a prolonged period. When it finally switches, it shows high sensitivity to the current highest-UCB arm (Arm 5 in this case).
- After several switches, it eventually stabilizes on the new optimal arm.
- Discrimination Challenge with c = 0.5: Higher exploration parameters require more time to distinguish between arms with moderate differences.
- Arm 3 (mean = 0.3) vs. Arm 5 (mean = 0.45)
- In one run (ucbshift2.jpg), the algorithm incorrectly preferred Arm 3
- In another run (ucbshift3.jpg), it successfully identified Arm 5 as better
- This inconsistency highlights the stochastic nature of reward observations
General Conclusions
1. UCB Confidence Bounds Alone Are Insufficient for Rapid Adaptation
Even when arm distributions shift dramatically (0.85 → 0.1), the UCB algorithm with sparse exploration (small c) struggles to:
- Quickly abandon historically good arms
- Rapidly converge to new optimal arms
The time required for adaptation depends heavily on:
- The magnitude of the distribution shift
- The exploration parameter
c - The similarity between competing arms’ new means
2. The Exploration-Discrimination Trade-off
Small c values (e.g., 0.25):
- ✅ Better at detecting fine differences between arms (higher resolution)
- ❌ Over-exploit historically good arms
- ❌ Slow to adapt to distribution changes
Large c values (e.g., 0.5-1.0):
- ✅ More exploration increases probability of trying other arms
- ❌ May fail to distinguish between arms with small mean differences (especially when means are between 0-1)
- ❌ Potential to select suboptimal arms
3. Need for Additional Information
In non-stationary environments, relying solely on UCB values may be insufficient. We may need:
- Change detection mechanisms: Detect when distributions shift
- Discounted rewards: Give more weight to recent observations
- Sliding windows: Only consider recent history
- Contextual information: In V2X scenarios, use vehicle position, velocity, or signal strength as context
Next Steps
In the next blog post, I’ll explore:
- Gradual distribution changes: What happens when arm means decay progressively rather than shifting suddenly?
- Different decay rates: How does the speed of change affect UCB performance?
Stay tuned! 🚗📡
Further Reading:
- Garivier, A., & Moulines, E. (2011). On Upper-Confidence Bound Policies for Switching Bandit Problems. ALT 2011.
- Besbes, O., Gur, Y., & Zeevi, A. (2014). Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards. NIPS 2014.
- Russac, Y., Vernade, C., & Cappé, O. (2019). Weighted Linear Bandits for Non-Stationary Environments. NeurIPS 2019.
Enjoy Reading This Article?
Here are some more articles you might like to read next: