UCB Exploration in Sparse Rewarded Bandits
Introduction
The Multi-Armed Bandit (MAB) problem is a classic exploration-exploitation dilemma where an agent must choose between multiple options (arms) to maximize cumulative reward over time. Each arm pull yields a stochastic reward, and the challenge lies in balancing exploration of uncertain arms with exploitation of known good ones. For a detailed overview of MAB algorithms and their variants, check out my previous post on MAB basics.
A word from Chi: In my recent research on MAB-based user association for V2X communications, I encountered scenarios where reward distributions are highly sparse—many arms yield zero rewards most of the time. This sparked my curiosity: how does the UCB (Upper Confidence Bound) algorithm behave in such sparse environments, and what’s the optimal hyperparameter tuning strategy?
Experiment Setup
Arm Generation
The stochastic bandit environment consists of num_arms arms where:
- A subset of arms (typically 60%) are active arms that generate rewards from Beta distributions with values in [0,1]
- The remaining arms are zero arms that always return 0
- Each active arm follows a fixed Beta distribution with different parameters
UCB Algorithm
The UCB algorithm selects arms based on an upper confidence bound:
\[\text{UCB}_i = \bar{r}_i + c \sqrt{\frac{\log t}{n_i}}\]where $\bar{r}_i$ is the average reward of arm $i$, $n_i$ is the number of times arm $i$ has been pulled, $t$ is the total number of rounds, and $c$ is the exploration hyperparameter.
Experimental Results
Experiment 1: Similar Expected Rewards with Zero Arms
Setup: 5 arms with distribution [0, 0, 0, 0.85, 0.75]
Observation: We tested UCB with four different hyperparameters: $c \in {2, 1, 0.5, 0.05}$
Key Findings:
-
Small $c$ values favor non-zero arms: When $c$ is small (e.g., 0.05, 0.5), the algorithm quickly converges to active arms and rarely revisits zero arms. This behavior is consistent regardless of initial exploration order.
-
UCB value oscillation decreases over time: For zero-reward arms, the frequency of UCB value “flips” (when their UCB temporarily exceeds that of consistently rewarding arms) decreases as more trials accumulate. The intervals between flips grow exponentially.
-
Initial exploration order matters slightly: When exploring high-reward arms first (largest index first), initial UCB values are slightly lower, with the effect more pronounced at larger $c$ values. However, the long-term selection frequency of zero arms remains largely unaffected.
Experiment 2: Diverse Expected Rewards without Zero Arms
Setup: 3 arms with distribution [0.45, 0.15, 0.9]
Key Findings:
-
Optimal $c$ matches reward range: In the absence of zero arms, setting $c$ close to the maximum possible reward (1.0 for rewards in [0,1]) provides balanced exploration-exploitation trade-off.
-
Zero arms change the game: When zero arms are present, smaller $c$ values are preferable to minimize regret by avoiding frequent revisits to zero-reward arms.
Conclusions
General Principle: For stochastic bandits with sparse reward distributions (many zero-reward arms), smaller UCB hyperparameters ($c < 1$) significantly improve performance by:
- Reducing unnecessary exploration of zero-reward arms
- Accelerating convergence to optimal non-zero arms
- Minimizing cumulative regret
Practical Recommendation: In sparse reward scenarios, start with $c \in [0.05, 0.5]$ rather than the classical $c = \sqrt{2}$ or $c = 2$.
Code
The full experimental code is available below:
% test_ucb_main.m
% Main script to test UCB algorithm on stochastic bandit with sparse rewards
clear; clc; close all;
%% Parameters
num_arms = 5;
num_steps = 100;
c_values = [2, 1, 0.5, 0.05];
%% Run UCB for each c value
num_c = length(c_values);
figure('Position', [100, 100, 1400, 800]);
for c_idx = 1:num_c
c = c_values(c_idx);
fprintf('Running UCB with c = %.2f...\n', c);
arms = 1:num_arms;
scores = zeros(1, num_arms);
trials = zeros(1, num_arms);
sum_rewards = zeros(1, num_arms);
choice = zeros(num_steps, 1);
reward = zeros(num_steps, 1);
ucb_his = zeros(num_steps, num_arms);
for t = 1:num_steps
[action, ucbV] = ucbTest(arms, scores, trials, c);
ucb_his(t, :) = ucbV;
reward(t) = armGen(action, num_arms);
trials(action) = trials(action) + 1;
sum_rewards(action) = sum_rewards(action) + reward(t);
scores(action) = sum_rewards(action) / trials(action);
choice(t) = action;
end
% Plot UCB values
subplot(2, num_c, c_idx);
hold on;
for k = 1:num_arms
plot(1:num_steps, ucb_his(:, k), 'LineWidth', 1.5, ...
'DisplayName', sprintf('Arm %d', k));
end
xlabel('Time Step');
ylabel('UCB Value');
ylim([0, max(ucb_his(:)) + 0.2]);
title(sprintf('UCB Values (c=%.2f)', c));
legend('Location', 'best');
grid on;
hold off;
% Plot arm selection
subplot(2, num_c, num_c + c_idx);
plot(1:num_steps, choice, 'o-', 'LineWidth', 1.5, 'MarkerSize', 6);
xlabel('Time Step');
ylabel('Selected Arm');
selection_counts = '';
for k = 1:num_arms
count = sum(choice == k);
selection_counts = [selection_counts sprintf('Arm%d:%d ', k, count)];
end
title(sprintf('Arm Selection (c=%.2f)\n%s', c, selection_counts));
yticks(1:num_arms);
ylim([0.5, num_arms+0.5]);
grid on;
end
sgtitle('UCB Algorithm Comparison with Different c Parameters');
function [choice, ucb_values] = ucbTest(arms, scores, trials, c)
num_arms = length(arms);
rounds = sum(trials);
ucb_values = zeros(1, num_arms);
if any(trials == 0)
untried_arms = find(trials == 0);
choice = arms(untried_arms(end)); % Explore from largest index
return;
end
for i = 1:num_arms
confidence_bound = c * sqrt(log(rounds) / trials(i));
ucb_values(i) = scores(i) + confidence_bound;
end
[~, choice] = max(ucb_values);
end
function reward = armGen(arm_id, num_arms)
num_active = max(2, round(num_arms * 0.6));
active_arms = num_arms - num_active + 1 : num_arms;
if ~ismember(arm_id, active_arms)
reward = 0;
else
active_idx = find(active_arms == arm_id);
if active_idx == 1
alpha = 5; beta = 2;
elseif active_idx == 2
alpha = 3; beta = 3;
else
alpha = 2 + active_idx;
beta = 5 - active_idx;
end
reward = betarnd(alpha, beta);
end
end
Enjoy Reading This Article?
Here are some more articles you might like to read next: