UCB Exploration in Sparse Rewarded Bandits

Introduction

The Multi-Armed Bandit (MAB) problem is a classic exploration-exploitation dilemma where an agent must choose between multiple options (arms) to maximize cumulative reward over time. Each arm pull yields a stochastic reward, and the challenge lies in balancing exploration of uncertain arms with exploitation of known good ones. For a detailed overview of MAB algorithms and their variants, check out my previous post on MAB basics.

A word from Chi: In my recent research on MAB-based user association for V2X communications, I encountered scenarios where reward distributions are highly sparse—many arms yield zero rewards most of the time. This sparked my curiosity: how does the UCB (Upper Confidence Bound) algorithm behave in such sparse environments, and what’s the optimal hyperparameter tuning strategy?

Experiment Setup

Arm Generation

The stochastic bandit environment consists of num_arms arms where:

  • A subset of arms (typically 60%) are active arms that generate rewards from Beta distributions with values in [0,1]
  • The remaining arms are zero arms that always return 0
  • Each active arm follows a fixed Beta distribution with different parameters

UCB Algorithm

The UCB algorithm selects arms based on an upper confidence bound:

\[\text{UCB}_i = \bar{r}_i + c \sqrt{\frac{\log t}{n_i}}\]

where $\bar{r}_i$ is the average reward of arm $i$, $n_i$ is the number of times arm $i$ has been pulled, $t$ is the total number of rounds, and $c$ is the exploration hyperparameter.

Experimental Results

Experiment 1: Similar Expected Rewards with Zero Arms

Setup: 5 arms with distribution [0, 0, 0, 0.85, 0.75]

Observation: We tested UCB with four different hyperparameters: $c \in {2, 1, 0.5, 0.05}$

UCB behavior with lowest-reward-first exploration
UCB behavior with largest-reward-first exploration

Key Findings:

  1. Small $c$ values favor non-zero arms: When $c$ is small (e.g., 0.05, 0.5), the algorithm quickly converges to active arms and rarely revisits zero arms. This behavior is consistent regardless of initial exploration order.

  2. UCB value oscillation decreases over time: For zero-reward arms, the frequency of UCB value “flips” (when their UCB temporarily exceeds that of consistently rewarding arms) decreases as more trials accumulate. The intervals between flips grow exponentially.

  3. Initial exploration order matters slightly: When exploring high-reward arms first (largest index first), initial UCB values are slightly lower, with the effect more pronounced at larger $c$ values. However, the long-term selection frequency of zero arms remains largely unaffected.

Experiment 2: Diverse Expected Rewards without Zero Arms

Setup: 3 arms with distribution [0.45, 0.15, 0.9]

UCB behavior among arms with different level of reward

Key Findings:

  1. Optimal $c$ matches reward range: In the absence of zero arms, setting $c$ close to the maximum possible reward (1.0 for rewards in [0,1]) provides balanced exploration-exploitation trade-off.

  2. Zero arms change the game: When zero arms are present, smaller $c$ values are preferable to minimize regret by avoiding frequent revisits to zero-reward arms.

Conclusions

General Principle: For stochastic bandits with sparse reward distributions (many zero-reward arms), smaller UCB hyperparameters ($c < 1$) significantly improve performance by:

  • Reducing unnecessary exploration of zero-reward arms
  • Accelerating convergence to optimal non-zero arms
  • Minimizing cumulative regret

Practical Recommendation: In sparse reward scenarios, start with $c \in [0.05, 0.5]$ rather than the classical $c = \sqrt{2}$ or $c = 2$.

Code

The full experimental code is available below:

% test_ucb_main.m
% Main script to test UCB algorithm on stochastic bandit with sparse rewards
clear; clc; close all;

%% Parameters
num_arms = 5;
num_steps = 100;
c_values = [2, 1, 0.5, 0.05];

%% Run UCB for each c value
num_c = length(c_values);
figure('Position', [100, 100, 1400, 800]);

for c_idx = 1:num_c
    c = c_values(c_idx);
    fprintf('Running UCB with c = %.2f...\n', c);
    
    arms = 1:num_arms;
    scores = zeros(1, num_arms);
    trials = zeros(1, num_arms);
    sum_rewards = zeros(1, num_arms);
    choice = zeros(num_steps, 1);
    reward = zeros(num_steps, 1);
    ucb_his = zeros(num_steps, num_arms);
    
    for t = 1:num_steps
        [action, ucbV] = ucbTest(arms, scores, trials, c);
        ucb_his(t, :) = ucbV;
        reward(t) = armGen(action, num_arms);
        trials(action) = trials(action) + 1;
        sum_rewards(action) = sum_rewards(action) + reward(t);
        scores(action) = sum_rewards(action) / trials(action);
        choice(t) = action;
    end
    
    % Plot UCB values
    subplot(2, num_c, c_idx);
    hold on;
    for k = 1:num_arms
        plot(1:num_steps, ucb_his(:, k), 'LineWidth', 1.5, ...
            'DisplayName', sprintf('Arm %d', k));
    end
    xlabel('Time Step');
    ylabel('UCB Value');
    ylim([0, max(ucb_his(:)) + 0.2]);
    title(sprintf('UCB Values (c=%.2f)', c));
    legend('Location', 'best');
    grid on;
    hold off;
    
    % Plot arm selection
    subplot(2, num_c, num_c + c_idx);
    plot(1:num_steps, choice, 'o-', 'LineWidth', 1.5, 'MarkerSize', 6);
    xlabel('Time Step');
    ylabel('Selected Arm');
    selection_counts = '';
    for k = 1:num_arms
        count = sum(choice == k);
        selection_counts = [selection_counts sprintf('Arm%d:%d ', k, count)];
    end
    title(sprintf('Arm Selection (c=%.2f)\n%s', c, selection_counts));
    yticks(1:num_arms);
    ylim([0.5, num_arms+0.5]);
    grid on;
end

sgtitle('UCB Algorithm Comparison with Different c Parameters');

function [choice, ucb_values] = ucbTest(arms, scores, trials, c)
    num_arms = length(arms);
    rounds = sum(trials);
    ucb_values = zeros(1, num_arms);
    
    if any(trials == 0)
        untried_arms = find(trials == 0);
        choice = arms(untried_arms(end));  % Explore from largest index
        return;
    end
    
    for i = 1:num_arms
        confidence_bound = c * sqrt(log(rounds) / trials(i));
        ucb_values(i) = scores(i) + confidence_bound;
    end
    
    [~, choice] = max(ucb_values);
end

function reward = armGen(arm_id, num_arms)
    num_active = max(2, round(num_arms * 0.6));
    active_arms = num_arms - num_active + 1 : num_arms;
    
    if ~ismember(arm_id, active_arms)
        reward = 0;
    else
        active_idx = find(active_arms == arm_id);
        if active_idx == 1
            alpha = 5; beta = 2;
        elseif active_idx == 2
            alpha = 3; beta = 3;
        else
            alpha = 2 + active_idx;
            beta = 5 - active_idx;
        end
        reward = betarnd(alpha, beta);
    end
end



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • UCB Under Gradual Distribution Decay
  • UCB Under Non-stochastic Enviroment
  • MAB: A Introduction
  • SUMO Simulation with Ray Tracing