ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1502.05890
156
21
v1v2v3v4 (latest)

Efficient Contextual Semi-Bandit Learning

20 February 2015
A. Krishnamurthy
Alekh Agarwal
Miroslav Dudík
    OffRL
ArXiv (abs)PDFHTML
Abstract

We study a variant of the contextual bandit problem, where on each round, the learner plays a sequence of actions, receives a feature for each individual action, and reward that is linearly related to these features. This setting has applications to network routing, crowd-sourcing, personalized search, and many other domains. If the linear transformation is known, we analyze an algorithm that is structurally similar to the algorithm of Agarwal et a. [2014] and show that it enjoys a regret bound between O~(KLTln⁡N)\tilde{O}(\sqrt{KLT \ln N})O~(KLTlnN​) and O~(LKTln⁡N)\tilde{O}(L\sqrt{KT \ln N})O~(LKTlnN​), where KKK is the number of actions, LLL is the length of each action sequence, TTT is the number of rounds, and NNN is the number of policies. If the linear transformation is unknown, we show that an algorithm that first explores to learn the unknown weights via linear regression and thereafter uses the estimated weights can achieve O~(∥w∥1(KT)3/4ln⁡N)\tilde{O}(\|w\|_1(KT)^{3/4} \sqrt{\ln N})O~(∥w∥1​(KT)3/4lnN​) regret, where www is the true (unknown) weight vector. Both algorithms use an optimization oracle to avoid explicit enumeration of the policies and consequently are computationally efficient whenever an efficient algorithm for the fully supervised setting is available.

View on arXiv
Comments on this paper