Interval Q-Learning: Balancing Deep and Wide Exploration

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

113 Downloads (Pure)

Abstract

Reinforcement learning requires exploration, leading to repeated execution of sub-optimal actions. Naive exploration techniques address this problem by changing gradually from exploration to exploitation. This approach employs a wide search resulting in exhaustive exploration and low sample-efficiency. More advanced search methods explore optimistically based on an upper bound estimate of expected rewards. These methods employ deep search, aiming to reach states not previously visited. Another deep search strategy is found in action-elimination methods, which aim to discover and eliminate sub-optimal actions. Despite the effectiveness of advanced deep search strategies, some problems are better suited to naive exploration. We devise a new method, called Interval Q-Learning, that finds a balance between wide and deep search. It assigns a small probability to taking sub-optimal actions and combines both greedy and optimistic exploration. This allows for fast convergence to a near-optimal policy, and then exploration around it. We demonstrate the performance of tabular and deep Q-network versions of Interval Q-Learning, showing that it offers convergence speed-up both in problems that favor wide exploration methods and those that favor deep search strategies.
Original languageEnglish
Title of host publicationAdaptive and Learning Agents Workshop
Number of pages7
Publication statusPublished - 2020
EventAdaptive and Learning Agents Workshop 2020 -
Duration: 9 May 202010 May 2020
https://ala2020.vub.ac.be/

Workshop

WorkshopAdaptive and Learning Agents Workshop 2020
Abbreviated titleALA 2020
Period9/05/2010/05/20
OtherVirtual/online event due to COVID-19
Internet address

Bibliographical note

Virtual/online event due to COVID-19

Fingerprint

Dive into the research topics of 'Interval Q-Learning: Balancing Deep and Wide Exploration'. Together they form a unique fingerprint.

Cite this