Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Thiago D. Simão; Matthijs T.J. Spaan

doi:10.1609/aaai.v33i01.33014967

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Algorithmics

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

19 Citations (Scopus)

Abstract

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

Original language	English
Title of host publication	33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019
Publisher	American Association for Artificial Intelligence (AAAI)
Pages	4967-4974
Number of pages	8
ISBN (Electronic)	9781577358091
DOIs	https://doi.org/10.1609/aaai.v33i01.33014967
Publication status	Published - 2019
Event	The 33th AAAI Conference on Artificial Intelligence - Honolulu, United States Duration: 27 Jan 2019 → 1 Feb 2019 Conference number: 33th

Publication series

Name	33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019

Conference

Conference	The 33th AAAI Conference on Artificial Intelligence
Country/Territory	United States
City	Honolulu
Period	27/01/19 → 1/02/19

Access to Document

10.1609/aaai.v33i01.33014967

Cite this

Simão, T. D., & Spaan, M. T. J. (2019). Safe Policy Improvement with Baseline Bootstrapping in Factored Environments. In 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019 (pp. 4967-4974). (33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019). American Association for Artificial Intelligence (AAAI). https://doi.org/10.1609/aaai.v33i01.33014967

Simão, Thiago D. ; Spaan, Matthijs T.J. / Safe Policy Improvement with Baseline Bootstrapping in Factored Environments. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. American Association for Artificial Intelligence (AAAI), 2019. pp. 4967-4974 (33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019).

@inproceedings{96594fabac98411db1a36b07c88bc563,

title = "Safe Policy Improvement with Baseline Bootstrapping in Factored Environments",

abstract = "We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent{\textquoteright}s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.",

author = "Sim{\~a}o, {Thiago D.} and Spaan, {Matthijs T.J.}",

year = "2019",

doi = "10.1609/aaai.v33i01.33014967",

language = "English",

series = "33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019",

publisher = "American Association for Artificial Intelligence (AAAI)",

pages = "4967--4974",

booktitle = "33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019",

address = "United States",

note = "The 33th AAAI Conference on Artificial Intelligence ; Conference date: 27-01-2019 Through 01-02-2019",

}

Simão, TD & Spaan, MTJ 2019, Safe Policy Improvement with Baseline Bootstrapping in Factored Environments. in 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, American Association for Artificial Intelligence (AAAI), pp. 4967-4974, The 33th AAAI Conference on Artificial Intelligence, Honolulu, United States, 27/01/19. https://doi.org/10.1609/aaai.v33i01.33014967

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments. / Simão, Thiago D.; Spaan, Matthijs T.J.
33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. American Association for Artificial Intelligence (AAAI), 2019. p. 4967-4974 (33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

AU - Simão, Thiago D.

AU - Spaan, Matthijs T.J.

N1 - Conference code: 33th

PY - 2019

Y1 - 2019

N2 - We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

AB - We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

UR - http://www.scopus.com/inward/record.url?scp=85074920659&partnerID=8YFLogxK

U2 - 10.1609/aaai.v33i01.33014967

DO - 10.1609/aaai.v33i01.33014967

M3 - Conference contribution

T3 - 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019

SP - 4967

EP - 4974

BT - 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019

PB - American Association for Artificial Intelligence (AAAI)

T2 - The 33th AAAI Conference on Artificial Intelligence

Y2 - 27 January 2019 through 1 February 2019

ER -

Simão TD , Spaan MTJ. Safe Policy Improvement with Baseline Bootstrapping in Factored Environments. In 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. American Association for Artificial Intelligence (AAAI). 2019. p. 4967-4974. (33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019). doi: 10.1609/aaai.v33i01.33014967

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Safe Online and Offline Reinforcement Learning

Safe Policy Improvement with an Estimated Baseline Policy

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Cite this