Guidance Priors to Reduce Human Feedback Burden in Sequential Decision Making

Verma, Mudit

Human in the loop sequential decision making such as Reinforcement Learning from Human Feedback (RLHF) or behavior synthesis leverages human feedback to the AI system for multifaceted purposes. This dissertation will interpret the use cases of human feedback such as…

Human in the loop sequential decision making such as Reinforcement Learning from Human Feedback (RLHF) or behavior synthesis leverages human feedback to the AI system for multifaceted purposes. This dissertation will interpret the use cases of human feedback such as realizing individual human preferences, obtaining general common-sense human guidance and guidance for domain and dynamics information. While individual preferences are only known to the human in the loop, this dissertation shows that guidance information can be obtained from alternate, automated sources to mitigate the use of humans as a crutch. Specifically, RLHF on tacit tasks such as robot arm manipulation and locomotion suffer from high feedback complexity. A large portion of human-AI interaction budget is used by the AI agent in discovering guidance information rather than user preferences, essentially disrespecting their efforts. Similarly, for task-planning with human in the loop, a major challenge is acquiring common-sense user preferences on the agent behaviors. This dissertation proposes ways of obtaining priors (or background knowledge) to support guidance information needed by the AI agents thereby reducing the burden on the human in the loop. For RLHF in tacit tasks, the research minimizes unnecessary interaction by observing that a large budget of human feedback mostly informs the AI agent about domain structure information (such as temporal relationship between states) or the fact that human feedback is typically conditioned on a few key states. The thesis builds guidance priors based on these observations which provide effective means of reducing the interaction burden on the human in the loop. For symbolic task planning, the research explores the reliability of Large Language Models to act as a preference proxy for common-sensical guidance information. Specifically, we investigate along reasoning abilities in performing preferred plan detection and its brittleness in operationalizing human advice to generate plans. We extend this argument to approximate retrieval abilities for teasing out desirable domain models such as task reduction schemas for LLMs, which are useful in AI planning.

Copyright Statement