How can we ground large language models (LLM) with the surrounding scene for real-world robotic planning?
Our work NLMap-Saycan allows LLMs to see and query objects in the scene, enabling real robot operations unachievable by previous methods.
Link: nlmap-saycan.github.io
1/6
SayCan, a recent work, has shown that affordance functions can use used to allow LLM planners to understand what a robot can do from observed *state*. However, SayCan did not provide scene-scale affordance grounding, and thus cannot reason over what a robot can do in a *scene*.
To that end, we propose NLMap to address two core problems 1) how to maintain open-vocabulary scene representations that are capable of locating arbitrary objects and 2) how to merge such representations within long-horizon LLM planners to imbue them with scene understanding.
3/
NLMap builds a natural language queryable scene representation with Visual Language models. An LLM-based object proposal module infers involved objects to query the representation for object availability and location. LLM planner then plans conditioned on such information.
4/6
We combine NLMap with SayCan to show new robotic capabilities NLMap enables in a real office kitchen. NLMap frees SayCan from a fixed list of objects, locations, or executable options. We show 35 tasks that cannot be unachieved by SayCan but is enabled by NLMap.
5/6