Thread Reader

Boyuan Chen


Sep 22

6 tweets

How can we ground large language models (LLM) with the surrounding scene for real-world robotic planning? Our work NLMap-Saycan allows LLMs to see and query objects in the scene, enabling real robot operations unachievable by previous methods. Link: 1/6

Project page for Project page for Open-vocabulary Queryable Scene Representations for Real World Planning


SayCan, a recent work, has shown that affordance functions can use used to allow LLM planners to understand what a robot can do from observed *state*. However, SayCan did not provide scene-scale affordance grounding, and thus cannot reason over what a robot can do in a *scene*.

To that end, we propose NLMap to address two core problems 1) how to maintain open-vocabulary scene representations that are capable of locating arbitrary objects and 2) how to merge such representations within long-horizon LLM planners to imbue them with scene understanding. 3/

NLMap builds a natural language queryable scene representation with Visual Language models. An LLM-based object proposal module infers involved objects to query the representation for object availability and location. LLM planner then plans conditioned on such information. 4/6

We combine NLMap with SayCan to show new robotic capabilities NLMap enables in a real office kitchen. NLMap frees SayCan from a fixed list of objects, locations, or executable options. We show 35 tasks that cannot be unachieved by SayCan but is enabled by NLMap. 5/6

Shout out to my collaborators on this project. @Fei Xia, @Brian Ichter, @Kanishka Rao, @Keerthana Gopalakrishnan, Austin Stone, @Michael Ryoo, and my internship host Daniel Kappler Project Website Link: Video: 6/6

Boyuan Chen


AI and Robot researcher @MIT EECS

Follow on Twitter

Missing some tweets in this thread? Or failed to load images or videos? You can try to .