Researchers at Physical Intelligence, an AI robotics company, have introduced the Hierarchical Interactive Robot (Hi Robot), an advanced system that enhances robots’ ability to process complex instructions and adapt to real-time feedback. Leveraging vision-language models (VLMs) in a hierarchical structure, Hi Robot breaks down intricate tasks into simpler steps, mimicking human reasoning based on Daniel Kahneman’s ‘System 1’ and ‘System 2’ approaches.
Hi Robot employs a high-level VLM for reasoning and a low-level VLM for execution, enabling more intuitive and efficient task completion. To train the system, researchers used synthetic data, pairing robot observations with hypothetical scenarios and human feedback, surpassing traditional methods that rely solely on real-world examples and atomic commands.
The system demonstrated 40% higher instruction-following accuracy than GPT-4o, excelling in real-world tests like clearing tables, making sandwiches, and grocery shopping. Its ability to adapt to real-time corrections and understand contextual user feedback makes it a significant step toward more autonomous and intelligent robotics.
Hi Robot can even “talk to itself”, reasoning through modified commands and adjusting its actions accordingly. Researchers plan to further refine the system by integrating high-level and low-level models for even greater adaptability in future applications.