What is Physical Intelligence (PI)?
Physical Intelligence is a robotics and AI startup founded in 2024 by Chelsea Finn (Stanford Professor), Sergey Levine, Karol Hausman, Brian Ichter, and Lachy Groom. Headquartered in San Francisco, the company focuses on building general-purpose embodied AI by combining machine learning with real-world interaction, leveraging advances in reinforcement learning, control, and computer vision.

[Paper Review] Pi0, Pi0.5, Pi0-FAST - Tracing the Path of Physical Intelligence (PI)

[Blog] It’s Been a While – Here’s What I’ve Been Up To

1. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption.
2. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP.
3. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%

[Paper Review] OpenVLA: An Open-Source Vision-Language-Action Model

• An elusive goal in navigation research is to build an intelligent agent that can understand multimodal instructions including natural language and image, and perform useful navigation
• To achieve this, we study a widely useful category of navigation tasks we call Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video.
• We evaluated Mobility VLA in a 836$m^2$ real world environment and show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions such as “Where should I return this?” while holding a plastic bin.

[Paper Review] Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

The document provides an overview of vision-language modeling (VLM) training strategies, discussing when to use contrastive models like CLIP, masking techniques, generative models, and pretrained backbones. It emphasizes the importance of grounding and alignment in VLMs, detailing methods such as instruction tuning and reinforcement learning from human feedback (RLHF). Additionally, it highlights advancements in models like LLaVA and its variants, which incorporate multimodal instruction tuning and improve performance on various benchmarks. Finally, it addresses parameter-efficient fine-tuning methods to adapt large-scale models for specific tasks while managing computational costs.

(2) An Introduction to Vision-Language Modeling: A Guide to VLM Training

The document discusses Vision-Language Models (VLMs), highlighting their role in solving rate-distortion problems by optimizing predictive information and constraining conditional densities. It covers various approaches, including generative-based VLMs that generate text and images, and examples like CoCa and CM3Leon which utilize multimodal generative techniques. The document also explores the use of pretrained backbones in VLMs, emphasizing models like MiniGPT and BLIP2 that efficiently integrate visual and textual data for various tasks, showcasing advancements in multimodal understanding and generation capabilities.

(1) An Introduction to Vision-Language Modeling: The Families of VLMs

• Adapting driving behavior to new environments, customs, and laws is a long-standing problem in autonomous driving, precluding the widespread deployment of autonomous vehicles (AVs)
• LLaDA achieves this by leveraging the impressive zero-shot generalizability of large language models (LLMs) in interpreting the traffic rules in the local driver handbook.
• We also demonstrate LLaDA’s ability to adapt AV motion planning policies in real-world datasets; LLaDA outperforms baseline planning approaches on all our metrics.

[논문리뷰] Driving Everywhere with Large Language Model Policy Adaptation

Today, I will review a new paper that was released yesterday. This research comes from Sergey Levine’s team, a prominent figure in the AI and RL domains. They propose fine-tuning Vision-Language Models (VLM) with Reinforcement Learning (RL) to enhance performance in optimal decision-making tasks within multi-step interactive environments. The paper presents a simple approach that outperforms both GPT-4 and Gemini. This research is similar to my own ideas for solving challenges in embodied AI. Therefore, I will review this paper and organize its key concepts.

(3) [Paper Review] Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning: Embodied AI

1. 임의의 언어 지시를 따를 수 있도록 구체적 행동에 맞추어 복잡한 작업을 수행하도록 한다.
2. 시뮬레이션된 3D 환경에서 인간이 할 수 있는 모든 것을 수행할 수 있는 Scalable, Instructable, Multiworld agent를 학습할 것이다. language + observation → keyboard-and-mouse
3. 동기와 목표, 초기 진행상황, 여러 연구환경과 상업용 비디오 게임에서의 예비적인 결과를 설명한다.

[논문리뷰] Scaling Instructable Agents Across Many Simulated Worlds

이번에는 4D generation (3D generation + motion)에 대해 리뷰해보도록 하겠다. 이 연구는 Nvidia에서 발표한 논문으로 여러 모델을 사용하여 4D generation을 진행하는 것을 목표로하였다. 아직은 발전할 것이 많아보이지만 그래도 새로운 연구 방향으로써 4D가 주목받고 있고 연구를 하기에는 최적의 주제라고 생각한다.

(3) Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models : 3D Generation

이번에는 Q-transformer에 이어 딥마인드에서 공개한 Embodied AI인 RT-2에 대한 리뷰를 진행해보도록 하겠다. 이전에 RT-1과 Q-transformer가 로봇 데이터만을 가지고 Transformer를 학습시켜 Imitation learning을 진행했다면 이번에는 Internet Scale에서 학습된 Vision Language 모델을 사용하여 로봇 데이터를 추가해 더욱 Generalization 성능이 뛰어난 모델을 개발하는 연구이다.

GPT를 써보았다면 이미 이미지를 통한 Reasoning의 수준이 놀라운 수준이고, 이를 활용하면 실제 로봇을 만들 수 있다는 상상을 할 수 있을 것이다. 이 논문은 그 상상을 직접 실험으로 증명하고 검증한 논문이다. 이 논문을 통해 더 좋은 성능의 Embodied AI가 개발될 것이라는 확신을 가지게 되었다.

(2) [논문리뷰] RT-2, Vision-Language-Action Models Transfer Web Knowlege to Robotic Control: Embodied AI

이번 포스팅에서는 3D generation 모델 중에서 Zero123를 리뷰해 볼 것이다. Zero123는 diffusion model을 카메라 각도에 따른 이미지를 생성하도록 finetuning하여 3D generation을 진행한다는 매우 간단한 아이디어에서 출발한 논문이다.

(2) [논문리뷰] Zero123: 3D Generation

이번 포스트에서는 현재 다양한 3D generation 모델의 기초가 된다고 볼 수 있는 DreamFusion에 대해서 리뷰해볼 것이다. 이 논문은 구글 리서치와 버클리에서 진행한 연구이고, 2D diffusion 모델을 사용하여 NeRF를 학습시켜 3D generation 모델을 만들 수 있다는 사실로 주목받았다. 현재는 video prior, 2D와 3D를 결합한 prior 등을 이용하여 다양한 연구가 나오고 있다. 앞으로 3D generation 모델들에 대해 리뷰를 진행하기 위해 알아두어야하는 논문이기 때문에 자세한 리뷰를 진행해보도록 할 것이다.

(1) [논문리뷰] DreamFusion: Text-To-3D Using 2D Diffusion - 3D generation

이번 포스트에서는 지난번 포스트에서 짧게 설명했던 Q-Transformer라는 논문에 대해 더 자세히 알아볼 것이다. 이 포스트를 이해하기 위해서는 지난 글 중 offline RL 부분을 반드시 읽어보는 것이 좋다.

(1) [논문리뷰] Q-transformer : Embodied AI

지난 포스팅에 이어 이번에는 Image generation model에서의 alignment를 살펴보려고한다. 이 분야는 현재 치열한 경쟁이 진행되고 있어 많은 논문이 발표되고 있다. 이 글에서는 첫 시도인 Aligning Text-to-Image 논문부터 DPOK, Diffusion DPO까지 자세하게 리뷰해보고자 한다. 나머지 다양한 연구들은 짧게 요약해서 설명할 것이다.

(2) Text-to-Image Diffusion Model, Alignment in Deep Learning : Comprehensive summary

LLM과 Image generation 모델을 통해 생성형 인공지능 모델에 대한 관심은 폭발적으로 증가했다. 이미 ChatGPT와 미드저니와 같은 인공지능 모델의 영향력은 경제 사회 전반의 변화를 일으키고 있다. 하지만 생성 모델을 서비스에 사용하기까지는 여러 번의 학습 과정을 거치게 되는데, 이는 바로 생성형 모델이 가진 특징 때문이다. 이 학습 과정에서 필수적인 Alignment에 대해서 정리해보았다.

(1) RLHF LLM, Alignment in Deep Learning: Comprehensive Summary

데일 카네기의 인간관계론을 읽고 느낀점을 정리해보았다.

데일 카네기의 인간관계론을 읽고

이번 블로그 글은 벌써 마지막 주제이다. 두 개의 발표만이 남았는데 모두 강화학습을 게임에 적용하여 상용화하는 방법에 대한 발표였다.

(4) 강화학습 게임 상용화  - [모두의 연구소] 중요한 것은 꺾이지 않는 RL 후기 

How to get indexing from google search console efficiently

Striving for Visibility: My Journey to Get My Blog Indexed on Google

이번 발표는 reinforcement learning에서 pretrain과 관련된 전체적인 내용을 설명해주는 강의였다. 쉬운 설명과 함께 대표적인 알고리즘을 설명해주어 매우 도움이 많이 되었던 것 같다.

(3) Pretraining for inteligent robot - [모두의 연구소] 중요한 것은 꺾이지 않는 RL 후기 

Causal RL과 multi environment RL에 대한 발표를 정리해보았다.

(2) Causal RL, Multi environment RL - [모두의 연구소] 중요한 것은 꺾이지 않는 RL 후기

모두의 연구소 강화학습 세미나에 참석보았다.