AI Agent in Minecraft


Create a Figure AI x OpenAI-like demo of a minecraft agent.

How does Figure AI demo work?

Coery Lynch from Figure has an explainer X post.

Figure AI demo

The demo takes advantage of the high-level reasoning abilities from LLM’s to choose which learned, closed-loop policies to run on the robot.

The interesting part is the low-level policy control of the robot. The behavious are driven by visuomotor transformer policies, onboarding images at 10hz and generating actions at 200hz.

This looks very similar to diffusion policy, which executes learned policies at 10hz.

How do current minecraft agents work?

Voyager: An Open-Ended Embodied Agent with Large Language Models creates a skill library of executable code using a minecraft javascript API. This is great, but my rationale for doing this project is to use the Minecraft as a sandbox for testing approaches that would work in real-life, and unfortuantely, we don’t have an API for reality.

However, it is good to keep note the voyager approach because it could be used to create demonstrations for training.

Diffusion Policy in Minecraft

To evaluate the effectiveness of a diffusion policy in Minecraft, I plan to initially train it on the classic task of chopping down trees.

Data collection

200 demonstrations of chopping down trees

Screen resolution

To strike a balance between processing efficiency and detail retention, I will set the Minecraft screen resolution to 240x240. This resolution is sufficiently low to enhance training and inference speeds, while still capturing adequate details of the Minecraft world for the agent to effectively learn policies.
(Perhaps worth analysing the impact of different resolutions).

Minecraft screenshot

Sample rate

Sample rate is a trade-off between long-horizon planning and fine-grained control. Given a fixed prediction horizon, a higher sample rate allows for more fine-grained control, but at the expense of long-horizon planning. A very large sample rate would allow for very fine-level control and precise movement but no long-term planning.

For this reason, I follow the paper sample rate of 10hz. This presents some challenges given inputs are real-time.