In Phase 1 I got a single drone flying in Docker on AWS, with Claude controlling it through MCP. That post ended with the milestone of “first containerized flight,” and at the time I was pretty proud of it. Looking back, takeoff was the easy part. A drone that lifts off and hovers on a natural-language command is a fun demo, but it isn’t doing anything useful yet. The value shows up when you can hand it a mission, walk away, and trust it to handle the details.
That’s the work in this post: Phase 2, which gave the drone a real notion of what a mission is, and Phase 2.5, which moved the whole stack onto a photorealistic simulator so the drone could finally see something.
I should also flag that my framing of the project has shifted since the last post. I started this with “infrastructure inspection and search and rescue” as the headline goal, and while those are still good targets, I’ve come to think the more honest framing is that I’m building an AI-native runtime for autonomous aircraft. The mission domain matters less than the control and perception model underneath. The applications will follow from getting that model right.
Phase 2: Mission Intelligence
By the end of Phase 1.5 the drone could be moved around. It couldn’t be tasked. Closing that gap meant introducing three things at once: a structured model of what a mission is, a set of patterns the LLM can run over an area, and a way to track progress without flooding the LLM’s context with raw telemetry.
The work landed in five sub-phases (2A through 2E) over the course of late February. Rather than walking through each one in order, here’s the shape of what shipped:
| Sub-phase | Focus | What it added |
|---|---|---|
| 2A | Mission state | Structured MissionState, Sector, Finding, Decision objects with 6 MCP tools for the lifecycle |
| 2B | Search patterns | Grid (boustrophedon), expanding square, and sector search generators; position-based tracking |
| 2C | Vision hooks | capture_image, analyze_image, get_findings_near shipped as stubs, ready for real cameras |
| 2D | Navigation and safety | Custom waypoint routes, orbit-around-point, geofence polygons, background battery monitor with RTL |
| 2E | Unified flight tracking | FlightActivity with a real lifecycle, persistent telemetry cache, get_drone_activity for live state |
A few of these are worth pulling out.
The mission state model is deliberately small. The LLM doesn’t need a continuous telemetry stream to know what’s happening; it needs to know what the drone has done, what it has found, and what choices it has made. Six fields, not sixty. Most of the per-mission JSON ends up being a few hundred tokens, which means the LLM can carry it in context across many turns without strain.
The search patterns themselves are straightforward to generate, but I lost a half day to a subtler issue: MAVSDK exposes two mission APIs (mission and mission_raw), and they don’t share progress state. Once I switched to tracking sector completion by measuring position with haversine distance instead of asking the SDK for progress, things worked. Lesson filed under “don’t trust SDK abstractions to compose just because they look like they should.”
Shipping vision hooks as stubs in 2C, before any camera existed, turned out to be the most useful sequencing decision in the whole phase. By the time AirSim came online weeks later, the LLM already knew how to call capture_image and analyze_image. The capture layer dropped in behind tools that Claude was already using, with no breaking change at the MCP surface.
The headline result, on February 24, was the first end-to-end autonomous grid search. Seven-pass boustrophedon over a roughly 120-meter square at 30 meters altitude, around five and a half minutes of flight, sectors tracked correctly, clean RTL at the end. From Claude’s perspective, it was a single natural-language instruction; from the drone’s perspective, it was a complete mission.
Phase 2.5: Why Cosys-AirSim
By April it was time to give the drone eyes. Gazebo, which had carried the project through Phase 1, is a serious physics simulator but it isn’t built for photorealistic cameras. For anything resembling a real vision workload I needed a different tool, and Cosys-AirSim was the clear choice for several reasons:
| Capability | Gazebo (Phase 1 stack) | Cosys-AirSim |
|---|---|---|
| Physics | Strong, headless-friendly | Strong, handled in UE5 |
| Photorealistic rendering | No | Yes, via Unreal Engine 5 |
| Multi-camera drone models | Limited | First-class, configurable per drone |
| Environments | Default flat world | Large library of free UE5 community maps |
| PX4 lockstep integration | Native | Native, via TCP lockstep |
| Project status | Active | Active community fork of Microsoft AirSim (which was archived) |
The two things that mattered most for where this project is heading are the rendering quality and the map library. UE5 maps give me a visual playground that looks close enough to reality that vision models trained on real-world data have a chance of being useful, and the free community packages mean I can drop the drone into a city, a forest, a coastline, or an industrial site without modeling any of it myself. That kind of environment variety is exactly what an inspection or search workflow needs to be tested against.
The cost of all this is hardware. Photorealistic rendering needs a GPU, so the stack moved off the t3.large I was using for Phase 1.5 and onto a g5.2xlarge with an NVIDIA A10G. Spot pricing comes in around forty-seven cents an hour, which is cheap enough to leave running during active development and easy enough to stop overnight.
The Difficult Parts
Switching simulators wasn’t drop-in. A few rabbit holes worth naming:
Compose profiles had to be reworked so the Gazebo path and the AirSim path could coexist in the same repository without stepping on each other. Different containers, different entrypoints, different network expectations. Solved with --profile gazebo and --profile airsim separation.
AirSim’s lockstep listener on TCP 4560 accepts exactly one connection, which is the PX4 SITL handshake. My readiness probe was opening that socket with nc -z and silently consuming the only allowed connection slot, causing PX4 to fail to connect afterward. The fix was to probe RPC port 41451 instead, which behaves like a normal API endpoint.
The cosysairsim Python client runs its own event loop (tornado plus msgpack-rpc) and does not coexist well with asyncio. Calling AirSim RPC from a normal asyncio executor pool worked for a few frames and then deadlocked. Pinning the AirSim client to a single dedicated executor thread, and routing every RPC call through that thread, brought stability from “fails inside a minute” to “runs indefinitely.” This is the kind of bug that’s easy to misdiagnose as a network issue, and the lesson is the same as the SDK one above: don’t assume two async runtimes compose just because each one is fine in isolation.
First Vision-Capable Flight
April 1. The drone armed on the runway in the Blocks environment, and for the first time I could pull a real photograph from its cameras through the MCP interface. Front-center and bottom-center, both 1920 by 1080, captured at the launch position before the drone had even moved:


There’s no perception pipeline behind these yet. No object detection, no scene understanding, no findings being logged. That’s the next layer of work. What this milestone confirmed was that the LLM can ask the drone for a photograph of its world, and the world it sees is rich enough to be worth analyzing.
The Stack Today
The architecture has changed shape since Phase 1. Here’s what it looks like now:
┌────────────────────────────────────────────────────────────┐
│ OPERATOR (MacBook) │
│ Claude Code via MCP, over an SSH tunnel to EC2 │
└────────────────────────────┬───────────────────────────────┘
│ MCP (HTTP/SSE on :8080)
┌────────────────────────────▼───────────────────────────────┐
│ AWS EC2 — g5.2xlarge (NVIDIA A10G GPU) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ droneserver │◀────▶│ PX4 SITL │ │
│ │ (MCP + mission + │MAV- │ (autopilot │ │
│ │ search patterns + │Link │ firmware) │ │
│ │ vision hooks) │UDP │ │ │
│ └──────────────────────┘ └──────────┬───────────┘ │
│ │ TCP lockstep │
│ ┌──────────▼───────────┐ │
│ │ Cosys-AirSim + UE5 │ │
│ │ (physics, cameras, │ │
│ │ environments) │ │
│ └──────────────────────┘ │
│ drone-net (Docker bridge network) │
└─────────────────────────────────────────────────────────────┘
The MCP surface is the same contract as Phase 1, just with about sixty tools behind it instead of forty. Everything underneath, including the swap from Gazebo to AirSim, happened without the operator having to learn anything new. That interface stability is, increasingly, the thing I find most useful about building the system this way.
Lessons Learned
A handful of things came out of these two phases that I expect will keep being relevant.
Stubbing the integration surface ahead of the hardware was the best sequencing call in the project so far. The vision hooks that shipped as placeholders in Phase 2C absorbed the real camera inputs in Phase 2.5 with no contract changes. When you can lock the interface before you commit to the implementation, the implementation gets much easier to swap.
Two APIs that look like they should compose often don’t. MAVSDK’s mission and mission_raw don’t share state. Cosys-AirSim’s tornado RPC and Python asyncio don’t share an event loop. In both cases the failure mode was the same shape: things work for a while, and then they don’t, and the symptom is several layers away from the actual mismatch. When that pattern shows up, the fix is almost always to stop trying to bridge them and instead isolate one cleanly behind the other.
Photorealism is worth paying for. Gazebo was the right starting point because it ran on cheap hardware and never blocked progress on physics. But the moment cameras became part of the system, the cost of fidelity dropped below the cost of doing without it. The GPU instance pays for itself in how much faster you can iterate on anything visual.
What’s Next
The drone can fly, run a mission, and produce a photograph of where it is. The next stretch of work is about making that into something an operator other than me would actually use, and about closing the gap between what the system can do and what it can do reliably.
There’s a dashboard already partway built that exposes live telemetry and video to a browser, alongside the MCP interface. There’s a second MCP transport so Codex can drive the same stack natively. There’s a control runtime refactor under way to clean up some autopilot-specific assumptions still leaking into the tool layer. The refactor will also separate out the droneserver Python monolith so tools can grow independently and debugging becomes more isolated as the codebase matures.
After that comes the perception loop properly: continuous capture, fast first-pass detection, escalation to a vision model for the interesting frames, and re-tasking based on findings without the operator having to ask for each step. That’s the layer where this stops looking like a flying robot and starts looking like a useful one. Multi-drone coordination is on the roadmap but it’s still further out than I’d like to commit to.
The framing I’m carrying forward is the one I mentioned at the top: the project is most usefully thought of as an AI-native runtime for autonomous aircraft, with the search and inspection applications as the first things that runtime gets pointed at. The runtime is what’s actually being built. The applications come next.
-Jake