March 4, 2026 Part 1 of 4

FlyingGame Deep Moment: Contract-Driven Execution with Codex

We are recreating a flying prototype we built years ago — but this time from scratch, in C++, with multiplayer-first focus, a proper plan, and LLM support from day one.

In the old prototype, networking quality was the part we could not reliably break through. This week was a deep moment: we hit a multiplayer milestone early, with clearer quality than before.

While iterating on flyingGame, we ran a contract-driven workflow instead of improvising:

Implementation plans for each slice,
ExecutionPlan to track progress and gate items,
Requirements that defined expected behavior,
test_plan_7steps with explicit acceptance criteria.

Milestone Snapshot

For Milestone 1, we validated in PIE with 2 players + server/editor host and logged real net stats in the plan, not just subjective feel. We only moved checklist items after user-confirmed evidence.

FlyingGame planning and AI-assisted execution flow during contract-driven development — Planning first, then execution: the workflow stack became part of the product loop instead of background bureaucracy.

The best part was catching our own spec gap in real time. We thought the milestone was fully done, but Codex pushed us back to our own acceptance criteria:

"Yes, 50 ms is worse than 20 ms latency."
"You validated at about 18-20 ms, which is a strong baseline, but it is not the same as the explicit Phase 0/1A gate that asks for validation under 50 ms simulated lag. So the remaining checkbox is still correct until you run that specific test condition."

That was exactly the reminder we needed: good numbers are not the same as the right test condition. Keeping that 50 ms gate open prevented a false "done" and kept the workflow honest.

Follow-Up Validation (Where the Real Learning Hit)

When we finally ran the strict latency condition, we reported: "set min latency to 50, got some moderate shaking effect of the clients." Codex accepted the test and explained why that behavior appeared with the current bridge path.

"For testing quality, keep two runs:"
"40/60 ms, 1% loss, Clients for realistic jitter profile."
"50/50 ms, 0% loss, Clients as strict deterministic latency check (the one where you saw shaking)."

Our next feedback was even more direct: steering felt delayed, direction changes took too long, and there was still shake/spin under fixed 50/50. That got logged as a real gate issue, not hand-waved away.

"baseline at ~20ms is good,"
"fixed 50ms exposes steering delay + moderate shake/spin,"
"so the 50ms gate is correctly still open."

Then came the execution decision: yes, go hard on implementation — but in the right order. First stabilize 50ms steering/shake with a focused smoothing pass, then continue Phase 1A class-specific movement depth.

What Made It Work

The biggest practical improvement was adding short How To Test steps after each change. That made every patch verifiable in minutes and easy to document in ExecutionPlan and debug notes.

The loop became: change, test, record, advance. A well-trained process matters: the time spent on rules, contracts, and preparation helped us execute cleanly and reach important targets early.

We also keep the publishing side contract-driven: a post is only done once feed.xml is updated with the newest top entry and a matching main <updated> timestamp.

← Back to home