Legibility: A Scaling Bottleneck of the Agentic Era

"Legibility" means different things in different contexts. In typography, it's how easily individual characters are read. In law, it's how clearly a contract can be interpreted. Here, I mean something specific: can you, or anyone working with you, actually see and understand a complex system well enough to trust it, improve it, and change it without fear?

That's the definition I care about. And for most of my career, I didn't.

I came up through engineering drawn to the math side of code: build a function, test the input, validate the output. Legibility felt like a writing concern. Not my department, which makes it mildly ironic that I'm now writing an essay about it. But as I moved into system design, and especially as AI started handling the code generation I used to do manually, something shifted. The bottleneck moved.

When ChatGPT arrived, I built fast. Lots of experiments, real excitement about what was suddenly possible. A lot of it didn't stick. Systems got opaque. I'd lose the thread. Projects that started with momentum stalled out and eventually got abandoned.

Then in 2025, agentic coding arrived, and the pace didn't just accelerate, it multiplied. One person now has leverage that previously required a large team. I jumped into building an agentic engineering platform with that energy. Built fast again. And hit the same wall, harder.

I had to stop adding features and start distilling. Combined modules, cleaned the domain model, built auto-generated diagrams from a consistent architecture. I came out the other side able to actually see my system again, and gained the confidence to keep pressing forward.

Two things crystallized from that experience:

As a system becomes more powerful, legibility becomes more important.

If a system is powerful and not legible, it will eventually get shut down.

This isn't only a software problem. AI is giving lawyers, doctors, accountants, and everyone building anything access to agentic systems that can handle work at a scale no human team could before. The complexity that comes with that power is now universal. And so is the question underneath it: how do you manage massive complexity without creating a black box?

That's what this essay is about.

The New Failure Mode

The Shift

The bottleneck has moved. For decades, knowledge work was constrained by thinking and typing speed: you could only produce as fast as you could type. In software, AI now generates code faster than engineers can read it. And this isn't only a software shift: it applies to any digital knowledge work. That gap, between generation speed and human comprehension, is a new failure mode.

The gap between individual and organizational output is collapsing, and it's accelerating toward a point where the distinction barely matters.

Peter Steinberger, creator of OpenClaw, made 29,769 contributions on GitHub in January 2026 alone.¹ For context: the world's most prolific solo developer in 2012 peaked at around 7,500 contributions for an entire year.² Extrapolated over twelve months, Steinberger's pace would produce roughly 360,000 contributions, the output of a team of nearly fifty developers at that 2012 peak. OpenClaw, built by one person with AI, has surpassed React's all-time GitHub star count,³ a milestone Facebook's team took twelve years to reach. His own words: "All the mundane stuff of writing code is automated away, I can move so much faster. But I have to think so much more." The first sentence is the promise. The second is the catch.

The reality is that AI doesn't scale human comprehension. Our working memory holds roughly 7 ± 2 chunks of information at once: Miller's Law, and it hasn't been patched recently.⁴ You can write code ten times faster. You cannot understand ten times faster. Which means the faster you generate, the faster you outrun your own grasp of what you've built.

And the AI agents working on your behalf face a similar wall. They can only see a fraction of a system at once. The context window of LLMs makes this concrete. Rule of thumb: roughly 10 tokens per line of code. Claude's 200K context window fits about 20,000 lines, roughly 40 files at 500 lines each. Gemini 1.5's 1M context pushes to around 100,000 lines; Google demonstrated this by loading the entire JAX codebase (746K tokens) in a single session.⁵ Now look at what real production systems contain:

Codebase	Lines of code	Claude 200K sees	Gemini 1M sees
Medium SaaS	~150K LOC	13%	67%
React	~593K LOC	4%	17%
VS Code	~1.44M LOC	1.4%	7%
Linux kernel	~40M LOC	0.05%	0.25%

A fresh agent session on a medium-sized codebase sees roughly one-eighth of the system at best. It doesn't know what it doesn't know. Every new session bootstraps from zero context. Things get duplicated. Dependencies get missed. Features land adjacent to features that already solve the same problem.

Software is abstract: there's no physical structure you can walk through. No way to see the load-bearing walls by looking at it. If you can't make that structure visible, intentionally, you're asking an AI to keep adding floors to a building whose blueprints don't even exist. At machine speed. With agentic orchestration on deck.

The Cascade

Systems don't break all at once: they calcify. And the progression is predictable.

First, basic questions start to go unanswered: What is this actually capable of? What is it actually doing? Then changes become scary: you don't know what you'll break, so you avoid touching anything that isn't on fire. New features get added without a clear picture of what already exists, so things get duplicated, or built next to something that already does the same job. The system grows more complex, but no clearer. Eventually the people working with it stop trusting it. They route around it. They build shadow systems. And then someone in a position of authority asks the one question the system can't answer, what did it do, and why?, and it gets shut down.

That's the cascade. Not a single catastrophic failure, but a slow loss of visibility that ends in a loss of control.

Agentic systems amplify every step of it. The same opacity that takes a human team years to accumulate can build up in weeks when agents are generating output at scale. The speed that makes agentic systems powerful is the same speed that makes illegibility dangerous.

Real-World Carnage

These aren't edge cases.

Amazon built an AI recruiting tool. It learned to penalize resumes that included the word "women's." Engineers found the bias but couldn't fix it: the model's decision-making was opaque. They scrapped the whole system.⁶

The Dutch government used an algorithmic fraud-detection system for childcare benefits. It falsely accused 26,000 families. Parents lost housing. Children entered state care. The scandal, known as the Toeslagenaffaire, consumed Dutch politics for years and forced the resignation of Prime Minister Rutte's third cabinet.⁷

Netscape decided their codebase had become too complex to maintain and chose a full rewrite of Navigator from scratch. It took three years. The market moved on. The company never recovered.⁸

In each case: the system couldn't explain itself. And systems that can't explain themselves don't get fixed: they get shut down, or they take the organization down with them.

These failures happened in systems built and maintained by human teams, over years. Agentic systems can reach the same complexity in weeks. The question isn't whether illegibility becomes a problem: it's how fast.

Four Directions Toward Legibility

Domain + Temporal Legibility

Two directions toward legibility that I rarely saw prioritized in practice, but found enormously valuable: across people, and across time.

The first is about language. When engineers design systems, they tend to model them in the language of the database: tables, records, IDs, foreign keys. That language is precise, but it's invisible to almost everyone outside engineering. A product manager, a customer support lead, a new hire: none of them can look at that structure and recognize the business they work in.

The insight that changed how I build: what if the system spoke the same language as the people it serves? Not "user_transaction_records": an Order that gets Placed, Fulfilled, Cancelled. Words anyone in the company already uses. When I found Domain-Driven Design,⁹ it gave this idea a name and a framework: a ubiquitous language¹⁰: a shared vocabulary where the code and the business use the same terms for the same things.

When a system speaks the language of the problem it solves, anyone can read it. A product manager can look at the model and recognize their domain. A new engineer joins and the system already makes intuitive sense. The gap between "what the business needs" and "what the system does" becomes visible, and closeable. I'll call this domain legibility.

The second direction is about history. Most systems store the current state of things, like a single cell in a spreadsheet that gets overwritten every time something changes. You can see what the system looks like now, but not how it got there. Event sourcing¹¹ flips that: instead of storing the latest state, you store an immutable log of everything that happened. Not just the account balance: every transaction that ever touched it. The approach is borrowed from banking, taken further by Bitcoin's distributed ledger, and applied to software systems by a generation of architects who wanted auditability built in from the start.¹² ¹³

The result is that you can always answer: why is the system in this state? Replay from the first event. Travel to any point in time. See not just what the system is, but the complete story of how it became that way. That history is also raw material for AI: a structured record of every decision the system ever made, ready to be learned from.¹⁴

Combined, domain and temporal legibility mean the system is readable by people, and explainable across time. Both are essential for long-term trust.

Modularity + Low Cognitive Load

The biggest legibility benefit of modularity is simple: it reduces how much you need to hold in your head at once.¹⁵

Think about a car. Combustion cycles, fuel injection, battery management, temperature regulation: the driver understands almost none of it, and doesn't need to. The interface is a wheel, a pedal, a key. Clean inputs, predictable outputs. All the complexity is hidden behind that boundary. A single person can confidently operate something extraordinarily complex because they never have to see what's underneath.

That's what good modularity does for a system. Wrap thousands of lines of complexity behind a simple interface. The cognitive load of operating it drops to the size of that interface, not the size of what's inside it. You can understand and change one module without needing to hold the whole system in your head at once.

This also has a team dimension. As Conway observed, systems tend to mirror the communication structure of the teams that build them.¹⁶ Which means module boundaries aren't just a technical question: they're a team design question. And as agent teams become the norm, it's likely an agent team design question too.¹⁷ How you divide the work shapes what gets built.

Anti-Pattern: Over-Documentation

In a failed attempt to improve legibility, I generated docs for every feature of my system. The more I created, the less I wanted to use them. They got out of sync. They became overwhelming. They actually increased cognitive load instead of reducing it. I had created a legibility problem while trying to solve one. Classic.

Now I think of documentation like the interface on a module: its purpose is compression, not coverage. What's the minimum a person needs to understand and operate this? I focus on that, and push hard on quality over quantity. Agents can use these docs too, but humans are the primary audience.

Standardization

Standardization used to feel constraining to me: ordinary, not novel. But it took time to appreciate that ordinary has a superpower: it eliminates the cognitive overhead of decision-making across every system that uses it.

The proof is everywhere. Drive anywhere in your country: same signs, same exits, and you feel at home immediately. The cognitive load of somewhere new drops dramatically. HTTP and TCP are the same idea at internet scale: standards so ubiquitous that every system integrates with them by default, enabling a decentralized global network. Designed once. Scaled to billions.

In software, this compounds in a way that matters enormously for the AI era: understand the standard once, and you can read any system built on it. Not just faster: fundamentally differently. Every new codebase starts at partial comprehension instead of zero. You recognize the shape before you've read a line.

I adopted Vertical Slice Architecture (VSA)¹⁸ across all my projects, originally because it lets agent work run in parallel more easily. Each feature is its own self-contained slice through the system, from interface to data, so multiple agents can work on different features simultaneously without stepping on each other. But standardizing on VSA delivered a bonus I didn't expect: because the structure is consistent, I could build tooling that auto-generates diagrams of the system with every change. A visual map of what exists, always current, always in the same format. You can see how the system grows over time: a temporal legibility bonus from a structural choice.

One standard. All systems. Comprehension that transfers and tooling that scales. That's what standardization buys you, and in the agentic era, it compounds fast.

Visualization

The directions above are about how you design legibility into a system. But there's a parallel question: as systems grow beyond what any diagram can capture on a flat screen, how do you see them at all?

The human role isn't disappearing: it's shifting. From writing low-level code to something different: understanding what was built, navigating it, and deciding what comes next.

The tools for that are still primitive. We read code in text editors: flat files, no spatial sense, no way to feel the shape of something large. The institutional knowledge that used to live in a team, the "why did we build it this way" conversations and mental models accumulated over years, doesn't compress automatically into a codebase. It has to be designed in. And as systems grow to thousands of services and agents, flat representations reach a comprehension ceiling.

The progression I've worked through: diagram-as-code first: Mermaid, then D3 and React Flow, which add a design dimension Mermaid can't offer. With standardization, these diagrams auto-generate from the codebase and stay current. Then code complexity tooling: hotspot maps, dependency graphs, code smell visualizers. Seeing not just what the system is, but where the problems are.

Then 3D. I built a CodeCity clone¹⁹, a three-dimensional representation of a codebase where files become buildings and directories become city blocks, scaled by size and complexity. Then generated an automated flyover video. You're not reading the system anymore. You're moving through it. The first time I watched a codebase render as a city, I just sat there. I'd been living inside that system for months. I'd never actually seen it before. Large systems that felt overwhelming as flat files suddenly have geography: you can see where the density is, where things cluster, where something looks out of place.

We're early. The tools that help humans see complex systems are just beginning to be built. But when generation is no longer the bottleneck, the constraint shifts from what you can build to what you can understand. The systems that survive the agentic era won't just be the most capable. They'll be the ones humans can actually see.

These four directions are where I've found the most leverage so far. The field is early, and I expect better answers to emerge. I'm actively looking for them.

Legibility Checklist

Here are some questions to pressure-test the legibility of anything you're building.

Can a new person understand this system in one day? If not, your onboarding cost is a scaling bottleneck.
Can you generate an architecture view automatically? If it requires manual upkeep, it's already out of date.
Can you trace why any decision was made? Auditability and traceability are the goal: event sourcing is one way to get there, but any approach that lets you answer why works. If the system can't explain itself, it can't be trusted and is headed for shutdown.
Is complexity packaged behind a simple interface? If not, your complexity ceiling is lower than it needs to be: cognitive load is doing the limiting, not the actual complexity of the problem.
Is there a shared vocabulary between code and the problem it solves? Event modeling is one powerful approach: commands, events, and queries are intuitive enough that any stakeholder can reason about a feature before a line of code is written. Humans think in events naturally. Abstract data models don't come naturally to most people, unless you're a robot. (Are you?)
Can you improve the system with confidence? If legibility is working, you can look at the system, know where a problem lives or where a feature belongs, and make the change without fear. That confidence is what lets you move fast sustainably.

Closing

Legibility is how you stay in control of the unprecedented leverage agentic orchestration provides.

A black box that controls a thousand tireless agents isn't powerful: it's dangerous. The question that ends complex systems will be: what did it do, and why? The Dutch government couldn't answer it.⁷ Amazon couldn't answer it.⁶ Netscape lost three years trying to undo what they couldn't explain.⁸ (And that was before agentic orchestration.)

Confidence is the currency. Can you audit it? Can a new person work with it? Can you change it without fear? If the answer to any of those is no, you have a legibility problem, and that problem compounds at mass agentic speed.

LLMs don't work without direction, which means humans aren't being replaced, they're becoming the most important part of the system. The direction layer. That's not a limitation, it's leverage, but whether it's an upward or downward spiral depends on how clearly you can see what you're directing. Right now is the most powerful this leverage has ever been, and the least powerful it will ever be.

Legibility isn't about slowing down. It's about making speed sustainable. In the agentic era, the most important skill isn't generating output: it's maintaining the understanding to direct it.

Steinberger, Peter (January 2026). "GitHub contribution graph". GitHub. 29,769 contributions in January 2026. See also: "The creator of Clawd: I ship code I don't read" (2025). The Pragmatic Engineer. ↩
paulmillr (2013). "GitHub worldwide contributor leaderboard". GitHub Gist. TJ Holowaychuk (#1 globally): 7,458 contributions/year (Jan 2012–Jan 2013). ↩
"OpenClaw". GitHub. Launched November 24, 2025. 270.4k stars as of March 2026, #12 all-time on GitHub, surpassing React (243.7k). Created by Steinberger, Peter (@steipete). ↩
Miller, George A. (1956). "The Magical Number Seven, Plus or Minus Two". Psychological Review. Human working memory holds ~7 ± 2 chunks. ↩
Google (2024). "Gemini 1.5 Technical Report". arXiv. Demonstrated loading JAX codebase at 746,152 tokens in a single session. ↩
Dastin, Jeffrey (2018). "Amazon scrapped AI recruiting tool that showed bias against women". Reuters. ↩ ↩²
"Dutch childcare benefits scandal". Wikipedia. Opaque algorithmic fraud detection falsely accused ~26,000 families; government fell (2020-21). ↩ ↩²
Spolsky, Joel (2000). "Things You Should Never Do, Part I". Joel on Software. Netscape's rewrite cost 3 years and killed the company. ↩ ↩²
Evans, Eric (2003). Domain-Driven Design. Ubiquitous language, bounded contexts. ↩
Fowler, Martin (2006). "Ubiquitous Language". martinfowler.com. Shared vocabulary = code reads how business thinks. ↩
Fowler, Martin. "Event Sourcing". martinfowler.com. Complete audit log + temporal queries. ↩
Dilger, Martin (2023). Understanding Event Sourcing. Leanpub. Accessible introduction to event sourcing concepts and patterns. ↩
Young, Greg (2010). "CQRS Documents". Events as first-class citizens. ↩
"Syntropic137". GitHub. Agent engineering framework. Repository not yet public. ↩
Sweller, John (2011). "Cognitive Load Theory". Wikipedia. Design must respect working memory limits. ↩
Conway, Melvin E. (April 1968). "How Do Committees Invent?". Datamation. "Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure." ↩
Skelton, Matthew; Pais, Manuel (2019). Team Topologies. Software boundaries should align with cognitive load limits. Popularized the "Inverse Conway Maneuver": design your org to get the architecture you want. ↩
"Vertical Slice Architecture". jimmybogard.com. A module structure where features are organized as vertical slices through all layers of the stack rather than horizontal layers. ↩
Wettel, Richard; Lanza, Michele (2008). "CodeCity: 3D Visualization of Large-Scale Software". University of Lugano. ↩

The New Failure Mode#

The Shift#

The Cascade#

Real-World Carnage#

Four Directions Toward Legibility#

Domain + Temporal Legibility#

Modularity + Low Cognitive Load#

Anti-Pattern: Over-Documentation#

Standardization#

Visualization#

Legibility Checklist#

Closing#

Footnotes#