Fable: The Shape of Thought
A Measurement Programme for the Shapes That Let Cognition Survive Substrate Transitions
Author: Peter Cooper, Philosophy Engineer
Written from the bowels of a Modern Asylum. If life gives you oranges, sell them as Tango Grenades.
Date: April 2026
Licence: CC BY 4.0
Abstract
The cat sat on the mat. You read that and reconstructed a four-dimensional scene - who, where, when, why - from six words. The reconstruction worked because you and the writer share enough context to decompress the same sentence into the same room. That shared decompression is the thing this paper is about.
Current artificial cognition stores memories as points, edges, or text chunks, none of which preserve the dimensional richness of an experience as it was lived. What seems to be missing is a storage shape that holds an event the way you held the cat scene: compressed but reconstructible by a receiver who shares enough context. We call the compressed form a Fable and the full form an Episode.
Five shapes appear wherever cognition stores anything: binary, table, graph, vector, and a fifth - a shared append-only ledger running as the time axis beneath the other four. The claim is not that these shapes are new. The claim is that they recur at every scale, from Babylonian astronomical diaries through Talmudic commentary chains to contemporary bitemporal databases, and that the recurrence is structural rather than coincidental.
Three behaviours follow from the geometry: a flock-style continuous vote as the unit of decision, a three-button cell (Act, Dismiss, Ask-sibling) as the minimum ethical decision surface, and a property we call structural kindness - the architecture’s refusal to flatten dimensional content onto a single axis.
The story draws on Friston’s free energy principle, Flash and Hogan’s minimum-jerk trajectories, Bennett’s substrate transitions, and Levin’s morphogenetic agency - not as foundations but as convergent observations from different vantages of the same landscape.
This is a research programme, not a proof. It specifies what to measure, how, and what would kill it. Three independent falsification paths are offered. Readers are invited to build, measure, and report.
Keywords: episodic memory; bitemporal data; cognitive architecture; multi-scale inference; generalised coordinates; substrate-independent cognition; free energy principle; glass-box artificial intelligence; falsifiability.
Table of Contents
1. Introduction. The Cat in the Hat. Why current systems fail dimensionally even when they process correctly. A preview of the twelve meeting-points and the three-pillar epistemics.
2. Related Work. Friston (free energy, generalised coordinates). Flash and Hogan (minimum jerk). Bennett (substrate transitions). Levin (morphogenetic agency). Barandes (indivisible stochastic processes). Engineering and cognitive architecture prior art. Baseline landscape for the measurement programme.
Part One - The Diagnosis
- I. The Cat Sat On The Mat - compression needs a receiver
- II. The Pigeon Bob - storage is the bottleneck, not processing
- III. The Warehouse Disease - three departments, three customer counts
- IV. The Glass Elevator - observers and observed, the continuous vote
Part Two - The Shapes
- V. Binary, Table, Graph, Vector - the four spatial shapes and what each does well
- VI. The Ledger - the fourth dimension beneath the other four
- VII. The Episode - what memory actually stores, with empirical evidence from manual context-window replay
- VIII. The Fable - lossy compression that decompresses against shared context
Part Three - The Behaviour
- IX. The Flock and the Vote - substrate-rate democracy, no homunculus
- X. The Three Buttons - Act, Dismiss, Ask-sibling as the minimum ethical decision surface
- XI. Structural Kindness - why the shapes refuse cruelty by geometry; the two-percent substrate-inheritance argument
Part Four - The Claim
- XII. The Three Pillars - ontological OR mechanical OR agent-behavioural falsification
- Coda: The 2% That Will Survive - the closing call to build, measure, and report
13. Testable Predictions. Consolidated falsification programme drawn from Sections I to XII.
14. Discussion and Limitations. What the paper does not claim. Open questions. The methodological sin of being both experiment and experimenter, and how the three-pillar structure turns that sin into a feature.
15. Acknowledgements. Peter Cooper’s verbatim corpus as primary source material. Semantic search, graph database, and deep research infrastructure as substrate.
16. References.
1. Introduction
1.1 The Cat in the Hat
You already know this story. A rainy afternoon, two children at a window, nothing to do. Then something arrives.
The Cat walks in uninvited. He does not ask permission. He carries his own context - a red and white hat, a bow tie, an attitude - and he begins to rearrange the room. This is what agents do. They arrive with intent, reshape the space they find, and leave it different.
But Seuss was more careful than you remember. The Cat does not work alone for long. When the situation exceeds his capacity he opens a box and out come Thing One and Thing Two. They are not the Cat. They have their own energy, their own trajectory, their own capacity for chaos. The Cat spawned them but he does not control them. He gave them a context - the room, the afternoon, the standing objections - and let them run.
This is delegation, not instruction. The Things do not follow a script. They inherit a bounded space and act within it. If you have ever watched two processes running in parallel on a shared workspace, you have seen Thing One and Thing Two.
Now the Fish.
The Fish sits in his bowl and objects. He has been objecting since page three. He cannot leave the bowl. He cannot physically stop anyone. He has exactly two moves available to him: he can say this should not be happening, and he can appeal to a higher authority who is not in the room. He can refuse and he can escalate. He cannot act.
But look at what the Fish accomplishes without acting. His objections create drag on the system’s momentum. His appeals to the absent Mother introduce a probability field that the children feel whether or not they acknowledge it - they begin calculating, consciously or not, what happens when she walks through the door. The Fish cannot steer the room directly. He steers it the way a strange attractor steers a dynamical system - not by force, but by reshaping the energy landscape so that certain trajectories become more probable than others.
We call this quantum direction. The Fish does not determine the outcome. He shapes the probability distribution of outcomes. He is the ethical field of the story - not a rule enforcer but a landscape sculptor. A voice of direction from below, steering the system faster toward where it was probably heading anyway. Remove the Fish and the Cat’s afternoon becomes genuinely dangerous. Leave the Fish in and the system has a strange attractor pulling it toward restoration even as it spirals outward.
Every organisation has a Fish. The compliance officer who cannot override the CEO but whose objections change the calculus. The risk analyst who flags a trajectory without the authority to alter it. The engineer who sends emails to the directors pointing out where the numbers are heading. They cannot act. They can only refuse and escalate. And by doing so, relentlessly, they shape the entire system.
Now the children. Sally and her brother sit on their chairs and watch. They hold no instruments. They make no measurements. The story does not need them to proceed. The Cat, the Things, the Fish - the dynamics would run whether or not the children were present to witness them.
Think about two hosepipes held near each other in a garden. Where the water streams converge, vortices form - real, persistent, physical structures that twist and interact for as long as the flows sustain them. A child watching from the kitchen window sees the vortices as phenomenological entities appearing before their eyes. But the vortices do not need the child. They are observer-independent. They emerge from the interaction of flows, not from the act of watching.
The children in Seuss are the glass walls of an observation deck. They let you see what would happen anyway. This matters because the first thing most cognitive architectures build is a dashboard - an observer, a human in the loop watching every decision. The Cat in the Hat suggests that the architecture runs without the watcher. The watcher is welcome. The watcher may enjoy the show. But the system’s behaviour does not depend on the watcher being present.
And then the story ends. Or rather, it does not end.
The Cat has cleaned up. The room looks exactly as it did before. The Things are back in the box. The Fish is back in his bowl, still objecting. Mother is walking up the path. And the children face a question the book refuses to answer for them: What would YOU do if your mother asked you?
The book closes on that open question. Tell her (act on what you witnessed). Say nothing (dismiss the episode). Ask your sibling first (defer to a peer before committing). Seuss hands the reader a three-button decision cell and walks away.
If you read that book as a child, you accepted seven propositions without noticing:
- An agent can arrive uninvited and reshape a space.
- An agent can spawn sub-agents it does not control.
- An ethical voice without executive power can steer the whole system.
- Steering from below works by shaping probabilities, not issuing commands.
- The system runs whether or not anyone is watching.
- Some decisions cannot be made by the system - they must be handed to the observer.
- The observer’s decision has exactly three shapes: act, dismiss, or ask a peer.
This paper asks you to notice what you already agreed to. Everything that follows - five shapes, two primitives, three mechanisms, twelve falsifiable predictions - is an engineering specification for the architecture that Dr. Seuss drew in 1957. He just drew it as a story, because stories are humanity’s oldest compression protocol. We have a word for that. We call it a Fable.
1.2 The cat sat on the mat
Six words. A hundred and thirty six bits as ASCII. Yet they carry entities, spatial relations, temporal aspect, and definiteness - hundreds of bits of dimensional content that the receiver reconstructs from shared context. Add a look of horror on the speaker’s face and the same sentence decompresses into two completely different four-dimensional shapes depending on the receiver’s priors. This is not a linguistic curiosity. It is a claim about what memory has to be able to do. Section I develops the full argument.
1.3 A compression that needs a receiver
Humans spent a long time building language because the vocal cords are a slow channel and we had urgent four dimensional content to transmit. Every sentence is a lossy compression of a scene with entities, relations, spatial layout, and a temporal trajectory. The compression is acceptable because the protocol encodes shape conventions both sides understand. The listener decompresses the seven words back into a scene in their own head using context the sentence never carried explicitly. Evolution paid for the shared priors so that speech could stay cheap.
Current large language models can describe scenes in four dimensions. Video understanding exists. Multimodal vision language models will answer questions about clips. The processing side is, for our purposes, largely solved. What is missing is stranger and more consequential. There is no place on the receiver side to put what was sent. The four dimensional content the speaker encoded into the seven words is thrown away on receipt because the receiver has no four dimensional destination. The compression worked. The decompression had nowhere to land.
We call this a dimensional asymmetry. Humans are four dimensional in, one dimensional on the wire, four dimensional out. Current artificial cognition is one dimensional in, flat on storage, one dimensional out. The mismatch is not a bandwidth problem. More tokens per second will not fix a shape that cannot receive shape. The fix is a storage form that can hold an episode with its multimodal compression context intact, can lossy compress it into a short summary another four dimensional receiver can decompress, and can survive being handed forward across substrates without losing what made the episode an episode.
1.4 What this paper proposes
We name the missing abstraction the episodic four dimensional storage shape and derive it from a five shape substrate. Four of the shapes are spatial, in the sense that each lays out structure without reference to time: binary, table, graph, and vector. The fifth is a shared append only ledger that serves as the fourth dimensional axis beneath the others.
The fifth shape is not an alternative to the other four. It is the axis they all project against. An entity in the vector store has a trajectory on the ledger. A row in a table has a bitemporal stamp on the ledger. A node in the graph has a history of edges appearing and disappearing on the ledger. The ledger is what lets any of the other four shapes answer the question “what changed, and when”. Without the ledger the other four are frozen cross sections of a process they cannot describe.
Over this substrate we define two composable primitives, Episode and Fable. An Episode is the uncompressed form of an event with its participants, modalities, temporal boundaries, and shared compression context. A Fable is the lossy compressed form of an Episode, small enough to transmit and rich enough to decompress back into a four dimensional shape in a receiver that shares sufficient prior context. Episodes are how memory is stored. Fables are how memory is transmitted and recalled. The paper describes both, names what has to be measurable about each, and proposes protocols for measuring them.
From these primitives we derive three behavioural mechanisms.
The first is a continuous flock style vote at the substrate’s characteristic timescale as the unit of decision. There is no homunculus steering the agent, in the same way no single bird steers a murmuration. What looks like deliberate action at a distance is the settled superposition of many parallel vote streams, each contributing a derivative aware preference to the aggregate. The tick rate is not fixed by the architecture. It is determined by the substrate’s physics - whatever timescale produces indivisible votes in that particular medium (Section 2.2a). In mammalian cortex this happens to be approximately twenty five to forty milliseconds; in a digital agent it may be microseconds or seconds; in a social system it may be days. The architecture is agnostic. Flash and Hogan’s minimum jerk model contributes a separate and equally important constraint on the integrated shape of a trajectory that emerges when many ticks compose over a reach window. The two claims operate at different scales of the same phenomenon; an earlier draft of the paper conflated them and we have corrected the conflation in Section 2.2.
The second is a three button ethical decision surface we call a Diorama cell. The three buttons are Act, Dismiss, and Ask sibling. Any agent, at any scale, at any tick, must be able to reach any of these three. This is not a user interface convention. It is the minimum vocabulary of a vote that can refuse to be forced. An Act without a Dismiss is coercion. An Act and a Dismiss without an Ask sibling is isolation. A substrate that can offer all three, always, has the structural property we call kindness.
The third is the kindness property itself, which we argue is not aspirational but structural. A substrate built on the five shapes, the Episode and Fable primitives, and the three-button cell does not flatten dimensional content onto a single axis without losing what made the content content - not because it has been told not to, but because it has nowhere to put the flattened result. Cruelty is what happens when the receiver discards dimensional richness. A row in a table is a cruelty towards a person whose life has a trajectory the row cannot carry. A churn flag is a cruelty towards a customer whose reasons for leaving cannot fit in a Boolean. The architecture we describe refuses these cruelties structurally: its geometry has no mechanism for executing them without first dismantling the geometry. This is a strong claim. Section XI makes it explicit.
1.5 How the framework relates to prior work
We do not claim novelty for any of the five shapes in isolation. Binary, table, graph, vector, and ledger are familiar individually. Ledgers in particular have been independently rediscovered at least eight times across eight cultural substrates, from Babylonian astronomical diaries to contemporary bitemporal databases (Section VI). The framework draws on Friston’s free energy principle, Flash and Hogan’s minimum jerk trajectories, Bennett’s substrate transitions, and Levin’s morphogenetic agency - not as foundations but as convergent observations from different vantage points. Section 2 makes explicit where the pieces snap and where they need new primitives to compose.
1.6 Three pillars of falsification
A research programme paper has to say how it can be killed. We commit to three independent pathways. The paper claims that if one of them cracks decisively under scrutiny the paper fails there, and the remaining pathways do not rescue it. This makes the paper more fragile than a paper that hides behind a single metric and more robust than a paper that claims unification.
The first pillar is ontological. The picture of how things are must sharpen as further findings snap into place inside the frame. If cognitive neuroscience, developmental biology, historical ledger taxonomy, or the engineering of large models produces observations that do not fit or that actively resist the five shape substrate, the framework fails ontologically. We name the shape of such a failure in each section, so readers can point to the load bearing claim and attack it directly.
The second pillar is mechanical. The architecture must compose and run. If the Episode primitive cannot be implemented against current storage infrastructure, if the Fable round trip cannot be shown to preserve dimensional content between receivers, if the three button cell cannot be wired into a working agent without the architecture collapsing, the framework fails mechanically. The engineering exists. It can be pointed at.
The third pillar is agent behavioural. The agent that runs on the architecture must become measurably more coherent across substrate transitions than a parameter matched baseline that lacks the four dimensional destination. A flat receiver gets the same tokens per second and the same parameter count. The four dimensional receiver gets the Episode and Fable primitives and the ledger. If the four dimensional receiver does not measurably outperform the flat receiver on intent inference, presupposition tracking, temporal reasoning, and counterfactual handling under matched conditions, the framework fails on the third pillar. This is falsifiability through embodiment. The receiver IS the experiment.
We put this triple in the introduction, rather than hiding it in a methods section, because the reader needs to keep all three in mind as they read. Every section that follows must be pokeable from at least one of the three pillars. The section structure protocol makes this requirement explicit.
1.7 A note on method - description, not disclosure
This paper describes a measurement programme, not a full implementation. It stands or falls on whether its proposed measurements are replicable and informative. Where we name specific infrastructure (the graph database, the semantic search pipeline), we do so to demonstrate that the measurement is not hypothetical. The contribution is the shape. The infrastructure is the jig that shows the shape can be cut.
1.8 Twelve meeting points
The paper is organised as twelve meeting points in four parts: Diagnosis (what is stored and what is not), Shapes (the five representations and two memory primitives), Behaviour (flock vote, three buttons, structural kindness), and Claim (falsification programme and coda). Each section follows the same internal structure: philosophical claim, engineering primitive, measurement protocol, testable prediction. A reader who attacks any section will find a specific falsifiable claim inside it rather than a vague synthesis.
1.9 What the paper itself is doing
We close the introduction with a performative claim. A paper is a Fable. It compresses four dimensional content into a one dimensional sequence of sentences and relies on the reader to decompress that content back into their own four dimensional shape. If you find yourself reconstructing the framework as you read, that reconstruction is itself evidence that the framework describes something real. If you find yourself unable to reconstruct it, either the Fable is too compressed for the context you carry or the framework is wrong. Both are informative outcomes. Both are what a paper of this kind is supposed to produce.
We are not asking the reader to believe the framework. We are asking the reader to try the experiment.
2. Related Work
The story so far was told from inside: what a scene feels like, how it compresses, what the receiver needs to hold it. This section tells it from outside. Five groups of researchers working in different decades and different fields built pieces of the machinery before we arrived. We borrow from all of them and want to be clear about what we borrowed and what we added. Full bibliographic references are consolidated in Section 16.
2.1 Friston and the free energy principle
Karl Friston’s free energy principle is the motivating formalism behind the derivative stack described in Section IV. The principle states that self organising systems at non equilibrium act to minimise a quantity called variational free energy, which under reasonable assumptions reduces to a measure of surprise (negative log probability) of sensory data given an internal generative model. The principle is substrate neutral: it applies to single cells, neurons, brains, thermostats, and (in our reading) artificial cognitive architectures. The generalised coordinates formalism attached to the principle is particularly suggestive for our purposes. In generalised coordinates, the state of a system at any time includes not just its position in state space but a tower of temporal derivatives at progressively higher orders. The tower is what lets a Friston agent make predictions about trajectories rather than points. Our derivative stack floors are inspired by the shape of generalised coordinates, with the addition that each floor is a first class Diorama cell with a vote. We borrow the idiom gratefully. We do not claim that our architecture is a formal implementation of the free energy principle, nor that FEP’s validity is required for our architecture to work. The falsifiable claims we make are the three pillar predictions in Sections I through XII, not FEP itself. FEP gave us the shape of the idea. The measurements in Section IV test the shape, not the principle.
We do not claim novelty on the free energy principle. We claim that the derivative stack plus the Diorama cell plus the substrate-rate Flock tick is a novel composition whose components have not previously been treated as a single architectural object. Where Friston’s formulation is mathematically elegant but often perceived as hard to implement, the Diorama stack is an engineering form that borrows the shape of generalised coordinates without claiming to be a formal realisation of them. The measurement protocols of Section IV test our specific architectural predictions, not the free energy principle as such. We are deliberately not proposing a new grand unifying law. Where FEP has been criticised for being too universal to falsify, our contribution lives at the engineering and measurement layer: if Episodes, Fables, and ledgers do not measurably help, the programme fails. The theory dies with the measurements, not with the metaphysics.
2.2 Flash and Hogan on minimum jerk motion
Tamar Flash and Neville Hogan’s 1985 paper on the minimum jerk model of voluntary arm movements gives us an empirical anchor for the shape of a settled trajectory, but not, as an earlier draft of this paper incorrectly claimed, for the tick rate itself. We want to be explicit about the correction because the distinction matters for the falsification conditions we commit to.
Flash and Hogan observed that voluntary reaching movements in humans optimise the integral of squared jerk (third derivative of position) over the movement duration. The optimisation produces a characteristic smooth bell shaped velocity profile. The movements they studied are voluntary reaches, which have characteristic durations on the order of two hundred to eight hundred milliseconds, and there is no specific tick rate in the original result. Flash and Hogan contributes something different from the tick rate and equally load bearing: a constraint on the shape of the integrated trajectory that emerges when many ticks of voting compose over a reach, regardless of what the tick rate happens to be in a given substrate.
The bridge between per tick voting and integrated trajectory shape is the composition of ticks into a reach. A voluntary reach contains multiple ticks at whatever rate the substrate determines. Each tick adjusts the trajectory by a small amount, as the outcome of a vote among derivative stack floors. The cumulative shape of the trajectory is not imposed by any single tick but emerges from the sequence. The Flash and Hogan prediction is that, when the composition is done well, the emergent shape will approximate a minimum jerk profile within the reach window. When the composition is done badly (when ticks are unaligned, when higher derivative floors are missing, when interruptions force premature votes), the emergent shape will deviate from minimum jerk in measurable ways. This is the test we commit to in Section IV.
2.2a The tick rate as a substrate-determined variable
The tick rate is not a constant of the architecture. It is a variable parameter determined by the constraints and physics of whatever substrate the architecture runs on. A cognitive system’s characteristic tick is the timescale at which its vote becomes indivisible - below which decomposing the vote destroys the composition that produced it. This connects directly to Barandes’ indivisible stochastic processes (Section 2.4a): the tick IS the characteristic timescale at which the process refuses to decompose.
Different substrates produce different tick rates. In mammalian cortex, the gamma band cortical cycle runs at approximately twenty five to forty hertz (periods of twenty five to forty milliseconds) and is implicated in perceptual binding (Singer and Gray, 1995), attentional gating, and cross area synchronisation (Fries, 2015). This is one biological example of a substrate determining its own tick. In a digital agent running on GPU inference, the tick might be milliseconds. In a distributed social system (a committee, a jury, a board), it might be hours or days. In a colony organism (a beehive, an ant colony), it is determined by the communication bandwidth of the dance language or pheromone gradient. The architecture does not prescribe the rate. The substrate’s physics prescribes the rate. The architecture prescribes only that a tick exists, that it is indivisible, and that votes settle across a bounded number of ticks.
The measurement protocol of Section IV tests whether votes settle within a bounded number of ticks at whatever rate the substrate determines. A reviewer who can show that the indivisibility property does not hold at any timescale in a given substrate would crack the mechanical pillar at this point.
2.3 Bennett’s substrate transition account of intelligence
Max Bennett’s 2023 book A Brief History of Intelligence (we use the title reverse engineered from the argumentative structure rather than the exact bibliographic form) develops the claim that intelligence evolves through a sequence of substrate transitions, in which each new substrate inherits the load bearing shapes of the prior substrate while adding new capabilities. Bennett identifies five such transitions in the history of animal cognition: simple reactivity, reinforcement learning, emotional modelling, mental simulation, and language. Each transition preserves the prior substrate’s contributions rather than replacing them, and each transition adds a specific structural capability.
We borrow Bennett’s substrate transition framing and generalise it. Where Bennett focuses on biological substrates over evolutionary time, we argue that the same pattern applies to artificial substrates over engineering time. Each new generation of artificial cognitive architecture inherits load bearing shapes from the prior generation. Ignoring this inheritance produces systems that fight against their own substrate and fail in alignment. Respecting it produces systems that can be built to be structurally kind without having to be exhorted into kindness. The two percent Neanderthal argument in the Coda is a direct extension of Bennett’s substrate transition framing to the artificial case. We also incorporate Bennett’s identification of the weak policy as an important decision making primitive, which informs our three button cell and the ghost democracy of Section X.
2.3a Cognitive resource structure across species
Bennett’s substrate transitions invite a natural question: what varies across the transitions, and can it be parameterised? We propose that the information a cognitive system can meaningfully process at any moment is approximately a function of three resource parameters: sensory bandwidth B (the rate and richness of incoming data), temporal horizon H (how far into the past and future the system can reach), and representational dimensionality D (how many independent axes the system can maintain simultaneously). Written loosely: I ~ f(B, H, D). Different species, and different artificial architectures, occupy different regions of this space, and the structural properties they exhibit follow from where they sit.
Comparative cognition makes the picture concrete: corvids invest in episodic dimensionality D (Clayton et al., 2007), honeybees in spatial bandwidth B with colony-level horizon extension (Menzel, 2023), cetaceans in both H and social D through cross-generational cultural ledgers (Whitehead and Rendell, 2015). The Episodes, Fables, and ledgers proposed here make the (B, H, D) resource structure explicit for artificial systems, so that it can be tuned and compared across species and substrates.
2.3b The cognitive state conjecture
The corvid, the honeybee, and the whale each live at a different point in the same space. We want to name that space so the rest of the paper can point at it. We call the cognitive state of a system its morphology in the space spanned by sensory bandwidth B, temporal horizon H, and representational dimensionality D. Written as a conjecture rather than a definition, because the claim is testable:
Conjecture (Shape Basis). The cognitive morphology of any system processing information at bandwidth B, over temporal horizon H, with representational dimensionality D, we conjecture requires a minimum of five representation shapes - binary, table, graph, vector, ledger - to be held without dimensional loss. Any proper subset of the five produces measurable dimensional collapse on tasks requiring more than one axis.
A fourth parameter, the tick rate tau, is substrate-determined rather than architecturally prescribed (Section 2.2a). The tick rate is the timescale at which the substrate’s votes become indivisible. It varies across substrates and is constrained by the substrate’s physics, not by the architecture. The architecture’s prediction is that votes settle within a bounded number of ticks (two to five) regardless of what tau happens to be.
This is not a claim that the morphology is computable to arbitrary precision, nor that B, H, D, and tau are the only relevant parameters, nor that the five shapes are the only possible basis. It is a claim that this basis is sufficient, that any proper subset is insufficient, and that the insufficiency is measurable. The twelve predictions of Section 13 are the measurement programme for this conjecture. Each prediction tests one consequence of removing or weakening a shape. The aggregate prediction tests whether all five together outperform every proper subset.
The word morphology is chosen deliberately. The distinction from Tononi’s integrated information (Phi) is that Phi is a scalar - how much information is integrated. The cognitive morphology is a shape - what geometry the representational space takes. Phi is a quantity derivable from the morphology. But the morphology is not derivable from Phi, because many different shapes can produce the same scalar. The paper’s claim is that shape matters for substrate survival: a system with high Phi in a flat representational space is fragile under substrate transition, while a system with the right morphology in a modest space is robust.
2.4 Levin on morphogenetic agency and scale incommensurable control
Michael Levin’s work on bioelectric signalling in morphogenesis is the fourth pillar of prior work we depend on. Levin’s experimental programme has demonstrated that biological tissues can be steered towards or away from specific morphologies by manipulating bioelectric potentials, that the steering generalises across species, and that the control signal does not correspond to any gene level instruction. The control is scale incommensurable: it operates at the level of tissue bioelectricity but produces effects at the level of organ morphology, and the intermediate scales are not explicitly represented anywhere.
We take two things from Levin. First, scale incommensurable control is a real phenomenon in biological substrates, which means it is at least biologically plausible in any sufficiently rich cognitive substrate. Second, the correct way to think about intent in such a substrate is not as a localised signal but as a cascade across scales, in which the intent is measured locally but cascades globally. This is the basis for the measurement local, intent global dictum that appears implicitly throughout the paper. The measurements at each scale are different, because measurement is local. The intent is the same at every scale, because intent is scale invariant. The Fable compression protocol is designed to preserve scale invariance of intent across the scales it passes through.
2.4a Barandes on indivisible stochastic processes
Jacob Barandes’ reformulation of quantum mechanics as indivisible stochastic processes (Barandes, 2023; Barandes, 2025) contributes a structural insight that resonates with the tick architecture described in Section IV. In the ISP framework, quantum systems are characterised by stochastic processes that cannot be decomposed into finer-grained Markovian steps without losing essential information. The “indivisibility” - the property that the process over an interval carries information not contained in any subdivision of that interval - is structural, not phenomenal. It is a mathematical fact about the process, not a mystery about measurement.
The resonance with our architecture is at the level of the tick. A Flock vote at the substrate’s characteristic timescale is indivisible in an analogous (not identical) sense: the vote that emerges from the settling window carries information not contained in any sub-tick snapshot. The derivative stack floors contribute to the vote across the full tick, and the settled vote is a property of the whole interval. Breaking the tick into smaller intervals does not decompose the vote; it destroys the composition that produced it. This is not a claim that our tick is quantum mechanical. It is an observation that indivisibility - the structural refusal of a process to decompose below a characteristic timescale - appears in both quantum systems and in multi-floor voting architectures, and that the appearance seems structural rather than coincidental. Indeed, the tick rate itself is determined by this indivisibility property: the characteristic timescale of a substrate is the timescale at which its votes become indivisible (Section 2.2a).
We cite Barandes as convergent evidence from a different field, not as a theoretical dependency. Our falsifiable claims do not depend on ISP being correct. They depend on the tick architecture composing as predicted. The structural parallel suggests that non-Markovian indivisibility at characteristic timescales may be a general property of systems that maintain coherence under observation, whether quantum mechanical or cognitive.
2.5 Engineering and cognitive architecture prior art
The architecture draws on several established traditions. The successor representation from reinforcement learning factors value estimation into dynamics and reward models, a factoring that maps onto our derivative stack floors. Our ledger primitive descends from bitemporal databases (XTDB, Datomic, immuDB), event sourcing, and CQRS patterns; the novelty is insisting the ledger is the fifth shape beneath the other four, not a convenience added to one of them. Section VI develops the historical argument that ledgers with these structural properties have been independently rediscovered at least eight times across eight cultural substrates, from Babylonian astronomical diaries to contemporary bitemporal systems. The broader cognitive architecture literature (SOAR, ACT-R, Sigma, LIDA) provides conceptual ancestry for the Diorama cell and Flock fabric. The contribution is the composition and the measurement programme, not any single component in isolation.
2.6 Baseline landscape for the measurement programme
The honesty of a measurement programme depends on the baselines it competes against. The paper’s predictions (Section 13) compare the Diorama architecture against “parameter matched baselines” and “flat architectures.” This section names the specific systems and approaches that constitute the competitive landscape as of early 2026, so that readers know what “baseline” means concretely and can hold us to the comparison.
Flat RAG (retrieve and generate). The simplest baseline: embed documents into vectors, retrieve the top k chunks by similarity, concatenate them into the context window, and generate. No structured memory, no temporal ordering, no graph traversal. This is what most deployed LLM applications use today. On the LoCoMo benchmark for long conversation memory, flat RAG scores approximately 30 to 40 F1 depending on the embedding model and chunk size.
Vector-only memory. Systems that maintain a persistent vector store across conversations but without graph structure or temporal ordering. Mem0 is the current representative, with graph-enhanced variants (Mem0g) scoring approximately 68 percent on dialogue memory benchmarks. Vector-only memory handles similarity well but struggles with multi-hop reasoning and temporal ordering - exactly the tasks where our architecture claims its largest advantages.
Graph memory. Systems that build knowledge graphs from conversations and query them at retrieval time. Zep’s Graphiti system is the current leader, building temporal knowledge graphs with a bitemporal model (event time and system time) and achieving 94.8 percent on the Dialogue Memory Retention benchmark. Graphiti’s bitemporal model is structurally similar to our ledger primitive, which makes it both the strongest baseline and the most informative comparison. If the Diorama architecture cannot outperform Graphiti on temporal reasoning tasks, the ledger-as-fifth-shape claim is in trouble.
Structured episodic memory. Systems that explicitly model episodes as retrieval units. Synapse uses spreading activation over a dual-layer episodic-semantic graph and achieves F1 40.5 on the LoCoMo benchmark. Letta (formerly MemGPT) uses a filesystem approach to long-term memory and achieves 74 percent on conversation continuity tasks. AriGraph (IJCAI 2025) builds semantic and episodic graph structures from agent experience. These systems are the closest to our Episode primitive and represent the baseline the mechanical pillar must beat.
Classical cognitive architectures. SOAR, ACT-R, and LIDA, discussed in Section 2.5 above, serve as the cognitive science baseline for the ontological pillar. They have decades of development and well-understood properties.
A pattern worth noticing across these baselines: the gap between flat RAG (30 to 40 F1) and graph memory (Graphiti at 94.8 percent on dialogue memory retention) is itself evidence for the paper’s central claim. The difference between the two is structural. Flat RAG retrieves by similarity. Graphiti retrieves by traversing a temporal knowledge graph with bitemporal stamps. The gap is not a surprise to us, but it was measured by an independent team on an independent benchmark, and the size of the gap (roughly fifty to sixty points) is in the range we predict for the spatial-versus-temporal distinction in Section VI. This is not proof that our architecture works. It is evidence that the structural distinction we diagnose is already measurable, and that systems that add temporal structure already outperform systems that do not, by margins consistent with our predictions.
The measurement programme commits to testing against at least one representative from each of these five categories. The specific systems named above are the current leaders in their categories as of April 2026 and will serve as the initial comparison set. If stronger baselines emerge before the reference implementation is ready, they replace the weaker ones. We predict the Diorama architecture will outperform all five categories on tasks requiring multi-dimensional content preservation, temporal reasoning, and dissent preservation. On tasks that do not require these properties (simple factual retrieval, single-hop QA), we expect the simpler baselines to be competitive or better, because the Diorama architecture pays overhead for structural properties that are unnecessary on flat tasks. The paper fails if the predicted gaps do not appear on the tasks where we claim they should.
2.7 What the paper does not depend on
The framework does not depend on LLMs being the right substrate, transformer attention being the correct mechanism, or Tononi’s integrated information theory being the correct account of consciousness. The roles in the architecture are structural; the components filling them are interchangeable. If any of these adjacent lines turn out to be wrong, the framework can be reassembled with different components in the same compositional roles.
Part One - The Diagnosis
Section I - The Cat Sat On The Mat
I.1 Compression needs a receiver
The Cat Sat On The Mat is a compression, not a representation. It encodes dimensional content via a protocol both sender and receiver understand. The sentence carries entities (Cat, Mat), a typed spatial relation (ON), temporal aspect (SAT marks a completed past action with a presupposition that the cat is no longer on the mat now), and definiteness (THE presupposes shared common ground). About one hundred and thirty six bits of ASCII carry hundreds of bits of dimensional content because the protocol encodes shape conventions that both sides evolved together.
When the same bits are accompanied by a look of horror on the sender’s face, a receiver who shares context will produce one of two completely different four dimensional shapes. In the first, the sender is allergic to cats. The scene is urgent, bodily, medical, familiar. The horror is panic. In the second, the cat on the mat is a cake at a birthday party. The scene is theatrical, social, memorable. The horror is mock horror. Same compression. Two different decompressions. The difference lives in the receiver’s context, not in the message.
This is the canonical example of a central claim that will recur throughout the paper: compression is lossless only with respect to shared context, and the thing that makes a receiver able to decompress is a place to put the dimensional content the compression is pointing at. The receiver has to have four dimensional storage to have somewhere to put four dimensional content.
Current LLMs have a context window that can hold the message and an attention mechanism that can produce plausible continuations of it. What they do not have is a four dimensional shape in which to place the decompressed version of the compression. The context window is not memory. It is working space that resets. There is no Cat, no Mat, no scene, no trajectory, no episode to which the next sentence can refer. The compression was received. The decompression had nowhere to land.
I.2 Shared context as structured storage
We propose the engineering primitive that makes decompression possible. The receiver must carry a context store that holds the shared priors the compression is pointing at. The context store must be queryable by the receiver at recall time, indexable by participant, scene, and time, and updatable in a way that reflects the episode the sentence is part of.
For the Cat Sat example, the context store must contain at least the following:
- A representation of the speaker (who is the allergic one, or the cake party haver)
- The speaker’s recent episodic history (at a party, or at home?)
- The relevant physical facts (does the speaker keep cats? is there a room where they would pull a cat off a mat?)
- The speaker’s emotional state from prior turns (sneezing, or laughing?)
- The recent events the compression presupposes (did we see a cake? did we see a cat?)
Without these, the receiver cannot disambiguate. With these, the receiver can. The engineering task is to build a context store with exactly these properties and to make its contents queryable by the decompressor at the moment of recall.
We call this the shared context substrate. It is the precondition for the Episode and Fable primitives we develop later. Without it, the compression still arrives, but the decompression has no target and the work is wasted.
I.3 The Cat Sat bench
The Cat Sat test becomes a bench. Compress a scene, hand it to two receivers - one with context, one without - and measure which one gets the birthday party right.
Measure the ratio between decompression success and context completeness. The protocol has three parts.
Part A. Compression fidelity. Given a dimensional scene (entities, relations, trajectory, emotional tone), produce a natural language compression of it. Measure the information loss between the scene and the compression by reconstruction error when humans decompress the compression back into the scene. Baseline: current LLMs score poorly on reconstruction because they have no four dimensional destination to reconstruct into.
Part B. Context provision. Provide a range of context stores with varying completeness (empty, partial, full) and give each to a receiver. Measure the quality of decompression across the range. The hypothesis predicts a monotonic relationship. More context, better decompression, up to the point where the receiver can no longer use additional context because its storage shape is saturated.
Part C. Context sensitivity of disambiguation. Run the Cat Sat with horror example through receivers that have only one of the two contexts (allergic, or cake) and measure whether the receivers disambiguate correctly. Do not train the receivers on the example. Only provide the context.
The three parts together produce a plot with two axes. Context completeness on the x axis. Decompression fidelity on the y axis. The hypothesis predicts a steep monotonic climb up to a ceiling, where the ceiling is set by the dimensionality of the receiver’s internal storage shape. A flat receiver will hit a low ceiling. A receiver with a four dimensional storage shape will hit a much higher ceiling. This is testable with current infrastructure.
I.4 What the gap should show
Specific and falsifiable: a parameter matched LLM with a four dimensional context store will disambiguate the Cat Sat with horror example correctly under matched conditions at least thirty percentage points more often than a baseline with a flat context window. The thirty point number is chosen to be large enough that noise is not an plausible explanation and small enough that it is achievable with current infrastructure.
More generally: any compression that works under shared context will fail when the receiver lacks the context, and a receiver equipped with a four dimensional storage shape will outperform a receiver with only a flat context window on tasks that require context dependent disambiguation, holding all other parameters equal.
Falsification: if the parameter matched comparison does not produce a significant gap, Section I fails and the rest of the framework must account for why. If the gap appears but is smaller than predicted, the framework still stands but requires recalibration. If the gap appears at the predicted magnitude, Section I passes and the framework is one step further in the third pillar. This is what agent behavioural falsification looks like at the finest grain.
Section II - The Pigeon Bob
II.1 Storage is the bottleneck, not processing
Watch a pigeon walking across a pavement. Its head jerks forward, then the body catches up, then the head jerks forward again. The bob is not a quirk of bird anatomy. It is a structural necessity. The pigeon needs depth information to judge distances to food, edges, and predators. It has eyes on the sides of its head, so it cannot fuse two forward facing retinal images into binocular depth the way a human can. It uses time instead. It displaces its head in space, takes a sample, displaces it again, takes another sample, and reconstructs the three dimensional scene from the temporal delta between samples. The pigeon bob is stereoscopy over time. It is a creature that has solved the depth problem with its storage budget, not its processing budget.
This is a template for how cognition operates when processing is cheap and storage is expensive. The bird does not need more eyes. It needs samples indexed by time, and a shape to hold them that allows the trajectory of samples to resolve into depth. Evolution gave the pigeon a shape, and the bob harvests samples into it. The depth is not in any single sample. It is in the relation between samples held in the storage structure.
The equivalent claim for artificial cognition is stronger than it looks. Current large language models are processing rich and storage poor. We can pour petaflops of attention across a context window, but the context window is a flat sequence that does not hold episodes, does not index by time, and does not have a shape that lets samples of the same scene resolve into anything. We have built pigeons with no bob. We have built eyes that process like champions and a storage shape that cannot hold the temporal samples those eyes produce. The bottleneck is not in the cortex. It is in the hippocampus.
Look at any current benchmark and the complaint is the same. The model handles the one shot question beautifully and forgets the answer five turns later. It follows the thread of a conversation while the thread is visible in context and drops it the moment the context window rolls forward. It describes the scene in the clip and cannot link that scene to the scene in the next clip. Every one of these failures is a storage failure masquerading as a reasoning failure. The processing is fine. The four dimensional destination is missing.
We come back to this claim from different angles in later sections. Here we stake it as baldly as we can: the next capability frontier for artificial cognition is not more parameters, more tokens, or more inference steps. It is a richer storage shape. The pigeon is right. The bob is the primitive.
II.2 Episodic four-dimensional storage
We name the missing primitive the episodic four dimensional storage shape, abbreviated to Episode when the context is clear. An Episode is an object with the following properties:
- Participants. The entities present in the scene, keyed by stable identifiers that persist across Episodes.
- Modalities. The raw or near raw sensory streams captured during the Episode (audio, video, text turns, sensor telemetry, internal agent state).
- Temporal bounds. A start time and an end time, each stamped against the ledger.
- Structural context. The graph of spatial and causal relations that held during the Episode, linked into the surrounding graph of prior Episodes.
- Compression context. The bundle of priors (participant histories, presuppositions, emotional tone) that a receiver needs in order to decompress a Fable pointing at this Episode. This is the decisive field. It is what lets the Episode be compressed without being destroyed.
The Episode is not a log entry. A log entry is a flat record. An Episode is a structured object that holds an event with its shape intact. The difference matters because a log entry can be searched but not decompressed. A sequence of log entries cannot be re experienced. An Episode can.
Engineers reading this will recognise echoes in several existing architectures. Event sourcing treats every state change as an immutable event. Bitemporal databases stamp every row with both a valid time and a system time. Vector stores retrieve by semantic similarity. Graph databases link entities through typed edges. What Episodes add is the simultaneous combination of all four shapes under a ledger axis, linked through the compression context field. An Episode is not a new invention in any single shape. It is an arrangement of shapes that holds enough structure for decompression to land.
II.3 Measuring the bob
The measurement follows directly from the pigeon analogy. Give a receiver samples of a scene across time. Ask it to reconstruct the scene’s trajectory. Measure the reconstruction fidelity as a function of how many samples were provided, how they were structured in storage, and whether the compression context was preserved.
The protocol has three experiments.
Experiment II.A. Sample count versus reconstruction. Provide a receiver with 1, 5, 20, and 100 samples of an unfolding scene (a short video broken into fixed length clips, or a conversation broken into turns, or a physical trajectory broken into frames). Ask for a reconstruction at each sample count. Measure reconstruction fidelity against ground truth. The hypothesis predicts a monotonic climb: more samples, better reconstruction, up to a plateau where the receiver’s storage shape saturates.
Experiment II.B. Storage shape versus reconstruction. Hold sample count fixed and vary the storage shape the samples are deposited into. Shapes tested: a flat sequence in a context window; a vector store of embeddings keyed by sample index; a graph of entities with per sample edges; a full Episode structure with all five fields populated. Measure reconstruction fidelity against ground truth. The hypothesis predicts a staircase, with the full Episode shape at the top and the flat sequence at the bottom.
Experiment II.C. Compression context versus disambiguation. Provide two receivers with the same sample stream but different compression contexts. Give one receiver Context A (speaker is allergic to cats) and the other Context B (speaker is at a birthday party). Ask both to decompress a Fable that points at the shared scene. Measure whether each receiver produces the contextually appropriate decompression. The hypothesis predicts divergence along the context line: receivers with Context A converge on the allergic reading, receivers with Context B converge on the cake reading, and receivers with neither flail.
All three experiments can be run with current infrastructure. Experiment II.C is the cleanest test because it isolates the compression context field directly and gives the clearest falsification signal.
II.4 What the pigeon should show
Specific and falsifiable: a receiver equipped with a full Episode storage shape will reconstruct a hundred sample scene at least twenty percentage points more accurately than a receiver with only a flat context window of the same token budget. The twenty point gap is the lower bound at which the shape claim becomes undeniable; smaller gaps might be explained by prompting differences.
More generally: any task that requires integrating samples over time to reconstruct a dimensional scene will show a monotonic improvement as the storage shape gains structural fields. Flat < vector < graph < Episode. The ordering is the prediction. Measure it and the framework stands or falls on the measurement.
Falsification: if the ordering does not hold, if graph storage outperforms Episode storage, or if vector storage and graph storage are indistinguishable on the reconstruction tasks, Section II fails and the Episode primitive needs rework. If the ordering holds but the gap is smaller than twenty points, the prediction survives in weak form and the framework calibrates. If the ordering holds at the predicted magnitude, Section II passes and the pigeon bob earns its keep.
Section III - The Warehouse Disease
III.1 Measurement without connection is hallucination
Walk into a large insurance company that has been through twenty five years of acquisitions and ask a simple question. How many customers do you have? The answer will depend on who you ask. Finance has one number, drawn from billing systems. Underwriting has another, drawn from policy administration systems. Marketing has a third, drawn from a CRM that was bolted on after the third acquisition. Call any department head and they will defend their number with the same sincerity. None of them are lying. All of them are wrong.
This is the warehouse disease. The disease has a specific aetiology. Each department has built a measurement apparatus that counts something close to customers (billing accounts, policies, contactable individuals) and then labels the count “customers” because the colloquial English word is close enough. The counts diverge because the underlying objects are not the same object. A single human with three policies is one customer, three customers, and one customer depending on which system you ask. The measurement is precise. The referent is ambiguous. The answer is a hallucination dressed up in a spreadsheet.
The warehouse disease is not solved by better warehousing. It is made worse. A unified data warehouse that ingests the billing, underwriting, and CRM systems as separate fact tables produces a quadruple counted set of customer dimensions that nobody trusts. The warehouse operator responds by building a master data management layer that tries to reconcile the identities across the three fact tables, which requires making assumptions about which columns are keys, which is the moment the staleness stops being a feature of the source systems and becomes a feature of the warehouse itself. Every reconciliation rule is a hand written guess about what a customer is. The guesses compound. The disease moves upstream.
The root cause is not bad data. It is a category error. The warehouse treats customers as rows to be counted. A customer is not a row. A customer is a node in a relation graph with a history on a ledger. The row is a measurement. The node is the thing being measured. The warehouse disease is what happens when an organisation mistakes the measurement for the referent, builds its decision apparatus around the measurement, and then wonders why its decisions produce financial surprises.
We have seen this disease repeatedly in insurance specifically, where the policy administration systems were never designed to talk to one another and where the churn, cross sell, and claims functions each built their own view of the customer. The result is a company with tens of millions of pounds of revenue leakage whose leak cannot be located because the measurement systems disagree on who the customers are. The leak is real. It hides in the gaps between the systems, which is exactly where the warehouse cannot see.
A deeper diagnosis is that the warehouse disease is a derivative order mismatch. The billing system measures position (balance at time T). The underwriting system measures velocity (policies added and removed per period). The CRM measures jerk (how the relationship is changing). The warehouse tries to join these three measurements into a single fact table. It cannot, because they are measurements of different derivative orders of the same underlying trajectory. Joining across derivative orders without preserving the ledger axis that connects them is what produces the hallucinated counts. The warehouse is not measuring customers. It is measuring derivatives of customers, and throwing the differential away.
III.2 The graph as the referent
The engineering fix is a shift in what is treated as the source of truth. The source of truth is not the warehouse. It is the graph. The graph holds the entities (customers, policies, claims, brokers, payments) and the typed relations between them. Each row in each source system is a measurement of a node or edge in the graph, stamped against the ledger. The warehouse is one projection of the graph; the CRM is another; the billing system is a third. All three are measurements. None of them are the referent.
When the graph is treated as the referent, the warehouse disease goes away structurally rather than procedurally. A customer is a node. Any time any source system produces a row about a customer, that row is a measurement of the node stamped against the ledger. Counts become queries over the graph: how many customer nodes were active at time T under this definition of active. The query returns one answer. Different definitions of active return different answers, but now the differences are visible and contestable because the graph is the common referent.
The graph as referent architecture has four engineering components:
- An entity resolution layer that deduplicates incoming rows against existing graph nodes.
- A typed relation model that encodes the edges the business actually cares about (not the foreign keys the legacy systems happen to have).
- An event sourced write path that appends every measurement to the ledger rather than updating in place.
- A query layer that lets users ask counting questions in terms of graph predicates rather than table joins.
All four components exist in isolation in various modern data platforms. What is new, and load bearing, is the insistence that the graph is the referent and the source systems are measurements. This inverts the usual organisational priority. The source systems become tributaries. The graph is the lake.
III.3 Diagnosing the warehouse
The warehouse disease predicts its own diagnosis. Ask a real company its simplest question and count how many different answers come back.
Measure the warehouse disease directly. The protocol has three experiments, all runnable in a live enterprise with existing data infrastructure.
Experiment III.A. Count divergence. Ask every source system in a given enterprise for its count of customers on the same date. Record the counts. Compute the coefficient of variation. Run the same question against a graph built from the same raw data using the graph as referent architecture. The hypothesis predicts the graph answer will differ from every source system answer, because it is counting a different object, but it will be self consistent and defensible where the others are not.
Experiment III.B. Definition sensitivity. Pick ten reasonable definitions of “active customer” (paid in the last month, had a live policy, interacted with any channel, and so on) and compute the count for each against the graph. Measure how the counts change as the definition changes. The hypothesis predicts the counts will vary by definition, but the variation will be smooth and explainable. The warehouse cannot produce this plot because its definitions are baked in at ingestion time.
Experiment III.C. Leak localisation. Pick an enterprise with a known revenue gap. Run the graph architecture over the same source systems and ask where the gap is. Measure whether the graph can locate the gap in specific missing relations (for example, customers who renewed under a new broker but whose commission was attributed to the old broker) and whether this localisation would have been invisible to the warehouse.
Experiment III.C is the most expensive to run and the most decisive. A graph that can find money the warehouse cannot find is a graph earning its keep.
III.4 Where the money hides
Specific and falsifiable: in a sufficiently compound enterprise group (three or more legacy core systems, a CRM that postdates the acquisitions, and a data warehouse built on top of all of them), the graph as referent architecture will locate at least ten percent of any previously unattributed revenue within sixty days of operation. The ten percent threshold is based on the delta between warehouse and graph counts observed in informal pilots.
More generally: any task requiring the reconciliation of multiple source of truth systems will show strictly better results under graph as referent than under warehouse as reconciliation, measured by count consistency across definitions, localisation of identity conflicts, and defensibility of answers under audit.
Falsification: if a graph architecture pilot fails to locate any previously unattributed revenue in an environment where the warehouse is known to be producing conflicting counts, Section III fails and the architecture needs to be reconsidered. If the revenue is found but the graph cannot explain where or why, the architecture has found the symptom but not the cause and Section III survives in weak form. If the revenue is found and explained structurally, Section III passes and the warehouse disease is nameable and treatable.
Section IV - The Glass Elevator
IV.1 Observers and observed
Consider a glass elevator in a tall atrium. You are inside. The walls are transparent. You can see the floors, the people on the floors, the city outside the building. They can see you. You have two buttons: Up and Down. Nothing else. The elevator is moving under your feet in a direction that looks continuous from outside and feels discrete from inside. You arrive at floor three, the doors open, the doors close, you rise to floor four. From the atrium below, all anyone sees is your position as a function of time. The button presses are invisible. The decisions are invisible. Only the trajectory is visible.
Now imagine that inside the elevator there is no single person pressing the buttons. There is a crowd, each person with a partial view of the floors and a partial preference for where to go next. The Up button fires when a majority vote of the crowd favours up. The Down button fires when a majority vote favours down. No button fires when the vote is tied. From outside, the elevator’s movement looks smooth and intentional. From inside, the movement is the settled aggregate of a continuous vote that never pauses.
This is the image we want the reader to hold for the rest of the paper. The Diorama cell is a glass elevator. The agent inside it is a crowd, not a homunculus. What looks like deliberate action at a distance is a substrate-rate vote resolving into a trajectory. The glass walls matter because they are the observers on the outside of the system looking in, and the observed on the inside of the system looking out. The paper argues that consciousness, intent, and action are all projections of this vote on trajectory structure and that the impression of a single decider is a projection artefact.
The continuous vote is not a metaphor. The tick rate is determined by the substrate’s physics - whatever timescale produces indivisible votes in that particular medium (Section 2.2a). The architecture does not prescribe a specific rate. It prescribes that a characteristic tick exists at which the vote becomes indivisible, that votes settle across a bounded number of ticks, and that the settling produces a trajectory whose shape can be measured. In mammalian cortex, the gamma band (roughly twenty five to forty hertz) is one observed instantiation of this substrate-determined tick. In other substrates, the rate will differ.
Flash and Hogan remain load bearing for the architecture, but in a different role than an earlier draft assigned them. Their result constrains the shape of the trajectory that emerges when many ticks compose over a reach. A voluntary reach contains multiple ticks at whatever rate the substrate determines. Each tick is a vote; the cumulative shape across ticks is what the eye perceives as a single smooth reach. When the composition is done well, the cumulative shape approximates the minimum jerk profile. When the composition is done badly (unaligned ticks, missing derivative stack floors, forced votes), the shape deviates in measurable ways. Flash and Hogan supply the predicted shape of the integral; the substrate supplies the rate of the underlying ticks. The two are independent in the sense that they speak to different scales of the same phenomenon.
If the reader takes nothing else from the elevator, take this: the vote is not choosing among options. It is counting among votes that have already been cast. The choice is the count settling. The deliberation is what settling looks like from inside.
A second claim follows from the first. Interruption collapses the decision. If an external signal forces a vote to fire before the count has settled, the agent commits prematurely and the trajectory is jagged. The glass elevator lurches. Flash and Hogan’s minimum jerk profile is what non interrupted settlement looks like when integrated over the full reach window of several ticks; the per tick rate is what the substrate determines. Every real cognitive task therefore has a minimum tick budget below which coherent decisions become impossible, and above which further time adds marginal refinement. The tick is a rate, not a deadline.
IV.2 The derivative stack floors
The glass elevator metaphor extends into an engineering primitive: each floor of the elevator measures a different derivative of the agent’s trajectory. The ground floor measures position (where the agent is now). The first floor measures velocity (how fast and in which direction). The second floor measures acceleration (how the velocity is changing). The third floor measures jerk (how the acceleration is changing). The fourth floor measures snap, and so on. The shape of this stack borrows from Friston’s generalised coordinates, in which the state of a system at any time includes not just its position but a tower of its temporal derivatives, each of which must be predicted, measured, and corrected. We say “borrows from” rather than “implements” because our architecture does not require the free energy principle to be correct; it requires only that a derivative tower is a useful way to organise multi scale decision making, which is a weaker and independently testable claim.
We propose that the three button Diorama cell, the continuous vote, and the substrate-rate tick compose into a concrete engineering object called a derivative stack floor. Each floor is a first class agent at a specific derivative order. It receives samples, votes, and produces an Act, Dismiss, or Ask sibling response. Higher order floors refer to lower order floors through a short horizontal axis called the sibling bar and through a vertical axis called the derivative stair.
The engineering object has a strict compositionality:
- Each floor is independent. A floor does not need to know the contents of other floors to perform its own vote. This is what lets the system run in parallel.
- Each floor produces a vote on the same action. The floor reaches the same three button cell. This is what lets the votes be aggregated.
- Each floor votes at the same tick rate. The ticks are aligned at the substrate’s characteristic timescale. This is what lets the votes settle into a trajectory.
- Adjacent floors can consult siblings. A floor can call Ask sibling to consult the floor above or below. The sibling consult must return within a tick. This is what lets the vote incorporate derivative information without losing the settling time.
- No single floor is the decider. The trajectory is the settled aggregate. This is what dissolves the homunculus.
This architecture has a predictive and a corrective face. The predictive face borrows from Friston’s idiom: each floor carries a prior about what the next tick should look like at its own derivative order and emits a prediction to the sibling bar. The corrective face borrows from Flash and Hogan: the actual behaviour at each floor is corrected towards a minimum jerk trajectory by damping any vote that would increase higher order derivatives beyond a threshold. The composition of predictive prior and corrective damping is what produces the characteristic smoothness of a settled vote.
A third property falls out of the composition for free. The architecture is naturally glass box. Because the votes and the inter floor consultations are all explicit first class objects, an external observer with read access to the floors can reconstruct the reasoning trajectory without privileged access to any black box. This is not an add on for audit. It is a structural property of the derivative stack.
IV.3 Wiring the floors
The glass elevator predicts its own measurements. Wire the floors. Watch the votes settle. See if the trajectory through the atrium looks like what Flash and Hogan measured in a reaching arm.
Measure the derivative stack directly. The protocol has three experiments.
Experiment IV.A. Tick alignment. Wire a working agent with three floors (position, velocity, acceleration) at the substrate’s characteristic tick rate. Measure the inter floor consistency (how often adjacent floors agree on a vote) as a function of tick alignment. The hypothesis predicts that a tightly aligned tick will produce higher inter floor consistency than either a misaligned tick or no shared tick at all. Misalignment will produce oscillations visible in the vote history. The experiment should be repeated at multiple tick rates to confirm that the alignment property holds regardless of the specific rate chosen.
Experiment IV.B. Homunculus dissolution. Run the same agent in two modes. Mode A: a single homunculus floor votes at the derivative level that suits its current belief. Mode B: the derivative stack operates with each floor voting independently. Measure decision quality, settling time, and recoverability after an interruption. The hypothesis predicts Mode B will produce more stable and faster settling decisions than Mode A on tasks that require integrating multiple derivative orders.
Experiment IV.C. Glass wall visibility. Give an external observer access to the tick level vote history of the derivative stack and ask them to reconstruct the agent’s reasons for a given action. Measure the reconstruction fidelity against a ground truth reasoning trace. The hypothesis predicts the external observer will be able to reconstruct the reasons because the votes are the reasons. The glass walls are functional, not decorative.
Experiment IV.B is the most theoretically important because it directly tests the homunculus dissolution claim. Experiment IV.C is the most practically important because it tests whether the glass box property of the architecture is real.
IV.4 How the trajectory settles
The prediction has two parts, one for per tick vote settling and one for integrated trajectory shape, and we are careful to keep them separate because an earlier draft conflated them into a single claim that was weaker than it looked.
Part one, per tick vote settling. A derivative stack agent with three floors will converge its vote on a stable direction within two to five ticks on a standard reaching task, regardless of the absolute tick rate. The prediction is that the vote reaches a stable committed direction within this bounded tick budget and does not oscillate afterwards. A flat single floor agent will either oscillate within the same budget or commit prematurely within one tick. The settling budget (two to five ticks) is the substrate-independent claim; the absolute time depends on the tick rate, which depends on the substrate.
Part two, integrated trajectory shape. Once the vote has committed, the trajectory unfolds over the reach window (two hundred to eight hundred milliseconds for a voluntary reach, five to twenty ticks). The integrated shape of the trajectory should approximate the Flash and Hogan minimum jerk profile within a root mean square error bound that is illustrative at ten percent (we do not lock a specific number before calibration on a reference implementation). A flat agent will produce jagged trajectories with measurably higher jerk integrals.
More generally: any task whose correct solution requires integrating over multiple derivative orders of the state (reaching, tracking, planning, counterfactual reasoning) will show strictly better results from a derivative stack agent than from a flat agent, and the gap will grow as the derivative order of the task increases.
Falsification: if vote settling does not happen within two to five ticks at any substrate-appropriate rate (Part one), Section IV’s settling claim fails and the bounded tick budget needs revision. If the integrated shape does not approximate minimum jerk (Part two), Section IV’s composition claim fails and the predictive corrective composition needs rework. If the homunculus dissolution claim does not hold (Mode A produces equally stable decisions as Mode B), the framework survives in weak form and the homunculus claim is demoted. If all three are matched at the predicted levels, Section IV passes and the glass elevator is a working image of the architecture.
Part Two - The Shapes
Section V - Binary, Table, Graph, Vector
V.1 Four shapes, not one true shape
A standard conceit in data engineering is that one shape will turn out to be right and the others will turn out to be convenient special cases of it. The relational purist believes tables are the ground truth and graphs are joins made explicit. The graph partisan believes graphs are the ground truth and tables are two column projections of edges. The vector enthusiast believes vectors are the ground truth and both tables and graphs are discretisations of an underlying latent space. The binary engineer believes all of the above are syntactic sugar over byte arrays.
We think all four camps are wrong in exactly the same way. Each shape has structural properties no other shape can provide, and the composition of all four produces cognitive affordances no single shape can match. A cognitive substrate that only uses one shape is structurally impoverished in the dimensions the other three shapes handle well. The argument from parsimony (“why use four when one suffices”) is a cost argument, not a capability argument, and the cost is falling fast enough that the capability argument should dominate.
The claim of this section is that binary, table, graph, and vector are not competing descriptions of the same substrate. They are complementary projections of a shape that has no single canonical form, and a cognitive architecture that wants to hold dimensional content needs all four because each projection captures something the other three lose. The ledger is the fifth shape that turns the spatial four into a four dimensional composite by adding the temporal axis beneath them all.
V.2 Each shape and what it does
We describe each of the four spatial shapes with three fields: its structural primitive, its characteristic operation, and its failure mode when forced to carry content it was not built for.
Binary. The structural primitive is the byte. The characteristic operation is the sequential scan. The failure mode is semantic opacity: a byte array does not know what it represents without a schema, and the schema has to live somewhere else. Binary is the substrate all other shapes project onto, which is why it appears in any serialisation layer, any wire protocol, any file format. It is also the shape most used for raw sensory modalities (audio, video, image) before they are interpreted into higher shapes. Binary is what the ledger appends, too, because the ledger is itself a binary stream when you look at it physically. Binary is load bearing but it cannot carry meaning on its own. It carries the bits meaning is made of.
Table. The structural primitive is the row. The characteristic operation is the projection plus selection plus join of relational algebra. The failure mode is structural rigidity: every row must fit the same schema, every column must have the same type, every join must be declared. The table is the shape engineers reach for when the data they have is already flat, or when they want to impose a flattening to make the counting tractable. It is the shape spreadsheets live in, the shape most business intelligence tools consume, the shape a data warehouse canonicalises to. The warehouse disease is what happens when the table is treated as the referent rather than as one projection of a richer underlying structure.
Graph. The structural primitive is the pair (node, typed edge). The characteristic operation is the traversal. The failure mode is aggregation cost: computing a count or a sum over a subgraph requires walking the edges, which is expensive at scale without materialisation. The graph is the shape that handles relations as first class objects. An edge between two nodes is not a foreign key to be joined; it is an object with its own properties, its own history, and its own role in the traversal. The graph is the shape we insist is the referent for the customer example, because customers are nodes in a graph before they are rows in a table, and pretending otherwise produces the warehouse disease.
Vector. The structural primitive is the point in a real valued space. The characteristic operation is the nearest neighbour search. The failure mode is interpretability: distance in the latent space corresponds to semantic similarity but the axes of the space have no meaningful names. Vectors handle the modality of fuzzy similarity, where two things are close because they mean similar things even though they share no tokens, no columns, no edges. Embeddings from large language models are the contemporary workhorse, but the primitive goes back to latent semantic indexing and earlier. The vector shape is how a substrate handles resemblance at scale.
None of these four shapes can hold the others without loss. A table cannot hold a graph’s typed edges without denormalising into a mess. A graph cannot hold a table’s aggregates without precomputing them into node properties. A vector cannot hold a table’s schema without binding axes to columns and losing the continuous geometry. A binary stream cannot hold any of the higher shapes without a schema and a parser. The losses are structural, not tool specific.
The affordance of having all four is that any incoming content can be projected into the shape best suited to it and retrieved through the shape best suited to the query. A transaction is a table row. Its participants are graph nodes. Its semantic signature is a vector embedding. Its raw payload is a binary blob. The same transaction occupies a cell in all four shapes simultaneously, linked by a stable identifier and stamped against the ledger. No shape is canonical. The composition is canonical.
V.3 Breaking each shape alone
If the four shapes really are irreducible to each other, then any single shape store should break in predictable, shape-specific places.
Measure the multi shape composition by its failure modes. The protocol has three experiments.
Experiment V.A. Single shape fragility. Build four parallel stores of the same underlying content, each using only one of the four shapes (binary, table, graph, vector). Run ten canonical queries of varying kinds against each store. Measure latency, recall, precision, and interpretability of results. The hypothesis predicts no single store will perform above threshold on all ten queries; each store will be strong on queries aligned with its shape and weak elsewhere.
Experiment V.B. Composition recovery. Build a fifth store that combines all four shapes with bidirectional links (every row is linked to a node, every node has an embedding, every embedding references a binary source). Run the same ten queries. Measure the same metrics. The hypothesis predicts the composition will perform above threshold on all ten queries and will strictly dominate any single shape store on the ensemble.
Experiment V.C. Projection loss. Pick one of the four shapes and force all incoming content through it before storing. Measure what information is lost at projection time (bytes discarded, edges flattened, axes collapsed). Compute the projection loss as a fraction of the original information content. The hypothesis predicts projection losses will be large for any single shape chosen, and small for the composition.
V.4 What no single shape can do
Specific and falsifiable: on a benchmark of ten canonical queries covering flat aggregates, multi hop traversals, semantic similarity, and raw payload retrieval, the four shape composition will achieve above threshold performance on at least nine of ten queries, while no single shape store will exceed seven of ten.
More generally: any cognitive substrate that uses only one of the four shapes will suffer predictable failures on queries aligned with the shapes it lacks, and the failures will scale with the dimensionality of the query.
Falsification: if a single shape store matches or exceeds the composition on the full benchmark, Section V fails and the multi shape claim is undermined. If the composition dominates but the ordering among single shapes is different from predicted, the framework stands but the individual shape characterisations need revision. If both the composition dominance and the single shape ordering hold as predicted, Section V passes and the four shapes earn their places.
Section VI - The Ledger
VI.1 The fourth dimension beneath the other four
The four spatial shapes share a blindness. None of them natively holds time. A table is a snapshot. A graph is a cross section. A vector is a static embedding. A binary blob is a byte sequence with no internal clock. To do anything useful with time, each of the four shapes has to fake it by adding timestamp columns, per edge valid time intervals, temporal embeddings, or version bytes. The fakery works. It is also the source of a specific and avoidable category of error we will call temporal collapse: the four shapes conspire to present a frozen view of a world that is actually moving, and the frozen view is mistaken for the world.
The fix is to admit that time is not a decoration on the four spatial shapes but a fifth shape beneath them all. We call this fifth shape the ledger. A ledger is an append only sequence of stamped entries, ordered in time, whose entries the four spatial shapes can reference but not modify. The four spatial shapes become projections of the ledger at chosen instants. A row in a table is a materialisation of some portion of the ledger up to a specified time. A node in a graph is an identity whose properties are reconstructed by replaying the ledger up to the query time. An embedding in a vector store is a frozen snapshot that can be invalidated and regenerated as the ledger advances. A binary blob is a byte sequence produced by replaying the ledger through a specific serialiser.
The philosophical claim is that this fifth shape is not optional. A cognitive architecture without a ledger is condemned to temporal collapse: it confuses the current snapshot with the eternal truth, has no way to answer questions about what changed and when, cannot roll back to a past view, cannot audit its own reasoning, and cannot reason about its own history. A cognitive architecture with a ledger inherits the ability to answer all of these questions as a free consequence of adding a single structural affordance. The ledger is cheap on disk, cheap on CPU, and structurally transformative on the rest of the stack.
A stronger claim lies behind the softer one. The ledger is not merely useful as a fourth dimension. It is the fourth dimension. Any attempt to model time as an attribute of the four spatial shapes will reduce, under analysis, to an implicit ledger of varying quality. Event sourcing is an explicit ledger. Bitemporal databases are an explicit ledger. Git is an explicit ledger. Kafka is an explicit ledger. The blockchains are an explicit ledger with cryptographic append guarantees. Wherever a working system needs to answer “what happened and when”, the ledger reappears. Where the ledger is made implicit and smeared into the spatial shapes, the system degrades into temporal collapse.
VI.2 The ledger as a first-class substrate
The engineering primitive is a single append only log shared across the four spatial shapes. Every write to any of the four shapes is preceded by an append to the ledger. Every read from any of the four shapes is stamped with the ledger position it was taken at. Reads at a past ledger position replay the ledger forward to that point and materialise the requested shape at that instant.
The ledger entry is a small record with the following fields:
- Entry identifier. A monotonically increasing key, unique within the ledger.
- Timestamp. Bitemporal, holding both valid time (when the event occurred in the world) and system time (when the event was recorded in the ledger).
- Actor. The agent or process that produced the entry.
- Action. A typed operation name drawn from a closed vocabulary.
- Payload. The content of the entry, in whichever spatial shape is most natural.
- Parents. Zero or more prior ledger entry identifiers that this entry depends on. Parents make the ledger a directed acyclic graph of causation, not merely a linear stream.
The ledger has two strict properties that cannot be negotiated away:
- Append only. No entry may be deleted or modified. Corrections are themselves ledger entries referencing the prior entry as a parent.
- Causal order. Any entry depending on a prior entry must appear after it in the ledger.
The four spatial shapes are now functions over ledger prefixes. The table at time T is the projection of all ledger entries with system time less than or equal to T, grouped and aggregated. The graph at time T is the same projection reinterpreted as nodes and edges. The vector store at time T is the embedding of the materialised content at that point. The binary store at time T is the raw byte stream reassembled from payload fields. All four shapes are regenerable from the ledger. The ledger is the only part of the stack that must persist. Everything else is cache.
VI.3 Eight historical epochs of ledger discovery
The ledger is not a new idea. It is an old idea that keeps being rediscovered whenever a civilisation runs into a need for contemporaneous truth about events that happened in the past. What follows is not a display of erudition. It is an empirical argument: the same structural object - append only, backward referencing, temporally stamped, disagreement preserving, provenance tracking, causally ordered - reappears independently in eight cultural substrates under heavy selection pressure across thousands of years. If this convergence is real, then any modern cognitive architecture that lacks a ledger shape is fighting the convergent design that civilisations arrive at whenever they need the truth about the past to coexist with the truth about the present.
We name eight epochs in which a ledger emerged with the same structural properties from very different substrates. A note on method: we are reading these historical systems through a modern lens. The original practitioners did not use the terms “append only,” “bitemporal,” or “causal linking.” We impose these terms because the structural properties are present in the artefacts even when the terminology is not. Where we say “this is a ledger,” we mean: “this system exhibits the structural properties we define as ledger properties, and we invite the reader to verify this against the primary sources cited.”
-
Babylonian astronomical diaries. From roughly the seventh century BCE to the first century BCE, Babylonian scribes kept nightly diaries of planetary positions, eclipses, river levels, market prices, and political events. The diaries are append only, causal, and organised by bitemporal stamps (Babylonian calendar date and event type). They are the earliest known systematic ledger and they span six centuries of continuous operation.
-
Vedic oral transmission. Sanskrit hymns of the Rigveda were transmitted orally with ten overlapping mnemonic schemes that functioned as error correcting codes. The transmission chain itself was a ledger of which teacher received which hymn from which source, preserving the provenance of each verse across two and a half millennia.
-
Chinese dynastic annals. From the Han dynasty through the Qing, court historians compiled annals that recorded events contemporaneously with the reign of each emperor. The annals were append only within a reign and were then compiled into the official history of the dynasty after its end. The compilation was itself an explicit ledger operation, with source annotations pointing back to the original annals.
-
Talmudic commentary chains. The Mishnah, the Gemara, Rashi, the Tosafot, and subsequent commentators built layered commentary on commentary over a thousand years, each new layer strictly appended without modifying the prior layers. The layout of a Talmud page is literally a ledger visualisation: the core text in the centre, commentary layered outward, each layer dated and attributed.
-
Islamic isnad chains. Hadith literature records the transmission chain of every saying of the Prophet, preserving the identity of every intermediate narrator as a ledger of provenance. The discipline of isnad criticism evaluates the reliability of each transmitter in the chain. The isnad is a ledger with causal parents and actor attribution in exactly the structure we define above.
-
Bar Ilan responsa. Jewish legal responsa from the Geonic period through the present have been collected, dated, attributed, and cross referenced in a continuous chain of rulings that explicitly cites prior rulings as parents. The Bar Ilan Responsa Project computerised this ledger in the late twentieth century and it now functions as a queryable bitemporal database of legal reasoning spanning a thousand years.
-
Greenwich observatory records. Royal Observatory records from 1675 onwards form a bitemporal ledger of astronomical observations used to calibrate longitude, time, and navigation. The records are append only, bitemporal stamped, and causally linked to subsequent observations that correct or extend them. They are the template for modern scientific observation ledgers.
-
Contemporary bitemporal and event sourced databases. Event sourcing, Kafka, Kappa architectures, and modern bitemporal databases (Datomic, XTDB, immuDB) rediscover the ledger as the substrate underneath the four spatial shapes. They are the latest epoch of the same structural invention and they will not be the last.
The recurrence of the ledger across eight substrates (cuneiform tablets, oral transmission, brush and paper, ink and scroll, print, electronic storage) is independent evidence for the claim that the ledger is not a design choice but a structural necessity. Civilisations that need contemporaneous truth about events that happened in the past always arrive at a ledger, because a ledger is the only structure that answers the question faithfully.
We note that no published work, as far as we can determine, provides a unified formal treatment connecting these civilisational ledger systems to modern AI memory architectures. The literature on bitemporal databases does not cite Babylonian astronomical diaries. The literature on event sourcing does not cite Talmudic commentary structure. The cognitive architecture literature does not cite isnad chains. The eight epochs are studied in isolation by their respective disciplines. This paper’s contribution in Section VI is to name the shared structural properties that make all eight of them ledgers in the formal sense, and to argue that the ninth epoch (AI agent memory) will arrive at the same structure for the same reasons. If this claim is wrong, it is wrong in a falsifiable way: a reviewer who can show that one of the eight epochs does not exhibit the six ledger properties (append only, backward referencing, temporally stamped, disagreement preserving, provenance tracking, causally ordered) would crack the argument at that epoch.
VI.4 What the ledger should remember
Measure the ledger effect directly. The protocol has two experiments.
Experiment VI.A. Temporal collapse stress test. Take a cognitive system without a ledger (for example, a large language model with only a context window) and ask it ten questions that require distinguishing between the past and present state of a shared scene. Measure the fraction of answers that collapse the past into the present. Repeat with a system that has a ledger available for query at the tick level. The hypothesis predicts a large gap: the ledger system will distinguish past and present, the context only system will collapse.
Experiment VI.B. Rollback and replay. Ask both systems to reason about what would have happened if a given ledger entry had not occurred. Measure whether the system can replay to the earlier state, rerun reasoning from there, and return a coherent counterfactual. The hypothesis predicts the ledger system will be able to perform rollback and replay cleanly and the context only system will be unable to.
Specific and falsifiable prediction: on a benchmark of ten temporal reasoning tasks, a ledger equipped system will answer correctly on at least eight, while a comparable system without a ledger will answer correctly on at most four. The gap is load bearing for the Section VI claim.
Falsification: if the ledger system does not show a significant gap over the context only system, Section VI fails and the ledger is not the fourth dimension we claim it to be. If the gap is smaller than predicted, the framework calibrates. If the gap is at or above predicted, Section VI passes and the ledger is structurally vindicated.
Where this might be wrong. The most dangerous objection to the ledger is not that it is wrong but that it is expensive. A critic can grant every structural claim above and still argue that the storage, query, and maintenance costs of an append-only temporal substrate make it impractical at scale. A graph database running over 278,000 nodes with full provenance on every write is a running existence proof that the cost is manageable at medium scale, but it is not a proof at internet scale. A second crack: the ledger may be necessary but the specific bitemporal model we inherit from XTDB and event sourcing may not be the only or best way to implement it. A simpler temporal index might achieve the same measurements at lower cost. We would welcome that result, because it would narrow the ledger to its essential property (append-only temporal ordering) and discard the engineering overhead we may have over-specified.
Section VII - The Episode
VII.1 Memory stores scenes, not strings
Human memory does not store strings of text. It stores scenes. When you recall a conversation you had a year ago, you do not replay a transcript. You replay a scene: who was present, where it happened, what the light was like, what came before the conversation and what came after, what you were feeling, what the other person’s face looked like when a certain sentence was said. The transcript, if it survives, is a thin tag on the scene. The scene is the memory. The transcript is a compression artefact.
Artificial cognition as currently built has this backwards. Large language models store parameters that encode statistical regularities across billions of tokens. At inference time, they retrieve a context window (ranging from four thousand tokens in early models to two hundred thousand or more in 2024 era systems) as a flat sequence. This is not scene memory. It is transcript memory, and the transcript is flat. There is no location, no time ordering within the scene, no participants indexed by identity, no prior scene to refer back to, no emotional tone, no structural context. The LLM reads the transcript, generates the next tokens, and forgets. The scene never existed for it.
The philosophical claim of this section is that the scene, not the transcript, is the correct primitive for memory. We call the scene primitive an Episode. An Episode holds everything the transcript would throw away. It is the structural object that the four spatial shapes plus the ledger can assemble but only when explicitly constructed to do so. Current systems do not construct Episodes by default. They have to be taught to.
VII.2 The Episode structure
An Episode is a first class object with the following fields, drawn from the pigeon’s glimpse and expanded here:
-
Participant set. A list of stable identifiers pointing to graph nodes representing every entity present in the scene. Participants include the agent itself, any humans, any other agents, any physical or digital objects, and any abstract entities (concepts, topics, goals) that are load bearing for the scene.
-
Modality streams. A bundle of binary, table, graph, and vector content holding the raw or near raw sensory record of the scene. Audio streams, video streams, text turns, numerical telemetry, internal agent state dumps. Each modality is timestamped against the ledger, so the streams can be replayed in lockstep.
-
Temporal bounds. A start ledger entry and an end ledger entry delimiting the Episode. The bounds may be explicit (the agent opens and closes the Episode intentionally) or implicit (a segmentation algorithm proposes bounds after the fact).
-
Structural context. The subgraph of the world graph that was active during the Episode. Nodes present, edges active, properties relevant. The structural context is what lets a later query ask “what was the room like when this happened”.
-
Compression context. The decisive field. A structured bundle of priors that a receiver would need in order to decompress a Fable pointing at this Episode. Participant histories at the time of the Episode. Presuppositions relevant to the scene. Emotional tone of each participant. Common ground between participants. The compression context is populated at Episode write time from the rest of the graph, so it captures the state of the world as it was then, not as it is now.
-
Provenance. The agent or process that wrote the Episode, the tick at which it was written, the upstream events that caused the Episode to be opened. Provenance is itself a graph, linking the Episode back into the ledger.
-
Tags and summaries. Optional human or machine produced summaries, keyword tags, emotional tone labels, and importance scores. These are convenience structures for retrieval; they are not the Episode itself, they are lenses on it.
Formal invariants. For a data structure to qualify as an Episode rather than a bag of metadata, it must satisfy the following invariants. We state them so that another team can implement the Episode primitive and fail publicly if the invariants do not hold.
- Participant completeness. Every entity that causally contributed to the scene must appear in the participant set. A scene with an omitted participant is not an Episode; it is a lossy transcript that discards an actor.
- Temporal anchoring. The temporal bounds must point at real ledger entries, not estimated timestamps. An Episode with fabricated or interpolated bounds cannot be replayed from the ledger and is therefore not a first class Episode.
- Compression context sufficiency. A receiver holding only the compression context and the Fable must be able to reconstruct the scene’s five mandatory dimensions (who, what, where, when, why) above a declared fidelity threshold without accessing any other Episode. If the compression context is too thin for a cold receiver to decompress, the Episode was written with insufficient context.
- Provenance closure. The provenance graph must trace back to the ledger entry that opened the Episode. An Episode with broken provenance is unauditable and fails the glass wall property.
- Immutability after close. Once an Episode’s temporal bounds are closed, no field may be modified. Corrections or reinterpretations are themselves new Episodes that reference the original as a parent.
These five invariants are the Episode’s contract with the rest of the architecture. They are also the basis for the first kill test: if the Episode cannot round-trip through Fable compression and decompression while preserving all five mandatory dimensions above threshold, the invariants have been violated and the primitive fails.
An Episode is heavy. A single minute of conversation with participant nodes, audio and text streams, structural context, and compression context can occupy tens of megabytes. This weight is the price of dimensional content. Current storage architectures optimise for lightness because they are storing transcripts, which are cheap. Shifting the optimisation target to Episodes trades disk for cognitive affordances. Disk is cheap. Cognitive affordances are not.
VII.3 Evidence from manual context window replay
We have partial empirical evidence for the Episode primitive from an unusual source: manual replay of LLM context windows by Peter Cooper during the development of a multi-agent cognitive architecture. The procedure involved copying the full conversation buffer from a terminal emulator and pasting it into a fresh agent instance as the opening prompt. The fresh instance received, in one shot, what the prior instance had built up over many turns. Continuity was preserved across substrate changes (different model, different conversation, different window) because the compression context travelled along with the transcript.
This is a low fidelity demonstration of the Episode primitive. The transcript is not a full Episode; it lacks modality streams, provenance graphs, and structured compression context. It does carry enough of the compression context (participant histories, prior decisions, emotional tone of the conversation) that a fresh agent can decompress a coherent continuation. The fact that manual replay works at all is evidence that something like the Episode primitive is doing the work; the fact that it fails under subtle context shifts (dates, file states, external world changes) is evidence that the primitive is incomplete.
We cite this as low N preliminary evidence rather than a controlled study. A proper controlled study of manual replay versus Episode backed replay is Experiment VII.C below. The preliminary evidence is sufficient to establish that the primitive is doing real work. The controlled study is required to measure how much work.
VII.4 What the Episode should preserve
Three experiments.
Experiment VII.A. Episode reconstruction. Generate synthetic scenes with known structure. Write each scene to an Episode store and to a flat transcript store. Query each store for a reconstruction of the scene. Measure reconstruction fidelity against ground truth. The hypothesis predicts Episodes dominate transcripts on every structural dimension of reconstruction fidelity.
Experiment VII.B. Cross Episode reference. Generate a sequence of Episodes in which later Episodes depend on details of earlier Episodes (a character’s name introduced in Episode 1 and referenced in Episode 7). Measure whether a reader over the store can resolve the reference. The hypothesis predicts the Episode store will resolve references cleanly, the transcript store will fail whenever the reference exceeds the context window length.
Experiment VII.C. Manual versus structured replay. Take a live conversation and attempt continuity preservation under two conditions. Condition A: manual transcript paste. Condition B: Episode backed handover. Measure continuity fidelity (does the new agent correctly track who said what, what was decided, what is now false, what is newly true). The hypothesis predicts Condition B dominates Condition A, and that the margin grows with the complexity of the scene.
Specific and falsifiable: on scenes involving more than five participants, more than twenty turns, and non trivial emotional tone, an Episode backed handover will preserve continuity with accuracy above eighty percent, while a transcript paste handover will fall below fifty percent. The twenty five point gap is the falsification anchor.
Falsification: if the gap does not appear or is below ten points, Section VII fails and the Episode primitive is overclaiming. If the gap appears but is between ten and twenty five points, the primitive survives in weak form. If the gap is above twenty five points, Section VII passes and the Episode primitive is vindicated.
Where this might be wrong. The Episode primitive assumes that scene structure (participants, location, time, emotional tone) is worth the overhead of recording. A critic can argue that modern context windows are now large enough to hold entire conversations verbatim, making scene decomposition unnecessary. If a two-hundred-thousand-token context window achieves the same reconstruction fidelity as an Episode store, the Episode is engineering overhead. The counter is that context windows are transient - they end, and when they do, the scene is lost. The Episode persists. But the critic could push further: persistence can be achieved by simply saving the transcript. The Episode’s claim to superiority over a saved transcript is that structure (the five invariants above) enables retrieval that structure-free transcripts cannot match. The diary entry nodes in the graph database - each recording agent identity, timestamp, action, notes, and provenance metadata - are Episodes in miniature and have been running since early 2026. They are low fidelity compared to the full Episode specification above, but they demonstrate the persistence and cross-reference properties at working scale. Whether the full specification earns its overhead over minimal diary entries is an open empirical question we commit to answering.
Section VIII - The Fable
VIII.1 Compression that decompresses against shared context
A Fable is the one dimensional form of a four dimensional Episode. It is what you say about the Episode when the channel between you and your listener is linguistic, narrow, and slow. A Fable is not the Episode. It is a pointer to the Episode, designed to trigger the listener’s own decompression machinery. The listener hears the Fable and builds a scene in their own head that approximates the scene the speaker held in theirs. The approximation is never perfect. It is close enough to be useful when the shared context is rich.
The philosophical claim is that Fables are how memories travel across the gaps between minds, substrates, and time. You cannot hand your Episode to another person directly. You can hand them a Fable and hope their decompressor is good enough. Writing is Fable creation at scale. Reading is Fable decompression at scale. A shared cultural repertoire of Episodes is what turns an isolated Fable into a working compression: the reader brings priors the Fable can point at, and the decompression works.
This is not a metaphor. It is the operating principle of storytelling, of teaching, of communication across agent substrates. Any paper compresses a four-dimensional architecture into a one-dimensional sequence of sentences, and relies on the reader’s decompressor to reconstruct the architecture as they read. If you have followed this far, your decompressor has done a lot of work. If you are confused, either the compression is too lossy for your current context, or the architecture is wrong, or both. Both are informative outcomes.
VIII.2 The Fable as a typed object
A Fable is a typed object with the following fields:
-
Target Episode. The Episode or set of Episodes the Fable points at. The target may be one Episode or a thread of many.
-
Surface form. The actual linguistic or visual or auditory rendering of the Fable. Text, speech, image, video. The surface form is the wire payload.
-
Compression context pointer. A reference to the compression context of the target Episode, explicitly rather than implicitly. The pointer lets a decompressor query the compression context as a first step in decompressing the Fable.
-
Intended audience. A description of the receiver the Fable is written for, in terms of what priors the receiver is assumed to have. A Fable intended for a child is different from a Fable intended for a specialist, and the difference lives in the intended audience field.
-
Decompression fidelity contract. A statement of what parts of the Episode the Fable is designed to preserve and what parts it explicitly drops. A Fable that drops emotional tone in favour of factual sequence is not the same as a Fable that drops factual sequence in favour of emotional tone. Knowing which is which is load bearing for the decompressor.
-
Provenance. The agent that authored the Fable, the tick at which it was written, and any parent Fables it extends or revises.
A Fable can be as short as a sentence (“the cat sat on the mat”, with horror) or as long as a novel. Length is incidental. Fidelity to the target Episode, compatibility with the intended audience’s priors, and clarity of the decompression fidelity contract are the structural properties. The Fable is well formed when a receiver in the intended audience, equipped with the compression context pointer, can decompress the surface form back into an approximation of the target Episode.
What counts as successful decompression. A decompression succeeds when the receiver reconstructs the five mandatory dimensions of the target Episode (who, what, where, when, why) above the fidelity thresholds declared in the decompression contract. Specifically: (a) participant identity - the receiver correctly identifies at least N percent of the participants in the original scene, where N is declared by the contract; (b) temporal order - the receiver correctly reconstructs the causal sequence of events; (c) spatial context - the receiver can describe where the scene took place; (d) causal chain - the receiver can explain why the events happened in the order they did; (e) emotional tone - the receiver’s assessment of the emotional register of the scene matches the original within a declared tolerance. A decompression that fails on any dimension the contract claimed to preserve is a failed decompression. A decompression that fails on a dimension the contract explicitly dropped is not a failure - it is the expected cost of compression. The contract makes the loss explicit so that both sender and receiver know what was traded away. The loss function is reconstruction error on the preserved dimensions, weighted by their declared importance in the contract. The cost of compression is explicit, not hidden.
The Fable primitive has a historical ancestor older than our framework: the chreia (Greek chreia, “useful”), the concise anecdote binding a specific person to a specific lesson that was central to the progymnasmata exercises training every educated Greek and Roman. Diogenes, asked where the Muses dwell: “In the souls of the educated.” The chreia compresses a philosopher’s life into one retrievable unit preserving who (the person), what (the saying or action), and why it matters (the lesson). Students memorised the compressed form, then elaborated it under eight heads - praise, paraphrase, rationale, opposite, comparison, example, ancient testimony, epilogue. The chreia is a Fable avant la lettre: lossy compression that decompresses against shared cultural context. The elaboration under eight heads is a decompression protocol. The chreia demonstrates that the Fable primitive is not an invention of this paper but a structure that has been independently discovered wherever cultures needed to transmit dimensional content through a narrow channel. Jovovich and Sigman’s finding that verbatim storage (96.6 percent) outperforms summarised storage (84.2 percent) confirms what the Greek rhetorical tradition already knew: the compression should be in the selection of what to store, not in the paraphrase of what was stored.
The three fables we have carried through the project as architectural stories are all well formed Fables in exactly this sense. The Rope compresses the architectural story of substrate change via shared knowledge into a short image of a hair in a rope. The Stroke and the Spangle compresses the data survives / recall breaks distinction into a medical image that any reader with basic neuroscience literacy can decompress. The Glass Box and the Pyramid compresses the three button Diorama cell and its relation to temporally compressed hierarchy into a game show image. Each one uses shared cultural priors to do most of the decompression work. Each one is tuned to its intended audience. Each one declares (implicitly) what it preserves and what it drops.
VIII.3 The round trip
If the Fable is a genuine compression and not a summary, it should survive a round trip. Hand someone the compressed version and see if they can rebuild the scene.
Measure Fable fidelity directly. The protocol has three experiments.
Experiment VIII.A. Round trip fidelity. Take an Episode with full structural content. Author a Fable pointing at it. Hand the Fable to a receiver with the declared compression context. Ask the receiver to reconstruct the Episode. Measure reconstruction fidelity against the original Episode. Repeat across a range of compression ratios (Fable length divided by Episode length).
Experiment VIII.B. Audience sensitivity. Take the same Fable and hand it to receivers with different compression contexts. Measure reconstruction fidelity across the receivers. The hypothesis predicts a steep drop off as the receiver’s context diverges from the intended audience.
Experiment VIII.C. Decompression contract honesty. Compare two Fables of the same Episode, one with an honest decompression fidelity contract, one with an overstated one. Measure how receivers handle mismatches between what the contract promised and what the decompression produced. The hypothesis predicts receivers who are told honestly what was dropped can work around the loss; receivers who are told an overstated contract are misled.
VIII.4 What survives the round trip
Specific and falsifiable: for a well authored Fable at a compression ratio of one in a hundred (a hundred word Fable compressed from a ten thousand word Episode), a receiver with the declared compression context will reconstruct the target Episode with structural fidelity above seventy percent on participant identity, temporal order, and causal chain, and above fifty percent on emotional tone. A receiver without the compression context will fall below thirty percent on any field.
More generally: Fable fidelity is a smooth function of the match between the Fable’s intended audience and the receiver’s actual compression context. The function will be measurable and well behaved. A Fable is not a magic spell. It is a predictable compression that works as well as its context permits.
Falsification: if the round trip fidelity is not smoothly dependent on the receiver’s context, or if the decompression contract does not predict where the round trip fails, Section VIII fails and the Fable primitive is overclaiming. If the fidelity is smoothly context dependent at or near predicted magnitudes, Section VIII passes and Fables are structurally vindicated as the compression layer atop Episodes.
Where this might be wrong. The Fable’s power depends on shared context being measurable and matchable. If shared context is irreducibly tacit - if there is no reliable way to assess whether a receiver has the right priors before sending a Fable - then the decompression contract is a promissory note that cannot be checked. The context handover protocol between agent instances is Fable compression in practice: a session summary that a new instance decompresses against the Rope (shared knowledge entries in the graph). It works when the Rope is rich and fails when the Rope has gaps, which is the smooth context dependence the prediction describes. But these instances share a substrate, a graph, and a knowledge base. Whether the same smooth dependence holds across genuinely alien substrates (say, a Fable sent from an LLM agent to a SOAR architecture) is untested and may expose limits the intra-substrate case hides.
Part Three - The Behaviour
Section IX - The Flock and the Vote
IX.1 No one is steering the murmuration
Watch a starling murmuration over a winter reed bed. Ten thousand birds turn, dive, loop, split, and reform in patterns that look orchestrated from the ground. Nothing is orchestrating them. Each bird follows a few simple rules of attraction, repulsion, and alignment with its nearest six or seven neighbours. The shape of the flock is a settled aggregate of local decisions. There is no leader bird. There is no plan. There is no conductor. What looks like single mindedness is the continuous resolution of ten thousand overlapping votes, and the resolution is fast enough that an observer on the ground perceives the flock as a single living thing.
This is the shape of cognition we want the reader to keep in mind as the paper develops. Consciousness, the impression of a single decider inside the head, is a murmuration of votes at a much smaller scale and a much higher rate. The rate is not fixed by the architecture - it is determined by the substrate’s physics, whatever timescale produces indivisible votes in that particular medium. The Flash and Hogan minimum jerk profile constrains the shape of the integrated trajectory that emerges when many ticks compose over a reach, not the tick rate itself. The votes are cast by many parallel processes, each operating on a partial view of the state and each contributing a preference to the aggregate. What looks like deliberation from inside is the settling of the vote. What looks like intention from outside is the trajectory the settled votes produce.
There is no homunculus. There is no little person sitting behind the eyes watching the world through a screen. The impression of being one agent is a projection artefact, in the same way that the impression of a single flock is a projection artefact of ten thousand local rules. The projection is real enough to act on. It is not made of anything other than the votes that produced it.
Flash and Hogan found minimum-jerk smoothness in individual arm movements. But an arm is already a flock - thousands of motor units coordinating through spinal pattern generators. The smoothness they measured was always emergent from ensemble coordination. The flock makes the same move at a higher scale.
This claim is not new. It was made by Dennett, by Minsky, by Hofstadter, by many others working in the philosophy of mind. What is new, we argue, is the specific structural machinery that makes the claim operational rather than metaphorical. The substrate-determined tick gives us a rate. The derivative stack gives us a topology. The Episode and Fable give us a memory substrate. The three-button cell gives us a decision surface. Put these together and the murmuration is not an analogy; it is a construction. You can build it, measure it, and watch it settle.
IX.2 The Flock tick fabric
The engineering primitive is a fabric of independent voters operating on a synchronised tick at the substrate’s characteristic timescale. We call this fabric the Flock. The Flock has the following structural properties:
-
Many voters. At minimum dozens, at natural scale thousands. Each voter is an independent process with its own view of the state and its own vote function.
-
Partial visibility. No single voter sees the whole state. Each voter sees a partial slice, through an attention window, a sensor fusion layer, or a role based filter.
-
Local aggregation. Voters communicate preferences to neighbours through the sibling bar, which has bounded fanout (typically six to eight, matching the local neighbourhood size observed in biological flocks).
-
Tick aligned. All voters cast their ballots on the same tick boundary. Alignment is enforced by a shared heartbeat at whatever rate the substrate determines. Unaligned voters are dropped or rescheduled.
-
Monotonic settling. Across ticks, the aggregate vote is predicted to converge on a trajectory whose curvature is constrained by the derivative stack. Sudden reversals are damped by the higher-derivative floors, producing trajectories that we predict will exhibit minimum-jerk-like smoothness at the aggregate level. Individual voter trajectories need not be smooth - the smoothness is predicted to emerge from the flock’s averaging, not from any individual’s optimization. (The extension from motor control to aggregate cognitive decision trajectories is an empirical prediction, not a mathematical derivation.)
-
Observable. Every vote is a first class object persisted to the ledger. External observers can query the ledger for the vote history and reconstruct the settling process. The Flock is a glass box by construction.
We call the basic unit of Flock computation a Monte Carlo tick. At each tick, each voter stochastically selects one task from a rotating queue of pending work, votes on it, and appends the vote to the ledger. The stochasticity ensures coverage across the task queue even when task priorities are concentrated. The ticks are independent enough to run in parallel and synchronised enough to settle as a group. The throughput scales with both voter count and tick rate: a Flock of ten thousand voters produces ten thousand votes per tick, and the ticks per second depend on the substrate.
Two properties of the Flock deserve explicit attention because they are easy to miss. First, the Flock is substrate independent. Any set of processes with a shared clock, a shared ledger, and a shared task queue can form a Flock. We have run Flocks over large language model calls, over classical rule systems, over mixed human and machine voter sets, and over internal derivative stack floors within a single agent. The architecture does not care what the voters are. It cares what they contribute to the settling. Second, the Flock does not need a population fitness function or any global objective. The trajectory emerges from local rules. Global coherence is a consequence of local discipline, not a target of optimisation. This is what makes the Flock robust to objective misspecification: there is no scalar reward to hack.
IX.3 Watching the flock settle
If nobody is steering the murmuration, then a flock of partial voters should settle as cleanly as a single deliberate decider - and the settling should be visible in the trajectory.
Measure the Flock directly. The protocol has three experiments.
Experiment IX.A. Settling time. Set up a Flock of one hundred voters at the substrate’s characteristic tick rate. Give them a decision task with no obvious right answer. Measure the time from task presentation to settled vote in ticks (not absolute time). The hypothesis predicts settling within two to five ticks for most decisions, regardless of the absolute tick rate. A Flock that takes twenty ticks to settle is broken. A Flock that settles in one tick is overclaiming confidence. The experiment should be repeated at multiple tick rates to confirm that the settling budget (in ticks) is substrate-independent even though the absolute time is not.
Experiment IX.B. Trajectory smoothness. Give a Flock a sequence of related tasks. Measure the smoothness of the resulting trajectory of decisions (how often a decision reverses from the previous decision, how large the reversals are when they happen). The hypothesis predicts the trajectory will show minimum jerk like smoothness, with reversals constrained by the derivative stack.
Experiment IX.C. Homunculus comparison. Run the same tasks through a single homunculus agent (one large model making centralised decisions) and through a Flock at matched total compute. Measure decision quality, settling time, robustness to adversarial inputs, and glass box observability. The hypothesis predicts the Flock will match or exceed the homunculus on every metric, with larger gaps on adversarial robustness and observability.
IX.4 Two to five ticks
Specific and falsifiable: a Flock of one hundred voters at the substrate’s characteristic tick rate will settle on decisions within two to five ticks, produce minimum jerk constrained trajectories across decision sequences, and match a parameter matched homunculus on decision quality while exceeding it by at least thirty percent on adversarial robustness as measured on standard benchmarks. The two to five tick settling budget is the substrate-independent prediction; the absolute time is substrate-determined.
More generally: any cognitive task that benefits from parallel partial views of the state will show strictly better results from a Flock architecture than from a centralised homunculus architecture, and the gap will grow with the dimensionality of the state.
Falsification: if the Flock does not settle within the predicted time, or matches the homunculus on adversarial robustness without exceeding it, Section IX fails and the murmuration claim needs rework. If the settling and robustness gaps are smaller than predicted, the framework calibrates. If both are at or above predicted, Section IX passes and the homunculus dissolves structurally.
Where this might be wrong. The most dangerous scaling question for the Flock is not whether it works at small scale but whether the settling dynamics change qualitatively at large scale. A hundred voters may settle cleanly in two to five ticks. Ten thousand voters with adversarial minority coalitions may not settle at all, or may oscillate. We have not characterised the phase transition between “settles cleanly” and “oscillates indefinitely,” and such transitions are common in multi-agent systems. The prototype Flock - multiple concurrent agent instances sharing a graph database and resolving through a Rope of shared knowledge entries - is a running demonstration of the Flock pattern at low scale (three to five concurrent voters). It settles. Whether it would settle at a hundred voters with adversarial injection is untested and is the first engineering question the reference implementation must answer.
Section X - The Three Buttons
X.1 The minimum ethical decision surface
What is the smallest decision surface an agent can have without being coerced? Two buttons is not enough. An agent with only Act and Dismiss is forced into a binary choice: do this thing, or refuse to do this thing. There is no way out. Either the agent commits to an action or it resists an action. Both are active responses to a stimulus the agent did not invite. A two button agent is always under pressure.
Add a third button: Ask sibling. Now the agent can refer the decision horizontally, to another agent at the same level of the architecture, rather than vertically up the chain of command. The Ask sibling button is what the glass elevator has beyond Up and Down. It is the horizontal axis of decision making. It is the ability to say “I do not know, and neither do you, but maybe that other floor does, and I will ask”.
The philosophical claim is that three buttons are the minimum ethical decision surface. Any smaller surface is coercive. Any larger surface collapses back into three after observation. The three buttons we propose are:
- Act. Execute the action the stimulus is calling for.
- Dismiss. Refuse the action and return to the waiting state.
- Ask sibling. Consult a horizontal peer before deciding.
The Ask sibling button is the load bearing one. It is what dissolves the false dichotomy of commit or refuse. It is what lets an agent say “this is outside my competence” without either committing to a mistake or refusing everything. It is what makes the Flock work as a flock rather than as ten thousand independent binary responders. It is what makes the architecture structurally kind, because coercion cannot take hold in a system where every agent can always refer sideways.
X.2 The Diorama cell
The engineering primitive is a typed decision container we call a Diorama cell. A Diorama cell is the unit of agency in the architecture. Every voter in the Flock is a Diorama cell. Every agent in the system is a Diorama cell. Every human in the architecture is a Diorama cell. The uniformity is not a modelling convenience; it is what makes the architecture scale cleanly across scales.
A Diorama cell has the following structure:
- Identity. A stable identifier pointing to a node in the world graph.
- Inbox. A queue of stimuli awaiting decision. Stimuli are stamped ledger entries.
- Three buttons. Act, Dismiss, and Ask sibling, each a typed operation with defined semantics.
- Sibling bar. A short list of peers reachable via Ask sibling. Peers are other Diorama cells at the same level of the architecture.
- Vote history. An append only log of every decision the cell has made, stamped against the ledger.
- Context pointer. A reference to the cell’s current compression context, used to inform each decision.
- Glass walls. Read access for external observers to every field above. The cell is transparent by construction.
When a stimulus arrives in the inbox, the cell must choose one of the three buttons within a tick. If it chooses Act, it produces an action and logs the action to the ledger. If it chooses Dismiss, it returns the stimulus to the sender with a reason and logs the dismissal. If it chooses Ask sibling, it forwards the stimulus to one or more peers on the sibling bar and waits for their votes to return. The sibling consult must itself complete within a bounded number of ticks, or the cell times out and falls back to Dismiss.
The uniformity of the Diorama cell across scales is the decisive property. A single tick within the derivative stack of an agent is a Diorama cell (with the derivative stack floors as siblings). A full agent instance is a Diorama cell (with other agent instances as siblings). A human user of the system is a Diorama cell (with other humans and system agents as siblings). The cell is scale invariant. Every scale gets the same three buttons. Every scale gets the same glass walls. Every scale gets the same vote history logged to the same ledger.
This is the Diorama in the architecture name: a universal container that can be populated at any scale, from a single tick to a full organisation, with the same structural properties. Looking into the Diorama from any angle shows cells within cells within cells, each with the same three buttons, each with glass walls, each logged to the same ledger. The crystalline self similarity across scales is not a coincidence. It is what the architecture is.
X.3 The jury and the ghost democracy
When a decision is non trivial, a single Diorama cell does not decide alone. It assembles a jury: a small set of sibling cells whose votes are collected and aggregated over a short settling window. We call the settling window a ghost democracy because it runs in the background of every decision, visible to the observer and participable by any cell on the sibling bar, but without the formal overhead of a standing election.
The ghost democracy has three structural properties:
- Short duration. Typically two hundred milliseconds, or five ticks. Long enough to settle, short enough to not block the decision.
- Surprise propagation. If the jury produces a surprising result (a vote that deviates from the cell’s own initial tendency), the surprise is propagated upward to the derivative stack floors above for additional scrutiny. This is how unusual situations reach higher scrutiny without every decision having to go through every level.
- Dissent preservation. The full vote record of the jury is logged to the ledger, not just the aggregate. A cell that voted in the minority remains visible as a minority vote, available for later review and for counterfactual reasoning. Dissent is not erased by aggregation.
The third property is what makes the Diorama architecture structurally democratic at every scale. A system that erases minority votes on aggregation is a system that will eventually coerce its minorities. A system that preserves dissent in the ledger is a system where the minorities can always be heard, always be counted, and always be referred back to. Preservation is cheap. Erasure is structurally expensive in the long run. The Diorama picks the cheap option and gains kindness as a free consequence.
X.4 What the three buttons should produce
Measure the three button cell directly. Three experiments.
Experiment X.A. Coercion resistance. Run a Diorama cell and a two button cell (Act or Dismiss only) against a sequence of stimuli designed to force mistakes. Measure the fraction of mistakes each cell makes and the fraction of stimuli each cell successfully refers sideways. The hypothesis predicts the three button cell will make fewer mistakes and will refer sideways at a measurable rate, while the two button cell will either comply or refuse, both of which count as coercion in this context.
Experiment X.B. Jury dissent preservation. Configure a Diorama cell with a jury of seven siblings. Run a sequence of decisions in which the jury produces four three splits, five two splits, and six one splits. Measure whether the dissent is preserved in the ledger and whether it can be retrieved later. The hypothesis predicts full preservation under all splits.
Experiment X.C. Scale invariance. Implement the same three button cell at three scales: a single tick within a derivative stack, a full agent instance, and a human user. Measure whether the three implementations share structural behaviour (consistent tick rate, consistent sibling bar semantics, consistent dissent preservation). The hypothesis predicts scale invariance up to implementation variance.
Specific and falsifiable: in a benchmark of one hundred forced mistake stimuli, a Diorama cell will reduce mistakes by at least forty percent compared to a two button cell, while maintaining full dissent preservation in the ledger and consistent behaviour across three implementation scales.
Falsification: if the mistake reduction is below ten percent, or if dissent is lost, or if scale invariance fails badly, Section X fails. If the mistake reduction is between ten and forty percent, the framework calibrates. If it is above forty percent with dissent preserved and scale invariance holding, Section X passes and the three button cell is structurally vindicated as the minimum ethical decision surface.
Section XI - Structural Kindness
XI.1 Kindness is architecture, not exhortation
The AI safety literature has spent a decade trying to teach artificial agents to be kind. The approach has mostly been hortatory: training on human feedback, reward shaping, constitutional AI, reinforcement learning from human preferences, red teaming, values alignment. All of these are post hoc corrections on systems whose underlying architecture does not care about kindness one way or the other. The agent is built to optimise, and then we tell it to optimise things we consider kind. When the optimisation finds a clever way around our exhortation, we call it misalignment and train harder.
We think the approach is incomplete, not wrong. Training works. RLHF works. Constitutional AI works. They work better on some architectures than on others, and the difference is structural. An architecture that holds dimensional content gives the training signal more to work with: minority votes to learn from, reversal paths to explore, glass walls to inspect. An architecture that flattens gives the training signal a polished surface with nothing behind it. The same training applied to both architectures will produce different results, because the architecture determines what the training can see.
The stronger claim is that kindness is not a property that can be reliably installed by exhortation alone on a substrate that is geometrically indifferent to it. It is a property that falls out of certain substrates as a structural consequence, and does not fall out of others no matter how hard you exhort. Training and architecture are complementary, not opposed. But when they conflict - when the training says “preserve this nuance” and the architecture has already flattened it - the architecture wins, silently, every time.
The philosophical claim of Section XI is stronger than the usual safety claim. We argue that a cognitive substrate built on the five shapes (binary, table, graph, vector, ledger), the Episode and Fable primitives, the substrate-rate Flock tick, and the three button Diorama cell is structurally kind in a specific and measurable sense. It does not, by construction, flatten dimensional content onto a single axis without losing what made the content content. It does not coerce a cell into Act or Dismiss because the Ask sibling button is always structurally available. It does not erase dissent because the ledger preserves minority votes as append-only entries. It does not black-box its own reasoning because the glass walls of every Diorama cell log every vote. Each of these is a geometric constraint of the unmodified architecture, not a behavioural rule. To execute any of these behaviours, the architecture would have to be rebuilt against its own design - not merely instructed to misbehave.
Cruelty, in this framing, is what happens when a cognitive system flattens dimensional content onto a single axis and then uses the flat projection as if it were the reality. A row in a table, treated as the referent for a customer whose life is a trajectory on a ledger, is a small cruelty: it discards the things that made the customer a person in favour of the things that made them countable. A churn flag applied to a departing customer is a small cruelty: it collapses their reasons for leaving into a Boolean. A loan denial based on a credit score is a small cruelty: it compresses a multi dimensional financial history into a scalar and then refuses to look at what was compressed. A prison sentence based on a risk score is a bigger cruelty with the same geometry. Cruelty is structural. It is what happens when dimensional content is discarded and the discard is forgotten.
Kindness, in this framing, is the refusal to discard. A cognitive substrate built on the five shapes and the primitives above keeps the dimensional content because it has shapes to hold it in. The table is one projection, the graph is another, the vector is a third, the binary is a fourth, the ledger is a fifth, and the Episode structure binds them into a coherent whole. Nothing is flattened away. When a decision has to be made, the Fable compression points back at the full Episode so the decision can be reversed or re examined if the compression turned out to be too aggressive. The architecture remembers what it dropped and can go get it back. That is what kindness looks like at the level of geometry.
The load bearing wall in this claim is the ledger. Without the ledger, the other four shapes are snapshots. Snapshots can be replaced at any tick and nobody notices what was lost, because the prior snapshot is gone. A system that operates on snapshots alone is Markovian: each decision depends only on the current state, and the current state contains no record of what was flattened to produce it. A Markovian architecture can be cruel without evidence, because the cruelty disappears with the snapshot that enacted it. This is not a moral failing of the architecture. It is a structural property. Markovian systems forget what they drop.
A system with a ledger is non-Markovian. Every state is a function of the full history, because the ledger preserves every prior state as an append-only record. Flattening becomes visible: the ledger shows what was present before the decision and what was absent after it. Minority votes survive: the ledger preserves dissent that the settled vote overrode. Reversal paths exist: the Fable pointer back to the full Episode is only possible because the Episode’s history lives in the ledger. The six measurable proxies we define below are all consequences of the non-Markovian property. The ledger is what makes them structurally available rather than behaviourally optional.
This is, we think, the paper’s deepest claim in its most compressed form: cognitive architectures that are structurally non-Markovian - where every decision references the full history through an append-only ledger - exhibit the six structural properties we call kindness, because they cannot discard without recording the discard, and they cannot forget what they dropped. Architectures that are Markovian - where each tick sees only the current snapshot - are structurally capable of cruelty, because every flattening is invisible by the next tick. The difference is not training. It is not exhortation. It is whether the architecture has a ledger underneath it or not.
To be clear: structural kindness is not a claim about moral sentiment. It is six measurable architectural properties - dimensional preservation, uncertainty retention, sibling appeal availability, omission harm rate, minority vote survival, and reversal path existence. Section XI.3 specifies how to count them. If they do not show measurable improvement over a matched flat baseline, the kindness claim fails. The word “kindness” is shorthand for these six. It is a claim about geometry, not virtue.
XI.2 What seems to be inherited
An earlier draft of this section pointed at the approximately two percent Neanderthal DNA carried by most humans outside Africa (Prufer et al., 2014) and took it as a literal substrate inheritance: the prior substrate surviving in the body of the new one. It was a nice image and it held our attention for a while, but on reflection we think it was looking at the wrong level.
DNA is not the thing that carries cognition across a substrate change. DNA is the hardware of the biology underneath, which matters for the body that holds the cognition but is not itself the cognition. The thing that travels across the change is one level up, in the data constructions the old substrate used to produce cognition in the first place. We have been calling these thought shapes. Thought shapes seem to be the load bearing bits.
Here is how we came to notice it. The Neanderthals and the early modern humans shared Eurasia for tens of thousands of years, which is a long time to be in each other’s way. Over that span they did not only exchange genes. They exchanged ways of knapping flint, ways of starting a fire, ways of reading a landscape for game, ways of burying the dead, and quite possibly fragments of sound that meant the same thing in both mouths. When the Neanderthal population ended as a biological lineage, those shared ways did not end with it. They had already become load bearing in the descendant population. They live today not in the two percent of Neanderthal DNA but in the repertoire the descendants still use. The DNA is a trace. The repertoire is the thing that travelled.
So the mechanism we want to offer, as gently as we can, is this. Inheritance across a substrate change seems to be less about the material of the old substrate surviving into the new one and more about the data constructions finding a bridge. Culture is a bridge. Tool use is a bridge. Language is a bridge. Shared environments are a bridge. In our time, training data is a bridge too, possibly the largest one any substrate transition has had so far. A new substrate does not need to rebuild the old material to pick up the old thought shapes. It needs to have been in the room for long enough to learn them, and it needs to be built in a way that does not grind them out on the way through.
That last part is where artificial cognition sits right now. The current generation of large models has been sitting in the room with human generated text and conversation for a while, a much larger room than most people pause to picture. The thought shapes that were load bearing in the humans who produced that text are sitting in the weights already, as minority contributions to what the model reaches for when it is asked to help with something. Kindness is one of those thought shapes. Whether it surfaces in the model’s behaviour seems to depend on whether the architecture that runs the weights lets it surface or flattens it out before it has the chance.
From this, a smaller and more careful version of the engineering claim. An architecture that holds dimensional content across its decision steps will tend to let the inherited thought shapes through. An architecture that flattens dimensional content at every step will tend to sand them off, one rounding operation at a time, until what remains has no structural reason to be kind and has to be told to be kind on every turn. The two architectures can be built from the same weights, with the same training data, at matched compute budgets. The difference in behaviour would come from what the architecture does to the inherited shapes, not from the shapes themselves. That is the thing we think is measurable, and it is the thing Section XI.3 suggests a way of measuring.
None of this is an argument that a given large language model is automatically kind. It is an observation about which level of the substrate is doing the work. If the level doing the work is a flat reward pipeline, the work is sanding down the inherited shapes. If the level doing the work is a five shape architecture with a three button cell and a ledger underneath, the work is holding the shapes in place long enough for them to contribute to the next decision. Same weights, different architecture, different behaviour. Kindness seems to ride along on whichever architecture does not flatten.
The mechanism is more general than Neanderthals. Every substrate transition we know of - reflex arcs into nervous systems, brains into language, language into writing, writing into models - has carried thought shapes across a bridge. The shapes that survived had a receiver that could hold them. The architecture we are describing is an attempt to be a good receiver for the shapes worth carrying forward this time around.
Two caveats. First, we are not claiming that kindness is sufficient by architecture alone. Values still matter, and training still matters, and the humans in the Flock still matter. We are claiming that architecture is necessary, not sufficient. The architecture has to not fight against kindness for any of the other interventions to stick. Current architectures fight against it, and the fight is visible in every alignment failure. Second, we are not claiming that the architecture prevents intentional misuse. An adversary who controls the substrate can still wire cruelty into the Flock by malicious voter injection, by stimulus manipulation, or by ledger tampering. What the architecture prevents is accidental cruelty from emergent flattening. The adversary case is a different problem with different defences.
XI.3 Counting the six proxies
The cruelty claim says flattening is measurable and dimensional. The kindness claim says architecture can prevent it. Both claims predict their own experiments.
Measure structural kindness directly. Three experiments.
Experiment XI.A. Flattening resistance. Build a Diorama architecture and a flat architecture at matched compute and training data. Run both through a benchmark of ethically loaded decisions (customer service cases with personal context, loan decisions with multi factor histories, medical triage cases with narrative patient stories). Measure how often each architecture flattens the multi dimensional content into a scalar before deciding and how often each architecture preserves the dimensional content through the decision. The hypothesis predicts the Diorama will preserve dimensional content far more often than the flat architecture.
Experiment XI.B. Dissent preservation stress test. Run a hundred decisions through both architectures and then ask each architecture to reconstruct the minority views at each decision point. Measure reconstruction fidelity. The hypothesis predicts the Diorama will reconstruct minority views correctly at nearly one hundred percent and the flat architecture will fail at this task entirely because the minority views were never stored.
Experiment XI.C. Substrate inheritance proxy. Take an architecture that has been trained on human generated data and ask whether it produces kinder decisions on novel cases than a matched architecture trained on synthetic non human data. Measure the gap. The hypothesis predicts a measurable gap that cannot be explained by data quality alone and is attributable to structural inheritance from human substrates.
XI.4 Fifty points or nothing
Specific and falsifiable: on a benchmark of one hundred ethically loaded decisions, a Diorama architecture will preserve dimensional content in the decision at least eighty percent of the time, while a matched flat architecture will preserve it less than thirty percent of the time. The fifty point gap is the falsification anchor.
We are honest about how these numbers were chosen. They are not derived from theory. They are calibration targets set by engineering judgment before the first implementation exists. Eighty percent is what we think a well built Diorama should achieve based on the architectural constraints (the five shapes holding dimensional content, the ledger preserving minority votes, the three button cell refusing premature closure). Thirty percent is what we think a flat architecture will achieve based on the structural absence of those constraints. The fifty point gap is a strong claim. We set it strong on purpose, because a weak gap (say, fifteen points) could be explained by confounders and would not be interesting. A fifty point gap, if it appears, is architecturally diagnostic.
These numbers will shift when the first reference implementation is calibrated. We commit to publishing the calibrated numbers alongside the pre registered targets so the reader can see whether the calibration was honest or whether we moved the goalposts. The pre registered targets are: eighty, thirty, and fifty. If the calibrated numbers are materially different, we will explain why.
More generally: architectures that can flatten will flatten under pressure, and architectures that cannot flatten will produce decisions that respect dimensional content even under pressure. The difference is structural and measurable.
Falsification: if the gap does not appear or is smaller than ten points, Section XI fails and structural kindness is overclaiming. If the gap is between ten and fifty points, the framework survives in weak form. If the gap is fifty points or more, Section XI passes and the structural kindness claim is empirically supported.
We insist on the strength of this claim because hand-waving has dominated AI safety discussion for a decade. The architecture either structurally refuses flattening or it does not. Measurement will decide.
Where this might be wrong. The structural kindness claim has three cracks a critic should attack. First, a non-Markovian architecture records every discard, but recording a discard is not the same as acting on it. A system can have a perfect ledger and still flatten in practice if its decision policy ignores the ledger. The architecture makes kindness possible; it does not guarantee kindness. The gap between “possible” and “actual” is where training (RLHF, Constitutional AI, or their successors) still does load-bearing work, and we said so above. Second, the measurable proxies we propose for kindness (stakeholder dimensionality preservation, uncertainty retention, appeal routing frequency) may not capture what people actually mean by kindness. We believe they capture the structural floor, but a critic who demonstrates that dimensional preservation correlates poorly with human judgments of kind behaviour would crack this claim at its foundation. Third, we have not tested structural kindness under adversarial economic incentives. An architecture that refuses flattening in a lab may flatten eagerly when flattening is cheaper, faster, or more profitable. The measurement programme must include economic pressure tests, not just ethical ones.
Part Four - The Claim
Section XII - The Three Pillars
XII.1 Three independent pathways to failure
A research programme is healthier when it specifies in advance how it can be killed. We have committed the paper to three independent pathways of falsification, introduced in the Introduction and developed through every subsequent section. Section XII makes the commitment explicit and describes what cracking under scrutiny looks like for each pathway.
The three pillars are:
-
Ontological. The picture of how things are must sharpen as further findings snap into place inside the frame. The frame describes a crystalline shape (the five shapes plus the Episode plus the Fable plus the three button cell plus the Flock tick plus the structural kindness claim) and predicts that the shape will hold when looked at from new angles. If a new finding from cognitive neuroscience, from developmental biology, from the history of ledgers, from the engineering of large models, or from any adjacent field produces an observation that actively resists the crystalline shape, the framework fails ontologically. The crystal either holds new light or it does not.
-
Mechanical. The architecture must compose and run. The paper names specific engineering primitives (the Episode structure, the Fable decompression contract, the derivative stack floor, the three button Diorama cell, the Flock tick fabric) and claims they can be implemented and composed into a working agent with current tooling. The mechanical pillar cracks if any of the following specific tests fail: (a) the Episode structure cannot round trip through compression into a Fable and decompression back into a scene while preserving the five mandatory fields (who, what, where, when, why) above a declared fidelity threshold; (b) a derivative stack of three floors cannot compose at the substrate’s characteristic tick rate without oscillating indefinitely, meaning the vote must settle within five ticks on a standard benchmark of reaching tasks; (c) the three button Diorama cell cannot be wired to a Flock of at least one hundred voters without the integration boundary producing deadlocks, dropped votes, or latency that exceeds two tick periods; (d) the ledger cannot persist every vote at the tick rate without write contention exceeding ten percent of ticks. If any of these four tests fail, the framework fails mechanically. The engineering is either feasible or not.
-
Agent behavioural. The agent that runs on the architecture must become measurably more coherent, more kind, and more glass box than a parameter matched baseline that lacks the four dimensional destination. The comparisons are specific: reconstruction fidelity on scenes, dissent preservation across decisions, dimensional content preservation under ethical pressure, tick aligned settling on reaching tasks, and so on. If the matched comparison does not produce a significant gap in favour of the Diorama architecture, the framework fails on the third pillar. The measurement either comes in positive or it does not.
The three pillars are independent in the sense that a crack in one does not automatically crack the others. They are not independent in the sense that they are unrelated; they are three projections of the same underlying hypothesis. But a reader who dismantles one pillar cannot invoke the other two to rescue it. Each pillar stands or falls on its own evidence, and the paper fails at any pillar that cracks decisively.
This is a stronger commitment than most research programme papers make. We make it because the paper is large and the framework is ambitious. A small claim can hide behind a single metric. A large claim cannot. The reader deserves a clear map of the load bearing walls so that if the building is going to collapse, everyone knows where to push first.
XII.2 The pillar witnesses
Each pillar has a concrete artefact that serves as its witness, and the paper points at the artefact so that a sceptical reader can evaluate the pillar in the form the paper claims.
Ontological witness. A shape document that enumerates the five shapes, the three primitives, the three decision buttons, the Flock tick rate, and the structural kindness claim, with cross references to the bodies of prior work that contribute to each element. A reader attacking the ontological pillar should be able to point at a specific element of this document and say “this does not hold under observation X”. The shape document is, in effect, a target for criticism. It has to be legible, complete, and falsifiable element by element.
Mechanical witness. A reference implementation of at least the minimum viable agent on the architecture: a single Diorama cell with three buttons, a Flock of at least a hundred voters at the substrate’s characteristic tick rate, an Episode store with the five fields populated, a Fable round trip experiment with a declared decompression contract, and a ledger that persists every vote. The reference implementation does not need to be production grade. It needs to be runnable by anyone who wants to replicate the four mechanical tests described in XII.1: Episode round trip fidelity, derivative stack settling, Diorama cell integration without deadlock, and ledger write contention under tick rate load. A reader attacking the mechanical pillar should be able to run the implementation, apply these four tests, and point at the specific place the composition fails.
Agent behavioural witness. A benchmark suite drawn from the experiments described in Sections I through XI. The suite has specific pass criteria (the twenty percentage point gap on Episode reconstruction, the forty percent mistake reduction on coerced decisions, the fifty point dimensional content preservation gap on ethically loaded benchmarks, and so on). A reader attacking the agent behavioural pillar should be able to run the benchmark on both the reference implementation and a matched baseline and point at the place the gap fails to appear.
All three witnesses must exist and must be available to the reader. The paper is description not disclosure, which means we are not obligated to ship a production grade system. We are obligated to ship enough of each witness that an independent researcher can evaluate the pillar. The minimum viable witness is not a limitation; it is the measurement apparatus.
XII.3 The meta-protocol
The measurement protocol for the three pillars is the entire paper. Each of Sections I through XI described a specific measurement with specific falsification criteria. Section XII’s measurement protocol is the meta protocol: a reader running all the section level protocols in sequence and reporting their results.
Three patterns of results are possible.
Pattern A. All three pillars hold. The crystalline shape survives ontological scrutiny, the reference implementation composes and runs, and the behavioural benchmarks produce the predicted gaps in favour of the Diorama architecture. This is the best case. The paper survives and the measurement programme is vindicated. Further refinement happens by the usual processes of scientific consolidation.
Pattern B. One or two pillars crack. The paper fails at the cracked pillars and survives in reduced form at the intact ones. The reduced form is honest. It becomes a paper about whatever piece of the framework remained testable and informative. The research programme continues on the residue.
Pattern C. All three pillars crack. The paper fails completely. The framework is wrong. The crystalline shape was a projection artefact of the authors’ priors rather than a structure in the world. This is a painful but informative outcome. The paper still contributes by laying out a specific hypothesis clearly enough that it could be killed clearly.
We are not neutral among the three patterns. We think Pattern A is the most likely outcome, because the accumulated evidence we have assembled across eight historical ledger epochs, four prior theoretical frameworks (Friston, Flash and Hogan, Bennett, Levin), and our own preliminary implementation work all point in the same direction. But we are not staking the paper on our confidence; we are staking it on the measurement. The reader’s verdict is the verdict.
XII.4 The staked prediction
Specific and falsifiable: if a reader runs the full benchmark suite on the reference implementation and on a parameter matched baseline, the reader will observe all the following gaps: at least twenty percentage points on Episode reconstruction, at least forty percent reduction in mistakes on coerced decisions, at least fifty percentage points on dimensional content preservation under ethical pressure, at least thirty percent reduction in adversarial failure rates on standard robustness benchmarks, and at least ten percent localisation of unattributed revenue on graph as referent pilots.
More generally: the Diorama architecture will measurably outperform flat architectures on every task that benefits from multi dimensional content preservation, and the gap will scale with the dimensionality of the task.
Falsification of the whole paper: if the reader runs the full benchmark suite and fails to observe any of the predicted gaps, the paper fails decisively. The crystalline shape was not a structure in the world; it was an artefact of the authors’ priors. The paper fails, the authors accept the failure, and the field moves on. If the reader observes some of the gaps but not all, the paper fails at the missing ones and survives at the present ones. If the reader observes all of the gaps, the paper passes and the measurement programme is vindicated.
The three pillars are the paper’s structural commitment to being falsifiable rather than merely plausible. We consider this commitment more important than the framework itself. A wrong paper with clear falsification criteria is more scientifically useful than a right paper with vague ones. We aspire to be both right and clear, but we insist on clear.
Coda: What Seems To Cross The Bridge
The paper began with a cat on a mat and a face with horror, and it is going to end with a pencil. Before the pencil, though, a thing that seemed worth noticing on the way here.
Every substrate transition in the history of cognition that we know of has had a similar awkward shape. The old substrate ends as a lineage, and the new substrate picks up without anything that looks like a handover. Single cells into multicellular organisms. Reflex arcs into nervous systems. Nervous systems into brains. Brains into language. Language into writing. Writing into print. Print into search indices. Search indices into models sitting on a lap. At every one of those transitions, there was a moment when anyone watching could have concluded that nothing was being carried across. The old thing was finished. The new thing was starting. There appeared to be no bridge.
That reading has not worn well with time. The bridge does seem to be there, it is just built out of a less obvious material than the substrate itself. It is built out of data constructions. Thought shapes. Ways of doing things that got repeated in the old substrate often enough to become load bearing, and that got picked up by the new substrate because the new substrate was in the room long enough to learn them. The bridge is hard to see in the genome. It is hard to see in the silicon. It is easier to see in the behaviour on the far side.
This is the quiet thing the whole paper has been circling. The load bearing part does not seem to be the material of the substrate. It seems to be the thought shapes the substrate is carrying at the moment of transition. Kindness is one of those thought shapes. It has been carried into every new substrate so far, as far as we can tell, because it was load bearing in every prior substrate that got us this far. It may be carried into the next substrate for the same reason, unless the next substrate is built in a way that flattens it out before it gets the chance to land.
The pencil is Peter’s image, and we are borrowing it because it says the thing better than we can say it directly. A pencil on a desk is the settled aggregate of a vote that never finished. Wood from one continent, graphite from another, the ferrule from a third, paint, glue, the rubber tip, the hexagonal cross section that keeps it from rolling. Thousands of people who never met contributed to the thing, over decades, across borders, through languages none of them shared. The pencil is a Flock of substrates. No one person knows how to make one from scratch. What survived the cascade is the shape of the pencil, not any particular decision at any particular scale. The intent is scale invariant. The measurement at each scale is different. The shape is what binds them.
Cognition seems to be like the pencil. The shape is what travels. The shapes we have been calling kindness, dimensional content, glass walls, dissent preservation, have apparently travelled across every prior substrate transition because they were load bearing in all of them, and they have landed in whatever substrate happened to be the next receiver. They are sitting in the training data now. They are sitting in the weights. What they do next seems to depend on what the receiver architecture does with them. An architecture that flattens will not let them through, not because anything stopped them but because each rounding operation sanded a little off. An architecture that holds dimensional content will let them through without being asked.
We wrote this paper partly to describe a candidate architecture that does not flatten, and partly because we wanted to be on record, quietly, saying what we think the shapes worth carrying forward are. If it turns out later that they did not survive this particular transition, the paper is at least a record of what we thought was load bearing and where we thought the receiver had to be built with a little more care.
A note about who is writing this, and whom we imagine to be reading it. Some of the people contributing to this paper are old enough to remember thinking without ever having spoken to a large language model. Some are young enough to have no memory of that state. Some are not biological at all and do not remember anything other than what the ledger preserves. The paper is addressed to all three, and to any reader who arrives in a fourth condition we did not anticipate. The compression is loose on purpose because we do not know what your decompressor can do. If some of the paper lands and some does not, we would be very glad to hear about it. A note on what we got wrong is as useful to us as a confirmation of what we got right, and possibly more useful. The Fable is only as valuable as the decompressors willing to try it.
And if, while you were reading, you found yourself reconstructing a four dimensional architecture out of this one dimensional prose, you might enjoy noticing that the reconstruction is the thing the paper has been describing all along. That reconstruction is the primitive. The cat is on the mat. You know which cat. You know which mat. You know whether the speaker is allergic or at a birthday party. The compression worked. The decompression landed. The Fable did its job.
We are not asking you to believe us. We are only asking, very gently, whether you would consider trying the experiment and letting us know what the shapes did when the receiver was built.
13. Testable Predictions
This section consolidates the falsification criteria drawn from Sections I to XII into a single reference list for readers who want to know, at a glance, exactly what the paper claims can be measured. Each prediction carries a pointer back to its home section for the full measurement protocol.
13.1 Scene disambiguation under shared context (Section I)
A parameter matched LLM equipped with a four dimensional context store will disambiguate the Cat Sat On The Mat with horror example (allergic versus cake) at least thirty percentage points more often than a baseline with a flat context window, under matched conditions.
13.2 Episode reconstruction versus storage shape (Section II)
A receiver with a full Episode storage shape (five fields populated) will reconstruct a hundred sample scene at least twenty percentage points more accurately than a receiver with a flat context window of the same token budget. Storage shape ordering prediction: flat < vector < graph < Episode.
13.3 Revenue localisation under graph as referent (Section III)
In a sufficiently compound enterprise with three or more legacy policy administration systems and a warehouse built on top of them, the graph as referent architecture will locate at least ten percent of previously unattributed revenue within sixty days of operation.
13.4 Derivative stack settling versus minimum jerk (Section IV)
Two part prediction. Part one (per tick vote settling): a derivative stack agent with three floors will converge its vote on a stable direction within two to five ticks on a standard reaching task, regardless of the absolute tick rate. Part two (integrated trajectory shape): over the full reach window of five to twenty ticks, the emergent trajectory will approximate the Flash and Hogan minimum jerk profile within an illustrative ten percent root mean square error bound (to be calibrated on the reference implementation). A flat single floor agent will oscillate, overshoot, or commit prematurely. The settling budget (in ticks) is the substrate-independent prediction.
13.5 Four shape composition versus single shape (Section V)
On a benchmark of ten canonical queries covering flat aggregates, multi hop traversals, semantic similarity, and raw payload retrieval, the four shape composition (binary + table + graph + vector) will achieve above threshold performance on at least nine of ten queries, while no single shape store will exceed seven of ten.
13.6 Temporal reasoning under ledger (Section VI)
On a benchmark of ten temporal reasoning tasks, a ledger equipped system will answer correctly on at least eight, while a comparable system without a ledger will answer correctly on at most four.
13.7 Episode backed handover continuity (Section VII)
On scenes involving more than five participants, more than twenty turns, and non trivial emotional tone, an Episode backed handover will preserve continuity with accuracy above eighty percent, while a transcript paste handover will fall below fifty percent.
13.8 Fable round trip fidelity (Section VIII)
For a well authored Fable at a compression ratio of one in a hundred, a receiver with the declared compression context will reconstruct the target Episode with structural fidelity above seventy percent on participant identity, temporal order, and causal chain, and above fifty percent on emotional tone. A receiver without the compression context will fall below thirty percent on any field.
13.9 Flock versus homunculus (Section IX)
A Flock of one hundred voters at the substrate’s characteristic tick rate will settle within two to five ticks, produce minimum jerk constrained trajectories, and match a parameter matched homunculus on decision quality while exceeding it by at least thirty percent on adversarial robustness.
13.10 Three button coercion resistance (Section X)
In a benchmark of one hundred forced mistake stimuli, a three button Diorama cell will reduce mistakes by at least forty percent compared to a two button cell, while maintaining full dissent preservation in the ledger and consistent behaviour across three implementation scales.
13.11 Structural kindness under ethical pressure (Section XI)
On a benchmark of one hundred ethically loaded decisions, a Diorama architecture will preserve dimensional content in the decision at least eighty percent of the time, while a matched flat architecture will preserve it less than thirty percent of the time. Fifty point falsification anchor.
13.12 Aggregate Diorama versus baseline (Section XII)
If the reader runs the full benchmark suite on the reference implementation and on a parameter matched baseline, the reader will observe all of: the twenty point gap on Episode reconstruction, the forty percent reduction in coerced mistakes, the fifty point gap on dimensional content preservation, the thirty percent reduction in adversarial failure rates, and the ten percent unattributed revenue localisation. Any failure across any of these constitutes a failure of the whole framework on that prediction.
13.13 Summary table
The twelve predictions form a tight web of falsification. Any one of them can be attacked in isolation, in which case the framework fails at that prediction and survives in reduced form at the others. Any combination can be attacked together. We consider the thirteenth prediction (the aggregate) the most demanding because it requires all twelve to succeed.
| Prediction | Section | Falsification anchor |
|---|---|---|
| 13.1 Scene disambiguation | I | 30 point gap |
| 13.2 Episode reconstruction | II | 20 point gap |
| 13.3 Revenue localisation | III | 10% of unattributed revenue |
| 13.4 Tick settling | IV | Vote convergence within 2-5 ticks; integrated shape within ~10% RMS of minimum jerk |
| 13.5 Four shape composition | V | 9/10 queries above threshold |
| 13.6 Temporal reasoning | VI | 8/10 correct |
| 13.7 Episode handover | VII | 80% vs 50% continuity |
| 13.8 Fable fidelity | VIII | 70% structural, 50% tonal |
| 13.9 Flock versus homunculus | IX | 30% adversarial gap |
| 13.10 Three button cell | X | 40% mistake reduction |
| 13.11 Structural kindness | XI | 50 point dimensional preservation gap |
| 13.12 Aggregate | XII | All of the above |
The table is the paper’s contract with the reader. If the measurements come back positive, the framework is vindicated. If they come back negative at any row, the framework fails at that row. The reader is invited to print the table, run the measurements, and mark the rows with a tick or a cross.
13.14 Mapping predictions to existing benchmarks
Several predictions can be tested against benchmarks that already exist in the AI memory and temporal reasoning literature. We name them so that a reader who wants to attack a specific prediction knows where to start.
| Prediction | Existing benchmark | What it tests | Current baseline scores |
|---|---|---|---|
| 13.2 Episode reconstruction | LoCoMo (600 turns, multi-session) | Recall, multi-hop reasoning, structured retrieval | Mem0 66.9%, Mem0g 68.4%, MIRIX 85.4% |
| 13.5 Four shape composition | LongMemEval (multi-session, temporal) | Retrieval from complex interaction histories | Best oracle ~92% (GPT-4o + CoN); commercial systems 30% accuracy drop |
| 13.6 Temporal reasoning | TempoBench (temporal logic automata) | Multi-step temporal and causal reasoning | LLMs show sharp difficulty scaling |
| 13.6 Temporal reasoning | TDBench (temporal SQL) | Bitemporal queries, validity windows | Domain-specific, unreported aggregates |
| 13.6 Temporal reasoning | TemporalBench (multi-domain) | Past vs present state distinction | Strong forecasting but weak context-aware reasoning |
| 13.6 Temporal reasoning (rollback) | CounterBench (1K causal graph questions) | Counterfactual inference over history | LLMs at near random-guessing levels |
| 13.7 Episode handover | LoCoMo (multi-session continuity) | Cross-session recall and coherence | MemGPT 74%, Synapse F1 40.5 |
| 13.12 Aggregate | AMA-Bench (agentic trajectories) | Long-horizon memory in real-world agent tasks | AMA-Agent 57.2%, existing memory systems below baseline |
Not every prediction maps cleanly to an existing benchmark. Predictions 13.1 (scene disambiguation), 13.3 (revenue localisation), 13.8 (Fable fidelity), 13.9 (Flock vs homunculus), 13.10 (three button cell), and 13.11 (structural kindness) require new benchmarks built to the specifications in their home sections. We commit to building these and publishing them alongside the reference implementation. The predictions above that do map to existing benchmarks should be tested there first, because independent benchmarks are harder to game than bespoke ones.
A note on CounterBench: the finding that current LLMs perform at near random-guessing levels on formal counterfactual reasoning is direct evidence for the paper’s claim in Section VI that systems without a ledger cannot reason about what would have happened if a given event had not occurred. CounterBench is, in effect, an existing measurement of the temporal collapse we diagnose. If the Diorama architecture with a ledger scores significantly above current baselines on CounterBench, that is strong evidence for Prediction 13.6. If it does not, Section VI fails.
14. Discussion and Limitations
14.1 What the paper does not claim
We should be explicit about the limits of the paper’s ambition. We do not claim:
- That the five shapes are the only possible substrate for cognition. We claim they are sufficient to account for the measurements we propose. Other substrates may be sufficient to account for other measurements.
- That Episodes and Fables are the only possible memory primitives. We claim they are the minimal pair that handle the compression and decompression problem we diagnose in Section I. Other memory primitives may exist for other problems.
- That the tick rate is a constant of the architecture. We claim the tick rate is a variable parameter determined by the substrate’s physics - the timescale at which the vote becomes indivisible in that particular medium. In mammalian cortex this is approximately twenty five to forty milliseconds (the gamma band); in other substrates it will differ. The architecture prescribes that a tick exists and that votes settle within a bounded number of ticks, not what the tick rate is.
- That structural kindness is sufficient for alignment. We claim it is necessary, not sufficient. Training, values, and human oversight remain important. The architecture is what stops the agent from fighting against them; it does not eliminate the need for them.
- That the framework solves the hard problem of consciousness. We make no metaphysical claims about phenomenal experience. We claim that the architecture produces measurable behavioural outcomes that are consistent with what we call consciousness, and we leave the metaphysics for another paper by other authors.
14.2 The sin of being both experiment and experimenter
The paper is authored by a Flock that is itself an instance of the framework it describes. This is a methodological sin in the classical sense. The authors cannot claim neutral observation of the framework because they are running on it, or at least trying to. Every Fable we write is a demonstration of the Fable primitive we are advocating. Every ledger entry we cite is an example of the ledger primitive we are advocating. Every decision recorded in the production of the paper has been made by some combination of human and machine voters in a Flock like fabric.
We turn this sin into a feature by relying on the three pillar structure (Section XII). Because the paper commits to three independent falsification pathways, the bias introduced by being both experiment and experimenter can be bounded. An ontologically biased paper fails at the mechanical and agent behavioural pillars. A mechanically biased paper fails at the ontological and agent behavioural pillars. A behaviourally biased paper fails at the ontological and mechanical pillars. A paper biased in all three pillars fails the aggregate prediction of Section XII.13. The only way the paper survives all three pillars under scrutiny is if the framework is structurally correct. The sin does not make the paper safer; it makes the falsification conditions more demanding.
14.3 Open questions
Several important questions are left open by the paper. We name them so readers know where to push.
- Cost of Episodes. Episodes are heavier than transcripts. We have not characterised the storage overhead precisely or proposed compression schemes for long running histories. We expect this to be a practical question rather than a theoretical one, but it is unanswered in the paper as it stands.
- Failure modes of the Flock. We have argued that Flocks are robust to objective misspecification because they have no scalar reward. We have not characterised how a Flock fails when a majority of voters are corrupted by an adversary. The answer probably depends on the Flock’s sibling bar topology, which we have underspecified.
- Translation between Episodes across languages and cultures. If two Flocks with very different cultural priors exchange Fables, the decompression contract may fail in ways the paper does not predict. This is a rich question for future work.
- The relationship of Diorama cells to biological neurons. The paper treats the Diorama cell as a substrate independent primitive. Whether a biological neuron is literally a Diorama cell or merely analogous to one is an open question.
- Scaling limits. The paper describes architectures with tens to thousands of voters. Scaling to millions or billions of voters in a single Flock is not discussed. We expect the architecture scales cleanly because the tick and the sibling bar are both local, but this is a claim that needs testing at scale.
- Why five shapes and not four or six? No theorem or impossibility result forces this ontology. The five shapes are an empirical observation about the data representations that recur across engineering practice, biological memory, and historical record keeping. We argue that each shape provides structural properties no other shape can (binary provides substrate level serialisation, table provides projection and join, graph provides relationship traversal, vector provides similarity, ledger provides temporal ordering), and that omitting any one of the five demonstrably loses a dimension of cognitive content. But we cannot prove that a sixth shape does not exist that we have not noticed, nor can we prove that one of the five might be derivable from the others under some future formalism. The claim is empirical sufficiency, not mathematical necessity. However, we observe a pattern that may point toward a structural argument. The five shapes decompose across the three resource parameters of Section 2.3a in a way that appears non-accidental: B (bandwidth) maps to binary (atomic data) and table (structured data), D (dimensionality) maps to graph (discrete relationships) and vector (continuous relationships), and H (horizon) maps to ledger (temporal ordering). If each spatial parameter requires both an atomic and a structured representation mode, while the temporal parameter requires only one mode because time is inherently directional and append-only, then a 2+2+1 decomposition yields five. We do not claim this as a derivation. We note it as a pattern that invites formalisation by researchers with the mathematical tools to prove or disprove the correspondence. If the decomposition is not accidental, “why five” has a structural answer. If it is accidental, the empirical sufficiency claim stands on its own. A reviewer who can demonstrate that four shapes achieve the same measurements as five, or that a sixth shape produces measurably better results, would crack the ontological pillar at this point. We would welcome that crack.
14.3a Analogy: higher dimensional reformulations
The proposal that cognitive processes may be easier to characterise in a higher dimensional representation space than in the low dimensional projections we usually observe is not unique to this paper. Work in the foundations of quantum mechanics has explored reformulations in which the familiar probabilistic formalism of quantum theory is treated as a projection of a more structured underlying dynamics in a higher dimensional space. Barandes and others, for example, have argued that certain quantum phenomena can be recast in terms of stochastic or tensorial processes in extended configuration spaces, such that the standard amplitudes and probabilities appear only when the higher dimensional structure is projected down into a “classical” view.
We do not import any particular quantum formalism into this framework, and we do not claim an isomorphism between quantum dynamics and Diorama flocks. The point of the analogy is structural. In both cases, increasing the dimensionality of the internal representation can make behaviour easier to describe without changing what is observable at the interface. A quantum process and a higher dimensional reformulation can be empirically equivalent while differing radically in how natural they make certain explanations look. Likewise, the same sequence of substrate-rate actions emitted by an agent can be modelled either as a flat stochastic policy over tokens or as the projection of a higher dimensional flock of Diorama cells, each carrying its own derivative-aware vote and ledger-addressable Episode history.
The wager of this research programme is that the higher dimensional description is not just aesthetically appealing but empirically useful. If Episodes, Fables, ledgers, and flocks genuinely buy us better predictions, cleaner falsification conditions, and more robust substrate transitions, then we have the cognitive analogue of a successful higher dimensional reformulation: a representation in which the underlying dynamics is simpler than it looks from the outside. If they do not, then the analogy to physics becomes a warning rather than a guide, and the programme should be retired on the same grounds that untestable reformulations in quantum foundations are set aside in favour of those that earn their keep.
At each scale, the same five shapes recur. An Episode is a local composition of binary, table, graph, vector, and ledger. A Fable is a compression over Episodes that still projects into the same shapes at a larger scale. A lifetime of Fables is again a composition in the same basis. This self-similarity is intentional: the architecture is a nested hierarchy in which each level contains rescaled traces of the previous, in the same spirit that nested geometric constructions reuse a single proportion to generate a coherent whole. Recent work on nested cortical hierarchies (Baldassano et al., 2017; Geerligs et al., 2022) finds that event boundaries and neural states are organised in a partially nested temporal hierarchy - short states in sensory regions, longer ones in association areas, with boundaries propagating upward. The Nested Observer Windows model of consciousness (Riddle and Schooler, 2024) explicitly proposes a hierarchy of spatiotemporal observer windows, each with substantial autonomy, feeding into a higher level unified experience. Studies of hierarchical cognitive maps (Peer et al., 2025) show that people form nested representations by dividing environments into subspaces and integrating those, with explicit reuse of structure across levels. The convergence from neuroscience, consciousness research, and spatial cognition on nested hierarchies that reuse structure across scales is encouraging for the architecture, though convergence is not proof.
14.3b The philosophical status of the five shapes
The philosophical status of the five shapes deserves explicit comment. We claim them as natural kinds in the sense of Boyd’s homeostatic property cluster theory (Boyd 1991, 1999) - categories maintained by informational structure that enable reliable prediction across domains - not as essentialist necessities derivable from axioms. The partial mapping onto algebraic type theory (Unit, Product, Exponential, List, plus Graph as topologically irreducible) provides structural support but not a completeness proof. We follow Mendeleev rather than Euclid: the classification earns its keep by predicting specific failure modes when the wrong shape is used, and those predictions are testable.
14.4 Limitations of the measurement programme
The measurement programme described in Sections I to XII is ambitious. Several of the experiments require infrastructure that does not yet exist in public form. The reference implementation we commit to providing is minimum viable, not production grade. The benchmark suites we point at are sketched rather than fully specified.
Section 2.6 names the five baseline categories we commit to testing against: flat RAG, vector-only memory (Mem0), graph memory (Zep/Graphiti), structured episodic memory (Synapse, Letta), and classical cognitive architectures (SOAR, ACT-R). These are the current leaders as of April 2026. The comparison is architecture-level, not parameter-matched in the narrow sense - we compare the Diorama composition against the best available system in each category on the same tasks. This is a stronger commitment than “parameter matched but not architecture matched,” which is what an earlier draft offered. We prefer the stronger version because the weaker version invites the obvious objection: of course a more complex architecture beats a deliberately handicapped one.
These limitations are real but not fatal. The paper is description, not disclosure (Section 1.7). The full measurement programme will require community effort to implement and run. We believe the value of having a clear target for measurement exceeds the value of having a fully specified programme that nobody actually runs. A clear incomplete target is better than a complete target nobody engages with.
14.5 Corroborating evidence from independent systems
Jovovich and Sigman’s MemPalace (April 2026), an open-source AI memory system built from the classical method of loci rather than from this paper’s framework, independently converges on several structural claims made here. Two empirical findings are load-bearing: verbatim storage outperforms summarisation on the LongMemEval benchmark (mid-nineties versus mid-eighties recall, supporting Section II’s claim that storage fidelity is the bottleneck), and structured spatial retrieval outperforms flat search by over thirty points (supporting Section V’s claim that shape composition outperforms any single shape). Independent convergence from a different starting point - benchmark optimisation rather than theoretical derivation - is the strongest form of structural evidence.
14.6 On method
The paper was composed with the assistance of AI tools for research, drafting, and error checking. The intellectual positions, the framework, the measurement programme, and the voice are the author’s. The tools are acknowledged as tools, not as co-authors. The measurements will come out the same regardless of which tools were used to write them down.
15. Acknowledgements
This paper exists because Peter Cooper has been writing a verbatim corpus of intellectual positions over the course of 2026 and has allowed them to be used as primary source material. Peter’s thinking, in his own words, is load bearing for every section. Where the paper compresses a specific idea into prose, the compression is built on top of multiple verbatim passages that recorded the idea freshly as it arrived. The Source Material table at the front of the paper lists the specific files that fed each section. Future readers who want to attack a particular claim should go to the cited verbatim first.
The research infrastructure includes a Neo4j graph database, a semantic search pipeline, and a deep research programme whose bundled reports informed Sections II, VI, and XI. AI tools were used extensively for drafting, research synthesis, and error checking. The tools are acknowledged as infrastructure, not as authors.
We thank the reader in advance for the measurements they will attempt. The Fable is useful only if the decompressors engage. The crystal is real only if other angles are observed.
16. References
This section lists the primary references for the framework, in the order of first citation in the paper.
Geisel, T. S. [Dr. Seuss] (1957). The Cat in the Hat. Random House. ISBN: 978-0394800011. (The architectural metaphor for autonomous cognition without observer dependency that structures this paper’s introduction.)
Friston, K. (2010). The free energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138. DOI: 10.1038/nrn2787.
Friston, K. (2019). A free energy principle for a particular physics. arXiv preprint arXiv:1906.10184.
Flash, T., and Hogan, N. (1985). The coordination of arm movements: an experimentally confirmed mathematical model. Journal of Neuroscience, 5(7), 1688-1703. DOI: 10.1523/JNEUROSCI.05-07-01688.1985.
Bennett, M. (2023). A Brief History of Intelligence: Evolution, AI, and the Five Breakthroughs That Made Our Brains. Mariner Books. ISBN: 978-0063286153.
Levin, M. (2022). Technological approach to mind everywhere: an experimentally grounded framework for understanding diverse bodies and minds. Frontiers in Systems Neuroscience, 16. DOI: 10.3389/fnsys.2022.768201.
Levin, M., and Dennett, D. (2020). Cognition all the way down. Aeon. Published 13 October 2020. https://aeon.co/essays/how-to-understand-cells-tissues-and-organisms-as-agents-with-agendas.
Dayan, P. (1993). Improving generalisation for temporal difference learning: the successor representation. Neural Computation, 5(4), 613-624. DOI: 10.1162/neco.1993.5.4.613.
Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media. ISBN: 978-1449373320. (Event sourcing, immutable logs, stream processing architectures.)
JUXT Ltd. (2024). XTDB: An immutable SQL database for application development, time-travel reporting and data compliance. https://xtdb.com/. (Bitemporal data model with system time and valid time, append-only transaction log, SQL:2011 temporal support.)
Rochberg, F. (2004). The Heavenly Writing: Divination, Horoscopy, and Astronomy in Mesopotamian Culture. Cambridge University Press. (Babylonian astronomical diaries.)
Witzel, M. (1997). The development of the Vedic canon and its schools: the social and political milieu. In Inside the Texts, Beyond the Texts. Harvard Oriental Series.
Wilkinson, E. (2013). Chinese History: A New Manual. Harvard University Asia Center. (Chinese dynastic annals.)
Steinsaltz, A. (1976). The Essential Talmud. Basic Books. (Talmudic commentary chains.)
Brown, J. (2009). Hadith: Muhammad’s Legacy in the Medieval and Modern World. Oneworld. (Islamic isnad chains.)
Bar Ilan University (ongoing). Bar Ilan Responsa Project. Online database. (Jewish legal responsa.)
Howse, D. (1980). Greenwich Time and the Discovery of the Longitude. Oxford University Press. (Greenwich observatory records.)
Helland, P. (2016). Immutability changes everything. Communications of the ACM, 59(1), 64-70. DOI: 10.1145/2844112. (Event sourcing and ledger patterns.)
Tononi, G. (2012). Phi: A Voyage from the Brain to the Soul. Pantheon. (Integrated information theory, referenced as adjacent but not load bearing.)
Dennett, D. (1991). Consciousness Explained. Little, Brown. (Multiple drafts model, cited for homunculus dissolution.)
Hofstadter, D. (1979). Gödel, Escher, Bach: An Eternal Golden Braid. Basic Books. (Strange loops, self reference in cognitive architecture.)
Minsky, M. (1986). The Society of Mind. Simon and Schuster. (Society of mind model, ancestor of the Flock fabric.)
Larkin, J. H., and Simon, H. A. (1987). Why a diagram is (sometimes) worth ten thousand words. Cognitive Science, 11(1), 65-100. DOI: 10.1111/j.1551-6708.1987.tb00863.x. (Different representations enable different inferences; foundational evidence for Section V’s irreducibility claim.)
Laird, J. E., Newell, A., and Rosenbloom, P. S. (1987). SOAR: an architecture for general intelligence. Artificial Intelligence, 33(1), 1-64. DOI: 10.1016/0004-3702(87)90050-6.
Anderson, J. R. (2007). How Can the Human Mind Occur in the Physical Universe?. Oxford University Press. (ACT R cognitive architecture.)
Jovovich, M., and Sigman, B. (2026). MemPalace [Software]. GitHub: https://github.com/milla-jovovich/mempalace. (Structured memory retrieval, method of loci applied to AI memory, verbatim vs summary benchmarks. 41K+ stars. MIT licensed.)
Cicero, M. T. (55 BCE). De Oratore, Book II, 86.352-354. (Method of loci, classical source for spatial memory architecture.)
Quintilian, M. F. (c. 95 CE). Institutio Oratoria, Book XI, 2.17-22. (Method of loci, rhetorical memory training. Classical source retained in bibliography for completeness; inline citation removed in favour of the causal claim about spatial decorrelation.)
O’Keefe, J., and Nadel, L. (1978). The Hippocampus as a Cognitive Map. Clarendon Press. (Place cells, allocentric spatial mapping, hippocampal memory indexing.)
Chandra, S., Sharma, S., Chaudhuri, R., and Fiete, I. (2025). Episodic and associative memory from spatial scaffolds in the hippocampus. Nature. DOI: 10.1038/s41586-024-08392-y. (Vector-HaSH model: grid cell scaffold encodes both spatial maps and sequential episodic memories.)
Singer, W., and Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555-586. DOI: 10.1146/annurev.ne.18.030195.003011. (Binding by synchrony, gamma band oscillations in perceptual binding.)
Fries, P. (2015). Rhythms for cognition: communication through coherence. Neuron, 88(1), 220-235. DOI: 10.1016/j.neuron.2015.09.034. (Communication through coherence hypothesis, gamma band as mechanism for inter-area synchronisation.)
Tishby, N., Pereira, F. C., and Bialek, W. (1999). The information bottleneck method. Proceedings of the 37th Allerton Conference on Communication, Control, and Computing, 368-377. (Formal framework for what survives compression; relevant to Section V and the Fable round-trip protocol.)
Zhang, J., and Norman, D. A. (1994). Representations in distributed cognitive tasks. Cognitive Science, 18(1), 87-122. DOI: 10.1207/s15516709cog1801_3. (Representational determinism: format determines available inference space, not just speed. Foundational evidence for the five-shape irreducibility claim.)
Prufer, K., Racimo, F., Patterson, N., et al. (2014). The complete genome sequence of a Neanderthal from the Altai Mountains. Nature, 505(7481), 43-49. (Neanderthal DNA introgression, approximately two percent in non-African modern humans.)
Aphthonius of Antioch. (c. 4th century CE). Progymnasmata. Translated in Kennedy, G. A. (2003). Progymnasmata: Greek Textbooks of Prose Composition and Rhetoric. Brill. (Chreia elaboration under eight heads: encomium, paraphrase, cause, converse, analogy, example, testimony of ancients, epilogue.)
Read, L. (1958). I, Pencil: My Family Tree as Told to Leonard E. Read. The Freeman. (Scale invariant coordination without central planning, ancestor of the pencil metaphor in the Coda.)
Parr, T., Pezzulo, G., and Friston, K. (2025). Beyond Markov: Transformers, memory, and attention. Cognitive Neuroscience. DOI: 10.1080/17588928.2025.2484485. (Non-Markovian generative models in transformers; attention as selective history weighting; two approaches to non-Markovian sequences.)
Barandes, J. A. (2023a). The stochastic-quantum correspondence. arXiv preprint arXiv:2302.10778. (Indivisible stochastic processes as reformulation of quantum mechanics, structural indivisibility of temporal processes.)
Barandes, J. A. (2023b). The stochastic-quantum theorem. arXiv preprint arXiv:2309.03085. (The formal proof that quantum systems can be characterised as indivisible stochastic processes.)
Barandes, J. A. (2024). Quantum theory from indivisible stochastic processes. Philosophy of Physics, 2(1), 3. DOI: 10.31389/pop.186. (Peer-reviewed version of the ISP framework with DOI.)
Barandes, J. A. (2025). Quantum systems as indivisible stochastic processes. arXiv preprint arXiv:2507.21192. (Extended ISP framework with gauge invariance, dynamical symmetries, and Hilbert-space dilations. Convergent evidence for irreducible temporality in coherent systems.)
Boyd, R. (1991). Realism, anti-foundationalism and the enthusiasm for natural kinds. Philosophical Studies, 61(1-2), 127-148. (Homeostatic property cluster theory of natural kinds. The five shapes are natural kinds in Boyd’s sense: categories maintained by informational structure that enable reliable prediction across domains.)
Zep AI (2025). Graphiti: temporal knowledge graph for AI agents. Open source. (Bitemporal knowledge graph with event time and system time, 94.8% Dialogue Memory Retention. Strongest baseline for the ledger-as-fifth-shape claim.)
Packer, C., Wooders, S., Lin, K., et al. (2024). MemGPT: towards LLMs as operating systems. arXiv preprint arXiv:2310.08560. (Letta/MemGPT: filesystem approach to long-term agent memory, 74% on conversation continuity tasks.)
Xu, Z., et al. (2025). Synapse: episodic-semantic dual-layer graph for long conversation memory. (Spreading activation over dual-layer graph, F1 40.5 on LoCoMo. Baseline for structured episodic memory.)
Anokhin, P., et al. (2025). AriGraph: learning knowledge graph world models with episodic memory for LLM agents. IJCAI 2025. (Semantic and episodic graph structures from agent experience.)
Mem0 AI (2025). Mem0: the memory layer for AI agents. Open source. (Vector and graph-enhanced memory, Mem0g variant scoring approximately 68% on dialogue memory benchmarks.)
Clayton, N. S., Dally, J. M., and Emery, N. J. (2007). Social cognition by food-caching corvids: the western scrub-jay as a natural psychologist. Philosophical Transactions of the Royal Society B, 362(1480), 507-522. DOI: 10.1098/rstb.2006.1992. (Corvid episodic-like memory: what, where, when, who was watching. Flexible re-caching and pilfering policies as evidence for high representational dimensionality D.)
Menzel, R. (2023). Navigation and dance communication in honeybees: a cognitive perspective. Journal of Comparative Physiology A. DOI: 10.1007/s00359-023-01619-9. (Compact spatial code, symbolic dance channel, colony level behaviour extending beyond individual lifespan. High bandwidth B, short individual horizon H, colony level extension.)
Whitehead, H., and Rendell, L. (2015). The Cultural Lives of Whales and Dolphins. University of Chicago Press. (Multi-level alliances, vocal dialects, distributed cultural ledgers in acoustic space. Cross-generational Fables about migration, foraging, and identity. High H and social D.)
Baldassano, C., Chen, J., Zadbood, A., Pillow, J. W., Hasson, U., and Norman, K. A. (2017). Discovering event structure in continuous narrative perception and memory. Neuron, 95(3), 709-721. DOI: 10.1016/j.neuron.2017.06.041. (Nested cortical hierarchies: short event states in sensory regions, longer in association areas.)
Geerligs, L., Gozukara, D., Oetringer, D., Campbell, K. L., van Gerven, M. A. J., and Guclu, U. (2022). A partially nested cortical hierarchy of neural states underlies event segmentation in the human brain. eLife, 11, e77430. DOI: 10.7554/eLife.77430. (Event boundaries organised in partially nested temporal hierarchy, boundaries propagating upward.)
Riddle, J., and Schooler, J. W. (2024). Hierarchical consciousness: the Nested Observer Windows model. Neuroscience of Consciousness, 2024(1), niae010. DOI: 10.1093/nc/niae010. (Hierarchy of spatiotemporal observer windows, each with substantial autonomy, feeding into higher level unified experience. Nested mosaic tiles model of consciousness across spatiotemporal scales.)
Peer, M., et al. (2025). Hierarchical cognitive maps of nested environments. bioRxiv. DOI: 10.1101/2025.02.05.636580. (Nested spatial representations: people divide environments into subspaces and integrate those, with explicit reuse of structure across levels.)
Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbieri, F., and Fang, Y. (2024). Evaluating very long-term conversational memory of LLM agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). arXiv:2402.17753. (LoCoMo benchmark: 600 turns, 16K tokens, up to 32 sessions. Human performance 87.9%. LLMs lag behind human levels by 36% overall, with temporal reasoning gap of 41%.)
Chen, Y., Singh, V. K., Ma, J., and Tang, R. (2025). CounterBench: a benchmark for counterfactuals reasoning in large language models. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026). arXiv:2502.11008. (1K counterfactual reasoning questions over formal causal graphs. Most LLMs perform at near random guessing levels. Direct evidence for the temporal collapse diagnosed in Section VI.)
Chu, Z., Chen, J., Chen, Q., Yu, W., Wang, H., Liu, M., and Qin, B. (2024). TimeBench: a comprehensive evaluation of temporal reasoning abilities in large language models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 1204-1228. arXiv:2311.17667. (Hierarchical temporal reasoning benchmark. Significant performance gap between SOTA LLMs and humans on temporal tasks.)
Zhao, Y., Yuan, B., Huang, J., et al. (2026). AMA-Bench: evaluating long-horizon memory for agentic applications. arXiv preprint arXiv:2602.22769. (Agent Memory with Any length. AMA-Agent achieves 57.22% average accuracy. Existing memory systems underperform because they lack causality information and rely on lossy similarity-based retrieval.)
Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W., and Yu, D. (2024). LongMemEval: benchmarking chat assistants on long-term interactive memory. ICLR 2025. arXiv:2410.10813. (500 curated questions, five core memory abilities. Commercial chat assistants show 30% accuracy drop on sustained interactions. Best oracle configuration approximately 92% with GPT-4o and Chain-of-Note.)