What Is the CitySim.150Y.CF Backtest Protocol?

A city engine should not be trusted just because it looks intelligent.

It should be tested against reality.

That is what the CitySim.150Y.CF Backtest Protocol is for.

If the simulation says a city should have moved one way over the last 10, 20, or 50 years, and the real city moved another way, then the model has a problem. It may still be useful for thinking, but it is not yet calibrated properly. That is exactly what happened when Tokyo was pushed through the first version of the engine. The structure was interesting, but the backtest showed that the model was not yet tight enough to claim forecast-grade accuracy.

So before more forward 150-year runs are done, CitySim needs a formal backtest protocol.

One-sentence answer

The CitySim.150Y.CF Backtest Protocol is the canonical procedure for running the simulation from a known historical starting point to a later observed reality, scoring how well the model matched the real city, and deciding whether the engine is accurate, directionally useful, or still too loose for serious prediction.

That is the page that forces the model to face the world.


Why this page has to exist

A forward simulation can always sound impressive.

It can generate:

  • elegant trajectories
  • strong language
  • neat scoreboards
  • clear winners and losers

But none of that proves the engine is working.

The real test is much simpler:

If the model starts in the past and moves toward the present, how close does it get to what really happened?

That is backtesting.

Without backtesting:

  • the model may be clever but unproven
  • the coefficients may be arbitrary
  • the variable weights may be too strong or too weak
  • the direction may be right but the magnitude wrong
  • the engine may confuse narrative plausibility with calibration

A long-horizon city simulator without a backtest protocol is not yet a serious engine.


What the Backtest Protocol does

The Backtest Protocol does six jobs.

1. It checks whether the engine can reproduce known history

This is the basic reality check.

If the engine cannot approximately reproduce a city’s recent past, then it should not speak too confidently about its long-run future.

2. It separates directional truth from numeric accuracy

Some models get the direction right but the size wrong.

For example:

  • it correctly sees fertility decline
  • it correctly sees increasing aging pressure
  • it correctly sees growing education leakage

But it may still misjudge:

  • how fast the decline happens
  • how large the effect becomes
  • which subsystem weakens first

That difference matters.

3. It identifies which variables are under-calibrated

A backtest should not only say “pass” or “fail.”

It should show:

  • which variables fit reality reasonably well
  • which variables drift too far
  • which proxy bundles are too weak
  • which transition rules are too gentle or too aggressive

That is how the engine improves.

4. It prevents forward overclaiming

If the engine has not passed its backtest, then forward scenarios should be labeled honestly:

  • exploratory
  • scenario-grade
  • directionally useful
  • not forecast-grade

This is one of the most important trust rules in the entire CitySim stack.

5. It creates a standard for comparing cities

A backtest should not be improvised differently for every city.

Tokyo, London, Seoul, Singapore, or Paris should all be backtested using the same basic grammar:

  • same protocol
  • same scoring logic
  • same error language
  • same pass/fail classes

Only then can cross-city comparisons stay honest.

6. It tells us whether recalibration is needed

The backtest is the gate before calibration.

If the model already fits reasonably well, only minor adjustment may be needed.

If the model misses badly, then the calibration layer has to reopen.


What a backtest is not

A backtest is not:

  • proof that the model can predict perfectly
  • proof that the future will behave like the past
  • proof that every internal latent variable is correct
  • proof that the city engine is finished

A backtest only says:

given what really happened in a known time window, how well did the engine reproduce that movement?

That is enough to matter a great deal.


The basic backtest structure

Every CitySim backtest should follow the same sequence.

Step 1. Pick the city and the backtest window

Example:

  • Tokyo
  • start year: 2010
  • end year: 2025

Or:

  • 1975 to 2025
  • 2000 to 2025
  • 2015 to 2025

The window should be long enough to reveal meaningful movement, but not so long that the source definitions become unusably inconsistent.

Step 2. Freeze the start-state using only data available at the start year

This is critical.

Do not let the model see future data from the end of the window.

The start-state must only use what would have been knowable at the chosen start date.

Otherwise the backtest is contaminated.

Step 3. Run the simulation forward using the declared engine

Now the city engine advances from the historical start-state toward the later period.

Use:

  • the declared variable registry
  • the declared proxy map
  • the declared data adapter
  • the declared transition kernel

No post-hoc edits in the middle.

Step 4. Compare simulated outputs with observed reality

At the end of the backtest window, compare the simulated values to the actual later values.

This comparison should happen variable by variable, not just at one headline verdict level.

Step 5. Score the errors

Each variable should receive:

  • absolute error
  • relative error where relevant
  • directional accuracy
  • quality-weighted trust score

A city may pass directionally while failing numerically.

That distinction must remain explicit.

Step 6. Decide the model status

The backtest then classifies the run.

For example:

  • calibrated
  • directionally useful
  • under-calibrated
  • failed
  • not testable due to weak proxy coverage

This final status should control what kind of forward claims the engine is allowed to make.


What should be backtested

Not every variable deserves equal weight.

CitySim should backtest in layers.

Layer 1. Strong observed variables

These should always be included if data exists.

Examples:

  • total population
  • fertility rate
  • age 65+ share
  • migration
  • non-attendance rate
  • dropout rate
  • labour participation
  • life expectancy

These are the strongest reality anchors.

Layer 2. Derived variables

These should also be backtested where possible.

Examples:

  • dependency ratio
  • youth inflow index
  • teacher replacement pressure
  • housing stress
  • maintenance load

These help reveal whether the internal transformations are reasonable.

Layer 3. Latent variables with proxy bundles

These can be backtested more cautiously.

Examples:

  • legitimacy
  • transfer integrity
  • late-life usefulness
  • repair rate
  • civic continuity

These should not dominate the overall backtest verdict unless their proxies are strong enough.


The three scoring dimensions

A good backtest needs more than one score.

1. Directional accuracy

Did the model at least get the direction right?

Examples:

  • fertility down
  • aging up
  • school stress worsening
  • youth share down
  • late-life load up

This is the easiest test.

2. Magnitude accuracy

Did the model get the size of the change reasonably right?

Examples:

  • not just “aging increased”
  • but “aging increased by roughly the right amount”

This is a stricter test.

3. Timing accuracy

Did the model shift too early, too late, or at roughly the right pace?

Examples:

  • did it see school strain worsening, but only too slowly?
  • did it anticipate population flattening too early?
  • did it under-react to policy or economic shocks?

Timing matters a great deal in city systems.


Error measures CitySim should use

The protocol should allow a standard small set of error measures.

1. Absolute error

Simple difference between simulated and observed values.

Useful for:

  • percentage-point gaps
  • index gaps
  • count differences

2. Relative error

Useful when the scale of the variable matters.

Example:

  • being off by 2 points on a 10-point variable is not the same as being off by 2 points on a 100-point variable

3. Directional hit / miss

Did the model get the sign of the change correct?

Up / down / flat.

4. Path deviation

How much did the simulated trajectory differ from the observed path across the whole window, not just the endpoint?

This is very important for longer backtests.

5. Confidence-weighted error

Variables with weak proxy quality should contribute less strongly to the overall verdict.

A weak legitimacy bundle should not outweigh a strong fertility miss.


Backtest status classes

CitySim should classify results into clear bands.

Class A — well calibrated

  • strong directional fit
  • acceptable magnitude fit
  • acceptable timing fit
  • no major observable proxy failures

This is rare and should be earned.

Class B — directionally strong, numerically imperfect

  • direction mostly right
  • magnitude somewhat off
  • timing usable but not precise
  • suitable for scenario work, cautious forward runs

This is a respectable result.

Class C — partially useful, under-calibrated

  • some important directions right
  • several major variables too far off
  • not good enough for confident forward claims

This is where early CitySim Tokyo currently sits.

Class D — weak fit

  • multiple core variables wrong in direction or magnitude
  • poor trust for forward use
  • recalibration required before more city claims

Class E — not testable

  • data too weak
  • proxies too incomplete
  • boundary mismatches too severe
  • backtest not valid enough to score

This is not failure of the idea. It is a signal that the measurement layer is not ready.


What counts as a valid backtest window

The backtest window must be chosen carefully.

A valid window should have:

  • enough years to show real city movement
  • enough usable data coverage
  • stable enough variable definitions
  • clear start-state observability

Typical windows

  • 10 years: good for recent policy and directional checks
  • 20–25 years: better for medium-run city drift
  • 50 years: useful for broad civilisation patterns, but harder because of source-definition drift

So the protocol should allow several window classes rather than pretending one window fits all purposes.


Rules that stop the backtest from cheating

The Backtest Protocol needs hard anti-cheating rules.

Rule 1. No future leakage

Do not use end-of-window data in the start-state.

Rule 2. No mid-run retuning

Once the backtest starts, the coefficients cannot be adjusted halfway through to make the result look better.

Rule 3. No post-hoc threshold shifting

Do not tighten or loosen the success bands after seeing the error.

Rule 4. No hidden variable substitution

Do not quietly replace a weak variable with a different variable near the end of the run.

Rule 5. No silent source upgrades

If a better dataset was only available after the run began, it should not be quietly imported into the earlier model without declaration.

Rule 6. Proxy quality must affect scoring

Weak proxy variables should carry less weight in the overall verdict.

These rules matter because calibration without discipline becomes curve-fitting theatre.


The minimum backtest output pack

Every CitySim backtest should publish at least these outputs.

1. Start-state file

The exact variables and values used at the historical start point.

2. Backtest assumptions file

Declared coefficients, transition rules, and fallback rules active during the backtest.

3. Simulated trajectory file

What the engine produced through the window.

4. Observed comparison file

What the real city data showed.

5. Error table

Variable-by-variable error scoring.

6. Final model-status verdict

Class A, B, C, D, or E.

That should become standard for all serious city runs.


What the protocol should conclude after a backtest

After the backtest, CitySim should answer these questions clearly.

  1. Did the engine get the major directions right?
  2. Did it get the size of change reasonably right?
  3. Which variables are clearly under-calibrated?
  4. Which proxy bundles are too weak?
  5. Can the engine be used for forward scenarios yet?
  6. If yes, at what confidence level?
  7. If no, what needs recalibration first?

That is a proper backtest outcome.

Not just “the model seems sensible.”


Why this matters after Tokyo

Tokyo already taught the lesson.

The first 150-year forward run was structurally interesting, but once forced through a historical reality check, it became clear that the model was:

  • directionally useful
  • but not yet numerically trustworthy enough

That is not embarrassing.

That is exactly why this protocol has to exist.

The real embarrassment would be pretending that a backtest is unnecessary.

A civilisation-grade city engine must be able to say:

  • here is where we matched reality
  • here is where we missed
  • here is how large the miss was
  • here is whether the model is still fit for forward use

That is what gives the engine integrity.


Final definition

The CitySim.150Y.CF Backtest Protocol is the canonical reality-check procedure that runs the city engine from a historical starting point to a later observed state, measures directional, magnitude, and timing error, and classifies whether the model is calibrated, directionally useful, under-calibrated, or not yet fit for forward prediction.

Without it, CitySim can still produce scenarios.

But it cannot yet earn trust.


Almost-Code

“`text id=”o7xw2e”
CITYSIM_150Y_CF_BACKTEST_PROTOCOL_V1

PURPOSE:
Test whether the declared CitySim engine can reproduce known city movement from a historical start-state to a later observed reality.

CORE_LAW:
No forward 150-year city claim may be treated as forecast-grade unless the engine has passed a declared backtest class threshold.

BACKTEST_SEQUENCE:

  1. choose(city, window_start, window_end)
  2. freeze_start_state(using_only_data_available_at_window_start)
  3. run_simulation_forward(using_declared_engine_only)
  4. compare(simulated_outputs, observed_outputs)
  5. score_errors(variable_by_variable)
  6. classify_model_status

BACKTEST_LAYERS:
L1 = strong observed variables
L2 = derived variables
L3 = latent variables with proxy bundles

SCORING_DIMENSIONS:

  • directional_accuracy
  • magnitude_accuracy
  • timing_accuracy

ERROR_MEASURES:

  • absolute_error
  • relative_error
  • directional_hit_or_miss
  • path_deviation
  • confidence_weighted_error

MODEL_STATUS_CLASSES:
A = well_calibrated
B = directionally_strong_numerically_imperfect
C = partially_useful_under_calibrated
D = weak_fit
E = not_testable

VALID_WINDOW_TYPES:

  • short_window = 10_years
  • medium_window = 20_to_25_years
  • long_window = 50_years

ANTI_CHEATING_RULES:

  • no_future_leakage
  • no_mid_run_retuning
  • no_post_hoc_threshold_shifting
  • no_hidden_variable_substitution
  • no_silent_source_upgrade
  • proxy_quality_must_affect_scoring

MINIMUM_BACKTEST_OUTPUT_PACK:

  • start_state_file
  • assumptions_file
  • simulated_trajectory_file
  • observed_comparison_file
  • error_table
  • final_model_status_verdict

PASS_LOGIC:
IF major_directions_correct
AND magnitude_error_within_declared_band
AND timing_error_within_declared_band
AND no_core_observable_variable_fails_badly
THEN model_status = A_or_B

IF some_major_directions_correct
BUT several_core_variables_far_off
THEN model_status = C

IF multiple_core_variables_wrong_in_direction_or_magnitude
THEN model_status = D

IF proxy_coverage_too_weak
OR source_definition_breaks_make_comparison_invalid
THEN model_status = E

OUTPUT:
backtest_validity = TRUE or FALSE
forward_use_status = forecast_grade / scenario_grade / not_ready
“`