A city engine should not be trusted just because it looks intelligent.
It should be tested against reality.
That is what the CitySim.150Y.CF Backtest Protocol is for.
If the simulation says a city should have moved one way over the last 10, 20, or 50 years, and the real city moved another way, then the model has a problem. It may still be useful for thinking, but it is not yet calibrated properly. That is exactly what happened when Tokyo was pushed through the first version of the engine. The structure was interesting, but the backtest showed that the model was not yet tight enough to claim forecast-grade accuracy.
So before more forward 150-year runs are done, CitySim needs a formal backtest protocol.
One-sentence answer
The CitySim.150Y.CF Backtest Protocol is the canonical procedure for running the simulation from a known historical starting point to a later observed reality, scoring how well the model matched the real city, and deciding whether the engine is accurate, directionally useful, or still too loose for serious prediction.
That is the page that forces the model to face the world.
Why this page has to exist
A forward simulation can always sound impressive.
It can generate:
- elegant trajectories
- strong language
- neat scoreboards
- clear winners and losers
But none of that proves the engine is working.
The real test is much simpler:
If the model starts in the past and moves toward the present, how close does it get to what really happened?
That is backtesting.
Without backtesting:
- the model may be clever but unproven
- the coefficients may be arbitrary
- the variable weights may be too strong or too weak
- the direction may be right but the magnitude wrong
- the engine may confuse narrative plausibility with calibration
A long-horizon city simulator without a backtest protocol is not yet a serious engine.
What the Backtest Protocol does
The Backtest Protocol does six jobs.
1. It checks whether the engine can reproduce known history
This is the basic reality check.
If the engine cannot approximately reproduce a city’s recent past, then it should not speak too confidently about its long-run future.
2. It separates directional truth from numeric accuracy
Some models get the direction right but the size wrong.
For example:
- it correctly sees fertility decline
- it correctly sees increasing aging pressure
- it correctly sees growing education leakage
But it may still misjudge:
- how fast the decline happens
- how large the effect becomes
- which subsystem weakens first
That difference matters.
3. It identifies which variables are under-calibrated
A backtest should not only say “pass” or “fail.”
It should show:
- which variables fit reality reasonably well
- which variables drift too far
- which proxy bundles are too weak
- which transition rules are too gentle or too aggressive
That is how the engine improves.
4. It prevents forward overclaiming
If the engine has not passed its backtest, then forward scenarios should be labeled honestly:
- exploratory
- scenario-grade
- directionally useful
- not forecast-grade
This is one of the most important trust rules in the entire CitySim stack.
5. It creates a standard for comparing cities
A backtest should not be improvised differently for every city.
Tokyo, London, Seoul, Singapore, or Paris should all be backtested using the same basic grammar:
- same protocol
- same scoring logic
- same error language
- same pass/fail classes
Only then can cross-city comparisons stay honest.
6. It tells us whether recalibration is needed
The backtest is the gate before calibration.
If the model already fits reasonably well, only minor adjustment may be needed.
If the model misses badly, then the calibration layer has to reopen.
What a backtest is not
A backtest is not:
- proof that the model can predict perfectly
- proof that the future will behave like the past
- proof that every internal latent variable is correct
- proof that the city engine is finished
A backtest only says:
given what really happened in a known time window, how well did the engine reproduce that movement?
That is enough to matter a great deal.
The basic backtest structure
Every CitySim backtest should follow the same sequence.
Step 1. Pick the city and the backtest window
Example:
- Tokyo
- start year: 2010
- end year: 2025
Or:
- 1975 to 2025
- 2000 to 2025
- 2015 to 2025
The window should be long enough to reveal meaningful movement, but not so long that the source definitions become unusably inconsistent.
Step 2. Freeze the start-state using only data available at the start year
This is critical.
Do not let the model see future data from the end of the window.
The start-state must only use what would have been knowable at the chosen start date.
Otherwise the backtest is contaminated.
Step 3. Run the simulation forward using the declared engine
Now the city engine advances from the historical start-state toward the later period.
Use:
- the declared variable registry
- the declared proxy map
- the declared data adapter
- the declared transition kernel
No post-hoc edits in the middle.
Step 4. Compare simulated outputs with observed reality
At the end of the backtest window, compare the simulated values to the actual later values.
This comparison should happen variable by variable, not just at one headline verdict level.
Step 5. Score the errors
Each variable should receive:
- absolute error
- relative error where relevant
- directional accuracy
- quality-weighted trust score
A city may pass directionally while failing numerically.
That distinction must remain explicit.
Step 6. Decide the model status
The backtest then classifies the run.
For example:
- calibrated
- directionally useful
- under-calibrated
- failed
- not testable due to weak proxy coverage
This final status should control what kind of forward claims the engine is allowed to make.
What should be backtested
Not every variable deserves equal weight.
CitySim should backtest in layers.
Layer 1. Strong observed variables
These should always be included if data exists.
Examples:
- total population
- fertility rate
- age 65+ share
- migration
- non-attendance rate
- dropout rate
- labour participation
- life expectancy
These are the strongest reality anchors.
Layer 2. Derived variables
These should also be backtested where possible.
Examples:
- dependency ratio
- youth inflow index
- teacher replacement pressure
- housing stress
- maintenance load
These help reveal whether the internal transformations are reasonable.
Layer 3. Latent variables with proxy bundles
These can be backtested more cautiously.
Examples:
- legitimacy
- transfer integrity
- late-life usefulness
- repair rate
- civic continuity
These should not dominate the overall backtest verdict unless their proxies are strong enough.
The three scoring dimensions
A good backtest needs more than one score.
1. Directional accuracy
Did the model at least get the direction right?
Examples:
- fertility down
- aging up
- school stress worsening
- youth share down
- late-life load up
This is the easiest test.
2. Magnitude accuracy
Did the model get the size of the change reasonably right?
Examples:
- not just “aging increased”
- but “aging increased by roughly the right amount”
This is a stricter test.
3. Timing accuracy
Did the model shift too early, too late, or at roughly the right pace?
Examples:
- did it see school strain worsening, but only too slowly?
- did it anticipate population flattening too early?
- did it under-react to policy or economic shocks?
Timing matters a great deal in city systems.
Error measures CitySim should use
The protocol should allow a standard small set of error measures.
1. Absolute error
Simple difference between simulated and observed values.
Useful for:
- percentage-point gaps
- index gaps
- count differences
2. Relative error
Useful when the scale of the variable matters.
Example:
- being off by 2 points on a 10-point variable is not the same as being off by 2 points on a 100-point variable
3. Directional hit / miss
Did the model get the sign of the change correct?
Up / down / flat.
4. Path deviation
How much did the simulated trajectory differ from the observed path across the whole window, not just the endpoint?
This is very important for longer backtests.
5. Confidence-weighted error
Variables with weak proxy quality should contribute less strongly to the overall verdict.
A weak legitimacy bundle should not outweigh a strong fertility miss.
Backtest status classes
CitySim should classify results into clear bands.
Class A — well calibrated
- strong directional fit
- acceptable magnitude fit
- acceptable timing fit
- no major observable proxy failures
This is rare and should be earned.
Class B — directionally strong, numerically imperfect
- direction mostly right
- magnitude somewhat off
- timing usable but not precise
- suitable for scenario work, cautious forward runs
This is a respectable result.
Class C — partially useful, under-calibrated
- some important directions right
- several major variables too far off
- not good enough for confident forward claims
This is where early CitySim Tokyo currently sits.
Class D — weak fit
- multiple core variables wrong in direction or magnitude
- poor trust for forward use
- recalibration required before more city claims
Class E — not testable
- data too weak
- proxies too incomplete
- boundary mismatches too severe
- backtest not valid enough to score
This is not failure of the idea. It is a signal that the measurement layer is not ready.
What counts as a valid backtest window
The backtest window must be chosen carefully.
A valid window should have:
- enough years to show real city movement
- enough usable data coverage
- stable enough variable definitions
- clear start-state observability
Typical windows
- 10 years: good for recent policy and directional checks
- 20–25 years: better for medium-run city drift
- 50 years: useful for broad civilisation patterns, but harder because of source-definition drift
So the protocol should allow several window classes rather than pretending one window fits all purposes.
Rules that stop the backtest from cheating
The Backtest Protocol needs hard anti-cheating rules.
Rule 1. No future leakage
Do not use end-of-window data in the start-state.
Rule 2. No mid-run retuning
Once the backtest starts, the coefficients cannot be adjusted halfway through to make the result look better.
Rule 3. No post-hoc threshold shifting
Do not tighten or loosen the success bands after seeing the error.
Rule 4. No hidden variable substitution
Do not quietly replace a weak variable with a different variable near the end of the run.
Rule 5. No silent source upgrades
If a better dataset was only available after the run began, it should not be quietly imported into the earlier model without declaration.
Rule 6. Proxy quality must affect scoring
Weak proxy variables should carry less weight in the overall verdict.
These rules matter because calibration without discipline becomes curve-fitting theatre.
The minimum backtest output pack
Every CitySim backtest should publish at least these outputs.
1. Start-state file
The exact variables and values used at the historical start point.
2. Backtest assumptions file
Declared coefficients, transition rules, and fallback rules active during the backtest.
3. Simulated trajectory file
What the engine produced through the window.
4. Observed comparison file
What the real city data showed.
5. Error table
Variable-by-variable error scoring.
6. Final model-status verdict
Class A, B, C, D, or E.
That should become standard for all serious city runs.
What the protocol should conclude after a backtest
After the backtest, CitySim should answer these questions clearly.
- Did the engine get the major directions right?
- Did it get the size of change reasonably right?
- Which variables are clearly under-calibrated?
- Which proxy bundles are too weak?
- Can the engine be used for forward scenarios yet?
- If yes, at what confidence level?
- If no, what needs recalibration first?
That is a proper backtest outcome.
Not just “the model seems sensible.”
Why this matters after Tokyo
Tokyo already taught the lesson.
The first 150-year forward run was structurally interesting, but once forced through a historical reality check, it became clear that the model was:
- directionally useful
- but not yet numerically trustworthy enough
That is not embarrassing.
That is exactly why this protocol has to exist.
The real embarrassment would be pretending that a backtest is unnecessary.
A civilisation-grade city engine must be able to say:
- here is where we matched reality
- here is where we missed
- here is how large the miss was
- here is whether the model is still fit for forward use
That is what gives the engine integrity.
Final definition
The CitySim.150Y.CF Backtest Protocol is the canonical reality-check procedure that runs the city engine from a historical starting point to a later observed state, measures directional, magnitude, and timing error, and classifies whether the model is calibrated, directionally useful, under-calibrated, or not yet fit for forward prediction.
Without it, CitySim can still produce scenarios.
But it cannot yet earn trust.
Almost-Code
“`text id=”o7xw2e”
CITYSIM_150Y_CF_BACKTEST_PROTOCOL_V1
PURPOSE:
Test whether the declared CitySim engine can reproduce known city movement from a historical start-state to a later observed reality.
CORE_LAW:
No forward 150-year city claim may be treated as forecast-grade unless the engine has passed a declared backtest class threshold.
BACKTEST_SEQUENCE:
- choose(city, window_start, window_end)
- freeze_start_state(using_only_data_available_at_window_start)
- run_simulation_forward(using_declared_engine_only)
- compare(simulated_outputs, observed_outputs)
- score_errors(variable_by_variable)
- classify_model_status
BACKTEST_LAYERS:
L1 = strong observed variables
L2 = derived variables
L3 = latent variables with proxy bundles
SCORING_DIMENSIONS:
- directional_accuracy
- magnitude_accuracy
- timing_accuracy
ERROR_MEASURES:
- absolute_error
- relative_error
- directional_hit_or_miss
- path_deviation
- confidence_weighted_error
MODEL_STATUS_CLASSES:
A = well_calibrated
B = directionally_strong_numerically_imperfect
C = partially_useful_under_calibrated
D = weak_fit
E = not_testable
VALID_WINDOW_TYPES:
- short_window = 10_years
- medium_window = 20_to_25_years
- long_window = 50_years
ANTI_CHEATING_RULES:
- no_future_leakage
- no_mid_run_retuning
- no_post_hoc_threshold_shifting
- no_hidden_variable_substitution
- no_silent_source_upgrade
- proxy_quality_must_affect_scoring
MINIMUM_BACKTEST_OUTPUT_PACK:
- start_state_file
- assumptions_file
- simulated_trajectory_file
- observed_comparison_file
- error_table
- final_model_status_verdict
PASS_LOGIC:
IF major_directions_correct
AND magnitude_error_within_declared_band
AND timing_error_within_declared_band
AND no_core_observable_variable_fails_badly
THEN model_status = A_or_B
IF some_major_directions_correct
BUT several_core_variables_far_off
THEN model_status = C
IF multiple_core_variables_wrong_in_direction_or_magnitude
THEN model_status = D
IF proxy_coverage_too_weak
OR source_definition_breaks_make_comparison_invalid
THEN model_status = E
OUTPUT:
backtest_validity = TRUE or FALSE
forward_use_status = forecast_grade / scenario_grade / not_ready
“`

