Understanding the CitySim 150Y CF Backtest Protocol

A city engine should not be trusted just because it looks intelligent.

It should be tested against reality.

That is what the CitySim.150Y.CF Backtest Protocol is for.

If the simulation says a city should have moved one way over the last 10, 20, or 50 years, and the real city moved another way, then the model has a problem. It may still be useful for thinking, but it is not yet calibrated properly. That is exactly what happened when Tokyo was pushed through the first version of the engine. The structure was interesting, but the backtest showed that the model was not yet tight enough to claim forecast-grade accuracy.

So before more forward 150-year runs are done, CitySim needs a formal backtest protocol.

One-sentence answer

The CitySim.150Y.CF Backtest Protocol is the canonical procedure for running the simulation from a known historical starting point to a later observed reality, scoring how well the model matched the real city, and deciding whether the engine is accurate, directionally useful, or still too loose for serious prediction.

That is the page that forces the model to face the world.

Why this page has to exist

A forward simulation can always sound impressive.

It can generate:

elegant trajectories
strong language
neat scoreboards
clear winners and losers

But none of that proves the engine is working.

The real test is much simpler:

If the model starts in the past and moves toward the present, how close does it get to what really happened?

That is backtesting.

Without backtesting:

the model may be clever but unproven
the coefficients may be arbitrary
the variable weights may be too strong or too weak
the direction may be right but the magnitude wrong
the engine may confuse narrative plausibility with calibration

A long-horizon city simulator without a backtest protocol is not yet a serious engine.

What the Backtest Protocol does

The Backtest Protocol does six jobs.

1. It checks whether the engine can reproduce known history

This is the basic reality check.

If the engine cannot approximately reproduce a city’s recent past, then it should not speak too confidently about its long-run future.

2. It separates directional truth from numeric accuracy

Some models get the direction right but the size wrong.

For example:

it correctly sees fertility decline
it correctly sees increasing aging pressure
it correctly sees growing education leakage

But it may still misjudge:

how fast the decline happens
how large the effect becomes
which subsystem weakens first

That difference matters.

3. It identifies which variables are under-calibrated

A backtest should not only say “pass” or “fail.”

It should show:

which variables fit reality reasonably well
which variables drift too far
which proxy bundles are too weak
which transition rules are too gentle or too aggressive

That is how the engine improves.

4. It prevents forward overclaiming

If the engine has not passed its backtest, then forward scenarios should be labeled honestly:

exploratory
scenario-grade
directionally useful
not forecast-grade

This is one of the most important trust rules in the entire CitySim stack.

5. It creates a standard for comparing cities

A backtest should not be improvised differently for every city.

Tokyo, London, Seoul, Singapore, or Paris should all be backtested using the same basic grammar:

same protocol
same scoring logic
same error language
same pass/fail classes

Only then can cross-city comparisons stay honest.

6. It tells us whether recalibration is needed

The backtest is the gate before calibration.

If the model already fits reasonably well, only minor adjustment may be needed.

If the model misses badly, then the calibration layer has to reopen.

What a backtest is not

A backtest is not:

proof that the model can predict perfectly
proof that the future will behave like the past
proof that every internal latent variable is correct
proof that the city engine is finished

A backtest only says:

given what really happened in a known time window, how well did the engine reproduce that movement?

That is enough to matter a great deal.

The basic backtest structure

Every CitySim backtest should follow the same sequence.

Step 1. Pick the city and the backtest window

Example:

Tokyo
start year: 2010
end year: 2025

Or:

1975 to 2025
2000 to 2025
2015 to 2025

The window should be long enough to reveal meaningful movement, but not so long that the source definitions become unusably inconsistent.

Step 2. Freeze the start-state using only data available at the start year

This is critical.

Do not let the model see future data from the end of the window.

The start-state must only use what would have been knowable at the chosen start date.

Otherwise the backtest is contaminated.

Step 3. Run the simulation forward using the declared engine

Now the city engine advances from the historical start-state toward the later period.

Use:

the declared variable registry
the declared proxy map
the declared data adapter
the declared transition kernel

No post-hoc edits in the middle.

Step 4. Compare simulated outputs with observed reality

At the end of the backtest window, compare the simulated values to the actual later values.

This comparison should happen variable by variable, not just at one headline verdict level.

Step 5. Score the errors

Each variable should receive:

absolute error
relative error where relevant
directional accuracy
quality-weighted trust score

A city may pass directionally while failing numerically.

That distinction must remain explicit.

Step 6. Decide the model status

The backtest then classifies the run.

For example:

calibrated
directionally useful
under-calibrated
failed
not testable due to weak proxy coverage

This final status should control what kind of forward claims the engine is allowed to make.

What should be backtested

Not every variable deserves equal weight.

CitySim should backtest in layers.

Layer 1. Strong observed variables

These should always be included if data exists.

Examples:

total population
fertility rate
age 65+ share
migration
non-attendance rate
dropout rate
labour participation
life expectancy

These are the strongest reality anchors.

Layer 2. Derived variables

These should also be backtested where possible.

Examples:

dependency ratio
youth inflow index
teacher replacement pressure
housing stress
maintenance load

These help reveal whether the internal transformations are reasonable.

Layer 3. Latent variables with proxy bundles

These can be backtested more cautiously.

Examples:

legitimacy
transfer integrity
late-life usefulness
repair rate
civic continuity

These should not dominate the overall backtest verdict unless their proxies are strong enough.

The three scoring dimensions

A good backtest needs more than one score.

1. Directional accuracy

Did the model at least get the direction right?

Examples:

fertility down
aging up
school stress worsening
youth share down
late-life load up

This is the easiest test.

2. Magnitude accuracy

Did the model get the size of the change reasonably right?

Examples:

not just “aging increased”
but “aging increased by roughly the right amount”

This is a stricter test.

3. Timing accuracy

Did the model shift too early, too late, or at roughly the right pace?

Examples:

did it see school strain worsening, but only too slowly?
did it anticipate population flattening too early?
did it under-react to policy or economic shocks?

Timing matters a great deal in city systems.

Error measures CitySim should use

The protocol should allow a standard small set of error measures.

1. Absolute error

Simple difference between simulated and observed values.

Useful for:

percentage-point gaps
index gaps
count differences

2. Relative error

Useful when the scale of the variable matters.

Example:

being off by 2 points on a 10-point variable is not the same as being off by 2 points on a 100-point variable

3. Directional hit / miss

Did the model get the sign of the change correct?

Up / down / flat.

4. Path deviation

How much did the simulated trajectory differ from the observed path across the whole window, not just the endpoint?

This is very important for longer backtests.

5. Confidence-weighted error

Variables with weak proxy quality should contribute less strongly to the overall verdict.

A weak legitimacy bundle should not outweigh a strong fertility miss.

Backtest status classes

CitySim should classify results into clear bands.

Class A — well calibrated

strong directional fit
acceptable magnitude fit
acceptable timing fit
no major observable proxy failures

This is rare and should be earned.

Class B — directionally strong, numerically imperfect

direction mostly right
magnitude somewhat off
timing usable but not precise
suitable for scenario work, cautious forward runs

This is a respectable result.

Class C — partially useful, under-calibrated

some important directions right
several major variables too far off
not good enough for confident forward claims

This is where early CitySim Tokyo currently sits.

Class D — weak fit

multiple core variables wrong in direction or magnitude
poor trust for forward use
recalibration required before more city claims

Class E — not testable

data too weak
proxies too incomplete
boundary mismatches too severe
backtest not valid enough to score

This is not failure of the idea. It is a signal that the measurement layer is not ready.

What counts as a valid backtest window

The backtest window must be chosen carefully.

A valid window should have:

enough years to show real city movement
enough usable data coverage
stable enough variable definitions
clear start-state observability

Typical windows

10 years: good for recent policy and directional checks
20–25 years: better for medium-run city drift
50 years: useful for broad civilisation patterns, but harder because of source-definition drift

So the protocol should allow several window classes rather than pretending one window fits all purposes.

Rules that stop the backtest from cheating

The Backtest Protocol needs hard anti-cheating rules.

Rule 1. No future leakage

Do not use end-of-window data in the start-state.

Rule 2. No mid-run retuning

Once the backtest starts, the coefficients cannot be adjusted halfway through to make the result look better.

Rule 3. No post-hoc threshold shifting

Do not tighten or loosen the success bands after seeing the error.

Rule 4. No hidden variable substitution

Do not quietly replace a weak variable with a different variable near the end of the run.

Rule 5. No silent source upgrades

If a better dataset was only available after the run began, it should not be quietly imported into the earlier model without declaration.

Rule 6. Proxy quality must affect scoring

Weak proxy variables should carry less weight in the overall verdict.

These rules matter because calibration without discipline becomes curve-fitting theatre.

The minimum backtest output pack

Every CitySim backtest should publish at least these outputs.

1. Start-state file

The exact variables and values used at the historical start point.

2. Backtest assumptions file

Declared coefficients, transition rules, and fallback rules active during the backtest.

3. Simulated trajectory file

What the engine produced through the window.

4. Observed comparison file

What the real city data showed.

5. Error table

Variable-by-variable error scoring.

6. Final model-status verdict

Class A, B, C, D, or E.

That should become standard for all serious city runs.

What the protocol should conclude after a backtest

After the backtest, CitySim should answer these questions clearly.

Did the engine get the major directions right?
Did it get the size of change reasonably right?
Which variables are clearly under-calibrated?
Which proxy bundles are too weak?
Can the engine be used for forward scenarios yet?
If yes, at what confidence level?
If no, what needs recalibration first?

That is a proper backtest outcome.

Not just “the model seems sensible.”

Why this matters after Tokyo

Tokyo already taught the lesson.

The first 150-year forward run was structurally interesting, but once forced through a historical reality check, it became clear that the model was:

directionally useful
but not yet numerically trustworthy enough

That is not embarrassing.

That is exactly why this protocol has to exist.

The real embarrassment would be pretending that a backtest is unnecessary.

A civilisation-grade city engine must be able to say:

here is where we matched reality
here is where we missed
here is how large the miss was
here is whether the model is still fit for forward use

That is what gives the engine integrity.

Final definition

The CitySim.150Y.CF Backtest Protocol is the canonical reality-check procedure that runs the city engine from a historical starting point to a later observed state, measures directional, magnitude, and timing error, and classifies whether the model is calibrated, directionally useful, under-calibrated, or not yet fit for forward prediction.

Without it, CitySim can still produce scenarios.

But it cannot yet earn trust.

Almost-Code

“`text id=”o7xw2e”
CITYSIM_150Y_CF_BACKTEST_PROTOCOL_V1

PURPOSE:
Test whether the declared CitySim engine can reproduce known city movement from a historical start-state to a later observed reality.

CORE_LAW:
No forward 150-year city claim may be treated as forecast-grade unless the engine has passed a declared backtest class threshold.

BACKTEST_SEQUENCE:

choose(city, window_start, window_end)
freeze_start_state(using_only_data_available_at_window_start)
run_simulation_forward(using_declared_engine_only)
compare(simulated_outputs, observed_outputs)
score_errors(variable_by_variable)
classify_model_status

BACKTEST_LAYERS:
L1 = strong observed variables
L2 = derived variables
L3 = latent variables with proxy bundles

SCORING_DIMENSIONS:

directional_accuracy
magnitude_accuracy
timing_accuracy

ERROR_MEASURES:

absolute_error
relative_error
directional_hit_or_miss
path_deviation
confidence_weighted_error

MODEL_STATUS_CLASSES:
A = well_calibrated
B = directionally_strong_numerically_imperfect
C = partially_useful_under_calibrated
D = weak_fit
E = not_testable

VALID_WINDOW_TYPES:

short_window = 10_years
medium_window = 20_to_25_years
long_window = 50_years

ANTI_CHEATING_RULES:

no_future_leakage
no_mid_run_retuning
no_post_hoc_threshold_shifting
no_hidden_variable_substitution
no_silent_source_upgrade
proxy_quality_must_affect_scoring

MINIMUM_BACKTEST_OUTPUT_PACK:

start_state_file
assumptions_file
simulated_trajectory_file
observed_comparison_file
error_table
final_model_status_verdict

PASS_LOGIC:
IF major_directions_correct
AND magnitude_error_within_declared_band
AND timing_error_within_declared_band
AND no_core_observable_variable_fails_badly
THEN model_status = A_or_B

IF some_major_directions_correct
BUT several_core_variables_far_off
THEN model_status = C

IF multiple_core_variables_wrong_in_direction_or_magnitude
THEN model_status = D

IF proxy_coverage_too_weak
OR source_definition_breaks_make_comparison_invalid
THEN model_status = E

OUTPUT:
backtest_validity = TRUE or FALSE
forward_use_status = forecast_grade / scenario_grade / not_ready
“`

One-sentence answer

Why this page has to exist

What the Backtest Protocol does

1. It checks whether the engine can reproduce known history

2. It separates directional truth from numeric accuracy

3. It identifies which variables are under-calibrated

4. It prevents forward overclaiming

5. It creates a standard for comparing cities

6. It tells us whether recalibration is needed

What a backtest is not

The basic backtest structure

Step 1. Pick the city and the backtest window

Step 2. Freeze the start-state using only data available at the start year

Step 3. Run the simulation forward using the declared engine

Step 4. Compare simulated outputs with observed reality

Step 5. Score the errors

Step 6. Decide the model status

What should be backtested

Layer 1. Strong observed variables

Layer 2. Derived variables

Layer 3. Latent variables with proxy bundles

The three scoring dimensions

1. Directional accuracy

2. Magnitude accuracy

3. Timing accuracy

Error measures CitySim should use

1. Absolute error

2. Relative error

3. Directional hit / miss

4. Path deviation

5. Confidence-weighted error

Backtest status classes

Class A — well calibrated

Class B — directionally strong, numerically imperfect

Class C — partially useful, under-calibrated

Class D — weak fit

Class E — not testable

What counts as a valid backtest window

Typical windows

Rules that stop the backtest from cheating

Rule 1. No future leakage

Rule 2. No mid-run retuning

Rule 3. No post-hoc threshold shifting

Rule 4. No hidden variable substitution

Rule 5. No silent source upgrades

Rule 6. Proxy quality must affect scoring

The minimum backtest output pack

1. Start-state file

2. Backtest assumptions file

3. Simulated trajectory file

4. Observed comparison file

5. Error table

6. Final model-status verdict

What the protocol should conclude after a backtest

Why this matters after Tokyo

Final definition

Almost-Code

Share this:

Like this: