Recently Hera, posted on X, about power creep and about how it was affecting the game.
Below is my original response on x, expanded and edited.
I think civ balance needs a benchmark that isn’t relative to the other civs.
I think if they created a test civ that had very bland but broad bonuses (vills work 5% faster, all military units have 10% more hp, full tech tree) then You could run say 100 simulated 1v1 games vs every other civ.
And to be clear, the test civ would never ever change. General unit changes are fine. Otherwise no changes would be made to the test civ. If you “balance” the test civ it’s like measuring the lengths of things over time but you keep changing how long a foot is.
So now that we have an unchanging “yardstick” civ, then you compare all civs to the TEST civ. Sure it’d be based on the gameplay of the, presumably, hard ai and not actual players, but it’d give some useful insight.
Assuming the test civ is balanced (tho it doesn’t have to be, but for simplicity let’s just assume) the the other civs should collectively win 50% of the time.
Then if as a whole the in-game civs win rate improves against the test civ over time, it’s probably due to power creep. Now we actually have a measure of power creep instead of a gut feeling.
And we, presumably have all the changelogs available, we could go all the way back to when DE was released (or maybe to right after the cuman nerfs) so we can get a Benchmark. Then you can repeat for current balance, and again at any interval you want.
So then you’d have a measure of power creep over time. If you’re concerned about “Well the ai is bad against x kind of civs, and y number of new x civs have been added”, then you can always compare just the civs that were around at both times.
So for example if you wanted to compare power creep from say spring of 2020 to now, then you would remove LotW, DotD, DoI, RoR, and MR civs from the 2023 data set.
With all this data in hand, if say in a year from now we have 5 civs who are in the gutter, but the collective win rate against the test civ has risen by 5%, we know we should actually NERF the top performers first (preferably if they’re also particularly bad matchups for those bad civs).
Honestly It seems the biggest bottle neck would be the compute. Say a game takes an in-game hour (probably overstating the average but it’s a nice round number). If you had a single pc running ai games “normally” at 2x speed, then that’s 30 minutes per game. That’s 135000 minutes. Even if you had a pc running 24x7 that’s 94 days. If you could somehow run aoe2 w/o outputting graphics and speed ## ### ###### maybe put this on a cluster of machines, then this could go waaaay down. If you could run just 10x speed on ten mahcines, then testing a given patch for power creep goes down to about 45 hours. Honestly without outputting graphics, 10x might be trivial. IDK.
By my count there are about 40 updates (not including hotfixes) to DE since it’s release. If we assume that 45 hours is a good estimate to test a update for powercreep (and honestly earlier updates would be faster since they’d have fewer civs, but let’s ignore that) then testing all updates would take only 75 days. You really should only have to do that once, maybe again if the ai improves greatly. Any new patch would just be that 45 hour test. Not a lot when there are two months usually between updates.