Power Creep, and a suggestion to measure it

Recently Hera, posted on X, about power creep and about how it was affecting the game.

Below is my original response on x, expanded and edited.

I think civ balance needs a benchmark that isn’t relative to the other civs.

I think if they created a test civ that had very bland but broad bonuses (vills work 5% faster, all military units have 10% more hp, full tech tree) then You could run say 100 simulated 1v1 games vs every other civ.

And to be clear, the test civ would never ever change. General unit changes are fine. Otherwise no changes would be made to the test civ. If you “balance” the test civ it’s like measuring the lengths of things over time but you keep changing how long a foot is.

So now that we have an unchanging “yardstick” civ, then you compare all civs to the TEST civ. Sure it’d be based on the gameplay of the, presumably, hard ai and not actual players, but it’d give some useful insight.

Assuming the test civ is balanced (tho it doesn’t have to be, but for simplicity let’s just assume) the the other civs should collectively win 50% of the time.

Then if as a whole the in-game civs win rate improves against the test civ over time, it’s probably due to power creep. Now we actually have a measure of power creep instead of a gut feeling.

And we, presumably have all the changelogs available, we could go all the way back to when DE was released (or maybe to right after the cuman nerfs) so we can get a Benchmark. Then you can repeat for current balance, and again at any interval you want.

So then you’d have a measure of power creep over time. If you’re concerned about “Well the ai is bad against x kind of civs, and y number of new x civs have been added”, then you can always compare just the civs that were around at both times.

So for example if you wanted to compare power creep from say spring of 2020 to now, then you would remove LotW, DotD, DoI, RoR, and MR civs from the 2023 data set.

With all this data in hand, if say in a year from now we have 5 civs who are in the gutter, but the collective win rate against the test civ has risen by 5%, we know we should actually NERF the top performers first (preferably if they’re also particularly bad matchups for those bad civs).

Honestly It seems the biggest bottle neck would be the compute. Say a game takes an in-game hour (probably overstating the average but it’s a nice round number). If you had a single pc running ai games “normally” at 2x speed, then that’s 30 minutes per game. That’s 135000 minutes. Even if you had a pc running 24x7 that’s 94 days. If you could somehow run aoe2 w/o outputting graphics and speed ## ### ###### maybe put this on a cluster of machines, then this could go waaaay down. If you could run just 10x speed on ten mahcines, then testing a given patch for power creep goes down to about 45 hours. Honestly without outputting graphics, 10x might be trivial. IDK.

By my count there are about 40 updates (not including hotfixes) to DE since it’s release. If we assume that 45 hours is a good estimate to test a update for powercreep (and honestly earlier updates would be faster since they’d have fewer civs, but let’s ignore that) then testing all updates would take only 75 days. You really should only have to do that once, maybe again if the ai improves greatly. Any new patch would just be that 45 hour test. Not a lot when there are two months usually between updates.

3 Likes

This is thoughtful, but not viable imo. Simulation won’t yield any meaningful results, because AI doesn’t know how to play a civ’s advantage.

For example, have you seen AI go FC conqs with Spanish? Or early MAA with BS upgrades with Bulgarians? 17p scouts with Mongols while 20p scouts with Persians etc?

This is only the opening stray btw, we are not even talking about adapting. You often see AI overreacting to archer rush with a million skirms.

4 Likes

As for the solution, we don’t really need a bland civ to balance. Keeping win rates between 45% and 55% is already a good method.

I also like how the devs are buffing civs more instead of nerfing. Because by doing so you are more incentivized to try civs that were boring.

1 Like

That doesn’t really address the issue. the balance could be such that the worst civ wins 49.9% of it’s games and still would win 99% of games vs a civ from the past. They two aren’t mutually exclusive.

You wouldn’t turn on 10x bonuses, and so long as no civ is under 45% win rate, conclude no power creep has occurred.

power creep and balance are two completely different things.

We aren’t talking balance of civs relative to each other at the present time, we’re talking average civ strength changing over time, from the past to the present.

There is some merit to this. No the AI wouldn’t be able to test balance at 2500 ELO, and yes sometimes bonuses are easy to take advantage of passively, whereas some bonuses not so much.

But again, this really isn’t relevant. This isn’t attempting to answer the question “is civ A balanced compared to civ B when both are in the hands of a 2500 player”.

What this is answering is “are the in game civs, collectively, on average, right now, performing better against a benchmark test civ, than if you were to revert the civ balance changes (not unit balance or bug fixes, just civ specific changes) and repeat the test.”

I will say tho, in writing this response, a slightly different way to run this test is have say the 2020 version of a civ play against the 2023 version of the civ, and do that comparison for all the 35 civs that were in the game on the release of DE. I don’t think you could average results among the civs in the same way you could when comparing all of them to a benchmark, but rarely do civs get massive overhauls so an ai is likely to be equally proficient with both versions of a civ, and then the change in performance can really only be in the changed strength of the bonuses/TT themselves.

I’m not a statistician but I suspect the civ v benchmark test would be better at gauging the overall game-wide power creep average delta, but the old civ v new civ would be better at measuring power creep on an individual civ level and using that to determine a mean delta.

Lastly, and this may sound Neil DeGrasse Tyson-esque, but no test any of us can conceive is going to be some reality detector. We aren’t going to program a test, and then it spits out an objectively perfect omniscient result. Having an appropriate understanding of the limitations of what any given test can really say is just as important as the results themselves. Even if there are things you know your test can’t conclusively answer, the understanding as to what can be answered, or at least suggested, can give those who are making balancing decisions the additional insight to presumably make better informed decisions.

I think the perfect use case for a test like this is when there is an underpowered civ, and you’re trying to understand if the civ itself is “objectively” underpowered, or if it’s just less powerful when compared to a roster of over-buffed civs.

Japanese are like the 3rd lowest arabia 1v1 civ 1900+ ATM, and the lowest that was released with DE. It’s winning 45% of the time.

If the 2023 version of a japanese wins 45% of it’s games against current civs, but say 90% of it’s games against the 2020 version of itself, or alternatively the 2023 version of japanese wins 20% more games against the benchmark civ than the 2020 version of japanese, then maybe, possibly, conceivably, the problem isn’t that japanese need another buff to be in line with the average civs, but rather too many other civs have already had too many buffs, and the overbuffed civs should actually be nerfed to more in line with the Japanese.

But why must the average civ strength stay the same? It really is not an issue if every civ is buffed the same amount. As long as they are balanced.

But the reason 10x cannot be turned on is that it’s not balanced. It’s multiplicative, instead of additive. How about we think of the 9-vil start that’s popular in tournaments as a buff to every civ? This is an additive buff, and a huge buff. But it’s balanced.

1 Like

I don’t think this is that useful as a measure. Unaddressed power creep at the civ level manifests through win rates, which is a more direct and actionable measure that something needs to change. But I don’t think power creep is inherently bad, and some level of it seems to be inevitably tied to the addition of new content. And once the power creep is addressed with new balance changes, it’s no longer a problem.

It strikes me as being similar to putting too much focus on the cumulative effects of long-term inflation in economics. Yeah, it’s interesting to see how little $1 is worth today compared to 200 years ago, but that doesn’t really tell you much about the health of the economy compared to other measures like GDP or purchasing power.

But also, as @PilgrimHYR said, I don’t trust the AI to play civ-specific builds anywhere near optimally as would be required to get a strong measure of this. Civ winrates, while imperfect, seem to be the best and most relevant measures of civ strength.

3 Likes

In small amounts sure, but in the limit power creep would be detrimental. And presently there is no way to quantify the power creep that has occurred.

I’d also like to take a moment and say, that while it seems like the conventional wisdom is that power creep has occurred, without a way to measure it, it’s hard to no for sure. I’m not devising this test to prove that there’s been power creep, only to measure change over time and to call it whatever it may be.

I wasn’t predicting that 10x bonuses WOULD be balanced, but saying if you turned it on and IF it so happened to still be balanced, then it didn’t disprove power creep hadn’t occurred. You supposed that if all civs had a win rate between 45% and 55% was a sufficient to prevent power creep. My statement was only showing that assumption to be erroneous.

Frankly IDK what you’re trying to say. if you average the win rates of all civs it’s 50%. That tells you nothing about power creep.

Presumably if civs got successive buffs such that on average their vills collected resources at 200x the present rate, then that’d be equally as good game design as we have at the present. the two civs in a game aren’t operating in a vacuum. They’re utilized in a game where civ design is intended to compliment the base game play. Hypothetically we could all play with the aegis cheat code, and because it’s balanced, that is equally good game design as any other. That is the supposition you are implicitly and presumably unknowing making, and I believe that to be false. I believe the game is best when the civs are operating at a level that best synergizes with the design of the rest of the game, and not some other arbitrary level.

Yes, 5% one way or another isn’t going to be noticeable, but at some point it would be, and how would you suppose quantifying the level at which that power creep becoming noticeable occurs, or the amount of corrective actions that would then be necessary?

I’ve already explained why this is irrelevant.

1 Like

What I meant is

that I fundamentally don’t think that measuring power creep is very important or useful. I see a lot of effort put into describing how we might measure it, but very little about why we should. The only justification I see for making this so important is a single sentence about knowing whether to buff weak civs or nerf top civs…but we can, and do, do both, and I don’t see the need to introduce another methodology to help with what is a rather simple choice. Yes, it can be an interesting bit of trivia, like the $ example I mentioned, but IMO its utility is lost when you try to separate it from winrates.

Okay, here’s one more sentence about “the why.” But I find myself finding the first part of the statement either meaningless, or something I don’t care about. How it performs relative to the other civs, whether or not they’re perceived as “over-buffed” is the only thing that’s relevant for balance. If anything I think it’s kind of counterproductive to try to anchor civ design to some arbitrary point in the past that will becomes less relevant with every new change or addition.

1 Like

Allow me to draw your attention to these additional sentences you presumably overlooked.

I ignored them on first read because they don’t seem to add much to your case and are quite exaggerated, but…

Who said Aegis would be balanced? Nullifying training time and res collection differences while preserving other aspects such as stronger military units is not necessarily balanced. I am also not suggesting that any possible resource collection rate/unit cost/whatever is “equally good as any other” because I agree that departing too far from the current collection rates would be disruptive to player experience. But the changes throughout the history of AoE2 that constitute power creep are civ specific buffs or new mechanics, not globally 200xing the collection rate of all resources. So using this even as an exaggerated example of why we must track power creep seems disingenuous.

You’re referencing something here in “the base game” that’s supposed to be meaningful and anchored, but it’s something you haven’t defined, and to my view, doesn’t exist independent of the cumulative state of the civs at any given time. Really, I think you’re trying to find a “gold standard” for how civs should be designed, but are running into the same problems that the Gold Standard itself had. Namely that trying to use an external metric to quantify the value of things of independent and self-referential value can never be objective.
You can establish reference points of course, but deciding which one of them is the goal is arbitrary. Or has the game suffered massive reverse power creep because most civs today would die to AoC Mongols?

You can tell me to read your posts again (again) in a hope that I’ll find something that surely exists in your mind, but I don’t think you’ve succeeded in expressing it on this page in a compelling and fully understandable way. We can give it some time, but nobody else seems to be getting it either. IDK, maybe at least screenshot Hera’s post or something, cause he hasn’t even made a video on it AFAIK.

2 Likes

Seems like your analogy was implicitly accepting this as an acceptable outcome. If not then perhaps it was a bad analogy.

You are fundamentally mis-understanding what I said.

I originally state that civs could be so buffed that they could have 200x collection (on average) than they do now through civ bonuses. Then instead of re-writing this sentence long definition everytime I wanted to refer to it, I used “aegis” to represent the idea, as it is functionally very similar. I guess you got caught up on building times and production times and research times, though all of those could be civs bonuses too. I’m not suggesting the devs will change global behavior so that vills collect 200x faster, units train 200x faster, buildings build 200x faster, etc. But in the limit sufficient buffs could yield a roster of civs, that from excessive buffs, have wildly ludicrous abilities. I’m not predicting that, but there isn’t anything intrinsically preventing that.

And because there isn’t anything mechanistic preventing that extreme outcome anything between absolutely no power creep what so ever (Theoretically even negative power creep is possible, where civs are getting over nerfed and collectively over time the civs are weaker, but I digress), and insane game-breakingly bad power creep is theoretically possible. Presumably then somewhere between those two, is a limit at which power creep is then detrimental. You even seem to implicitly admit at a certain point power creep would be detrimental to game play

Originally you’d stated.

But Later,

You are completely missing the point. What I’m saying is that in the limit power creep is bad. Your supposition was “power creep isn’t bad”. Then I said “Well eventually it would be, imagine this hypothetical that makes the reason obvious”. You’re response “Well that’s an extreme example.”

It’s like if you said “air drag doesn’t affect how things fall”, then I said “well imagine if you dropped a bowling ball and a feather, that’d make a difference” and then your response was “well that’s an extreme example.”

The point of the extreme example isn’t to present it as the likely outcome but just to show with an extremely clear example, that the original supposition, at least in the limit, was incorrect.

So if we can both agree then that there is SOME amount of power creep that would be bad for the game, and we don’t even have to agree on the amount, if we both agree that at SOME point it’d be bad, then perhaps if we can measure and quantify power creep, we can better understand that limit and therefore prevent from exceeding that limit.

It’s like there’s a guy laying down on his back and we’re slowing placing things onto him but we haven’t any way to measure how heavy those things are or how heavy the collective group of already placed objects is. I then suggest “you know, if we could measure how heavy these things are, that could be useful information”. Then you’d respond saying “Why? As long as all the weights are equally distributed on his body then it’s fine. If the weight is balanced he won’t be in discomfort. See we have this balance scale over here. We can compare the weights of the objects to each other so we know how to distribute the weight around.” I then say “Well, at a certain point that wouldn’t be true. Imagine if we laid an elephant on him. I think he’d fine that dis-comfortable even if the weight was reasonably well distributed.” Then you’d say “Well that’s an extreme example.” I’d respond “Well yeah, but somewhere between no weights and the weight of an elephant would be the limit of what we can put on this guy. All I’m proposing is that we have a better way of measuring the mass of weights than hefty vibes.”

That is what this conversation has been like.

If there were no civs, then you’d basically be playing the game in Full techs mode with no UU. That’s the base game play shared by every civilization, then removals from the TT, bonuses, UUs and UTs deviate from the globally shared gameplay. Sure, playing without bonuses, UUs, UTs, full tech tree, etc would be bland, but that’s why we have unique and well designed civs. I acknowledge I think that the in-game civs are, and probably should be, stronger than 100% generic civ. It’s like if a base civ had a power level 10, then after we remove stuff from the TT and add bonuses, we actually come out with an 11 or 12. These numbers are arbitrary just, trying to give an example. It seems that both the base and civ deviation was designed that way for a reason, and if we keep pumping up that level eventually it’ll be detrimental. IDK if that would be 13, 15, 420. But there definitely is a point.

I do think it would tell us something useful if we had a thorough comparison of all civs with a ‘vanilla’ civ.
It would make it easier to distinguish between a civ’s powerlevel being to high/low, and civ matchups being unbalanced, which are different problems that require different solutions.
And thoroughly comparing all civs with a vanilla civ is computationally doable, if they develop good enough AI. A pairwise comparison of all civs would require 40x as much computation time, which probably isn’t doable.

Also, the default max pop having increased is a powercreep that would not be captured in this analysis.

Also also, too much powercreep changes the game, in a way I wouldn’t like. It forces civs to lean on their bonuses. I’d also be curious to know to what extend this has been happening.

1 Like

I’m disappointed in myself, that I’ve been thinking about this for days and didn’t come up with the term “vanilla civ” myself lol.

That is exactly what I meant with the “benchmark” civ. Soooooooooooooo much better or a term. TY TY TY LOL. Yes, test against a vanilla civ.

As an aside, if the vanilla civ is too weak that everything curb stomps it, it won’t be a useful test. Like shooting a handgun and a sniper rifle against cardboard to gauge their relative strength. Ideally the civ would be 100% vanilla, but if it’s waaaay to weak to be useful (like always under 25% ish win-rate if I had to guess) then you’d have to give it some very bland and generic bonus I think.

Yep, that’s really the entire point. ATM we have no way of measuring it. It’s even possible it’s way less than we think and the power creep is mostly perceived. Theoretically it’s even possible civs are getting weaker over time. I suspect they’re slowly getting stronger, but ATM none us can really know for sure.

Vikings and Franks are still 1 and 2 on Arabia

And btw Franks have been nerfed like 3 times already, actively nerfed. Not powercrept.

What are we talking about?

1 Like

franks are one of 45 different civs. Franks were trash in AoC, Triple S tier after HD buffs, and have been getting nerfed for a while.

It is well within the realm of possibility that the average strength of civs is rising, even if franks are still #1 and have been getting nerfs.

I support adjsuting the powerlevels of the civs continously. And I think that in general there should be a slightly bigger emphasis on buffing up civs from the lower cards so there is an ecouragement to try them.

FE does a good job in holding that balance.

Powercreep for me would be if the newly ADDED civs would completely dominate the meta, but they don’t.

It’s for me an important destinction between a continouus power level adjustment and a powercreep. Cause powercreep would mean that also the already “bad” civs would fall down. But instead devs do quite a good job in general in trying to help out civs that underperform.

And I don’t like the word powercreep used as a catchprahse to try to maintain a power disbalance. Not allowing to buff civs that are underperforming is a clever way to try keeping the top dogs forever on top.

Ofc devs should both nerf and buff, but for me it looks like they found a reasonable balance for that.

2 Likes

These type of topics need to have analysis done by serious Statisticians.

AoE2 Institute of Statistics is the need of this age.

1 Like

SOTL should be the Dean of AoE2 Institute of Statistics.

3 Likes

BTW there has been a general powercreep going on, but this topic isn’t really about that.

It’s less obvious things like a small increase in ressource efficiency from various optimization patches. It went even so far one of the patches made that with HC you got not even about 5 % faster food collection from farms (which was actually needed as without that boost Hand Cart wasn’t giving much anymore cause even wheelbarrow gave already close to the hard 24 f /min maximum), but even about 4% MORE food from these farms.
It should be quite easy to find out what patch that was, as this immediately reactivated the infamous vikings fimp arb play on the ladder.

IDK if devs have already adressed that, i still wait for it being mentioned in a patch note…

But back to topic, I don’t think this is what is meant here, despite this is the true and legit powercreep, it’s one of the main reasons why we now have these super fast and optimized buildorders. They wouldn’t be possible without that.

1 Like

yep not viable and the issue happens WAY before the AI issue. they can’t even test pathing of the patch they released while claiming “fixed” or “improved” and all they needed to do was to play 1 game