A lengthy, breathless explanation (with examples) as to why proposing balance changes based off winrate is a terrible idea

WoeIsToWho · August 28, 2021, 8:06pm

I’ve shrugged off my duty for telling people to knock it off for long enough. There’s an overabundance of reasons why trying to evaluate civilization strength via winrate is a bad idea even assuming you take all the relevant variables into account for accurately assessing said winrate. There’s a lot of things that play into civilization winrate, that if taken into account, will (in the eyes of the impartial) convince you that the balance is actually very close to ideal and there’s not a lot that needs your immediate balance thread.

Failing to properly take these things into account will, in all likelihood, assure that most people ignore your thread because it’s not worth explaining to you all that’s about to follow here, and if your changes were actually to survive the ringer, would actually be an absolute nightmare for the game as a whole. Here goes.

1. Map Selection, Restrictions, and the Pool

One absolutely massive factor (and definitely the most prevalent) towards the winrate of every civ is what maps are available, and more importantly, are picked. Certain civs will simply become the meta option on certain, less typical maps, and that can oftentimes create a lingering impact on both playrate and winrate.

Further, having a map or map type be less picked or less readily available means that such civilizations that thrive on those settings simply will not play as often, or win as often, since they’ll be playing outside of their element.

Example #1 - Franks and Portuguese

Franks - 55.29 winrate (1650+)
Portuguese - 44.54 winrate (1650+)

The stats indicate that the Franks, who have strong knights, a good early economy, and a powerful lategame push with cheaper castles securing map and giving them access to their extremely versatile Throwing Axemen have too much, and should be nerfed, whereas the Portuguese, who lack a strong early economy bonus and any real midgame powerspikes, are in need of a buff.

The stats, as always, tell us actually nothing besides how often they win, which is up to a lot of things. The first thing on the list is map selection and map availability. Arabia (an open land map with no water to control) is always available and one of the most popular maps, bar none. What’s important about Arabia is that it actually accounts for 75% of the civilization win/loss records.

So if a civ is particularly good on Arabia, it’ll tend to do well in overall win-loss. If a civ is terrible on arabia… well, you have the Portuguese, who are, at 1650+, the worst civ on Arabia with a 40% winrate. Mind you, that’s 75% of all their records, and they’re only sitting at 44% winrate with that in mind. Remove the arabia results from the account, and their low winrate becomes much more respectable. Likewise for Franks, who have an exceptionally good (54.5%) winrate on Arabia, their overall winrate will be heavily skewed by these results, and indicate that they’re a powerhouse while in reality, get Franks on a water map and you’ll find they lack Bracer and Shipwright, so you’d better win in Castle age. Map changes everything.

Now, some here argue that Arabia should be the map we balance around, and therefore a low % rate overall, heavily influenced by Arabia, is still telling of a civ’s overall potential. Whether or not you agree with that premise, Maps influence winrate, and unless we want to make every civilization more similar to ensure they all play nice on the most popular map, it’s bunk.

2. Popularity, Meta, and Adaptation

Some civs see more play than others. Some civs have very distinct bonuses that you need to utilize properly to benefit from them. Truth be told, outside of a very select few civilizations, most dark ages, and even feudal ages to a degree, play out the exact same way. Those few civilizations that are clearly different in their early execution are likely to throw off more general players, especially those that play many different civs often, or random often.

Popularity plays a big part in that. The more played a civ is, the more likely someone is to be aware of, and be able to adapt to, the little quirks inherent in the civ that they are playing. A civ that sees a very low amount of play and also has a very different start is going to suffer as a result of having a lesser amount of adaptation time committed towards playing to their civilization strengths and making use of the bonuses they are provided.

Example #2 - Malay and Mayans

Both of these civilizations share a commonality, an atypical start. Mayans start with an extra villager, and thus need to research Loom immediately or have idle time with queueing villagers when they start out, whereas Malay ascends through the ages faster, meaning less time to accumulate resources, but as a tradeoff, a higher villager count. Both of these will surprise an unprepared player, but the Mayans have two major things going for them. Firstly, this happens right at the beginning of the game, so it’s something that you’re likely to correct immediately even if you didn’t catch it on the loading screen, but further, the Mayans are a far more popular civilization. Mayans get played almost three times as much as the Malay, and as such the Malay start is less practiced, and more detrimental to not account for.

Given how the Malay have no early economy bonus beside this, failing to utilize this bonus properly can not only be detrimental, it’s giving away the only bonus the Malay has over a standard civ. Combine that with the fact that the Malay are not generally a very strong civ on open maps (the Arabia thing from point one) while the Mayans are literally the best civ on Arabia, bar none, it’s easy to see how their average winrate can fall so low. That doesn’t make the Malay a weak civilization. That can simply mean the average player of the Malay civ fails to utilize their bonus to a degree that is efficient, and generally evaluating the winrate of the civ without accounting for such a variable leaves said evaluation lacking.

3. Team Winrate VS. Solo Winrate VS. Non-Standard Winrate

Do you count all these winrates equally, do you weigh them based on what you play (or what you think the majority of people play) and what constitutes “too good” in any of these categories. If a civ is 57% in empire wars but 48% in Random map and 49% in Team games, do we nerf the 57% empire wars stat at the cost of what are already (possibly) lackluster stats in the other categories?

For the record, at 1650+, every single civilization falls between 46% winrate and 53% winrate in Team Random Map. Do you want to distort this balance for the sake of improving one civilization in RM, or nerfing a civ in RM that pushes too high in your estimation of winrate?

4. Draft Viability Vs. Matchmaking Viability.

It’s in our nature to assume that what does well in overall games must be a strong pick in drafts. Realistically speaking, drafting of civs is all about matchups, and knowing what a civ is good at both doing - and dealing with. That’s why Vietnamese gets drafted a ton in tournaments even though they have a very poor winrate in RM. They’re used as a counterpick in drafts to deal with certain problem matchups, which is a thing for a lot of the civilizations you might expect would be bad, generally. Portuguese oftentimes gets used as anti-Viking on water (with dubious results) but also for their Feitoria + survive strategy which has taken plenty of games at the top of the top tier.

Buffing a civilization with a losing RM record but a strong drafting record might make sense from a casual perspective, but it makes a much bigger difference at the top level where bonuses are most efficiently capitalized on. I generally do not think the Vietnamese are very good, but when I see them in 8/10 drafts and played in 7/10 drafts, I temper my own distaste for the civilization with the expectation that they are probably utilizing those bonuses and timings better than I am, and that the weaknesses I feel are not so exploitable when in more capable hands.

Winrates tell us very little about the competitive viability of civilizations, and therefore buffing them or nerfing them purely off this metric is risking a lot to trust a number.

5. Some other factors to worry about

Small sample size

Most civilizations barely have a thousand games, grand total, under their belt. That sounds like a lot, until you realize there’s 39 civilizations to play, and you’re more likely to play against one of the more popular civs, many civilization matchups may very well have ten or less games between them on any map at all. On average, every civilization will have played every other civilization ~25 times, which doesn’t really take into account the extreme low play rates or extremely high play rates of some civilizations, which as I’ve mentioned before, can distort civilization strength by itself.

Long adaptation time to new changes and development of new strategies.

It took months for people to figure out and implement the early upgrades Malay build for Arena that has solidified them at the top of the clown tier list. It took years for players to invent and perfect the Incan trush. It practically didn’t exist in a meaningful way until DE, and they had been around, with the villager blacksmith upgrades, for ~ 7 years. When a balance change occurs, and you wait a month, not only have the percentages not settled, I’d like to remind you that sometimes, the full impact of balance changes and releases take a long time to feel out.

Noise in the data.

Some things simply won’t be accounted for when you look at something like win/loss. For example, do you know what happens when a 1650+ plays someone who is <1650 rated and wins? Loses? Do you count a win above 1650, a loss, or both? Do you count both equally?

Here’s an example where a data spreadsheet, with win-loss data showing how much of a difference it makes how you answer that question. What happens here is that the spreadsheet sees a win for a player above a certain rating, but the person who lost (people with higher rating tend to win more often, surprise and shock ensues) wasn’t at that rating, so the loss was ignored and the net winrate of the entire sample went up, thus inflating the results. If you consider a higher ranked player to be more likely to pick a meta civ (which is a stretch, but a reasonable one, I think) one may easily infer that as a small variable in the sample.

So, as an summation of what I cared to explore as important variables, Win% doesn’t account for:

Map selection and the available maps
Civilization popularity and familiarity
Various modes and the other metrics from which to judge a civilization’s success
Competitive viability as a drafting tool, not just as a generalist civilization
A list of various other things that cast serious doubt on the usefulness and accuracy of the metric.

Conclusion

Please stop doing this.

Signed cordially, Everyone.

CloudAct · August 28, 2021, 8:11pm

I agree. Civs can be overpowered in Castle Age and trash in Imperial Age at the same time. Win-rates don’t indicate that.
Often Team Games are also outright ignored in balance discussions which is a shame.

casusincorrabil · August 28, 2021, 8:22pm

Best example are saracens. They seem to suck, but people who learn the market abuse can make them work like crazy.
So if Saracens are buffed, the market abuse will be needed to nerf in the exchange otherwise they will be op in the right hands.

But overall winrate can be an indicator if a civ probably needs a buff or a nerf. Maybe the best indicator we have - but if it needs to be done sophisitcated.

For example I have since months a look into Portuguese and haven’t really figured out how they can be tweaked. Same for Vietnamese. Some civs are just so different that you can’t really say what really holds them back on the ladder, cause players who “know” them can play them with great success.

And until I haven’t figured out how to make them work how can I argue for certain tweaks?

ReanuKeeves00 · August 28, 2021, 8:32pm

One thing I’d like to add:

Some balance change suggestions seem to be a result of feelings only. “XYZ annoys me, so it must be OP”. I think those are the worst, because they actually sometimes end up influencing the devs balancing choices.

Lepigozzus · August 28, 2021, 9:06pm

So what should we use instead?
A vote poll?

FurtherLime7936 · August 28, 2021, 11:18pm

Because team games aren’t all
Indians were OP at TGs but since nerf they became quite bad at 1v1, lots of people love to say they are great but only in few matches, others are close to be lost, especially vs mesos.
Also try to talk about Franks and how OP are they at TGs and people cry to to see the civ nerfed.
Also Cysion himself said they check winrates to help in balance

casusincorrabil · August 29, 2021, 12:01am

Franks are a strong TG civ. But here we see they don’t have the highest TG winrates. Even though I think they might be one of the best TG pocket civs. The most classic for sure, we have some quite good contestants for them in Teutons, Lith, Burgs, Persians, Huns and Magyars. And Indians and Berbers add traditionally a “not classic” dimension to this.
The difference in archer civs is also not to underestimate. Britons, Mayans, Vikings and Chinese are perfect examples for extremely strong archer/flank civs that also can be called “OP”.

The whole TG meta is just so extremely stale, that all the civs with clear bonusses and good eco featuring the power unit lines are just dominating there.

And the best way to deal with this would be to try to make certain civ bonusses for the “bad” tg civs which somewhat break the meta. Like make the byzantine trash discount applying for all the team members.

Not nerfing the civs which thrive from the stale TG meta. Try to make other strategies viable in TG and the OP-ness of these civs will naturally shrink.

The reasons why these civs dominate in TG is because they are the best in the stale TG meta to feature the 2 roles we have there. Nerfing the civs caused on that would be a terrible decision, cause the goal should be to break the stale tg meta to make tg more interesting.

I mean we have different Team bonusses, let’s use them to balance TG. Let’s use them to make TG more interesting by playing with the civ bonusses.

MatCauthon3 · August 29, 2021, 12:16am

ahem.

yes they do have the top TG winrate.

except that wouldn’t do much good because archers would just pick off the spears and then cavalry would swoop in and wipe out the skirms.

CloudAct · August 29, 2021, 12:17am

True but 1v1 isn’t all either. I agree with what you are saying though.

casusincorrabil · August 29, 2021, 12:25am

https://aoestats.io/civ/Franks/RM_TEAM/1650+

Must be new in this patch. As far as I remember franks never had the number one spot there before.

And I think what you citate is the 1v1 winrate, not TG.

Maybe, but maybe there are some interesting combinations like with malians, lith, sicilians, goth (halb for 17F / 12 W) or bohemians. Idk if it’s enough to challenge knigh/archer. But we never know if we don’t try it.
In another thread I proposed this someone claimed immediately that would be OP.

MatCauthon3 · August 29, 2021, 12:41am

Nope. Was team games.

casusincorrabil · August 29, 2021, 12:44am

Nope, it is 1v1 + TG combined

BTW I threw 92 vietnamese halbs + 92 vietnamese imp skirms against 30 Magyar Paladin + 60 Magyar Arbs in aoe-combatsim.com .
With medium and perfect hit and run these are fairly even matched.
So it can possibly work with a vietnemese ally, also considering that you don’t need to set up trade for that.

Harooooo1 · August 29, 2021, 1:55am

TG stats on aoestats website are just garbage. The highest bracket being 1650+ is too low for the way too much inflated TG rating. There needs to be some sort of 2k+ / 2300+ option

FurtherLime7936 · August 29, 2021, 2:10am

2300? lol it should be 4300

WoeIsToWho · August 29, 2021, 4:03am

An argument that identifies something specific, hopefully based in experience, that you either have a suggestion to solve, or want to see solved.

Dagorad62 · August 29, 2021, 4:44am

This is good stuff. It covers a lot of the issues with identifying both whether and how a civ should change.

To add to this what you can do is use causal inference to isolate (as best as possible) the causal effect of civ on win rates. This can tell you who might need to be buffed or nerfed but not how you can do it. The might comes from players not really being too keen on exploration and experimentation which leads to the long lag times WoeIsToWho talked about in terms of strategy adaptation. So you might not be able to tell if players just don’t want to learn a civ or if the civ actually sucks.

The reason for being unable to tell how to change civs is there’s not enough heterogeneity in strategy use across the player base to test the causal effect of strategy choice on civ strength. The meta dominates strategy choices across all civs and unless you literally start paying 2k+ players to start exploring strategies to generate the right data you’re going to be left with questions data alone can’t answer.

This is where you start having to use counterfactuals, models, and psychology/behavioral econ to estimate things that fundamentally the data can’t show you. Things like “are players actually playing optimally?” and “are players just exceedingly good at strategy X which causes them to excel at civs which can execute X?”

Most of the models people make up on these forums to use as evidence are crap. They are usually too narrowly tailored or bake in static assumption which should be dynamic or even exogenous. A good model is generally applicable, explicit about parameters, and weighs things based on well defined and estimable functions (like pdfs or discount functions).

Examples of suboptimal play that could (I’d say are likely to) exist and which create an absolute nightmare for data interpretation:

For example Portuguese and now Malians should be the least averse to a double gold army. We can know this because effectively being able to make 25-30% more gold units before running out of gold provides a huge safety margin for double gold that isn’t available to most civs. So we can prove mathematically the risk of them going double gold is significantly reduced relative to other civs. Yet if you watch people play they do not properly adjust their heuristics about double gold units when playing these civs. It’s not about always using it, it’s about the marginal shift toward double gold not being there.

Another example is the Saracen market. Again we can prove under reasonable assumptions that the optimal use of Saracen market is (on average) to dramatically reduce feudal farm usage, ideally to nothing. Yet when you watch even high ranked players they rarely take advantage of this, usually waiting until significantly later in the game to use the market by which point a significant chunk of the bonus is lost due to time preference.

Another example is unit mixing. Mixing of UU and common units should be extremely common for certain units like mamelukes and camels due to the interaction these units have with each other. You can indeed document that the benefits of using both together outweighs using either one alone in a wide variety of situations. One can also be relatively confident that in a significant fraction of commonly encountered situations mixing units will pay back the extra resource costs in terms of army efficiency/efficacy. You would expect this to occur often enough to notice with some civs.

Another example is siege aversion. Many of the best civs are very weak to siege for various reasons: Mayans, Chinese, and Vikings. These matchups should show significantly more siege usage than other match ups yet for the most part they don’t. Again it’s not about always using siege. It’s about the lack of a noticeable bumb in usage against these civs which under optimal mixed strategy choice would almost certainly be present.

Another example is stone walls. Palisade walls tend to be built on ~20-25 villager ecos and cost half as much as stone walls in total eco time. Yet when players have 100 villagers many times they will not have stone walled important fractions of their base. This is clearly inconsistent behavior. At 100 villagers the opportunity cost to walling is basically unchanged, the eco is 4x the size and the cost is probably 4x as well due to the larger surface area, and the reward is now greater as you keep out raids capable of literally ending the game. So you have unchanged opportunity cost, and a higher reward for the cost and yet it’s not taken. I’m not saying one should always stone wall but economically speaking it’s almost impossible for palisade walling to be meta but stone walling to be nonexistent.

Some of these amplify each other like stone walling and siege aversion. If you are averse to stone walls you will also be averse to using slow units meaning you will not want to use siege.

If any of the above examples are true it creates a nightmare for data interpretation and making changes. If you are reasonably certain that top level players are making errors on under-used civs then what do you do? If you buff the civ(s) and then the players find out their mistakes then youre left with something grossly overpowered. This is exactly what happened to Saracens when DE came out. The devs, upon seeing what pros who abused the market could do, were forced to walk back all of the archer building damage changes and the cavalry archer anti building bonus which I believe had existed since AoK.

As the Saracen example shows you can’t really buff things if players aren’t playing optimally. This leaves you damned if you do and damned if you don’t. The solution to this is better communication. You want to explicitly communicate that players haven’t tested enough of strategy X which the staff found to be almost OP which held back change Y.

Well that’s enough of a tanget. Again the OP was spot on regarding the pitfalls of identifying changes. I just wanted to provide some other context regarding how difficult it is to predict if underutilized things are actually bad or if players are just biased.

GermanAttorney8 · August 29, 2021, 5:04am

Just saying would be good if you can remove some examples (Saracen market and unit mixing are in particular weird because it’s proven like “proof is left as an exercise” style) and leave with a TLDR. Tho I agree that player bias and lack of samples can create a lot of unintended affects during changes.

Lepigozzus · August 29, 2021, 8:13am

This means nothing.
Personal experience is even more skewed than random statistics.
There are players with 2k matches played that have played 70-80% (if not more) of their matches with just one civ.
Let’s say that civ is Huns, those players should only make proposals for that civ alone then, because clearly, even with more than 2k matches played (and probably high ELO), they have no valid data about, say, Saracens.
Then pick another player who equally chooses every civ in the game, and has played 2k matches in total. This means less than 60 with each civ, a statistically weak sample.
This not considering that losses might come from errors or better plays on the opposite side.
No thanks, I’d pick raw (but broad) statistical data over any “experience” any day to gather the info needed.
Where the sample allows it, separated by map to better analyze the situation, example: a civ has 60% winrate on Arabia, but 30% on Islands, maybe has oppressive raid capabilities but really weak navy, is in need of some tweak? Or the average (unweighted by map) winrate is enough?

Thinking of it, a better data should include the average win rate by map, along the total winrate, this to counter the effect of the preponderance of any map in the sample size.
Obvious example is Arabia which accounts for most of some civ’s wins.

Example of this, using data from aoestats (updated or not this is irrelevant).

Franks have 53,73% winrate across all ELOs in 1v1, but if we average the winrate on each map it becomes 50,62%. A big difference imho.
Same goes for Portuguese, 45,12% total winrate in 1v1, only 42,83% averaged by map.

casusincorrabil · August 29, 2021, 10:19am

Because the Idea of Double-Gold is a theretical that doesn’t really hold in reality.
I know it now, because I tried it with Portuguese a lot to “figure them out”. They can’t double gold. Double Gold is an extremely bad Idea with them, it makes them worse than they are.
Both power lines have comparably high early game tech investments to them. Especially the archer line. Going for both power lines makes them weaker in the early to mid game than they need to be. And it also don’t really pay of in the lategame cause what? At some point Gold becomes scarce even with portuguese.

Explanation:
A) When you want to make the most use of the gold discount it is best to chose ONE Powerline, cause the more gold-related upgrades you make the less gold you can invest into your powerlines as the techs aren’t discounted. So to get the most “Value” out of your Gold you want to make as less upgrades as possible. And the power units cost a lot of Gold to be upgraded. And in the lategame you have to invest into the trash anyways, so you don’t gain any real benefit from it.
B) Double-Gold doesn’t provides great benefit in 1v1. As you don’t face double-gold yourself, but a unit comp that is set around a power unit, you just don’t need double-gold. If you just play your standard counter strats they perform usually even better against the opponent comps than double-gold. Why? Because the specified counters are just better in that simple one job you need to fill: Cover the weakness of your army and punish the weakness of the opponents army. And here the standard trash counters are just perfectly designed for. They do this job even better than if you add the other power unit line.
C) If you try to go for double-Gold, at leas 60% of your eco must gold mine. So even if you don’t boom at alls but make an all-in you will get problems to even get this gold-income in. You can’t place that many vills on gold, it’s physically restricted. And that means, that you also can’t add eco behind if you go for double-gold. But that’s what you want if you have a superior comp, adding eco behind is the key factor of winning games with agresssion. All-in is more a meme strat in rts. Not that it can’t work, but being fixed to this is an overextension. And overextensions are usually punished very hard. If the opponent sees you going all-in he just adds enough TCs to hold and whether the storm with less army but with the defenders advantage and outboom you. Then your double-gold isn’t worth anything cause mass > power.

That’s why double-gold is bad strat in 1v1s even for civs like port or malians. Especially for ports, as it reduces the total amount of gold you can actually spend into your double-gold significantly.

So people playing ports with only one gold unit (usually archers) are playing them to their strength.

WoodsierCorn696 · August 29, 2021, 11:14am

I am a bit confused about the first post. If i read the post, then i see a great summary of some of the overlooked important relevant variables to look at to draw conclusions based on a winrate. Things like maps, meta, popularity, … are all important variables to look at to give a better meaning to a winrate. But you use it as claim that looking at winrates is bad, even after looking at the relevant variables.

Based on your first post then i would conclude that looking at winrates are usefull, but you dont need to just look at the winrate number only, but you also have to look at all the other variables to fully understand the state of a civ.