Since the core code of the game is from the 90s I assumed the game would only utilize a single thread.
However, when I checked the resource monitor with nothing but the game running I found that 4 threads were under load while all other threads were chilling at 0-10% use.
The game uses multi-threading as any modern application would which utilizes all CPU cores. However, the game logic cannot and should not run in a concurrent way and is thus can only run on one core at a time.
Contrary to misconceptions, this in itself isnât a performance issue. The game logic isnât bottlenecked at CPU cycles in such a way that concurrency would allow a greater throughput.
The game logic could run concurrently, by identifying independent subsets of the game objects. I expect the game runs in a sequence of small time slices, so there will be subsets of objects that cannot possibly interact with other subsets in the current time slice, so each of those subsets can be calculated independently. For example, at the very start of a game, there is no way for any of one playerâs units to affect the other playerâs units, so each playerâs units could be calculated independently.
The problem with this is that when there are enough objects in the game to make it the fps limiter, the computation to identify the independent subsets would itself become significant. Until fairly recently, core counts were low enough that it wouldnât have been worth doing this (the game already benefits from 4 cores over 3 cores), but now that some modern games are being optimised for 8 core CPUs, it might well improve performance to do it. It may well be possible to develop an algorithm to derive the independent subsets for each time slice fairly cheaply from those for the previous time slice, rather than evaluating everything from scratch each time.
But then optimising for 8 or more cores wonât help the multiplayer experience, as the slideshow is caused by the slowest PC, which probably doesnât have much unused capacity at the moment.
Remember, a bottlenecked thread will not pin the entire CPU to 100%. Instead, a thread will pin to 100% divided by the number of cores (x2 for hyperthreading). For example, my 10 core CPU with hyperthreading will only show 100/(10x2)=5% for any thread that is bottlenecked.
Letâs take a look at AOE2 when running a performance test:
I see two bottlenecked threads (5% and 4.89%). The other threads are barely being utilized. (Not shown: the gpu and SSD are barely being utilized as well). So, whatâs holding back performance? Itâs the bottlenecked threads.
The Steam hardware survey shows 80% of gamers have quad-core or better @ 2.3 GHz or better. That should be enough. But, youâre right, if any of the online opponents have a potato, that could become the limiting factor.
I would be interesting to understand what operations are the most constraining for the program. I mean, does each client computer have to compute the outcome for all units? Does the server just route a copy of each playerâs commands?
Itâs tricky, Iâd say the realistic options are:
Assign it to 4 logical processors that are on 4 different physical cores
Just leave it to the OS to assign across all 8
You could test to see what gives the best score, but bear in mind the score can vary a little randomly with both those approaches, as even if you do 1, you can still end up with some other threads using the same logical processor as one of the gameâs main threads. With 2, you can end up with some pretty major conflicts with threads running on the same physical core, but that can be offset by having more logical processors available in total. Youâd have to run the benchmark several times with each and note down the scores to see if thereâs a consistent advantage for one over the other.
I recommend setting the affinity using this method:
I tried doing it via a bat file and command line arguments, but it didnât result in the affinity being set correctly. Doing it the above way worked correctly every time I tried it, but itâs a bit of a pain to have to do it every time you run Steam. If you look in task manager with it showing all logical processors you can check the usage is on the logical processors you expect.
You can see per core utilization, but thatâs beside the point.
Even if you see 100% utilization of a certain core that does not mean that the game is CPU cycle bottlenecked. All it means is that the game spends most of its time on the CPU and not waiting or doing I/O.
The only way to see a degradation of performance is when in game ticks (or seconds) are progressing at a lower rate than the game speed ratio allows (which we can remember is the case when playing vs certain people with certain smart toasters for computers on HD/Voobly).
The CPU might be spending a lot of time on cache misses (due to lack of memory) and context switching back and forth with the kernel. The game logic might actually be seeing very little cycles.
Also remember that game ticks have to be synced between all the hosts which takes up most of the time.
I do remember in HD - late game trade cart path finding would cause performance degradation which is caused by an inefficient path finding algorithm. That for instance could be fixed by making it an order of magnitude more efficient, not by trying to make the entire thing concurrent.
The game engine was written for 1999 computers - weâre talking about Pentium 133 MHZ with perhaps 16 MB of RAM, no SIMD whatsoever and it was demanded exactly the same thing - 8 players with 200 maximum population for each. In fact, the game engine was inherited from the first age of empires game (1997). Since then the game code and net code could have only gotten more efficient (via Userpatch) and the most noticeable impact is that thereâs no longer action delay on multiplayer.
My point being that the assumption that the game logic is resource demanding and as such could benefit somehow from getting concurrency is simply false. I really wish the game code was open source or at least wasnât obfuscated so you could plug a profiler and see how much cpu time is spent on game logic code.
So what do you think is the bottleneck, given that the game runs faster on faster CPUs, and hence is unquestionably CPU limited in performance (given a reasonable graphics card).
I have a 4 core / 8 logical processors CPU (i7 6700T), so I tried 3 scenarios with it:
Logical processors 0,2,4,6
Check the All box
Uncheck the All box, then check 7,6,5,4,3,2,1,0
I ran these tests as 1,2,3,1,2,3,1,2,3 rather than 1,1,1,2,2,2,3,3,3, closing the game each time, to try to get a fresh roll of the dice for how the threads happened to be assigned to logical processors. The scores were:
1166.1, 1166.1, 1172.6
1145.5, 1145.5, 1152.6
1152.6, 1152.6, 1152.6
I think 2 and 3 are the same, and itâs just random variation in how the threads happen to get assigned to logical processors, and what conflicts it causes. On this PC, at least, it runs better if only 4 logical processors are used, but that will depend on the specific CPU and its hyperthreading implementation.
Not the game logic. The game utilizes a ridiculous amount of memory for sprites and other assets and accessing it all throughout the game would cause more cache misses on lower end CPUs (mind you, not slower CPUs, simply CPUs that have less cache and generally slower memory access).
Youâd think those things would be GPU accelerated, but theyâre not. Much of how graphics are handled in the game is carried over from that original 1999 code base and that also means much graphical work is done on the CPU. It worked in 1999 since back then when there werenât many graphical assets and graphics cards were just starting to become more than mere pixel writers so it made no sense to target them.
Again, I emphasize how game logic wise, we still have roughly the same thing that worked in 1999. The game is not logic heavy. Itâs a rather simple simulation of no more than a couple of thousand items that need to be updated no more than a hundred times per second.
Donât get me wrong, thereâs a lot to be desired from the game vis-Ă -vis performance - they could definitely overhaul graphics and implement it in a modern way that doesnât consume an insane amount of memory. They could also overhaul other parts and make it truly-client server so that the game simulation is only performed in a single place (rather than duplicated across peers). However, the game logic isnât really an issue.
Take a look at OpenAge by the way⊠itâs an attempt to remake the AoE game engine while also improving it significantly in every aspects - theyâre even capable of running borderless maps.
Okay, so some games have benchmarks that split the CPU element into two elements - rendering, and sim / game logic. Examples of games with benchmarks that do this are Shadow of the Tomb Raider, and Forza Horizon 4. What youâre saying is that the bottleneck is the rendering component, rather than the sim / game logic component. Youâre right that there is significant CPU load from everything the CPU has to do to prepare things for the GPU and execute the API calls to tell the GPU what to do.
Iâm not sure if we can definitely say which it is, from just observing the gameâs behaviour, but my feeling is it must be the game logic, based on the slowdown seen late in multiplayer games. Unless the code is extremely badly written, there is no need for other players to have to wait for another playerâs slow rendering, they would only need to wait for another playerâs slow game logic processing. If these two things are separate threads, they can progress independently, with many game logic updates per rendered frame, if the rendering is the slow part. This would result in poor fps for the player with the slow PC, but normal fps for other players. As this isnât what is observed, I infer itâs the game logic that is the slow part.
Actually when you see players in multiplayer games slowing the game down for everyone else itâs probably neither rendering nor game logic, but simply a bad rig where everything is slowed down and the game process context switching to kernel a million times a second for no good reason. Think malware, anti-virus, aggressive firewall that likes to do fancy stuff, even a terrible driver.
There are ways to profile and find those kind of things on Windows (for example, using a tool called Procmon).
Again, consider the best computer in something like 2004 - 5 years after the release of Age of Empires 2, 7 years after the release of AoE 1 (when the game engine was originally conceived). Multiplayer games ran fine on those rigs. Even the worst PC today, heck, the worst smartphone today has an order of magnitude more single core computing power than a computer then.
If the PC is context switching a lot, the end result is still a slowing down of execution of the game code, and some part of the game code running slowly as a result still has to be the reason for other players being made to wait for something to happen on the slow PC. The only thing that other players would need to wait for is slow game logic processing, not rendering. Maybe the game is coded really badly, and the game does hold everything up waiting for CPU rendering, but it would be an extremely small piece of work to fix that, so itâs very unlikely.
I understand your argument about the original version of the game. We know DE is more demanding, but I think what you are saying is the game logic element will not have changed much, so wonât be much more demanding than it was in the original game. Again, it depends how the code is written in terms of how it handles the extra civs, any improvements to the AI of units, etc. If youâre right and it is the CPU rendering that is holding things up, they really ought to fix that, it just shouldnât be hard to decouple it from the game logic (this is standard practice in driving games, for example, where you want the simulation to always run at a higher frequency such as 240hz, even when running at 60fps).
Of course if the computer slows down due to whatever then everything slows down, including game logic. That doesnât mean that making the game logic concurrent would have prevented the slow down⊠on the contrary, it would have created more overhead.
The game rendering runs on a different thread from the game logic, and could potentially run on a different core as a result.
Also, like I said, while the rendering in DE is quite inefficient, I really donât think itâs the cause for slow downs in some rigs. I believe the slow downs are caused by factors external to the DE application.
As a general point, I disagree with this, but for the specific subset of people who are causing the problem in multiplayer games, I agree, which is why I said âBut then optimising for 8 or more cores wonât help the multiplayer experience, as the slideshow is caused by the slowest PC, which probably doesnât have much unused capacity at the moment.â earlier in the thread.
It would also be an expensive activity to develop an algorithm to form the independent subsets of game objects needed to make the processing concurrent, which makes it unlikely to ever happen unless some general research on the subject exists that could be easily applied to the game.