Is the game optimized for 4 CPU cores?

Since the core code of the game is from the 90s I assumed the game would only utilize a single thread.

However, when I checked the resource monitor with nothing but the game running I found that 4 threads were under load while all other threads were chilling at 0-10% use.

I thought it was single core

I posted some analysis of the threads here:

Also, some tests with different affinities here, and in the post below it:

2 Likes

The game uses multi-threading as any modern application would which utilizes all CPU cores. However, the game logic cannot and should not run in a concurrent way and is thus can only run on one core at a time.

Contrary to misconceptions, this in itself isn’t a performance issue. The game logic isn’t bottlenecked at CPU cycles in such a way that concurrency would allow a greater throughput.

The game logic could run concurrently, by identifying independent subsets of the game objects. I expect the game runs in a sequence of small time slices, so there will be subsets of objects that cannot possibly interact with other subsets in the current time slice, so each of those subsets can be calculated independently. For example, at the very start of a game, there is no way for any of one player’s units to affect the other player’s units, so each player’s units could be calculated independently.

The problem with this is that when there are enough objects in the game to make it the fps limiter, the computation to identify the independent subsets would itself become significant. Until fairly recently, core counts were low enough that it wouldn’t have been worth doing this (the game already benefits from 4 cores over 3 cores), but now that some modern games are being optimised for 8 core CPUs, it might well improve performance to do it. It may well be possible to develop an algorithm to derive the independent subsets for each time slice fairly cheaply from those for the previous time slice, rather than evaluating everything from scratch each time.

But then optimising for 8 or more cores won’t help the multiplayer experience, as the slideshow is caused by the slowest PC, which probably doesn’t have much unused capacity at the moment.

How do you come to that conclusion?

Remember, a bottlenecked thread will not pin the entire CPU to 100%. Instead, a thread will pin to 100% divided by the number of cores (x2 for hyperthreading). For example, my 10 core CPU with hyperthreading will only show 100/(10x2)=5% for any thread that is bottlenecked.

Let’s take a look at AOE2 when running a performance test:

I see two bottlenecked threads (5% and 4.89%). The other threads are barely being utilized. (Not shown: the gpu and SSD are barely being utilized as well). So, what’s holding back performance? It’s the bottlenecked threads.

The Steam hardware survey shows 80% of gamers have quad-core or better @ 2.3 GHz or better. That should be enough. But, you’re right, if any of the online opponents have a potato, that could become the limiting factor.

I would be interesting to understand what operations are the most constraining for the program. I mean, does each client computer have to compute the outcome for all units? Does the server just route a copy of each player’s commands?

What should I set them to on a 4/8 CPU?

It’s tricky, I’d say the realistic options are:

  1. Assign it to 4 logical processors that are on 4 different physical cores
  2. Just leave it to the OS to assign across all 8

You could test to see what gives the best score, but bear in mind the score can vary a little randomly with both those approaches, as even if you do 1, you can still end up with some other threads using the same logical processor as one of the game’s main threads. With 2, you can end up with some pretty major conflicts with threads running on the same physical core, but that can be offset by having more logical processors available in total. You’d have to run the benchmark several times with each and note down the scores to see if there’s a consistent advantage for one over the other.

I recommend setting the affinity using this method:

I tried doing it via a bat file and command line arguments, but it didn’t result in the affinity being set correctly. Doing it the above way worked correctly every time I tried it, but it’s a bit of a pain to have to do it every time you run Steam. If you look in task manager with it showing all logical processors you can check the usage is on the logical processors you expect.

1 Like

Averaged 1243.5 both on 0, 2, 4, 6, 8 and 1, 3, 5, 7.

Here’s the interesting part. I ran benchmark 3 times with the ‘select all’ box manually ticked, and the average was 1246.5.
Screenshot:

However if I manually tick boxes 0-8, it will auto tick ‘all processors’, BUT the average of 3 test when doing this was 1276.9.
Screenshot:

Can you perhaps try the same on your system?

You can see per core utilization, but that’s beside the point.

Even if you see 100% utilization of a certain core that does not mean that the game is CPU cycle bottlenecked. All it means is that the game spends most of its time on the CPU and not waiting or doing I/O.
The only way to see a degradation of performance is when in game ticks (or seconds) are progressing at a lower rate than the game speed ratio allows (which we can remember is the case when playing vs certain people with certain smart toasters for computers on HD/Voobly).

The CPU might be spending a lot of time on cache misses (due to lack of memory) and context switching back and forth with the kernel. The game logic might actually be seeing very little cycles.

Also remember that game ticks have to be synced between all the hosts which takes up most of the time.

I do remember in HD - late game trade cart path finding would cause performance degradation which is caused by an inefficient path finding algorithm. That for instance could be fixed by making it an order of magnitude more efficient, not by trying to make the entire thing concurrent.

The game engine was written for 1999 computers - we’re talking about Pentium 133 MHZ with perhaps 16 MB of RAM, no SIMD whatsoever and it was demanded exactly the same thing - 8 players with 200 maximum population for each. In fact, the game engine was inherited from the first age of empires game (1997). Since then the game code and net code could have only gotten more efficient (via Userpatch) and the most noticeable impact is that there’s no longer action delay on multiplayer.

My point being that the assumption that the game logic is resource demanding and as such could benefit somehow from getting concurrency is simply false. I really wish the game code was open source or at least wasn’t obfuscated so you could plug a profiler and see how much cpu time is spent on game logic code.

So what do you think is the bottleneck, given that the game runs faster on faster CPUs, and hence is unquestionably CPU limited in performance (given a reasonable graphics card).

I have a 4 core / 8 logical processors CPU (i7 6700T), so I tried 3 scenarios with it:

  1. Logical processors 0,2,4,6
  2. Check the All box
  3. Uncheck the All box, then check 7,6,5,4,3,2,1,0

I ran these tests as 1,2,3,1,2,3,1,2,3 rather than 1,1,1,2,2,2,3,3,3, closing the game each time, to try to get a fresh roll of the dice for how the threads happened to be assigned to logical processors. The scores were:

  1. 1166.1, 1166.1, 1172.6
  2. 1145.5, 1145.5, 1152.6
  3. 1152.6, 1152.6, 1152.6

I think 2 and 3 are the same, and it’s just random variation in how the threads happen to get assigned to logical processors, and what conflicts it causes. On this PC, at least, it runs better if only 4 logical processors are used, but that will depend on the specific CPU and its hyperthreading implementation.

1 Like

Not the game logic. The game utilizes a ridiculous amount of memory for sprites and other assets and accessing it all throughout the game would cause more cache misses on lower end CPUs (mind you, not slower CPUs, simply CPUs that have less cache and generally slower memory access).

You’d think those things would be GPU accelerated, but they’re not. Much of how graphics are handled in the game is carried over from that original 1999 code base and that also means much graphical work is done on the CPU. It worked in 1999 since back then when there weren’t many graphical assets and graphics cards were just starting to become more than mere pixel writers so it made no sense to target them.

Again, I emphasize how game logic wise, we still have roughly the same thing that worked in 1999. The game is not logic heavy. It’s a rather simple simulation of no more than a couple of thousand items that need to be updated no more than a hundred times per second.

Don’t get me wrong, there’s a lot to be desired from the game vis-à-vis performance - they could definitely overhaul graphics and implement it in a modern way that doesn’t consume an insane amount of memory. They could also overhaul other parts and make it truly-client server so that the game simulation is only performed in a single place (rather than duplicated across peers). However, the game logic isn’t really an issue.

Take a look at OpenAge by the way… it’s an attempt to remake the AoE game engine while also improving it significantly in every aspects - they’re even capable of running borderless maps.

Okay, so some games have benchmarks that split the CPU element into two elements - rendering, and sim / game logic. Examples of games with benchmarks that do this are Shadow of the Tomb Raider, and Forza Horizon 4. What you’re saying is that the bottleneck is the rendering component, rather than the sim / game logic component. You’re right that there is significant CPU load from everything the CPU has to do to prepare things for the GPU and execute the API calls to tell the GPU what to do.

I’m not sure if we can definitely say which it is, from just observing the game’s behaviour, but my feeling is it must be the game logic, based on the slowdown seen late in multiplayer games. Unless the code is extremely badly written, there is no need for other players to have to wait for another player’s slow rendering, they would only need to wait for another player’s slow game logic processing. If these two things are separate threads, they can progress independently, with many game logic updates per rendered frame, if the rendering is the slow part. This would result in poor fps for the player with the slow PC, but normal fps for other players. As this isn’t what is observed, I infer it’s the game logic that is the slow part.

Actually when you see players in multiplayer games slowing the game down for everyone else it’s probably neither rendering nor game logic, but simply a bad rig where everything is slowed down and the game process context switching to kernel a million times a second for no good reason. Think malware, anti-virus, aggressive firewall that likes to do fancy stuff, even a terrible driver.

There are ways to profile and find those kind of things on Windows (for example, using a tool called Procmon).

Again, consider the best computer in something like 2004 - 5 years after the release of Age of Empires 2, 7 years after the release of AoE 1 (when the game engine was originally conceived). Multiplayer games ran fine on those rigs. Even the worst PC today, heck, the worst smartphone today has an order of magnitude more single core computing power than a computer then.

If the PC is context switching a lot, the end result is still a slowing down of execution of the game code, and some part of the game code running slowly as a result still has to be the reason for other players being made to wait for something to happen on the slow PC. The only thing that other players would need to wait for is slow game logic processing, not rendering. Maybe the game is coded really badly, and the game does hold everything up waiting for CPU rendering, but it would be an extremely small piece of work to fix that, so it’s very unlikely.

I understand your argument about the original version of the game. We know DE is more demanding, but I think what you are saying is the game logic element will not have changed much, so won’t be much more demanding than it was in the original game. Again, it depends how the code is written in terms of how it handles the extra civs, any improvements to the AI of units, etc. If you’re right and it is the CPU rendering that is holding things up, they really ought to fix that, it just shouldn’t be hard to decouple it from the game logic (this is standard practice in driving games, for example, where you want the simulation to always run at a higher frequency such as 240hz, even when running at 60fps).

Of course if the computer slows down due to whatever then everything slows down, including game logic. That doesn’t mean that making the game logic concurrent would have prevented the slow down… on the contrary, it would have created more overhead.

The game rendering runs on a different thread from the game logic, and could potentially run on a different core as a result.

Also, like I said, while the rendering in DE is quite inefficient, I really don’t think it’s the cause for slow downs in some rigs. I believe the slow downs are caused by factors external to the DE application.

As a general point, I disagree with this, but for the specific subset of people who are causing the problem in multiplayer games, I agree, which is why I said “But then optimising for 8 or more cores won’t help the multiplayer experience, as the slideshow is caused by the slowest PC, which probably doesn’t have much unused capacity at the moment.” earlier in the thread.

It would also be an expensive activity to develop an algorithm to form the independent subsets of game objects needed to make the processing concurrent, which makes it unlikely to ever happen unless some general research on the subject exists that could be easily applied to the game.