Being a good CPU neighbor
Very often computational tasks are roughly divided into real-time and batch processing tasks. Sometimes you might want to run some tasks that take a large amount of computation resources, get the result as fast as possible, but you don’t really care exactly when you get the result out.
In larger organizations there are shared computers that usually have a lot of CPU power and memory available and that are mostly idle. But they are also used by other users that will be quite annoyed if they suffer from long delays to key presses or from similar issues because your batch processing tasks take all the resources that are available away from the system. Fortunately *nix systems provide different mechanisms available to non-root users to avoid these kind of issues.
These issue can also raise in situations where you have a system that can only allocate static resource requirements to tasks, like in Jenkins. As an example, you might have some test jobs that need to finish within certain time limit mixed with compilation jobs that should finish as soon as possible, but that can yield some resources to these higher priority test jobs, as they don’t have as strict time limitation requirements. And you usually don’t want to limit the resources that compilation jobs can use, if they are run on otherwise idle machine.
Here I’m specifically focusing on process priority settings and other CPU scheduling settings provided by Linux, as it currently happens to be probably the most used operating system kernel for multi-user systems. These settings affect how much CPU time a process gets relative to other processes and are especially useful on shared overloaded systems where there are more processes running than what there are CPU cores available.
Different process scheduling priorities in Linux
Linux and POSIX interfaces provide different scheduling policies that define how Linux scheduler allocates CPU time to a process. There are two main scheduling policies for processes: real-time priority policies and normal scheduling policies. Real-time policies are usually accessible only to root user and are not the point of interest of here, as they can not usually be used by normal users on shared systems.
At the time of the writing this article there are three normal scheduling policies available to normal users for process priorities:
SCHED_OTHER
: the default scheduling policy in Linux with the default dynamic priority of 0.SCHED_BATCH
: policy for scheduling batch processes. This schedules processes in similar fashion asSCHED_OTHER
and is affected by the dynamic priority of a process. This makes the scheduler assume that the process is CPU intensive and makes it harder to wake up from a sleep.SCHED_IDLE
: the lowest priority scheduling policy. Not affected by dynamic priorities. This equals to nice level priority of 20.
The standard *nix way to change process priorities is to use nice
command. By default in Linux, process can get nice values of -20–19 where the default nice value for a process is 0. When process is started with nice
command, the nice value will be
- Values 0–19 are available to a normal user and -20–-1 are available to the root user. The lowest priority nice value of 19 gives process 5% of CPU time with the default scheduler in Linux.
chrt
command can be used to give a process access to SCHED_BATCH
and SCHED_IDLE
scheduling policies. chrt
can be used with combination of nice
command to lower the priority of SCHED_BATCH
scheduling policy. And using SCHED_IDLE
(= nice level 20) policy should give around 80% of the CPU time that nice level of 19 has, as nice level weights increase the process priority by 1.25 compared to lower priority level.
Benchmarking with different competing workloads
I wanted to benchmark the effect of different scheduling policies on a work done by a specific benchmark program. I used a system with Linux 3.17.0 kernel with Intel Core i7-3770K CPU at 3.5 GHz processor with hyperthreading and frequency scaling turned on and 32 gigabytes of RAM. I didn’t manage to make the CPU frequency constant, so results vary a little between runs. I used John the Ripper’s bcrypt benchmark started with following command as the test program for useful work:
$ /usr/sbin/john --format=bcrypt -test
I benchmarked 1 and 7 instances of John the Ripper for 9 iterations (9 * 5 = 45 seconds) with various amounts of non-benchmarking processes running at the same time. 1 benchmarking process should by default not have any cache contention resulting from other benchmarking processes happening and 7 benchmarking processes saturate all but 1 logical CPU core and can have cache and computational unit contention with each other.
Non-bechmarking processes were divided into different categories and there were 0–7 times CPU cores instances of them started to make them fight for CPU time with benchmarking processes. The results of running different amounts of non-benchmarking processes with different scheduling policies can be found from tables 1, 2, 3, and 4. Different scheduling policies visible in those tables are:
- default: default scheduling policy. Command started without any wrappers. Corresponds to
SCHED_OTHER
with nice value of 0. - chrt-batch: the default
SCHED_BATCH
scheduling policy. Command started withchrt --batch 0
prefix. - nice 10:
SCHED_OTHER
scheduling policy with the default nice value of 10. Command started withnice
prefix. - nice 19:
SCHED_OTHER
scheduling policy with nice value of 19. Command started withnice -n 19
prefix. This should theoretically take around 5 % of CPU time out from the useful work. - nice 19 batch:
SCHED_BATCH
scheduling policy with nice value of 19. Command started withnice -n 19 chrt --batch 0
prefix. - sched idle:
SCHED_IDLE
scheduling policy. Command started withchrt --idle 0
prefix. This should theoretically take around 80 % of CPU time compared to nice level 19 out from useful work.
The results in those tables reflect the relative average percentage of work done compared to the situation when there are no additional processes disturbing the benchmark. These only show the average value and do not show, for example, variance that could be useful to determine how close those values are of each other. You can download get the raw benchmarking data and the source code to generate and analyze this data from following links:
CPU looper
CPU looper applications consists purely of an application that causes CPU load. Its main purpose is just to test the scheduling policies without forcing CPU cache misses. Of course there will be context switches that can replace data cache entries related to process state and instruction cache entries for the process with another, but there won’t be any extra intentional cache flushing happening here.
The actual CPU looper application consists code shown in figure 1 compiled without any optimizations:
The results for starting 0–7 times CPU cores of CPU looper processes with the default priority and then running benchmark with 1 or 7 cores can be found from tables 1 and 2.
1 worker | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
default | 100 | 64.6 | 36.4 | 25.5 | 18.2 | 15.4 | 12.5 | 11.0 |
nice 0 vs. sched batch | 100 | 66.5 | 37.0 | 24.0 | 16.3 | 14.8 | 12.2 | 10.9 |
nice 0 vs. nice 10 | 100 | 93.3 | 90.2 | 82.7 | 75.4 | 75.2 | 75.0 | 74.9 |
nice 0 vs. nice 19 | 100 | 92.9 | 90.4 | 91.0 | 91.8 | 89.8 | 89.4 | 90.9 |
nice 0 vs. nice 19 batch | 100 | 94.0 | 89.2 | 91.4 | 89.8 | 86.4 | 89.4 | 90.7 |
nice 0 vs. sched idle | 100 | 95.0 | 91.3 | 92.2 | 92.0 | 92.6 | 91.0 | 92.4 |
— | ||||||||
nice 10 vs. nice 19 | 100 | 80.0 | 74.7 | 68.0 | 67.7 | 67.5 | 64.5 | 60.7 |
nice 10 vs. nice 19 batch | 100 | 92.5 | 83.8 | 75.9 | 75.0 | 74.3 | 74.1 | 69.8 |
nice 10 vs. sched idle | 100 | 89.8 | 89.7 | 90.9 | 89.0 | 87.8 | 87.4 | 87.2 |
— | ||||||||
nice 19 vs. sched idle | 100 | 77.9 | 69.7 | 66.9 | 66.3 | 64.6 | 63.4 | 58.1 |
7 workers | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
default | 100 | 49.8 | 33.6 | 24.6 | 19.4 | 16.4 | 13.8 | 12.2 |
nice 0 vs. sched batch | 100 | 51.1 | 34.0 | 25.1 | 20.0 | 16.6 | 14.2 | 12.4 |
nice 0 vs. nice 10 | 100 | 92.9 | 87.2 | 80.1 | 75.0 | 71.0 | 66.3 | 61.1 |
nice 0 vs. nice 19 | 100 | 94.1 | 94.9 | 95.4 | 95.2 | 96.2 | 95.7 | 96.0 |
nice 0 vs. nice 19 batch | 100 | 95.7 | 95.2 | 95.7 | 95.5 | 95.2 | 94.6 | 95.4 |
nice 0 vs. sched idle | 100 | 96.3 | 95.7 | 96.2 | 95.6 | 96.7 | 95.4 | 96.7 |
— | ||||||||
nice 10 vs. nice 19 | 100 | 93.6 | 84.9 | 78.0 | 71.4 | 65.7 | 60.4 | 56.2 |
nice 10 vs. nice 19 batch | 100 | 92.4 | 84.1 | 77.6 | 69.9 | 65.6 | 60.7 | 56.4 |
nice 10 vs. sched idle | 100 | 95.5 | 94.9 | 94.5 | 93.9 | 94.3 | 93.7 | 91.9 |
— | ||||||||
nice 19 vs. sched idle | 100 | 91.4 | 80.9 | 71.4 | 64.3 | 58.3 | 52.5 | 48.5 |
Tables 1 and 2 show that the effect is not consistent between different scheduling policies and when there is 1 benchmarking worker running it suffers more from the lower priority processes than what happens when there are 7 benchmarking workers running. But with higher priority scheduling policies for background processes the average amount of work done for 1 process remains higher for light loads than with 7 worker processes. These discrepancies can be probably explained by how hyperthreading works by sharing the same physical CPU cores and by caching issues.
Memory looper
Processors nowadays have multiple levels of cache and no cache isolation and memory access from one core can wreak havoc for other cores with just moving data from and to memory. So I wanted to see what happens with different scheduling policies when running multiple instances of a simple program that run on the background when John the Ripper is trying to do some real work.
Program shown in figure 2 is compiled without any optimizations and it basically reads one word from memory, adds 1 to it and stores the result to some other memory area and then XORs the read memory area with the written memory area. So it basically does not do anything useful, but reads and writes a lot of data into memory every time the program gets some execution time.
Tables 3 and 4 show the relative effect on John the Ripper benchmarking program when there are various amounts of the program shown in figure 2 running at the same time. If you compare these numbers to values shown in tables 1 and 2 a program that only uses CPU cycles is running, the numbers for useful work can be in some cases around 10 percentage points lower. So there is apparently some cache contention ongoing with this benchmarking program and the effect of lower priority scheduling policies is not the same that could be theoretically expected just from the allocated time slices.
1 worker | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
default priority | 100 | 61.7 | 32.6 | 21.8 | 16.6 | 13.3 | 11.2 | 9.8 |
nice 0 vs. sched batch | 100 | 60.2 | 32.6 | 22.7 | 16.2 | 12.9 | 10.9 | 10.2 |
nice 0 vs. nice 10 | 100 | 85.2 | 82.1 | 75.8 | 70.8 | 69.6 | 70.5 | 68.1 |
nice 0 vs. nice 19 | 100 | 85.4 | 83.0 | 79.7 | 81.7 | 78.2 | 81.9 | 84.7 |
nice 0 vs. nice 19 batch | 100 | 83.8 | 81.2 | 80.5 | 83.4 | 80.4 | 79.4 | 84.3 |
nice 0 vs. sched idle | 100 | 82.9 | 80.9 | 81.2 | 82.1 | 80.6 | 82.3 | 81.8 |
— | ||||||||
nice 10 vs. nice 19 | 100 | 80.0 | 74.7 | 68.0 | 67.7 | 67.5 | 64.5 | 60.7 |
nice 10 vs. nice 19 batch | 100 | 80.8 | 74.1 | 67.6 | 67.1 | 67.9 | 65.5 | 63.3 |
nice 10 vs. sched idle | 100 | 83.9 | 81.6 | 79.2 | 77.0 | 75.9 | 76.8 | 77.1 |
— | ||||||||
nice 19 vs. sched idle | 100 | 77.9 | 69.7 | 66.9 | 66.3 | 64.6 | 63.4 | 58.1 |
7 workers | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
default | 100 | 48.4 | 32.6 | 24.5 | 19.7 | 16.3 | 14.2 | 12.3 |
nice 0 vs. sched batch | 100 | 48.6 | 32.2 | 23.7 | 19.4 | 16.2 | 14.0 | 12.6 |
nice 0 vs. nice 10 | 100 | 89.3 | 81.5 | 74.6 | 69.2 | 64.6 | 60.4 | 56.3 |
nice 0 vs. nice 19 | 100 | 92.0 | 90.5 | 90.4 | 91.2 | 91.3 | 90.6 | 90.5 |
nice 0 vs. nice 19 batch | 100 | 91.5 | 91.8 | 92.5 | 91.8 | 91.9 | 92.1 | 91.9 |
nice 0 vs. sched idle | 100 | 92.1 | 91.9 | 91.9 | 91.5 | 92.0 | 92.4 | 92.1 |
— | ||||||||
nice 10 vs. nice 19 | 100 | 85.4 | 77.3 | 70.4 | 63.3 | 58.3 | 54.2 | 50.3 |
nice 10 vs. nice 19 batch | 100 | 86.8 | 77.7 | 70.0 | 63.4 | 58.8 | 54.4 | 50.6 |
nice 10 vs. sched idle | 100 | 90.6 | 89.0 | 88.9 | 88.9 | 88.0 | 87.4 | 86.3 |
— | ||||||||
nice 19 vs. sched idle | 100 | 82.8 | 72.9 | 62.9 | 57.3 | 52.2 | 48.2 | 44.1 |
Conclusions
These are just results of one specific benchmark with two specific workloads on one specific machine with 4 hyperthreaded CPU cores. They should anyways give you some kind of an idea how different CPU scheduling policies under Linux affect the load that you do not want to disturb when your machine is more or less overloaded. Clearly when you are reading and writing data from and to memory, the lower priority background process has bigger impact on the actual worker process than what the allocated time slices would predict. But, unsurprisingly, a single user can have quite a large impact on how much their long running CPU hogging processes affect rest of the machine.
I did not do any investigation how much work those processes that are supposed to disturb the benchmarking get done. In this case SCHED_BATCH
with nice value of 19 could probably be the best scheduling policy if we want to get the most work done and at the same time avoid disturbing other users. Otherwise, it looks like the SCHED_IDLE
policy, taken into use with chrt --idle 0
command, that is promised to have the lowest impact on other processes has the lowest impact. Especially when considering processes started with lower nice values than the default one.