Being a good CPU neighbor

Very often computational tasks are roughly divided into real-time and batch processing tasks. Sometimes you might want to run some tasks that take a large amount of computation resources, get the result as fast as possible, but you don’t really care exactly when you get the result out.

In larger organizations there are shared computers that usually have a lot of CPU power and memory available and that are mostly idle. But they are also used by other users that will be quite annoyed if they suffer from long delays to key presses or from similar issues because your batch processing tasks take all the resources that are available away from the system. Fortunately *nix systems provide different mechanisms available to non-root users to avoid these kind of issues.

These issue can also raise in situations where you have a system that can only allocate static resource requirements to tasks, like in Jenkins. As an example, you might have some test jobs that need to finish within certain time limit mixed with compilation jobs that should finish as soon as possible, but that can yield some resources to these higher priority test jobs, as they don’t have as strict time limitation requirements. And you usually don’t want to limit the resources that compilation jobs can use, if they are run on otherwise idle machine.

Here I’m specifically focusing on process priority settings and other CPU scheduling settings provided by Linux, as it currently happens to be probably the most used operating system kernel for multi-user systems. These settings affect how much CPU time a process gets relative to other processes and are especially useful on shared overloaded systems where there are more processes running than what there are CPU cores available.

Different process scheduling priorities in Linux

Linux and POSIX interfaces provide different scheduling policies that define how Linux scheduler allocates CPU time to a process. There are two main scheduling policies for processes: real-time priority policies and normal scheduling policies. Real-time policies are usually accessible only to root user and are not the point of interest of here, as they can not usually be used by normal users on shared systems.

At the time of the writing this article there are three normal scheduling policies available to normal users for process priorities:

SCHED_OTHER: the default scheduling policy in Linux with the default dynamic priority of 0.
SCHED_BATCH: policy for scheduling batch processes. This schedules processes in similar fashion as SCHED_OTHER and is affected by the dynamic priority of a process. This makes the scheduler assume that the process is CPU intensive and makes it harder to wake up from a sleep.
SCHED_IDLE: the lowest priority scheduling policy. Not affected by dynamic priorities. This equals to nice level priority of 20.

The standard *nix way to change process priorities is to use nice command. By default in Linux, process can get nice values of -20–19 where the default nice value for a process is 0. When process is started with nice command, the nice value will be

Values 0–19 are available to a normal user and -20–-1 are available to the root user. The lowest priority nice value of 19 gives process 5% of CPU time with the default scheduler in Linux.

chrt command can be used to give a process access to SCHED_BATCH and SCHED_IDLE scheduling policies. chrt can be used with combination of nice command to lower the priority of SCHED_BATCH scheduling policy. And using SCHED_IDLE (= nice level 20) policy should give around 80% of the CPU time that nice level of 19 has, as nice level weights increase the process priority by 1.25 compared to lower priority level.

Benchmarking with different competing workloads

I wanted to benchmark the effect of different scheduling policies on a work done by a specific benchmark program. I used a system with Linux 3.17.0 kernel with Intel Core i7-3770K CPU at 3.5 GHz processor with hyperthreading and frequency scaling turned on and 32 gigabytes of RAM. I didn’t manage to make the CPU frequency constant, so results vary a little between runs. I used John the Ripper’s bcrypt benchmark started with following command as the test program for useful work:

$ /usr/sbin/john --format=bcrypt -test

I benchmarked 1 and 7 instances of John the Ripper for 9 iterations (9 * 5 = 45 seconds) with various amounts of non-benchmarking processes running at the same time. 1 benchmarking process should by default not have any cache contention resulting from other benchmarking processes happening and 7 benchmarking processes saturate all but 1 logical CPU core and can have cache and computational unit contention with each other.

Non-bechmarking processes were divided into different categories and there were 0–7 times CPU cores instances of them started to make them fight for CPU time with benchmarking processes. The results of running different amounts of non-benchmarking processes with different scheduling policies can be found from tables 1, 2, 3, and 4. Different scheduling policies visible in those tables are:

default: default scheduling policy. Command started without any wrappers. Corresponds to SCHED_OTHER with nice value of 0.
chrt-batch: the default SCHED_BATCH scheduling policy. Command started with chrt --batch 0 prefix.
nice 10: SCHED_OTHER scheduling policy with the default nice value of 10. Command started with nice prefix.
nice 19: SCHED_OTHER scheduling policy with nice value of 19. Command started with nice -n 19 prefix. This should theoretically take around 5 % of CPU time out from the useful work.
nice 19 batch: SCHED_BATCH scheduling policy with nice value of 19. Command started with nice -n 19 chrt --batch 0 prefix.
sched idle: SCHED_IDLE scheduling policy. Command started with chrt --idle 0 prefix. This should theoretically take around 80 % of CPU time compared to nice level 19 out from useful work.

The results in those tables reflect the relative average percentage of work done compared to the situation when there are no additional processes disturbing the benchmark. These only show the average value and do not show, for example, variance that could be useful to determine how close those values are of each other. You can download get the raw benchmarking data and the source code to generate and analyze this data from following links:

CPU looper

CPU looper applications consists purely of an application that causes CPU load. Its main purpose is just to test the scheduling policies without forcing CPU cache misses. Of course there will be context switches that can replace data cache entries related to process state and instruction cache entries for the process with another, but there won’t be any extra intentional cache flushing happening here.

The actual CPU looper application consists code shown in figure 1 compiled without any optimizations:

int main() {
    while (1) {}
    return 0;
}

Figure 1: source code for CPU looper program.

The results for starting 0–7 times CPU cores of CPU looper processes with the default priority and then running benchmark with 1 or 7 cores can be found from tables 1 and 2.

1 worker	0	1	2	3	4	5	6	7
default	100	64.6	36.4	25.5	18.2	15.4	12.5	11.0
nice 0 vs. sched batch	100	66.5	37.0	24.0	16.3	14.8	12.2	10.9
nice 0 vs. nice 10	100	93.3	90.2	82.7	75.4	75.2	75.0	74.9
nice 0 vs. nice 19	100	92.9	90.4	91.0	91.8	89.8	89.4	90.9
nice 0 vs. nice 19 batch	100	94.0	89.2	91.4	89.8	86.4	89.4	90.7
nice 0 vs. sched idle	100	95.0	91.3	92.2	92.0	92.6	91.0	92.4
—
nice 10 vs. nice 19	100	80.0	74.7	68.0	67.7	67.5	64.5	60.7
nice 10 vs. nice 19 batch	100	92.5	83.8	75.9	75.0	74.3	74.1	69.8
nice 10 vs. sched idle	100	89.8	89.7	90.9	89.0	87.8	87.4	87.2
—
nice 19 vs. sched idle	100	77.9	69.7	66.9	66.3	64.6	63.4	58.1

Table 1: the relative percentage of average work done for one benchmarking worker when there are 0–7 times the logical CPU cores of non-benchmarking CPU hogging jobs running with different scheduling policies.

7 workers	0	1	2	3	4	5	6	7
default	100	49.8	33.6	24.6	19.4	16.4	13.8	12.2
nice 0 vs. sched batch	100	51.1	34.0	25.1	20.0	16.6	14.2	12.4
nice 0 vs. nice 10	100	92.9	87.2	80.1	75.0	71.0	66.3	61.1
nice 0 vs. nice 19	100	94.1	94.9	95.4	95.2	96.2	95.7	96.0
nice 0 vs. nice 19 batch	100	95.7	95.2	95.7	95.5	95.2	94.6	95.4
nice 0 vs. sched idle	100	96.3	95.7	96.2	95.6	96.7	95.4	96.7
—
nice 10 vs. nice 19	100	93.6	84.9	78.0	71.4	65.7	60.4	56.2
nice 10 vs. nice 19 batch	100	92.4	84.1	77.6	69.9	65.6	60.7	56.4
nice 10 vs. sched idle	100	95.5	94.9	94.5	93.9	94.3	93.7	91.9
—
nice 19 vs. sched idle	100	91.4	80.9	71.4	64.3	58.3	52.5	48.5

Table 2: the relative percentage of average work done for seven benchmarking workers when there are 0–7 times the logical CPU cores of non-benchmarking CPU hogging jobs running with different scheduling policies.

Tables 1 and 2 show that the effect is not consistent between different scheduling policies and when there is 1 benchmarking worker running it suffers more from the lower priority processes than what happens when there are 7 benchmarking workers running. But with higher priority scheduling policies for background processes the average amount of work done for 1 process remains higher for light loads than with 7 worker processes. These discrepancies can be probably explained by how hyperthreading works by sharing the same physical CPU cores and by caching issues.

Memory looper

Processors nowadays have multiple levels of cache and no cache isolation and memory access from one core can wreak havoc for other cores with just moving data from and to memory. So I wanted to see what happens with different scheduling policies when running multiple instances of a simple program that run on the background when John the Ripper is trying to do some real work.

#include <stdlib.h>

int main() {
    // 2 * 20 megabytes should be enough to spill all caches.
    const size_t data_size = 20000000;
    const size_t items = data_size / sizeof(long);
    long* area_source = calloc(items, sizeof(long));
    long* area_target = calloc(items, sizeof(long));
    while (1) {
        // Just do some memory operations that access the whole memory area.
        for (size_t i = 0; i < items; i++) {
            area_source[i] = area_target[i] + 1;
        }
        for (size_t i = 0; i < items; i++) {
            area_target[i] ^= area_source[i];
        }
    }
    return 0;
}

Figure 2: source code for memory bandwidth hogging program.

Program shown in figure 2 is compiled without any optimizations and it basically reads one word from memory, adds 1 to it and stores the result to some other memory area and then XORs the read memory area with the written memory area. So it basically does not do anything useful, but reads and writes a lot of data into memory every time the program gets some execution time.

Tables 3 and 4 show the relative effect on John the Ripper benchmarking program when there are various amounts of the program shown in figure 2 running at the same time. If you compare these numbers to values shown in tables 1 and 2 a program that only uses CPU cycles is running, the numbers for useful work can be in some cases around 10 percentage points lower. So there is apparently some cache contention ongoing with this benchmarking program and the effect of lower priority scheduling policies is not the same that could be theoretically expected just from the allocated time slices.

1 worker	0	1	2	3	4	5	6	7
default priority	100	61.7	32.6	21.8	16.6	13.3	11.2	9.8
nice 0 vs. sched batch	100	60.2	32.6	22.7	16.2	12.9	10.9	10.2
nice 0 vs. nice 10	100	85.2	82.1	75.8	70.8	69.6	70.5	68.1
nice 0 vs. nice 19	100	85.4	83.0	79.7	81.7	78.2	81.9	84.7
nice 0 vs. nice 19 batch	100	83.8	81.2	80.5	83.4	80.4	79.4	84.3
nice 0 vs. sched idle	100	82.9	80.9	81.2	82.1	80.6	82.3	81.8
—
nice 10 vs. nice 19	100	80.0	74.7	68.0	67.7	67.5	64.5	60.7
nice 10 vs. nice 19 batch	100	80.8	74.1	67.6	67.1	67.9	65.5	63.3
nice 10 vs. sched idle	100	83.9	81.6	79.2	77.0	75.9	76.8	77.1
—
nice 19 vs. sched idle	100	77.9	69.7	66.9	66.3	64.6	63.4	58.1

Table 3: the relative percentage of average work done for one benchmarking worker when there are 0–7 times the logical CPU cores of non-benchmarking CPU and memory bandwidth hogging jobs running with different scheduling policies.

7 workers	0	1	2	3	4	5	6	7
default	100	48.4	32.6	24.5	19.7	16.3	14.2	12.3
nice 0 vs. sched batch	100	48.6	32.2	23.7	19.4	16.2	14.0	12.6
nice 0 vs. nice 10	100	89.3	81.5	74.6	69.2	64.6	60.4	56.3
nice 0 vs. nice 19	100	92.0	90.5	90.4	91.2	91.3	90.6	90.5
nice 0 vs. nice 19 batch	100	91.5	91.8	92.5	91.8	91.9	92.1	91.9
nice 0 vs. sched idle	100	92.1	91.9	91.9	91.5	92.0	92.4	92.1
—
nice 10 vs. nice 19	100	85.4	77.3	70.4	63.3	58.3	54.2	50.3
nice 10 vs. nice 19 batch	100	86.8	77.7	70.0	63.4	58.8	54.4	50.6
nice 10 vs. sched idle	100	90.6	89.0	88.9	88.9	88.0	87.4	86.3
—
nice 19 vs. sched idle	100	82.8	72.9	62.9	57.3	52.2	48.2	44.1

Table 4: the relative percentage of average work done for seven benchmarking workers when there are 0–7 times the logical CPU cores of non-benchmarking CPU and memory bandwidth hogging jobs running with different scheduling policies.

Conclusions

These are just results of one specific benchmark with two specific workloads on one specific machine with 4 hyperthreaded CPU cores. They should anyways give you some kind of an idea how different CPU scheduling policies under Linux affect the load that you do not want to disturb when your machine is more or less overloaded. Clearly when you are reading and writing data from and to memory, the lower priority background process has bigger impact on the actual worker process than what the allocated time slices would predict. But, unsurprisingly, a single user can have quite a large impact on how much their long running CPU hogging processes affect rest of the machine.

I did not do any investigation how much work those processes that are supposed to disturb the benchmarking get done. In this case SCHED_BATCH with nice value of 19 could probably be the best scheduling policy if we want to get the most work done and at the same time avoid disturbing other users. Otherwise, it looks like the SCHED_IDLE policy, taken into use with chrt --idle 0 command, that is promised to have the lowest impact on other processes has the lowest impact. Especially when considering processes started with lower nice values than the default one.