Core scheduling lands in 5.14

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
July 1, 2021

The core scheduling feature has been under discussion for over three years. For those who need it, the wait is over at last; core scheduling was merged for the 5.14 kernel release. Now that this work has reached a (presumably) final form, a look at why this feature makes sense and how it works is warranted. Core scheduling is not for everybody, but it may prove to be quite useful for some user communities.

Simultaneous multithreading (SMT, or "hyperthreading") is a hardware feature that implements two or more threads of execution in a single processor, essentially causing one CPU to look like a set of "sibling" CPUs. When one sibling is executing, the other must wait. SMT is useful because CPUs often go idle while waiting for events — usually the arrival of data from memory. While one CPU waits, the other can be executing. SMT does not result in a performance gain for all workloads, but it is a significant improvement for most.

SMT siblings share almost all of the hardware in the CPU, including the many caches that CPUs maintain. That opens up the possibility that one CPU could extract data from the other by watching for visible changes in the caches; the Spectre class of hardware vulnerabilities have made this problem far worse, and there is little to be done about it. About the only way to safely run processes that don't trust each other (with current kernels) is to disable SMT entirely; that is a prospect that makes a lot of people, cloud-computing providers in particular, distinctly grumpy.

While one might argue that cloud-computing providers are usually grumpy anyway, there is still value in anything that might improve their mood. One possibility would be a way to allow them to enable SMT on their systems without opening up the possibility that their customers may use it to attack each other; that could be done by ensuring that mutually distrusting processes do not run simultaneously in siblings of the same CPU core. Cloud customers often have numerous processes running; spamming Internet users at scale requires a lot of parallel activity, after all. If those processes can be segregated so that all siblings of any given core run processes from the same customer, we can be spared the gruesome prospect of one spammer stealing another's target list — or somebody else's private keys.

Core scheduling can provide this segregation. In abstract terms, each process is assigned a "cookie" that identifies it in some way; one approach might be to give each user a unique cookie. The scheduler then enforces a regime where processes can share an SMT core only if they have the same cookie value — only if they trust each other, in other words.

More specifically, core scheduling is managed with the prctl() system call, which is defined generically as:

    int prctl(int option, unsigned long arg2, unsigned long arg3,
              unsigned long arg4, unsigned long arg5);

For core-scheduling operations, option is PR_SCHED_CORE, and the rest of the arguments are defined this way:

    int prctl(PR_SCHED_CORE, int cs_command, pid_t pid, enum pid_type type,
	      unsigned long *cookie);

There are four possible operations that can be selected with cs_command:

PR_SCHED_CORE_CREATE causes the kernel to create a new cookie value and assign it to the process identified by pid. The type argument controls how widely spread this assignment is; PIDTYPE_PID only changes the identified process, for example, while PIDTYPE_TGID assigns the cookie to the entire thread group. The cookie argument must be NULL.
PR_SCHED_CORE_GET retrieves the cookie value for pid, storing it in cookie. Note that there is not much that a user-space process can actually do with a cookie value; its utility is limited to checking whether two processes have the same cookie.
PR_SCHED_CORE_SHARE_TO assigns the calling process's cookie value to pid (using type to control the scope as described above).
PR_SCHED_CORE_SHARE_FROM fetches the cookie from pid and assigns it to the calling process.

Naturally, a process cannot just fetch and assign cookies at will; the usual "can this process call ptrace() on the target" test applies. It is also not possible to generate cookie values in user space, a restriction that is necessary to ensure that unrelated processes get unique cookie values. By only allowing cookie values to propagate between processes that already have a degree of mutual trust, the kernel prevents a hostile process from setting its own cookie to match that of a target process.

Whenever a CPU enters the scheduler, the highest-priority task will be picked to run in the usual way. If core scheduling is in use, though, the next step will be to send an inter-processor interrupt to the sibling CPUs, each of which will respond by checking the newly scheduled process's cookie value against the value for the process running locally. If need be, the interrupted processor(s) will switch to running a process with an equal cookie, even if the currently running process has a higher priority. If no compatible process exists, the processor will simply go idle until the situation changes. The scheduler will migrate processes between cores to prevent the forced idling if possible.

Early versions of the core-scheduling code had a significant throughput cost for the system as a whole; indeed, it was sometimes worse than just disabling SMT altogether, which rather defeated the purpose. The code has been through a number of revisions since then, though, and apparently performs better now. There will always be a cost, though, to a mechanism that will occasionally force processors to go idle when runnable processes exist. For that reason core scheduling, as Linus Torvalds put it, "makes little sense to most people". It can be beneficial, though, in situations where the only alternative is to turn off SMT completely.

While the security use case is driving the development of core scheduling, there are other use cases as well. For example, systems running realtime processes usually must have SMT disabled; you cannot make any response-time guarantees when the CPU has to compete with a sibling for the hardware. Core scheduling can ensure that realtime processes get a core to themselves while allowing the rest of the system to use SMT. There are other situations where the ability to control the mixing of processes on the same core can bring benefits as well.

So, while core scheduling is probably not useful for most Linux users, there are user communities that will be glad that this feature has finally found its way into the mainline. Adding this sort of complication to a central, performance-critical component like the scheduler was never going to be easy but, where there is sufficient determination, a way can be found. The developers involved have certainly earned a cookie for pushing this work to a successful completion.

Index entries for this article
Kernel	Releases/5.14
Kernel	Scheduler/Core scheduling

(Log in to post comments)

Core scheduling lands in 5.14

Posted Jul 1, 2021 19:05 UTC (Thu) by bluca (subscriber, #118303) [Link]

Is there any particular reason why this cannot be set at the cgroup level, rather than having yet-another-knob userspace has to deal with?

Control groups

Posted Jul 1, 2021 19:07 UTC (Thu) by corbet (editor, #1) [Link]

I suspect they didn't want to force the use of control groups, but that's a guess.

Regardless, it's a knob to tweak either way, so I don't think that would change much.

Control groups

Posted Jul 1, 2021 20:23 UTC (Thu) by walters (subscriber, #7396) [Link]

See the linked article:

The patch set has seen a fair amount of discussion. Greg Kerr, representing Chrome OS, questioned the control-group interface. Making changes to control groups is a privileged operation, but he would like for unprivileged processes to be able to set their own cookies. To that end, he proposed an API based on ptrace() prctl() calls. Zijlstra replied that the interface issues can be worked out later; first it's necessary to get everything working as desired.

Personally I find this surprising because systemd already supports delegating cgroup access: https://systemd.io/CGROUP_DELEGATION/.

Control groups

Posted Jul 2, 2021 3:44 UTC (Fri) by Gaelan (subscriber, #145108) [Link]

It isn't very clear from a quick google, but I don't think Chrome OS uses systemd.

Control groups

Posted Jul 2, 2021 4:06 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link]

> It isn't very clear from a quick google, but I don't think Chrome OS uses systemd.

Uses Upstart. They hired the primary developer and moved to it years back

https://www.chromium.org/chromium-os/chromiumos-design-do...

Control groups

Posted Jul 2, 2021 4:47 UTC (Fri) by re:fi.64 (subscriber, #132628) [Link]

Worth noting that I believe Container Linux images are based on Chrome OS but do use systemd.

Control groups

Posted Jul 2, 2021 4:53 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

There are patches (overlays) for Chromium OS that use systemd. Moreover, even the stock Chromium OS uses journald with upstart.

Control groups

Posted Jul 2, 2021 17:44 UTC (Fri) by bnorris (subscriber, #92090) [Link]

> even the stock Chromium OS uses journald with upstart.

No longer: https://crbug.com/1066706

Core scheduling lands in 5.14

Posted Jul 2, 2021 3:13 UTC (Fri) by willy (subscriber, #9762) [Link]

> [SMT] is a hardware feature that implements two or more threads of execution in a single processor, essentially causing one CPU to look like a set of "sibling" CPUs. When one sibling is executing, the other must wait.

I suspect Our Grumpy Editor already knows this, but that's a simplified and relatively low-performing implementation (I believe Itanium Montecito did this; called SoEMT)

What most implementations do is issue micro-operations to execution units ("ports" in Intel terminology), regardless of which thread the uOps come from.

This is what the Portsmash vulnerability exploits; by detecting which ports are currently busy, a thread can deduce which operations are being executed by the other thread.

Core scheduling lands in 5.14

Posted Jul 3, 2021 11:23 UTC (Sat) by Sesse (subscriber, #53779) [Link]

Yes. The best explanation I've heard came from an Intel engineer, saying something along the lines of: “Modern hardware has so many execution ports that in reality, the only way to use it fully is to write spaghetti code—code that does two unrelated things at the same time.” So HT/SMT is a way to feed the hardware with two execution streams at the same time, that don't have dependencies on each other's results.

Core scheduling lands in 5.14

Posted Jul 5, 2021 18:53 UTC (Mon) by fratti (guest, #105722) [Link]

I believe the "extreme" version of this is what modern GPUs do: fine-grained multithreading. Instead of doing things like branch prediction to keep the pipeline saturated, they simply execute a different thread's instruction with each clock cycle, such that each thread only ever has one instruction at most in the pipeline. Naturally this trades off single-threaded performance and requires keeping as many register files as one has pipeline stages, but it's a pretty elegant solution for maximising throughput if one really does have that many independent threads.

Core scheduling lands in 5.14

Posted Jul 2, 2021 10:15 UTC (Fri) by roc (subscriber, #30627) [Link]

Normally in Linux the word "process" means, technically, "thread group". So it's not clear to me what PIDTYPE_PID actually does. Is it actually setting the cookie for the current task a.k.a. thread?

Core scheduling lands in 5.14

Posted Jul 6, 2021 4:20 UTC (Tue) by ncm (guest, #165) [Link]

Usually, looking out from the kernel, a thread is a process is a thread, just with varying degrees of memory-map sharing. Thread groups as processes is largely a user-space notion; and those processes collected into cgroups is another. It is easy to see why the kernel prefers to avoid the issue, and try to treat them all as an undifferentiated pile of threads, wherever it can get away with that.

Scheduling, though, is a place where it often can't, because users want what they think of as fairness.

Core scheduling lands in 5.14

Posted Jul 2, 2021 12:26 UTC (Fri) by nix (subscriber, #2304) [Link]

It seems to me this feature could be useful for a subset of processes on any desktop system, because while most processes on a desktop shouldn't be affected by this stuff, web browsers in particular routinely run untrusted code. They could use this to assign distinct cookies to processes handling mutually untrusted code from distinct security domains (whatever those might be in the present-day state of the web). If you wanted to minimize performance impact, it seems to me you could allow a process to have *no* cookie (perhaps by having all processes share the same cookie until explicitly assigned), and prohibit uncookied processes from running on the same core as any processes with cookies. The only intrinsic, necessary performance impact then would be to stop most things sharing a core with a web browser running potentially untrusted code, which is exactly what you want in this case.

Thread or Process?

Posted Jul 2, 2021 18:40 UTC (Fri) by glenn (subscriber, #102223) [Link]

> PIDTYPE_PID only changes the identified process...
> ...systems running realtime processes usually must have SMT disabled; you cannot make any response-time guarantees when the CPU has to compete with a sibling for the hardware. Core scheduling can ensure that realtime processes get a core to themselves while allowing the rest of the system to use SMT.

Is this control at the process level, or can threads within a process be assigned unique cookies? From the realtime perspective, it's common to have a realtime thread that offloads I/O to a non-realtime I/O thread within the same process (e.g., for data logging). One might want to ensure that the realtime thread does not experience interference from SMT, but not care about the non-realtime thread.

Thread or Process?

Posted Jul 7, 2021 5:22 UTC (Wed) by ncm (guest, #165) [Link]

It reads like threads can have their own cookies, even shared with certain threads in other processes, if you like.

Thread or Process?

Posted Jul 7, 2021 10:17 UTC (Wed) by immibis (subscriber, #105511) [Link]

> The type argument controls how widely spread this assignment is; PIDTYPE_PID only changes the identified process, for example, while PIDTYPE_TGID assigns the cookie to the entire thread group. The cookie argument must be NULL.

Note that what the kernel calls a "process" or "task" is what us user-space plebs would call a "thread", and what they call a "thread group" is what we would call a "process".