Almost all current systems all have more than one CPU/core -IPhone 4’s have 2 CPU and 3 GPU cores -Galaxy S3 has 4 cores Multiprocessor -More than one physical CPU -SMP: Symmetric multiprocessing, --Each CPU is identical to every other --Each has the same capabilities and privileges -Each CPU is plugged into system via its own slot/socket Multicore -More than one CPU in a single physical package -Multiple CPUs connect to system via a shared slot /socket -Currently most multicores are SMP --But this might change soon!

1.Multiprocessing and NUMA

2.What we sort of assumed so far… Northbridge connects CPU and memory to rest of system Memory controller implemented in Northbridge chipset Devices and CPU can access memory via requests to Northbridge CPU connects using a Front Side Bus

3.Modern Systems Almost all current systems all have more than one CPU/core IPhone 4’s have 2 CPU and 3 GPU cores Galaxy S3 has 4 cores Multiprocessor More than one physical CPU SMP: Symmetric multiprocessing, E ach CPU is identical to every other Each has the same capabilities and privileges Each CPU is plugged into system via its own slot/socket Multicore More than one CPU in a single physical package Multiple CPUs connect to system via a shared slot / socket Currently most multicores are SMP But this might change soon!

4.SMP Operation Each processor in system can perform the same tasks Execute same set of instructions Access memory Interact with devices Each proc. connects to system in same way Traditional approach: Bus Modern approach: Interconnect Interacting with the rest of the system (memory/devices) done via communication over the shared bus/interconnect Obviously this can easily lead to chaos Why we need synchronization

5.SMP architecture First approach to multiprocessing Just connect another CPU to the northbridge Most of these systems used a shared bus CPUs could communicate with each other and with the northbridge But, only one user at a time, so scalability was limited (bus contention)

6.Multicore architecture During the early/mid 2000s CPUs started to change dramatically Could no longer increase speeds exponentially But: transistor density was still increasing Only thing architects could do was add more computing elements Replicated entire CPUs inside the same processor die The standard architecture is just like SMP, but with only one CPU slot in the system

7.Multiprocessor-Multicores SMP with multicore CPUs Multiple processor slots in system Each slot hosts multiple CPU cores What does this mean for the OS? Mostly hidden by the hardware OS sees N cpus that are identical, so treats them the same way But the similarity does not always hold for memory More on that in a minute

8.The Future (?) Manycore CPUs are currently being developed This could be a game changer A local machine starts to look like a distributed system

9.What does this mean for the OS? Many more resources must be managed OS must ensure that all CPUs cooperate together Example: If two CPUs try to schedule the same process simultaneously How do we identify CPUs? Hardware must provide identification interface X86: Each CPU assigned a number at boot time ID tied to local APIC gateway for all inter-CPU communication

10.Programming models What do we do with all these CPUs? Actually we don’t really know yet… 6 cores are about as much as we can effectively use in a desktop environment Still waiting for the killer app Some ideas… Side core: Dedicate entire cores for a single task I/O core: Dedicate entire core to handle an I/O device GUI core: Dedicate entire core to handle GUI Fine grain parallelization of Apps Pretty difficult… How much parallelism is actually in an interactive task? Virtual Machines Run an entirely separate OS environment on dedicated cores

11.Dealing with devices Current I/O devices must generally be handled by a single core Device interrupts are delivered to only one core CPUs must coordinate access to the device controller But this is changing Basic approach: Dedicate a single core for I/O All I/O requests forwarded to one CPU core Cores queue up I/O requests that the I/O core then services Slightly more advanced approach I/O devices are balanced across cores E.g. 1 core handles network, another core handles disk Even more advanced approach I/O devices reassigned to cores that are using them Interrupts are routed to the core that is making the most I/O requests

12.Cross CPU C ommunication (Shared Memory) OS must still track state of entire system Global data structure updated by each core i.e. the system load avg is computed based on load avg across every core Traditional approach Single copy of data, protected by locks Bad scalability, every CPU constantly takes a global lock to update its own state Modern approach Replicate state across all CPUs/cores Each core updates its own local copy (so NO locks!) Contention only when state is read Global lock Is required, but reads are rare

13.Cross CPU Communication (Signals) System allows CPUs to explicitly signal each other Two approaches: notifications and cross-calls Almost always built on top of interrupts X86: Inter Processor Interrupts (IPIs) Notifications CPU is notified that “something” has happened No other information Mostly used to wakeup a remote CPU Cross Calls The target CPU jumps to a specified instruction Source CPU makes a function call that execs on target CPU Synchronous or asynchronous? Can be both, up to the programmer

14.CPU interconnects Mechanism by which CPUs communicate Old way: Front Side Bus (FSB) Slow with limited scalability With potentially 100s of CPUs in a system, a bus won’t work Modern Approach: Exploit HPC networking techniques Embed a true interconnect into the system Intel: QPI ( QuickPath Interconnects) AMD: HyperTransport Interconnects allow point to point communication Multiple messages can be sent in parallel if they don’t intersect

15.Interconnects and Memory Interconnects allow for complex message types Can interface directly with memory Memory controllers can be moved onto CPU Memory references no longer have to go through Northbridge Definition of memory has become… less concrete PCIe devices can handle memory operations NVRAM and DRAM can exist in same address space Is it a disk or is it main memory?

16.Multiprocessing and memory Shared memory is by far the most popular approach to multiprocessing Each CPU can access all of a system’s memory Conflicting accesses resolved via synchronization (locks) Benefits Easy to program, allows direct communication Disadvantages Limits scalability and performance Requires more advanced caching behavior Systems contain a cache hierarchy with different scopes

17.Multiprocessor caching On multicore CPUs some (but not all) caches are shared Each core has its own private L1 cache L2 cache can either be private to a core, or shared between cores L3 cache almost always shared between cores Caches not shared across physical CPU dies What if two CPUs update the same memory location stored in their L1 caches? Shared memory systems require an absolute ordering of operations Cache coherency ensures this ordering Implemented in hardware to ensure that memory updates are propagated throughout the entire system Utilizes CPU interconnect for communication

18.Memory Issues As core count increases shared memory becomes harder Increasingly difficult for HW to provide shared memory behavior to all CPU cores M anycore CPUs: N eed to cross other cores to access memory Some cores are closer to memory and thus faster Memory is slow or fast depending on which CPU is accessing it This is called Non Uniform Memory Access (NUMA)

19.Dell R710

20.Non Uniform Memory Access Memory is organized in a non uniform manner Its closer to some CPUs than others Far away memory is slower than close memory Not required to be cache coherent, but usually is ccNUMA : Cache Coherent NUMA Typical organization is to divide system into “zones” A zone usually contains a CPU socket/slot and a portion of the system memory Memory is “local” if its in the CPU’s zone Fast to access

21.NUMA cont’d Accessing memory in the local zone does not impact performance in other zones Interconnect is point to point Looks a lot like a distributed shared memory (DSM) system… Local operations are fast, but if you go to another zone you take a performance hit DSM died in the 90s because it couldn’t scale and was hard to program Unclear whether NUMA will share that same fate

22.Dealing with NUMA Programming a NUMA system is hard Ultimately it’s a failed abstraction Goal: Make all memory ops the same But they aren’t, because some are slower AND the abstraction hides the details Result: Very few people explicitly design an application with NUMA support Those that do are generally in the HPC community So its up to the user and the OS to deal with it But mostly people just ignore it…

23.Dealing with NUMA (users) Users can query the system for the NUMA layout [ jarusl@cambria ~]$ numactl --hardware available: 2 nodes (0-1) node 0 cpus : 0 2 3 4 5 6 node 0 size: 8182 MB node 0 free: 7215 MB node 1 cpus : 1 7 8 9 10 11 node 1 size: 8192 MB node 1 free: 7475 MB node distances: node 0 1 0: 10 16 1: 16 10

24.Dealing with NUMA (users) Users can force OS to confine a process to a specific zone Restricts what memory a process gets allocated Restricts which CPUs process can run on Per process via command line ‘ numactl -- physcpubind = < cpus > < cmd >’ Groups of processes using scheduling domains Linux: cgroups and containers

25.Dealing with NUMA (OS) An OS can deal with NUMA systems by restricting its own behavior Force processes to always execute in a zone, and always allocate memory from the same zone This makes balancing resource utilization tricky However, nothing prevents an application from forcing bad behavior E.g. two applications in separate zones want to communicate using shared memory…

26.Managing NUMA (OS) How can OS know what zone a process should run in? Needs to know what the process behavior will be OS cannot know the future, but it can predict it based on past events Recent OS X and Windows versions profile application behavior When should a process switch zones? If it is communicating with a process in another zone If the system load is currently imbalanced in one zone If we can save power by shutting down a zone’s CPUs How should we layout process memory? Keep all memory in a single zone, or just the working set?

27.Multiprocessing and Power More cores require more energy (and heat) Managing the energy consumption of a system becoming critically important Modern systems cannot fully utilize all resources for very long Approaches Slow down processors periodically CPUs no longer identical (some faster, some slower) Shutdown entire cores System dynamically powers down CPUs OS must deal with processors coming and going This doesn’t really match the SMP model anymore

28.Heterogeneous CPUs Systems are beginning to look much different The SMP model is on its way out Heterogeneous computing resources across system Core specialization: CPU resources tailored to specific workloads GPUs, l ightweight cores, I/O cores, stream processors OS must manage these dynamically What to schedule where and when? How should the OS approach this issue? Active area of current research