SlideShare une entreprise Scribd logo
1  sur  67
Multi-core architectures




                    1
Single-core computer




                 2
Single-core CPU chip
                     the single core




                 3
Multi-core architectures
• This lecture is about a new trend in
  computer architecture:
  Replicate multiple processor cores on a
  single die.
    Core 1            Core 2   Core 3   Core 4




Multi-core CPU chip                       4
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called CMP (Chip Multi-Processor)


   c          c         c         c
   o          o         o         o
   r          r         r         r
   e          e         e         e

   1          2         3         4



                                  5
The cores run in parallel
    thread 1       thread 2       thread 3       thread 4




c              c              c              c
o              o              o              o
r              r              r              r
e              e              e              e

1              2              3              4




                                             6
Within each core, threads are time-sliced
       (just like on a uniprocessor)
     several       several       several       several
     threads       threads       threads       threads




 c             c             c             c
 o             o             o             o
 r             r             r             r
 e             e             e             e

 1             2             3             4




                                           7
Interaction with the
           Operating System
• OS perceives each core as a separate processor

• OS scheduler maps threads/processes
  to different cores

• Most major OS support multi-core today:
  Windows, Linux, Mac OS X, …



                                     8
Why multi-core ?
• Difficult to make single-core
  clock frequencies even higher
• Deeply pipelined circuits:
  –   heat problems
  –   speed of light problems
  –   difficult design and verification
  –   large design teams necessary
  –   server farms need expensive
      air-conditioning
• Many new applications are multithreaded
• General trend in computer architecture (shift
  towards more parallelism)            9
Instruction-level parallelism
• Parallelism at the machine-instruction
  level
• The processor can re-order, pipeline
  instructions, split them into
  microinstructions, do aggressive branch
  prediction, etc.
• Instruction-level parallelism enabled rapid
  increases in processor speeds over the
  last 15 years
                                   10
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
  thread (Web server, database server)
• A computer game can do AI, graphics, and
  physics in three separate threads
• Single-core superscalar processors cannot
  fully exploit TLP
• Multi-core architectures are the next step in
  processor evolution: explicitly exploiting TLP
                                    11
General context: Multiprocessors
• Multiprocessor is any
  computer with several
  processors

• SIMD
  – Single instruction, multiple data           Lemieux cluster,
                                                     Pittsburgh
  – Modern graphics cards                       supercomputing
                                                          center
• MIMD
  – Multiple instructions, multiple data
                                           12
Multiprocessor memory types
• Shared memory:
  In this model, there is one (large) common
  shared memory for all processors

• Distributed memory:
  In this model, each processor has its own
  (small) local memory, and its content is
  not replicated anywhere else

                                 13
Multi-core processor is a special
     kind of a multiprocessor:
  All processors are on the same chip
• Multi-core processors are MIMD:
  Different cores execute different threads
  (Multiple Instructions), operating on different
  parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:
  All cores share the same memory


                                         14
What applications benefit
         from multi-core?
• Database servers
• Web servers (Web commerce)           Each can
• Compilers                            run on its
                                       own core
• Multimedia applications
• Scientific applications,
  CAD/CAM
• In general, applications with
  Thread-level parallelism
  (as opposed to instruction-
  level parallelism)
                                  15
More examples
• Editing a photo while recording a TV show
  through a digital video recorder
• Downloading software while running an
  anti-virus program
• “Anything that can be threaded today will
  map efficiently to multi-core”
• BUT: some applications difficult to
  parallelize
                                 16
A technique complementary to multi-core:
       Simultaneous multithreading
• Problem addressed:                                    L1 D-Cache D-TLB

  The processor pipeline                              Integer        Floating Point
  can get stalled:




                               L2 Cache and Control
  – Waiting for the result                                  Schedulers

    of a long floating point                                Uop queues
    (or integer) operation
                                                            Rename/Alloc
  – Waiting for data to
                                                      BTB     Trace Cache          uCode
    arrive from memory                                                             ROM

 Other execution units                                        Decoder
                               Bus


 wait unused                                                BTB and I-TLB
                                                                        Source: Intel
                                                                17
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
  SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
  on the same core

• Example: if one thread is waiting for a floating
  point operation to complete, another thread can
  use the integer units


                                       18
Without SMT, only a single thread
    can run at any given time
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache       uCode ROM

                                       Decoder
     Bus




                                   BTB and I-TLB

                                              Thread 1: floating point
                                                             19
Without SMT, only a single thread
    can run at any given time
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache      uCode ROM

                                       Decoder
     Bus




                                   BTB and I-TLB

                             Thread 2:
                             integer operation           20
SMT processor: both threads can
       run concurrently
                                 L1 D-Cache D-TLB

                           Integer          Floating Point
    L2 Cache and Control



                                     Schedulers

                                   Uop queues

                                  Rename/Alloc

                           BTB     Trace Cache       uCode ROM

                                      Decoder
    Bus




                                  BTB and I-TLB

                            Thread 2:        Thread 1: floating point
                            integer operation               21
But: Can’t simultaneously use the
       same functional unit
                                  L1 D-Cache D-TLB

                            Integer          Floating Point
     L2 Cache and Control



                                      Schedulers

                                    Uop queues

                                   Rename/Alloc

                            BTB     Trace Cache      uCode ROM

                                       Decoder       This scenario is
     Bus




                                                     impossible with SMT
                                   BTB and I-TLB
                                                     on a single core
                             Thread 1 Thread 2       (assuming a single
                                IMPOSSIBLE           integer unit)
                                                          22
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
  simultaneous thread as a separate
  “virtual processor”
• The chip has only a single copy
  of each resource
• Compare to multi-core:
  each core has its own copy of resources
                                  23
Multi-core:
  threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                              Schedulers                                           Schedulers

                              Uop queues                                           Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB      Trace Cache   uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                                Decoder                                              Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                             Thread 1                                          Thread24
                                                                                      2
Multi-core:
  threads can run on separate cores
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                                   Thread 3                                             25 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
  – Single-core, non-SMT: standard uniprocessor
  – Single-core, with SMT
  – Multi-core, non-SMT
  – Multi-core, with SMT: our fish machines
• The number of SMT threads:
  2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads” 26
SMT Dual-core: all four threads can
       run concurrently
                         L1 D-Cache D-TLB                                      L1 D-Cache D-TLB

                       Integer      Floating Point                           Integer      Floating Point
L2 Cache and Control




                                                      L2 Cache and Control
                             Schedulers                                            Schedulers

                             Uop queues                                            Uop queues

                             Rename/Alloc                                          Rename/Alloc

                       BTB     Trace Cache    uCode                          BTB     Trace Cache    uCode
                                              ROM                                                   ROM
                               Decoder                                               Decoder
Bus




                                                      Bus



                             BTB and I-TLB                                         BTB and I-TLB

                         Thread 1 Thread 3                                     Thread27 Thread 4
                                                                                      2
Comparison: multi-core vs SMT
• Advantages/disadvantages?




                              28
Comparison: multi-core vs SMT
• Multi-core:
  – Since there are several cores,
    each is smaller and not as powerful
    (but also easier to design and manufacture)
  – However, great with thread-level parallelism
• SMT
  – Can have one large and fast superscalar core
  – Great performance on a single thread
  – Mostly still only exploits instruction-level
    parallelism
                                      29
The memory hierarchy
• If simultaneous multithreading only:
  – all caches shared
• Multi-core chips:
  – L1 caches private
  – L2 caches private in some architectures
    and shared in others
• Memory is always shared


                                     30
“Fish” machines
                                      hyper-threads
• Dual-core
  Intel Xeon processors




                          CORE1




                                              CORE0
• Each core is                    L1 cache            L1 cache
  hyper-threaded
                                       L2 cache

• Private L1 caches
                                       memory
• Shared L2 caches
                                         31
CORE1   Designs with private L2 caches


                   CORE0




                                      CORE1




                                                           CORE0
        L1 cache           L1 cache           L1 cache             L1 cache

        L2 cache           L2 cache           L2 cache             L2 cache

                                              L3 cache             L3 cache
             memory
                                                     memory
   Both L1 and L2 are private
                                              A design with L3 caches
   Examples: AMD Opteron,
   AMD Athlon, Intel Pentium D                Example: Intel Itanium 2
                                                         32
Private vs shared caches?
• Advantages/disadvantages?




                              33
Private vs shared caches
• Advantages of private:
  – They are closer to core, so faster access
  – Reduces contention
• Advantages of shared:
  – Threads on different cores can share the
    same cache data
  – More cache space available if a single (or a
    few) high-performance thread runs on the
    system
                                      34
The cache coherence problem
• Since we have private caches:
  How to keep the data consistent across caches?
• Each core should perceive the memory as a
  monolithic array, shared by all the cores




                                    35
The cache coherence problem
Suppose variable x initially contains 15213


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache




                                               multi-core chip
               Main memory
                x=15213                          36
The cache coherence problem
Core 1 reads x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213



                                               multi-core chip
               Main memory
                x=15213                          37
The cache coherence problem
Core 2 reads x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213           x=15213



                                               multi-core chip
               Main memory
                x=15213                          38
The cache coherence problem
Core 1 writes to x, setting it to 21660


   Core 1            Core 2            Core 3         Core 4




 One or more       One or more      One or more     One or more
  levels of         levels of        levels of       levels of
    cache             cache            cache           cache
  x=21660           x=15213



                                                  multi-core chip
               Main memory       assuming
                x=21660          write-through      39
                                 caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=21660           x=15213



                                               multi-core chip
               Main memory
                x=21660                          40
Solutions for cache coherence
• This is a general problem with
  multiprocessors, not limited just to multi-core
• There exist many solution algorithms,
  coherence protocols, etc.

• A simple solution:
  invalidation-based protocol with snooping


                                     41
Inter-core bus


  Core 1            Core 2        Core 3          Core 4




One or more       One or more   One or more     One or more
 levels of         levels of     levels of       levels of
   cache             cache         cache           cache




                                              multi-core chip
              Main memory
                                              inter-core
                                              bus42
Invalidation protocol with snooping
• Invalidation:
  If a core writes to a data item, all other
  copies of this data item in other caches
  are invalidated
• Snooping:
  All cores continuously “snoop” (monitor)
  the bus connecting the cores.


                                   43
The cache coherence problem
Revisited: Cores 1 and 2 have both read x


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=15213           x=15213



                                               multi-core chip
               Main memory
                x=15213                          44
The cache coherence problem
Core 1 writes to x, setting it to 21660


     Core 1           Core 2            Core 3         Core 4




  One or more       One or more      One or more     One or more
   levels of         levels of        levels of       levels of
     cache             cache            cache           cache
   x=21660           x=15213

sends                     INVALIDATED
invalidation
                                                   multi-core chip
request
                Main memory       assuming         inter-core
                 x=21660          write-through       45
                                                   bus
                                  caches
The cache coherence problem
After invalidation:


   Core 1            Core 2        Core 3          Core 4




 One or more       One or more   One or more     One or more
  levels of         levels of     levels of       levels of
    cache             cache         cache           cache
  x=21660



                                               multi-core chip
               Main memory
                x=21660                          46
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.


     Core 1            Core 2        Core 3          Core 4




   One or more       One or more   One or more     One or more
    levels of         levels of     levels of       levels of
      cache             cache         cache           cache
    x=21660           x=21660



                                                 multi-core chip
                 Main memory
                  x=21660                          47
Alternative to invalidate protocol:
               update protocol
Core 1 writes x=21660:


     Core 1           Core 2            Core 3         Core 4




  One or more       One or more      One or more     One or more
   levels of         levels of        levels of       levels of
     cache             cache            cache           cache
   x=21660           x=21660
                          UPDATED

broadcasts
                                                   multi-core chip
updated
value           Main memory       assuming         inter-core
                 x=21660          write-through       48
                                                   bus
                                  caches
Which do you think is better?
  Invalidation or update?




                      49
Invalidation vs update
• Multiple writes to the same location
  – invalidation: only the first time
  – update: must broadcast each write
            (which includes new variable value)


• Invalidation generally performs better:
  it generates less bus traffic


                                     50
Invalidation protocols
• This was just the basic
  invalidation protocol
• More sophisticated protocols
  use extra cache state bits
• MSI, MESI
  (Modified, Exclusive, Shared, Invalid)



                                  51
Programming for multi-core
• Programmers must use threads or
  processes

• Spread the workload across multiple cores

• Write parallel algorithms

• OS will map threads/processes to cores
                                52
Thread safety very important
• Pre-emptive context switching:
  context switch can happen AT ANY TIME

• True concurrency, not just uniprocessor
  time-slicing

• Concurrency bugs exposed much faster
  with multi-core

                                 53
However: Need to use synchronization
even if only time-slicing on a uniprocessor
 int counter=0;

 void thread1() {
   int temp1=counter;
   counter = temp1 + 1;
 }

 void thread2() {
   int temp2=counter;
   counter = temp2 + 1;
 }
                                54
Need to use synchronization even if only
    time-slicing on a uniprocessor

temp1=counter;
counter = temp1 + 1;   gives counter=2
temp2=counter;
counter = temp2 + 1

temp1=counter;
temp2=counter;         gives counter=1
counter = temp1 + 1;
counter = temp2 + 1

                              55
Assigning threads to the cores

• Each thread/process has an affinity mask

• Affinity mask specifies what cores the
  thread is allowed to run on

• Different threads can have different masks

• Affinities are inherited across fork()
                                    56
Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT

         1        1       0        1




       core 3   core 2   core 1   core 0



 • Process/thread is allowed to run on
   cores 0,2,3, but not on core 1
                                       57
Affinity masks when multi-core and
             SMT combined
• Separate bits for each simultaneous thread
• Example: 4-way multi-core, 2 threads per core
           1        1        0         0       1        0       1        1


            core 3            core 2               core 1           core 0



         thread   thread   thread   thread   thread   thread   thread   thread
            1        0        1        0        1        0        1        0


 • Core 2 can’t run the process
 • Core 1 can only use one simultaneous 58
                                         thread
Default Affinities
• Default affinity mask is all 1s:
  all threads can run on all processors

• Then, the OS scheduler decides what
  threads run on what core

• OS scheduler detects skewed workloads,
  migrating threads to less busy processors

                                  59
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as
  much as possible:
  it tends to keeps a thread on the same core
• This is called soft affinity



                                 60
Hard affinities

• The programmer can prescribe her own
  affinities (hard affinities)

• Rule of thumb: use the default scheduler
  unless a good reason not to



                                 61
When to set your own affinities
• Two (or more) threads share data-structures in
  memory
   – map to same core so that can share cache
• Real-time threads:
  Example: a thread running
  a robot controller:
  - must not be context switched,
    or else robot can go unstable               Source: Sensable.com

  - dedicate an entire core just to this thread

                                                 62
Kernel scheduler API
#include <sched.h>
int sched_getaffinity(pid_t pid,
  unsigned int len, unsigned long * mask);

Retrieves the current affinity mask of process ‘pid’ and
   stores it into space pointed to by ‘mask’.
‘len’ is the system word size: sizeof(unsigned int long)




                                              63
Kernel scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid,
   unsigned int len, unsigned long * mask);

Sets the current affinity mask of process ‘pid’ to *mask
‘len’ is the system word size: sizeof(unsigned int long)

To query affinity of a running process:
[barbic@bonito ~]$ taskset -p 3935
pid 3935's current affinity mask: f


                                              64
Windows Task Manager

                         core 2




                     core 1




                65
Legal licensing issues
• Will software vendors charge a separate
  license per each core or only a single
  license per chip?

• Microsoft, Red Hat Linux, Suse Linux will
  license their OS per chip, not per core



                                  66
Conclusion
• Multi-core chips an
  important new trend in
  computer architecture

• Several new multi-core
  chips in design phases

• Parallel programming techniques
  likely to gain importance
                               67

Contenu connexe

Tendances

Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorNeel Patel
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory MultiprocessorsSalvatore La Bua
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating SystemsRitu Ranjan Shrivastwa
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors pptSiddhartha Anand
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memoryDeepak John
 
Pentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil KawarePentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil KawareProf. Swapnil V. Kaware
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Ravindra Raju Kolahalam
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 

Tendances (20)

Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
 
SOC design
SOC design SOC design
SOC design
 
Process synchronization in Operating Systems
Process synchronization in Operating SystemsProcess synchronization in Operating Systems
Process synchronization in Operating Systems
 
Multithreaded processors ppt
Multithreaded processors pptMultithreaded processors ppt
Multithreaded processors ppt
 
Pcie basic
Pcie basicPcie basic
Pcie basic
 
Memory management
Memory managementMemory management
Memory management
 
Kernels and its types
Kernels and its typesKernels and its types
Kernels and its types
 
Introduction to parallel computing
Introduction to parallel computingIntroduction to parallel computing
Introduction to parallel computing
 
Disk scheduling
Disk schedulingDisk scheduling
Disk scheduling
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
 
Pentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil KawarePentium (80586) Microprocessor By Er. Swapnil Kaware
Pentium (80586) Microprocessor By Er. Swapnil Kaware
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]Inter Process Communication Presentation[1]
Inter Process Communication Presentation[1]
 
DMA operation
DMA operationDMA operation
DMA operation
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 

Similaire à Multi core-architecture

Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architectureJawid Ahmad Baktash
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdfshubhangisonawane6
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science StudentsMKKhaing
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 

Similaire à Multi core-architecture (20)

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
27 multicore
27 multicore27 multicore
27 multicore
 
27 multicore
27 multicore27 multicore
27 multicore
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
Final draft intel core i5 processors architecture
Final draft intel core i5 processors architectureFinal draft intel core i5 processors architecture
Final draft intel core i5 processors architecture
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
fundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdffundamentals of digital communication Unit 5_microprocessor.pdf
fundamentals of digital communication Unit 5_microprocessor.pdf
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Extlect04
Extlect04Extlect04
Extlect04
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
4 threads
4 threads4 threads
4 threads
 
Memory
MemoryMemory
Memory
 

Plus de Piyush Mittal

Plus de Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Dernier

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Multi core-architecture

  • 3. Single-core CPU chip the single core 3
  • 4. Multi-core architectures • This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip 4
  • 5. Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor) c c c c o o o o r r r r e e e e 1 2 3 4 5
  • 6. The cores run in parallel thread 1 thread 2 thread 3 thread 4 c c c c o o o o r r r r e e e e 1 2 3 4 6
  • 7. Within each core, threads are time-sliced (just like on a uniprocessor) several several several several threads threads threads threads c c c c o o o o r r r r e e e e 1 2 3 4 7
  • 8. Interaction with the Operating System • OS perceives each core as a separate processor • OS scheduler maps threads/processes to different cores • Most major OS support multi-core today: Windows, Linux, Mac OS X, … 8
  • 9. Why multi-core ? • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: – heat problems – speed of light problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism) 9
  • 10. Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years 10
  • 11. Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 11
  • 12. General context: Multiprocessors • Multiprocessor is any computer with several processors • SIMD – Single instruction, multiple data Lemieux cluster, Pittsburgh – Modern graphics cards supercomputing center • MIMD – Multiple instructions, multiple data 12
  • 13. Multiprocessor memory types • Shared memory: In this model, there is one (large) common shared memory for all processors • Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else 13
  • 14. Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip • Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). • Multi-core is a shared memory multiprocessor: All cores share the same memory 14
  • 15. What applications benefit from multi-core? • Database servers • Web servers (Web commerce) Each can • Compilers run on its own core • Multimedia applications • Scientific applications, CAD/CAM • In general, applications with Thread-level parallelism (as opposed to instruction- level parallelism) 15
  • 16. More examples • Editing a photo while recording a TV show through a digital video recorder • Downloading software while running an anti-virus program • “Anything that can be threaded today will map efficiently to multi-core” • BUT: some applications difficult to parallelize 16
  • 17. A technique complementary to multi-core: Simultaneous multithreading • Problem addressed: L1 D-Cache D-TLB The processor pipeline Integer Floating Point can get stalled: L2 Cache and Control – Waiting for the result Schedulers of a long floating point Uop queues (or integer) operation Rename/Alloc – Waiting for data to BTB Trace Cache uCode arrive from memory ROM Other execution units Decoder Bus wait unused BTB and I-TLB Source: Intel 17
  • 18. Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core • Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units 18
  • 19. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 1: floating point 19
  • 20. Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: integer operation 20
  • 21. SMT processor: both threads can run concurrently L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder Bus BTB and I-TLB Thread 2: Thread 1: floating point integer operation 21
  • 22. But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB Integer Floating Point L2 Cache and Control Schedulers Uop queues Rename/Alloc BTB Trace Cache uCode ROM Decoder This scenario is Bus impossible with SMT BTB and I-TLB on a single core Thread 1 Thread 2 (assuming a single IMPOSSIBLE integer unit) 22
  • 23. SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources 23
  • 24. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread24 2
  • 25. Multi-core: threads can run on separate cores L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 3 25 Thread 4
  • 26. Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines • The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads” 26
  • 27. SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB L1 D-Cache D-TLB Integer Floating Point Integer Floating Point L2 Cache and Control L2 Cache and Control Schedulers Schedulers Uop queues Uop queues Rename/Alloc Rename/Alloc BTB Trace Cache uCode BTB Trace Cache uCode ROM ROM Decoder Decoder Bus Bus BTB and I-TLB BTB and I-TLB Thread 1 Thread 3 Thread27 Thread 4 2
  • 28. Comparison: multi-core vs SMT • Advantages/disadvantages? 28
  • 29. Comparison: multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism • SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism 29
  • 30. The memory hierarchy • If simultaneous multithreading only: – all caches shared • Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others • Memory is always shared 30
  • 31. “Fish” machines hyper-threads • Dual-core Intel Xeon processors CORE1 CORE0 • Each core is L1 cache L1 cache hyper-threaded L2 cache • Private L1 caches memory • Shared L2 caches 31
  • 32. CORE1 Designs with private L2 caches CORE0 CORE1 CORE0 L1 cache L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D Example: Intel Itanium 2 32
  • 33. Private vs shared caches? • Advantages/disadvantages? 33
  • 34. Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention • Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system 34
  • 35. The cache coherence problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores 35
  • 36. The cache coherence problem Suppose variable x initially contains 15213 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory x=15213 36
  • 37. The cache coherence problem Core 1 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 multi-core chip Main memory x=15213 37
  • 38. The cache coherence problem Core 2 reads x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 38
  • 39. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory assuming x=21660 write-through 39 caches
  • 40. The cache coherence problem Core 2 attempts to read x… gets a stale copy Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 multi-core chip Main memory x=21660 40
  • 41. Solutions for cache coherence • This is a general problem with multiprocessors, not limited just to multi-core • There exist many solution algorithms, coherence protocols, etc. • A simple solution: invalidation-based protocol with snooping 41
  • 42. Inter-core bus Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache multi-core chip Main memory inter-core bus42
  • 43. Invalidation protocol with snooping • Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated • Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. 43
  • 44. The cache coherence problem Revisited: Cores 1 and 2 have both read x Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=15213 x=15213 multi-core chip Main memory x=15213 44
  • 45. The cache coherence problem Core 1 writes to x, setting it to 21660 Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=15213 sends INVALIDATED invalidation multi-core chip request Main memory assuming inter-core x=21660 write-through 45 bus caches
  • 46. The cache coherence problem After invalidation: Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 multi-core chip Main memory x=21660 46
  • 47. The cache coherence problem Core 2 reads x. Cache misses, and loads the new copy. Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=21660 multi-core chip Main memory x=21660 47
  • 48. Alternative to invalidate protocol: update protocol Core 1 writes x=21660: Core 1 Core 2 Core 3 Core 4 One or more One or more One or more One or more levels of levels of levels of levels of cache cache cache cache x=21660 x=21660 UPDATED broadcasts multi-core chip updated value Main memory assuming inter-core x=21660 write-through 48 bus caches
  • 49. Which do you think is better? Invalidation or update? 49
  • 50. Invalidation vs update • Multiple writes to the same location – invalidation: only the first time – update: must broadcast each write (which includes new variable value) • Invalidation generally performs better: it generates less bus traffic 50
  • 51. Invalidation protocols • This was just the basic invalidation protocol • More sophisticated protocols use extra cache state bits • MSI, MESI (Modified, Exclusive, Shared, Invalid) 51
  • 52. Programming for multi-core • Programmers must use threads or processes • Spread the workload across multiple cores • Write parallel algorithms • OS will map threads/processes to cores 52
  • 53. Thread safety very important • Pre-emptive context switching: context switch can happen AT ANY TIME • True concurrency, not just uniprocessor time-slicing • Concurrency bugs exposed much faster with multi-core 53
  • 54. However: Need to use synchronization even if only time-slicing on a uniprocessor int counter=0; void thread1() { int temp1=counter; counter = temp1 + 1; } void thread2() { int temp2=counter; counter = temp2 + 1; } 54
  • 55. Need to use synchronization even if only time-slicing on a uniprocessor temp1=counter; counter = temp1 + 1; gives counter=2 temp2=counter; counter = temp2 + 1 temp1=counter; temp2=counter; gives counter=1 counter = temp1 + 1; counter = temp2 + 1 55
  • 56. Assigning threads to the cores • Each thread/process has an affinity mask • Affinity mask specifies what cores the thread is allowed to run on • Different threads can have different masks • Affinities are inherited across fork() 56
  • 57. Affinity masks are bit vectors • Example: 4-way multi-core, without SMT 1 1 0 1 core 3 core 2 core 1 core 0 • Process/thread is allowed to run on cores 0,2,3, but not on core 1 57
  • 58. Affinity masks when multi-core and SMT combined • Separate bits for each simultaneous thread • Example: 4-way multi-core, 2 threads per core 1 1 0 0 1 0 1 1 core 3 core 2 core 1 core 0 thread thread thread thread thread thread thread thread 1 0 1 0 1 0 1 0 • Core 2 can’t run the process • Core 1 can only use one simultaneous 58 thread
  • 59. Default Affinities • Default affinity mask is all 1s: all threads can run on all processors • Then, the OS scheduler decides what threads run on what core • OS scheduler detects skewed workloads, migrating threads to less busy processors 59
  • 60. Process migration is costly • Need to restart the execution pipeline • Cached data is invalidated • OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core • This is called soft affinity 60
  • 61. Hard affinities • The programmer can prescribe her own affinities (hard affinities) • Rule of thumb: use the default scheduler unless a good reason not to 61
  • 62. When to set your own affinities • Two (or more) threads share data-structures in memory – map to same core so that can share cache • Real-time threads: Example: a thread running a robot controller: - must not be context switched, or else robot can go unstable Source: Sensable.com - dedicate an entire core just to this thread 62
  • 63. Kernel scheduler API #include <sched.h> int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask); Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’. ‘len’ is the system word size: sizeof(unsigned int long) 63
  • 64. Kernel scheduler API #include <sched.h> int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask); Sets the current affinity mask of process ‘pid’ to *mask ‘len’ is the system word size: sizeof(unsigned int long) To query affinity of a running process: [barbic@bonito ~]$ taskset -p 3935 pid 3935's current affinity mask: f 64
  • 65. Windows Task Manager core 2 core 1 65
  • 66. Legal licensing issues • Will software vendors charge a separate license per each core or only a single license per chip? • Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core 66
  • 67. Conclusion • Multi-core chips an important new trend in computer architecture • Several new multi-core chips in design phases • Parallel programming techniques likely to gain importance 67