SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
FlexSC
Flexible System Call Scheduling with
    Exception-Less System Calls

 Livio Soares and Michael Stumm
        University of Toronto
Motivation
  The synchronous system call interface is a
       legacy from the single core era


                       Expensive! Costs are:
                        ➔ direct: mode-switch
                        ➔ indirect: processor

                               structure pollution


  FlexSC implements efficient and flexible
      system calls for the multicore era
                                                2
FlexSC overview
Two contributions: FlexSC and FlexSC-Threads




Results in:
  1) MySQL throughput increase of up to 40%
     and latency reduction of 30%
  2) Apache throughput increase of up to 115%
     and latency reduction of 50%               3
Performance impact of synchronous syscalls
➔   Xalan from SPEC CPU 2006
    ➔   Virtually no time in the OS
➔   Linux on Intel Core i7 (Nehalem)
➔   Injected exceptions with varying frequencies
    ➔ Direct: emulate null system call
      Direct
    ➔ Indirect: emulate “write()” system call
      Indirect
➔   Measured only user-mode time
    ➔   Kernel time ignored

    Ideally, user-mode performance is unaltered
                                                   4
Degradation due to sync. syscalls
 Degradation (lower is faster)
                                              Xalan (SPEC CPU 2006)
                                 70%
                                 60%
                                             Apache                      Indirect
                                 50%                                     Direct
                                 40%                        MySQL
                                 30%
                                 20%
                                 10%
                                 0%
                                   1K       2K     5K     10K     20K    50K       100K
                                       user-mode instructions between exceptions
                                                       (log scale)

        System calls can half processor efficiency;
            indirect cause is major contributor
                                                                                          5
Processor state pollution

➔   Key source of performance impact

➔   On a Linux write() call:
                rd
    ➔   up to 2/3 of the L1 data cache and data
        TLB are evicted

➔   Kernel performance equally affected
    ➔   Processor efficiency for OS code is also cut
        in half
                                                       6
Synchronous system calls are expensive




User

Kernel


  Traditional system calls are synchronous
    and use exceptions to cross domains
                                             7
Alternative: side-step the boundary




User

Kernel



Exception-less syscalls remove synchronicity
  by decoupling invocation from execution
                                          8
Benefits of exception-less system calls
 ➔   Significantly reduce direct costs
     ➔   Fewer mode switches

User
 ➔ Allow for batching


Kernel
   ➔ Reduce indirect costs




 ➔   Allow for dynamic multicore specialization
     ➔   Further reduce direct and indirect costs
                                                    9
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;
entry->status = SUBMIT;
                SUBMIT

while (entry->status != DONE)
                        DONE
   do_something_else();

return entry->return_code;

                                         10
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;            SUBMIT
entry->status = SUBMIT;
                SUBMIT

while (entry->status != DONE)
                        DONE
   do_something_else();

return entry->return_code;

                                           11
Exception-less interface: syscall page
write(fd, buf, 4096);



entry = free_syscall_entry();

/* write syscall */
entry->syscall = 1;
entry->num_args = 3;
entry->args[0] = fd;
entry->args[1] = buf;
entry->args[2] = 4096;            DONE
entry->status = SUBMIT;
                SUBMIT

while (entry->status != DONE)
                        DONE
   do_something_else();

return entry->return_code;

                                         12
Syscall threads
➔   Kernel-only threads
    ➔   Part of application process
➔   Execute requests from syscall page
➔   Schedulable on a per-core basis




                                         13
System call batching




  Request as many system calls as possible
  Switch to kernel-mode
  Start executing all posted system calls

     Avoids direct and indirect costs,
          even on a single core          14
Dynamic multicore specialization




 FlexSC makes specializing cores simple
 Dynamically adapts to workload needs
                                          15
What programs can benefit from FlexSC?
Event-driven servers
(e.g., memcached, nginx webserver)
  ➔   Use asynchoronous calls, similar to FlexSC
  ➔   Can use FlexSC directly
  ➔   Mix sync and exception-less system calls

Multi-threaded servers: FlexSC-Threads
  ➔   Thread library, compatible with Pthreads
  ➔   No changes to app. code or recompilation required
  ➔   Transparently converts legacy syscalls into
      exception-less ones
                                                      16
FlexSC-Threads library
➔   Hybrid (M-on-N) threading model
    ➔ One kernel visible thread per core
    ➔ Many user threads per kernel-visible thread



➔   Redirects system calls (libc wrappers)
    ➔ Posts exception-less syscall to syscall page
    ➔ Switches to other user-level thread

    ➔ Resumes thread upon syscall completion



     Benefits of exception-less syscalls
while maintaining sequential syscall interface
                                                     17
FlexSC-Threads in action




User




                           18
FlexSC-Threads in action




On a syscall:
  Post request to system call page
  Block user-level thread
                                     19
FlexSC-Threads in action




Kernel

 On a syscall:
   Post request to system call page
   Block user-level thread
   Switch to next ready thread        20
FlexSC-Threads in action




User

Kernel

  If all user-level threads become blocked:
  1) enter kernel
  2) wait for completion of at least 1 syscall
                                                 21
Evaluation
➔   Linux 2.6.33

➔   Nehalem (Core i7) server, 2.3GHz
    ➔   4 cores on a chip

➔   Clients connected on 1 Gbps network

➔   Workloads
    ➔   Sysbench on MySQL (80% user, 20% kernel)
    ➔   ApacheBench on Apache (50% user, 50% kernel)

➔   Default Linux NTPL (“sync”) vs.
                         sync
         FlexSC-Threads (“flexsc”)
                          flexsc
                                                       22
Sysbench: “OLTP” on MySQL (1 core)

                  500

                  400
(requests/sec.)
  Throughput




                  300
                         15% improvement
                  200
                                                    flexsc
                  100
                                                    sync
                    0
                     0   50     100    150    200   250      300
                              Request Concurrency

                                                               23
Sysbench: “OLTP” on MySQL (4 cores)

                  1,000

                   800
(requests/sec.)
  Throughput




                   600
                               40% improvement
                   400

                   200                                flexsc
                                                      sync
                     0
                      0   50      100    150   200   250   300
                               Request Concurrency

                                                               24
MySQL latency per client request
                                     256 connections
                        1900
               1,000
                 900                                   95th
                 800                                   percentile
Latency (ms)




                 700                                   average
                 600
                 500
                 400
                 300
                 200
                 100
                   0
                       sync flexsc       sync flexsc   sync flexsc
                         1 core            2 cores       4 cores

                       Up to 30% reduction of average
                              request latencies
                                                                     25
MySQL processor metrics
                                               SysBench (4 cores)
                       1.4
                       1.2
Relative Performance




                                        User                             Kernel
                        1
    (flexsc/sync)




                       0.8
                       0.6
                       0.4
                       0.2
                        0
                                   L3    d-cache TLB          IPC      L2   i-cache Branch
                             IPC        L2   i-cache Branch         L3  d-cache TLB

                       Performance improvements consequence of
                            more efficient processor execution
                                                                                         26
ApacheBench throughput (1 core)

                  45,000
                  40,000                               flexsc
                  35,000                               sync
(requests/sec.)
  Throughput




                  30,000
                  25,000
                  20,000
                  15,000           80-90% improvement
                  10,000
                   5,000
                       0
                        0   200   400     600    800    1000
                             Request Concurrency
                                                                27
ApacheBench throughput (4 cores)

                  45,000
                  40,000
                  35,000
(requests/sec.)




                             115% improvement
  Throughput




                  30,000
                  25,000
                  20,000
                  15,000
                  10,000                               flexsc
                   5,000                               sync
                       0
                        0   200   400     600    800   1000
                             Request Concurrency
                                                                28
Apache latency per client request
                                256 concurrent requests
                      238
                 30
                                                   99th
                 25                                percentile
  Latency (ms)




                 20                                average

                 15

                 10

                  5

                  0
                      sync flexsc    sync flexsc    sync flexsc
                       1 core         2 cores        4 cores

                      Up to 50% reduction of average
                             request latencies
                                                                  29
Apache processor metrics
                                               Apache (1 core)
                        2
Relative Performance




                       1.5
    (flexsc/sync)




                                        User                             Kernel
                        1


                       0.5


                        0
                                   L3    d-cache TLB          IPC      L2   i-cache Branch
                             IPC        L2   i-cache Branch         L3  d-cache TLB

                             Processor efficiency doubles for kernel
                                   and user-mode execution
                                                                                         30
Discussion

➔   New OS architecture not necessary
    ➔   Exception-less syscalls can coexist with legacy ones
➔   Foundation for non-blocking system calls
    ➔   select() / poll() in user-space
    ➔   Interesting case of non-blocking free()
➔   Multicore ultra -specialization
    ➔   TCP Servers (Rutgers; Iftode et.al), FS Servers
➔   Single-ISA asymmetric cores
    ➔   OS-friendly cores (HP Labs; Mogul et. al)

                                                               31
Concluding Remarks
➔   System calls degrade server performance
    ➔   Processor pollution is inherent to synchronous
        system calls
➔   Exception-less syscalls
    ➔   Flexible and efficient system call execution
➔   FlexSC-Threads
    ➔   Leverages exception-less syscalls
    ➔   No modifications to multi-threaded applications
➔   Throughput & latency gains
    ➔   2x throughput improvement for Apache and BIND
    ➔   1.4x throughput improvement for MySQL             32
FlexSC
Flexible System Call Scheduling with
    Exception-Less System Calls

 Livio Soares and Michael Stumm
        University of Toronto

Contenu connexe

Tendances

Instruction on creating a cluster on jboss eap environment
Instruction on creating a cluster on jboss eap environmentInstruction on creating a cluster on jboss eap environment
Instruction on creating a cluster on jboss eap environmentMadhusudan Pisipati
 
Enterprise manager cloud control 12c(12.1) &agent安装图文指南
Enterprise manager cloud control 12c(12.1) &agent安装图文指南Enterprise manager cloud control 12c(12.1) &agent安装图文指南
Enterprise manager cloud control 12c(12.1) &agent安装图文指南maclean liu
 
Docker Setting for Static IP allocation
Docker Setting for Static IP allocationDocker Setting for Static IP allocation
Docker Setting for Static IP allocationJi-Woong Choi
 
JBoss EAP / WildFly, State of the Union
JBoss EAP / WildFly, State of the UnionJBoss EAP / WildFly, State of the Union
JBoss EAP / WildFly, State of the UnionDimitris Andreadis
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Asynchronous Io Programming
Asynchronous Io ProgrammingAsynchronous Io Programming
Asynchronous Io Programmingl xf
 
Visão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
Visão geral sobre Citrix XenServer 6 - Ferramentas e LicenciamentoVisão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
Visão geral sobre Citrix XenServer 6 - Ferramentas e LicenciamentoLorscheider Santiago
 
Benchmark emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
Benchmark   emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mwareBenchmark   emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
Benchmark emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mwaresolarisyougood
 
Windows Server "10": что нового в кластеризации
Windows Server "10": что нового в кластеризацииWindows Server "10": что нового в кластеризации
Windows Server "10": что нового в кластеризацииВиталий Стародубцев
 
Docker on openstack by OpenSource Consulting
Docker on openstack by OpenSource ConsultingDocker on openstack by OpenSource Consulting
Docker on openstack by OpenSource ConsultingOpen Source Consulting
 
Windows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel DevelopersWindows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel DevelopersKernel TLV
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Santosh Kangane
 
Citrix XenServer 5.5 Troubleshooting
Citrix XenServer 5.5 TroubleshootingCitrix XenServer 5.5 Troubleshooting
Citrix XenServer 5.5 TroubleshootingThomas Krampe
 
Uponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case StudyUponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case StudySimo Vilmunen
 
Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityYoshinori Matsunobu
 
JBoss Enterprise Application Platform 6 Troubleshooting
JBoss Enterprise Application Platform 6 TroubleshootingJBoss Enterprise Application Platform 6 Troubleshooting
JBoss Enterprise Application Platform 6 TroubleshootingAlexandre Cavalcanti
 
WildFly BOF and V9 update @ Devoxx 2014
WildFly BOF and V9 update @ Devoxx 2014WildFly BOF and V9 update @ Devoxx 2014
WildFly BOF and V9 update @ Devoxx 2014Dimitris Andreadis
 
WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.
WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.
WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.Dimitris Andreadis
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4Vepsun Technologies
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformanceProfessionalVMware
 

Tendances (20)

Instruction on creating a cluster on jboss eap environment
Instruction on creating a cluster on jboss eap environmentInstruction on creating a cluster on jboss eap environment
Instruction on creating a cluster on jboss eap environment
 
Enterprise manager cloud control 12c(12.1) &agent安装图文指南
Enterprise manager cloud control 12c(12.1) &agent安装图文指南Enterprise manager cloud control 12c(12.1) &agent安装图文指南
Enterprise manager cloud control 12c(12.1) &agent安装图文指南
 
Docker Setting for Static IP allocation
Docker Setting for Static IP allocationDocker Setting for Static IP allocation
Docker Setting for Static IP allocation
 
JBoss EAP / WildFly, State of the Union
JBoss EAP / WildFly, State of the UnionJBoss EAP / WildFly, State of the Union
JBoss EAP / WildFly, State of the Union
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Asynchronous Io Programming
Asynchronous Io ProgrammingAsynchronous Io Programming
Asynchronous Io Programming
 
Visão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
Visão geral sobre Citrix XenServer 6 - Ferramentas e LicenciamentoVisão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
Visão geral sobre Citrix XenServer 6 - Ferramentas e Licenciamento
 
Benchmark emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
Benchmark   emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mwareBenchmark   emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
Benchmark emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
 
Windows Server "10": что нового в кластеризации
Windows Server "10": что нового в кластеризацииWindows Server "10": что нового в кластеризации
Windows Server "10": что нового в кластеризации
 
Docker on openstack by OpenSource Consulting
Docker on openstack by OpenSource ConsultingDocker on openstack by OpenSource Consulting
Docker on openstack by OpenSource Consulting
 
Windows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel DevelopersWindows Internals for Linux Kernel Developers
Windows Internals for Linux Kernel Developers
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
Citrix XenServer 5.5 Troubleshooting
Citrix XenServer 5.5 TroubleshootingCitrix XenServer 5.5 Troubleshooting
Citrix XenServer 5.5 Troubleshooting
 
Uponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case StudyUponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case Study
 
Consistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced DurabilityConsistency between Engine and Binlog under Reduced Durability
Consistency between Engine and Binlog under Reduced Durability
 
JBoss Enterprise Application Platform 6 Troubleshooting
JBoss Enterprise Application Platform 6 TroubleshootingJBoss Enterprise Application Platform 6 Troubleshooting
JBoss Enterprise Application Platform 6 Troubleshooting
 
WildFly BOF and V9 update @ Devoxx 2014
WildFly BOF and V9 update @ Devoxx 2014WildFly BOF and V9 update @ Devoxx 2014
WildFly BOF and V9 update @ Devoxx 2014
 
WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.
WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.
WildFly v9 - State of the Union Session at Voxxed, Istanbul, May/9th 2015.
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
vSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting PerformancevSphere vStorage: Troubleshooting Performance
vSphere vStorage: Troubleshooting Performance
 

Similaire à FlexSC: Exception-Less System Calls - presented @ OSDI 2010

20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)Darpan Dinker
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket LinxiaofengMichael Zhang
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6David Pasek
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalTommy Lee
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelKernel TLV
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesSeveralnines
 
Full Stack Load Testing
Full Stack Load Testing Full Stack Load Testing
Full Stack Load Testing Terral R Jordan
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersSR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersGlenn K. Lockwood
 
VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead VMworld
 
MySQL Replication: Pros and Cons
MySQL Replication: Pros and ConsMySQL Replication: Pros and Cons
MySQL Replication: Pros and ConsRachel Li
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 

Similaire à FlexSC: Exception-Less System Calls - presented @ OSDI 2010 (20)

20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
NoSQL with MySQL
NoSQL with MySQLNoSQL with MySQL
NoSQL with MySQL
 
Load Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - SlidesLoad Balancing MySQL with HAProxy - Slides
Load Balancing MySQL with HAProxy - Slides
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
 
Full Stack Load Testing
Full Stack Load Testing Full Stack Load Testing
Full Stack Load Testing
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC ClustersSR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
SR-IOV: The Key Enabling Technology for Fully Virtualized HPC Clusters
 
VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead
 
MySQL Replication: Pros and Cons
MySQL Replication: Pros and ConsMySQL Replication: Pros and Cons
MySQL Replication: Pros and Cons
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 

FlexSC: Exception-Less System Calls - presented @ OSDI 2010

  • 1. FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto
  • 2. Motivation The synchronous system call interface is a legacy from the single core era Expensive! Costs are: ➔ direct: mode-switch ➔ indirect: processor structure pollution FlexSC implements efficient and flexible system calls for the multicore era 2
  • 3. FlexSC overview Two contributions: FlexSC and FlexSC-Threads Results in: 1) MySQL throughput increase of up to 40% and latency reduction of 30% 2) Apache throughput increase of up to 115% and latency reduction of 50% 3
  • 4. Performance impact of synchronous syscalls ➔ Xalan from SPEC CPU 2006 ➔ Virtually no time in the OS ➔ Linux on Intel Core i7 (Nehalem) ➔ Injected exceptions with varying frequencies ➔ Direct: emulate null system call Direct ➔ Indirect: emulate “write()” system call Indirect ➔ Measured only user-mode time ➔ Kernel time ignored Ideally, user-mode performance is unaltered 4
  • 5. Degradation due to sync. syscalls Degradation (lower is faster) Xalan (SPEC CPU 2006) 70% 60% Apache Indirect 50% Direct 40% MySQL 30% 20% 10% 0% 1K 2K 5K 10K 20K 50K 100K user-mode instructions between exceptions (log scale) System calls can half processor efficiency; indirect cause is major contributor 5
  • 6. Processor state pollution ➔ Key source of performance impact ➔ On a Linux write() call: rd ➔ up to 2/3 of the L1 data cache and data TLB are evicted ➔ Kernel performance equally affected ➔ Processor efficiency for OS code is also cut in half 6
  • 7. Synchronous system calls are expensive User Kernel Traditional system calls are synchronous and use exceptions to cross domains 7
  • 8. Alternative: side-step the boundary User Kernel Exception-less syscalls remove synchronicity by decoupling invocation from execution 8
  • 9. Benefits of exception-less system calls ➔ Significantly reduce direct costs ➔ Fewer mode switches User ➔ Allow for batching Kernel ➔ Reduce indirect costs ➔ Allow for dynamic multicore specialization ➔ Further reduce direct and indirect costs 9
  • 10. Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 10
  • 11. Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; SUBMIT entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 11
  • 12. Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; DONE entry->status = SUBMIT; SUBMIT while (entry->status != DONE) DONE do_something_else(); return entry->return_code; 12
  • 13. Syscall threads ➔ Kernel-only threads ➔ Part of application process ➔ Execute requests from syscall page ➔ Schedulable on a per-core basis 13
  • 14. System call batching Request as many system calls as possible Switch to kernel-mode Start executing all posted system calls Avoids direct and indirect costs, even on a single core 14
  • 15. Dynamic multicore specialization FlexSC makes specializing cores simple Dynamically adapts to workload needs 15
  • 16. What programs can benefit from FlexSC? Event-driven servers (e.g., memcached, nginx webserver) ➔ Use asynchoronous calls, similar to FlexSC ➔ Can use FlexSC directly ➔ Mix sync and exception-less system calls Multi-threaded servers: FlexSC-Threads ➔ Thread library, compatible with Pthreads ➔ No changes to app. code or recompilation required ➔ Transparently converts legacy syscalls into exception-less ones 16
  • 17. FlexSC-Threads library ➔ Hybrid (M-on-N) threading model ➔ One kernel visible thread per core ➔ Many user threads per kernel-visible thread ➔ Redirects system calls (libc wrappers) ➔ Posts exception-less syscall to syscall page ➔ Switches to other user-level thread ➔ Resumes thread upon syscall completion Benefits of exception-less syscalls while maintaining sequential syscall interface 17
  • 19. FlexSC-Threads in action On a syscall: Post request to system call page Block user-level thread 19
  • 20. FlexSC-Threads in action Kernel On a syscall: Post request to system call page Block user-level thread Switch to next ready thread 20
  • 21. FlexSC-Threads in action User Kernel If all user-level threads become blocked: 1) enter kernel 2) wait for completion of at least 1 syscall 21
  • 22. Evaluation ➔ Linux 2.6.33 ➔ Nehalem (Core i7) server, 2.3GHz ➔ 4 cores on a chip ➔ Clients connected on 1 Gbps network ➔ Workloads ➔ Sysbench on MySQL (80% user, 20% kernel) ➔ ApacheBench on Apache (50% user, 50% kernel) ➔ Default Linux NTPL (“sync”) vs. sync FlexSC-Threads (“flexsc”) flexsc 22
  • 23. Sysbench: “OLTP” on MySQL (1 core) 500 400 (requests/sec.) Throughput 300 15% improvement 200 flexsc 100 sync 0 0 50 100 150 200 250 300 Request Concurrency 23
  • 24. Sysbench: “OLTP” on MySQL (4 cores) 1,000 800 (requests/sec.) Throughput 600 40% improvement 400 200 flexsc sync 0 0 50 100 150 200 250 300 Request Concurrency 24
  • 25. MySQL latency per client request 256 connections 1900 1,000 900 95th 800 percentile Latency (ms) 700 average 600 500 400 300 200 100 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 30% reduction of average request latencies 25
  • 26. MySQL processor metrics SysBench (4 cores) 1.4 1.2 Relative Performance User Kernel 1 (flexsc/sync) 0.8 0.6 0.4 0.2 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Performance improvements consequence of more efficient processor execution 26
  • 27. ApacheBench throughput (1 core) 45,000 40,000 flexsc 35,000 sync (requests/sec.) Throughput 30,000 25,000 20,000 15,000 80-90% improvement 10,000 5,000 0 0 200 400 600 800 1000 Request Concurrency 27
  • 28. ApacheBench throughput (4 cores) 45,000 40,000 35,000 (requests/sec.) 115% improvement Throughput 30,000 25,000 20,000 15,000 10,000 flexsc 5,000 sync 0 0 200 400 600 800 1000 Request Concurrency 28
  • 29. Apache latency per client request 256 concurrent requests 238 30 99th 25 percentile Latency (ms) 20 average 15 10 5 0 sync flexsc sync flexsc sync flexsc 1 core 2 cores 4 cores Up to 50% reduction of average request latencies 29
  • 30. Apache processor metrics Apache (1 core) 2 Relative Performance 1.5 (flexsc/sync) User Kernel 1 0.5 0 L3 d-cache TLB IPC L2 i-cache Branch IPC L2 i-cache Branch L3 d-cache TLB Processor efficiency doubles for kernel and user-mode execution 30
  • 31. Discussion ➔ New OS architecture not necessary ➔ Exception-less syscalls can coexist with legacy ones ➔ Foundation for non-blocking system calls ➔ select() / poll() in user-space ➔ Interesting case of non-blocking free() ➔ Multicore ultra -specialization ➔ TCP Servers (Rutgers; Iftode et.al), FS Servers ➔ Single-ISA asymmetric cores ➔ OS-friendly cores (HP Labs; Mogul et. al) 31
  • 32. Concluding Remarks ➔ System calls degrade server performance ➔ Processor pollution is inherent to synchronous system calls ➔ Exception-less syscalls ➔ Flexible and efficient system call execution ➔ FlexSC-Threads ➔ Leverages exception-less syscalls ➔ No modifications to multi-threaded applications ➔ Throughput & latency gains ➔ 2x throughput improvement for Apache and BIND ➔ 1.4x throughput improvement for MySQL 32
  • 33. FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto