Este documento presenta una introducción a AWS y EC2. Explica cómo EC2 ofrece servidores virtuales en la nube con escalabilidad rápida y flexibilidad de pago. Se describen los diferentes tipos de instancias EC2 y sus características de rendimiento para diferentes cargas de trabajo. También analiza factores como la virtualización, el rendimiento de CPU e I/O, y proporciona consejos para optimizar el desempeño en EC2.
2. Qué esperar de esta sesión ?
• Introducción a AWS y EC2
• Definir el desempeño de un sistema y cómo se
caracteriza para diferentes cargas de trabajo
• Cómo las instancias EC2 ofrecen un óptimo desempeño,
manteniendo flexibilidad y agilidad
• Cómo aprovechar de mejor manera el uso de las
instancias EC2
4. Infraestructura Global de AWS
Region
Edge Location
12 Regions
33 Availability Zones
54 Edge Locations
5. US West (OR)
AZ A AZ B
AZ C
GovCloud (US)
AZ A AZ B
US West (CA)
AZ A AZ B
AZ C
US East (VA)
AZ A AZ B
AZ C AZ D
AZ E
*A limited preview of the China (Beijing) Region is available to a select group of China-based and multinational companies with customers in China.
These customers are required to create a AWS Account, with a set of credentials that are distinct and separate from other global AWS Accounts.
EU (Ireland)
AZ A AZ B
AZ C
AZ A AZ B
S. America (Sao
Paulo)
Asia Pacific
(Tokyo)
AZ A AZ B
AZ C
AZ A AZ B
Asia Pacific
(Singapore)
China (Bejing)Asia Pacific
(Sydney)
AZ A AZ B
EU (Frankfurt)
AZ A AZ B
AWS Regions
China (Beijing)*
AZ A AZ B
Regiones de AWS y Zonas de Disponibilidad (AZs)
7. Amazon Elastic Cloud Compute (EC2)
Servidores Virtuales
en la nube de AWS
Rápida y fácil
escalabilidad,
según lo necesite
Pague únicamente
por lo que usa
Sistemas Operativos
ya conocidos: Linux y
Windows
8. Amplia variedad de Tipos de Instancias
M4
General
purpose
Compute
optimized
C4
C3
Storage and IO
optimized
I2 G2
GPU
enabled
Memory
optimized
R3D2
M3
9. Amazon EC2 permite…
• Construir fácilmente aplicaciones con HA
• Distribuir la carga entre servidores EC2 usando AWS Elastic
Load Balancers
• Garantizar alta disponibilidad y escalabilidad usando Auto
Scaling
• Usar múltiples Zonas de Disponibilidad (AZs)
• Elegir entre diferentes modelos comerciales
10. Diferentes modelos comerciales
Instancias
Reservadas
Pague un adelanto inicial
mínimo
Reserve la capacidad
Asegure una tarifa menor por
hora
Instancias
On-Demand
Pague de acuerdo con el uso
Tarifa plana por hora
Sin contratos ni compromisos
Instancias
Spot
Haga una oferta
Economice hasta 90% en
comparación con On-Demand
Lance 1,000s de instancias
10:00
10:05
10:10
12. Selecionando un servidor
• Los servidores son reservados para realizar trabajos
• El desempeño se mide de manera diferente
dependiendo del trabajo que se realice
13. • Lo que desempeño significa,
depende de la perspectiva:
• Tiempo de respuesta
• Rendimiento
• Consistencia
Desempeño = perspectiva
Aplicación
Librerías de Sistema
Llamadas a sistema
Kernel
Dispositivo
Carga
14. Factores de desempeño
Recurso Factores Indicadores
CPU Sockets, número de núcleos,
frecuencia de reloj, capacidad
Utilización de CPU, tamaño de la fila de
ejecución
Memoria Capacidad Memoria libre, paginación, swapping
Interfaz de
Red
Ancho de Banda Máximo, paquetes Cantidad paquetes recibidos,
transferencia de paquetes sobre el
máximo ancho de banda
Disco IOPS, Desempeño Tamaño de fila en espera, utilización de
dispositivos, errores en los dispositivos
15. Utilización de Recursos
• Cada applicacion tienen una perfile de utilizacion de
recrusos, para un dado nivel de despemeño.
• Un recurso con utilización del 100% no puede recibir o
atender más peticiones
• Baja utilización indica que se han reservado más
recursos de los necesarios
16. Ejemplo: Aplicación Web
• MediaWiki instalado en un servidor Apache con 140
páginas de contenido
• Incremento de carga en intervalos de tiempo
21. Selección de instancia = optimización
• La selección de una instancia es equivalente a la
optimización de los recursos
• Dar de baja instancias es tan fácil como adquirir nuevas
• Alinear el tipo de carga con el tipo de instancia óptimo
23. Instrucciones de CPU y Niveles de Protección
• CPU tiene dos niveles de protección: Kernel y Aplicación
• Instrucciones privilegiadas no se pueden ejecutar en
modo usuario para proteger el sistema.
• Aplicaciones apalancan las llamadas al sistema al
kernel
Instrucciones privilegiadas:
• Inicio de I/O
• Acceso a I/O de Dispositivos
(red, disco)
• Manejo del tiempo
• Pausa CPU Aplicación
Kernel
25. X86 CPU Virtualización : Antes de Intel VT-x
• Traducción a binario para instrucciones privilegiadas
• Para-virtualization (PV)
• PV requiere pasar por VMM, introduciendo latencia
• Aplicaciones que son ligados/bound a llamadas de sistemas son
más afectadas
VMM
Application
Kernel
PV
26. 27
Aplicando la ley de Moore
90 nm
2003
180 nm
1999
130 nm
2001
65 nm
2005
45 nm
2007 32 nm
2009
22 nm
2012 14 nm
2014
LEY DE MOORE
Habilitando nuevos dispositivos con mayor funcionalidad y
complejidad, mientras se controla la potencia, el costo y el tamaño.
(duplicando la integración cada 2 años)
29. X86 CPU Virtualización : Despues de Intel VT-x
• Virtualización asistida por hardware (HVM)
• PV-HVM utiliza PV drivers para operaciones que son lentas a ser
emuladas. :
• e.g. Red y I/O de disco
Kernel
Application
VMM
PV-HVM
32. Instancias: T2
• Menor costo de instancias
• Burstable performance
• Asignación fija de créditos CPU
Model vCPU CPU Credits
/ Hour
Memory
(GiB)
Storage
t2.nano 1 3 0.5 EBS Only
t2.micro 1 6 1 EBS Only
t2.small 1 12 2 EBS Only
t2.medium 2 24 4 EBS Only
t2.large 2 36 8 EBS Only
33. How Credits Work
• Un crédito de CPU proporciona la
performance de un CPU completo
durante un minuto
• Una instancia gana créditos de
CPU a un ritmo constante
• Una instancia consume créditos
cuando está activa
• Créditos expiran (leak) después de
24 horas.
Baseline Rate
Credit
Balance
Burst
Rate
35. Tip: Como Interpretar Steal Time
• Asignaciones de CPU fijas puede ser ofrecidas con
limites establecidos en la CPU
• Steal time ocurre cuando el límite de tiempo en la CPU a
sido agotado
• Revisen las métricas de CloudWatch
37. Virtualización de I/O y Dispositivos
• Split Driver Model
• Cada dispositivo tiene dos componentes;
• Ring buffer de comunicación
• Canal de eventos avisando el ring buffer de actividad.
• Intel VT-d
• Paso directo para dispositivos dedicados
• Enhanced Networking (SR-IOV)
38. Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
39. Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
40. Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
41. Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
42. Split Driver Model : Red
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
43. Paso Directo al Dispositivo: Enhanced Networking
• SR-IOV elimina la necesidad del driver domain
• El dispositivo físico de red expone una función virtual a
la instancia
• Requiere un driver especial:
• El sistema operativo de la instancia necesita saber sobre el
driver
• Es necesario habilitar ”Enhanced Networking” en EC2
44. Paso Directo al Dispositivo: Enhanced Networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
NIC
Driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
45. Paso Directo al Dispositivo: Enhanced Networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
NIC
Driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
46. Paso Directo al Dispositivo: Enhanced Networking
Hardware
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
NIC
Driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
47. Tip: Usar Enhanced Networking
• Mayor cantidad de paquetes por segundo
• Menor varianza en latencia
• El Sistema Operativo de la instancia debe soportarlo
49. Instancias I2
• Proveen almacenamiento SSD
• Proveen IOPS a bajo costo
• Optimizadas para alta demanda de I/O aleatorio
Model vCPU Memory
(GiB)
Storage Read IOPS Write IOPS
i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000
i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000
i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000
i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000
50. Grants en kernels prévio a la versión 3.8.0
• Previo a la versión 3.8.0, se requiere un Mapa de grants
• El Mapa de grants requiere de operaciones costosas debido a flushes de TLB
(Translation Lookaside Buffer)
read(fd, buffer,…)
51. Cesión en kernels posteriores a la versión 3.8.0
• El Mapa de grants está definido en un pool
• La información es copiada o extraída del pool
Copy to
and from
grant pool
52. Tip: Usar kernels posteriores a la versión 3.8.0
• Amazon Linux 13.09 o mayor
• Ubuntu 14.04 o mayor
• RHEL7 o mayor
• Etc.
53. Resumen
• Usar PV-HVM
• Monitorar creditos T2
• Usar Enhanced Networking
• Usar kernels posteriores a la versión 3.8.0
Grna plazer de estar en el primer simmit de buenos aires.
Support. I help my customer deep dive performance issues on EC2 and AWS services.
This session is designed to be educational and consultative.
I want you all to come take away something that can help you use EC2,
starting with how you can define performance down to features and tips you can use to get more performance and how they work
You all have specific things you care about and objectives, but if those aren’t covered in the talk, don’t be too disappointed.
We’ve brought some great engineers into our booth to answer your questions after the session.
La parte mas importante es como aprovechar de mejor manero las instancias.
Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
Our data center footprint is global, spanning 5 continents with highly redundant clusters of data centers in each region. Our footprint is expanding continuously as we increase capacity, redundancy and add locations to meet the needs of our customers around the world.
You can choose to deploy and run your applications in multiple physical locations within the AWS cloud.
Our data center footprint is global, spanning 5 continents with highly redundant clusters of data centers in each region.
Amazon Web Services are available in geographic Regions that are independent and separate as much as possible for data sovertenty and as much as possible offer the same services.
When you use AWS, you can specify the Region in which your data will be stored, instances run, queues started, and databases instantiated.
Within each Region are Availability Zones (AZs).
Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from a failure (unlikely as it might be) that affects an entire zone. Regions consist of one or more Availability Zones, are geographically dispersed, and are in separate geographic areas or countries. The Amazon EC2 service level agreement commitment is 99.95% availability for each Amazon EC2 Region.
Our footprint is expanding continuously as we increase capacity, redundancy and add locations to meet the needs of our customers around the world.
AWS maintains Regions, which are major geographic areas, and Availability Zones (AZ), which are individual data centers, or clusters of data centers that make up a Region. Independent and separate that as much as possible offer the same services. But they have isolation as much as possible for data sovertenty.
Today, AWS operates 9 Regions around the world. Each Region has a minimum of 2 Azs (separate power, flood planes, etc) to allow customers to set up high availability architectures and data redundancy. An abstraction of a datacenter with fault isolation but close enough to build high availability architectures.
In addition to Regions, AWS maintains edge locations that supporting Route 53 DNS and Amazon CloudFront (CDN) points of presence.
Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
EC2 is designed to make web scale computing easier for developers. It has resizable compute capacity, configurable security and network access, and you have complete control of the resource.
Resources can be started, terminated and monitored as needed, and you can increase availability by deploying instances across multiple physical locations.
Talk about instance families and really stress the breadth of our offering. This graphic does not speak to all the Instance Types but it does allow you to begin the conversation on the different types of families and the purpose we had in mind when AWS created the different Instance Types.
Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously. Of course, because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs.
Speaker Note: Describe ELB and auto scaling, the key use cases and how they can be interdependent but not necessarily.
ELB Benefits: HA, health checks, SSL offloading, stickiness sessions, logging, etc…..
Detects health of amazon ec2 instances to ensure, detect, and remove failing instances
Dynamically grows and shrinks resources based on traffic
Seamlessly integrates with autoscaling to add and remove instances based on scaling activities
Supports load balancing of applications using HTTP, HTTPS, SSL, and TCP protocols.
Auto Scaling: Automatically scale your Amazon ec2 fleet to optimize utilization based on your conditions and needs
scale customer's ec2 capacity automatically and shed unneeded Amazon EC2 instances automatically
Good for apps that experience variability in usage
Is enabled by Amazon CloudWatch and carries no additional fees
Explain how pricing works, integrate our new SPOT model with 1-6 hours. Great slide to just talk and whiteboard out how our offerings could be bought in a hybrid model. Some Spot, some Demand, and some RI.
In order to know how to improve performance you have to first know what it is and how to measure it.
Performance can mean different things depending on what you’re talking about.
Servers are hired to do jobs, and what those are jobs depend on your business or personal objectives
Defining performance for your application is the first step to knowing what you need out of your virtual machines on EC2
Skipping that step can lead to overprovisioning or under provisioning, spending too much, or not spending enough and not meeting your customer promise
Because EC2 offers lots of virtual server configurations on-demand, pay by the hour, the approach to right sizing is different and less stressful
You aren’t stuck with it, and you can experiment easily
The goal is to hire the right server for the job
CPU bound, IO intensive etc.
The ways that you can generalize performance are the following:
How quickly does a unit of work get done, or response time
How much work is being done per unit of time, or throughput
And how consistently over time is a level of performance achieved. Consistency can be very important.
How quickly does a unit of work get done
When you execute the database query, how quickly does it come back
When you enter a website how quickly does the page load
How much work are you getting done in a unit of time
Web application: the number of requests per second handled within a tolerable response time
Database: transactions per second
Transcoding video: frames per second
Machine learning: inferences per second, or number of training jobs per unit of time
Going further down the stack, to the filesystem for example, you might look at filesystem cache hits.
Down to the hardware resources that do the work, you’re paying attention to CPU, Memory, Disk, Network, and whether these resources are fully saturated or utilized.
For instances, we think about performance at the resource level – the capabilities of though resources and how they are utilized
Inidcadores de Performance son indicadores de los recrusos para ver si todo el potential esta siendo utilizado o no.
Explain CPU, and stuff.
Cual la utilizacion correcta? La demando de cada componente depende de que tipo de applicacion, recordard es una applciacion que usa mucha CPU, Disco etc.
Each application can have a different resource utilization profile for a given level of workload performance.
Utilization: 100% utilization is usually a sign of a bottleneck.
Si tenemos baja capacidad de recursos.
Performance over Application on ec2.
As an illustration we set up a simple mediawiki deployment – a PHP application using apache and mysql.
We set it up with 140 pages of content and ran a load test
We used siege and gradually upped the load over time
On the server side, we collected some basic metrics using collectd and used a graphing tool to pull some charts together. The default interval of 10 seconds was used, so you get pretty good granularity
It plugs into Apache, and here we show the apache requests per second rate from the web server status output
Why not using cloudwatch metrics/ hyper visor only shows certain metrics. Lets have looks
//are we going to show any sings of capacity or we just going to go over metrics.
Buffers for Filesystem metadata
Cache for file cache to reduce disk accesses
No swapping
The information displayed in the memory section provides the same data about memory usage as the command free -m.
The swapd or “swapped” column reports how much memory has been swapped out to a swap file or disk. The free column reports the amount of unallocated memory. The buff or “buffers” column reports the amount of allocated memory in use. The cache column reports the amount of allocated memory that could be swapped to disk or unallocated if the resources are needed for another task.
The swap section reports the rate that memory is sent to or retrieved from the swap system. By reporting “swapping” separately from total disk activity, vmstat allows you to determine how much disk activity is related to the swap system.
The si column reports the amount of memory that is moved from swap to “real” memory per second. The so column reports the amount of memory that is moved to swap from “real” memory per second.
I/O
The io section reports the amount of input and output activity per second in terms of blocks read and blocks written.
The bi column reports the number of blocks received, or “blocks in”, from a disk per second. Thebo column reports the number of blocks sent, or “blocks out”, to a disk per second.
r/s & w/s: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly).
rrqm/s & wrqm/s: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don’t have to do it inside application.
Buffers for Filesystem metadata
Cache for file cache to reduce disk accesses
No swapping
The information displayed in the memory section provides the same data about memory usage as the command free -m.
The swapd or “swapped” column reports how much memory has been swapped out to a swap file or disk. The free column reports the amount of unallocated memory. The buff or “buffers” column reports the amount of allocated memory in use. The cache column reports the amount of allocated memory that could be swapped to disk or unallocated if the resources are needed for another task.
The swap section reports the rate that memory is sent to or retrieved from the swap system. By reporting “swapping” separately from total disk activity, vmstat allows you to determine how much disk activity is related to the swap system.
The si column reports the amount of memory that is moved from swap to “real” memory per second. The so column reports the amount of memory that is moved to swap from “real” memory per second.
The cpu section reports on the use of the system’s CPU resources. The columns in this section always add to 100 and reflect “percentage of available time”.
The us column reports the amount of time that the processor spends on userland tasks, or all non-kernel processes.
The sy column reports the amount of time that the processor spends on kernel related tasks.
The id column reports the amount of time that the processor spends idle.
The wa column reports the amount of time that the processor spends waiting for IO operations to complete before being able to continue processing tasks.
Lo bueno es que pueden ellimanr las instancias un fez terminandas – o lanzar las pruebas en otras.
Protection, system call performance,
Scheduling and P and C state management
Tips: HVM, which system calls to use for timekeeping, how to manage P and C states
CPU has at least two protection levels: Kernel mode and user mode
CPU checks current protection level on each instruction
Privileged instructions can’t be executed in user mode to protect system. They include:
“Initiate I/O” Access I/O devices, such as network and disk
“Access protected memory” Manipulate memory management unit
Time keeping
Halt CPU or chance power state
Done in user mode software through system calls – trap to kernel mode.
Took a sample of the system calls being done by httpd and here’s the list of the most frequently used
Creating processes
Input / output operations (file system operations)
And mapping files and devices into memory
These are generally some of the most popular system calls.
If you have debugging enabled, for example, you’ll see an elevation in the number of gettimeofday() calls to put time stamps in the debug logs.
Most time related php functions will use the system time. Since they use the system time, gettimeofday will be called a lot so if you want to reducte the calls, reducte your time related functions.
If your application does a lot of I/O or you want to use debugging mode with lots of time checks, for example, you would start to care more about your system call performance.
When virtualizing hardware it’s job of the hypervisor to enforce protections and schedule resources to offer a controlled virtual machine experience
Else, OS and user land share the same ring
The hypervisor must be able to trap and moderate any instruction that changes the hardware or state of the system. This is to provide isolation between virtual machines.
Hypervisor is moderating system calls and sending it back to the OS. System call performance is poor.
So you have a couple options
So you can scan the instruction stream of each virtual machine for privileged instructions and do binary translation – performance is not ideal
Ignore those instructions and provide “hypercalls” to replace instructions that lose their functionality – modified OS Kernel, compatibility and portability
Use hardware assisted virtualization technology provide a new CPU execution mode feature that allows the hypervisor to run in a new root mode below ring 0.
Then there are complex devices that need to get emulated
The hypervisor also provides hypercall interfaces for other critical kernel operations such as memory management, interrupt handling and time keeping.
When virtualizing the CPU, one also has a choices of how to assign physical CPU cores to virtual CPUs.
The hypervisor must be able to trap and moderate any instruction that changes the hardware or state of the system. This is to provide isolation between virtual machines.
Use hardware assisted virtualization technology provide a new CPU execution mode feature that allows the hypervisor to run in a new root mode below ring 0.
Then there are complex devices that need to get emulated
The hypervisor also provides hypercall interfaces for other critical kernel operations such as memory management, interrupt handling and time keeping.
When virtualizing the CPU, one also has a choices of how to assign physical CPU cores to virtual CPUs.
But fully virtualized mode, even with PV drivers, has a number of things that are unnecessarily inefficient. One example is the interrupt controllers: fully virtualized mode provides the guest kernel with emulated interrupt controllers (APICs and IOAPICs). Each instruction that interacts with the APIC requires a trip up into Xen and a software instruction decode; and each interrupt delivered requires several of these emulations.
With the introduction of PVHVM mode, we can start to see paravirtualization not as binary on or off, but as a spectrum. In PVHVM mode, the disk and network are paravirtualized, as are interrupts and timers. But the guest still boots with an emulated motherboard, PCI bus, and so on. It also goes through a legacy boot, starting with a BIOS and then booting into 16-bit mode. Privileged instructions are virtualized using the HVM extensions, and pagetables are fully virtualized, using either shadow pagetables, or the hardware assisted paging (HAP) available on more recent AMD and Intel processors.
The "HVM callback vector" line shows that PV interrupts are enabled (from PVHVM), which is a big difference. On full HVM mode, emulated PCI interrupts are used for device I/O delivery, along with emulating the PCI bus, local APIC, and IO APIC. If you doing a high rate of disk I/O or network packets – which is easy to do on today's networks – these emulation overheads add up. With vector callbacks instead of interrupts, the Xen hypervisor can call the destination guest driver directly, avoiding these overheads.
A fully virtualized system, like an OS running on bare hardware, relies on the timer interrupt for its time keeping. This means a number of things:
An idle virtual machine still has to process hundreds of interrupts a second.
Missed interrupts result in unstable time.
On Linux there are two different time mechanisms
Clock source and clock events
Gettimeofday you are accessing a clock source, same for QueryPerformanceCounter
Have commands that let you see your clock source
Usually by default it's going to be the xen clock source
JVM tracing does very heavy get time of day calls
Benchmarks tend to show this problem more than a lot of applications
TSC is a hardware clocksource that gets rid of all of the software that has to go on top of tings
In linux can access the TSC without talking to the kernel
Xen pvclock gives you compatibility with a wide range of hardware
If you want to see the differences, need to use a time keeping benchmark. In real world most often occurs when you are using a JVM and high a high debug level enabled so the JVM does time based tracing. Another classic example is SAP because they do a large about of time keeping operations. High fidelity trace records.
Test before changing!
CPU customaizdo para EC2.
Overlocking / C-estados.
Son super rapdiodos las instancia C4.8XL consigue llegar a 3.5GHZ, en una solo cor.
Controles de Estado C y P
Estado-C
Controla el nivel de reposo al que puede llegar un núcleo
Numerados del C0 (el núcleo está trabajando normalmente y ejecutando instrucciones) al C6 (el núcleo está apagado)
Estado-P
Controla el nivel de desempeño deseado en un núcleo
Numerados del P0 (el mayor desempeño en el núcleo en donde tiene permitido usar la tecnología Turbo Boost de Intel que permite incrementar la frecuencia), y luego va del P1 (solicita la máxima frecuencia base) al P15 (solicita la mínima frecuencia posible)
Nivel de int
You might want to change the C-state or P-state settings to increase processor performance consistency, reduce latency, or tune your instance for a specific workload. The default C-state and P-state settings provide maximum performance, which is optimal for most workloads. However, if your application would benefit from reduced latency at the cost of higher single- or dual-core frequencies, or from consistent performance at lower frequencies as opposed to bursty Turbo Boost frequencies, consider experimenting with the C-state or P-state settings that are available to these instances.
In this example, vCPUs 21 and 28 are running at their maximum Turbo Boost frequency because the other cores have entered the C6 sleep state to save power and provide both power and thermal headroom for the working cores. vCPUs 3 and 10 (each sharing a processor core with vCPUs 21 and 28) are in the C1 state, waiting for instruction.
C-states control the sleep levels that a core may enter when it is inactive. You may want to control C-states to tune your system for latency versus performance. Putting cores to sleep takes time, and although a sleeping core allows more headroom for another core to boost to a higher frequency, it takes time for that sleeping core to wake back up and perform work. For example, if a core that is assigned to handle network packet interrupts is asleep, there may be a delay in servicing that interrupt. You can configure the system to not use deeper C-states, which reduces the processor reaction latency, but that in turn also reduces the headroom available to other cores for Turbo Boost.
A common scenario for disabling deeper sleep states is a Redis database application, which stores the database in system memory for the fastest possible query response time.
You can reduce the variability of processor frequency with P-states. P-states control the desired performance (in CPU frequency) from a core. Most workloads perform better in P0, which requests Turbo Boost. But you may want to tune your system for consistent performance rather than bursty performance that can happen when Turbo Boost frequencies are enabled.
Intel Advanced Vector Extensions (AVX or AVX2) workloads can perform well at lower frequencies, and AVX instructions can use more power. Running the processor at a lower frequency, by disabling Turbo Boost, can reduce the amount of power used and keep the speed more consistent. For more information about optimizing your instance configuration and workload for AVX, seehttp://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf.
You can reduce the variability of processor frequency with P-states. P-states control the desired performance (in CPU frequency) from a core. Most workloads perform better in P0, which requests Turbo Boost. But you may want to tune your system for consistent performance rather than bursty performance that can happen when Turbo Boost frequencies are enabled.
Intel Advanced Vector Extensions (AVX or AVX2) workloads can perform well at lower frequencies, and AVX instructions can use more power. Running the processor at a lower frequency, by disabling Turbo Boost, can reduce the amount of power used and keep the speed more consistent. For more information about optimizing your instance configuration and workload for AVX, seehttp://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf.
T2.nano .
Este tipo de instancia fueron creados porque tenismos lientes que utilizaban poco CPU.
Funciona tipo bursting.
A CPU Credit provides the performance of a full CPU core for one minute
Hefty initial CPU credit balance for good startup experience
Use credits when active, accrue credits when idle
Transparency on credit balances
Revisar las metricas de cloudwatch para entender como estan la utilzacion.
Hmm vemos Steal Time – me estan robando la CPU?
In order to do time accounting for each process, measures time, schedules process, run a new process, check time again. Charge a process with the difference.
Model assumes OS in running 100% of the time, if hypervisor takes away time from the instance and the OS doesn't know it, because it wouldn’t be getting timer interrupts.
The category of “steal time” is enabled paravirtual extension where the guest queries the hypervisor for time, and can figure out when time was taken away.
Exists in Linux and not WIndows.
There's a caveat - when call a hault - if haulted, doesn't get reported as steal time. So steal time doesn't always account for the time the hypervisor has taken from you.
“A common misconception about steal time (due to the unfortunate naming) is that it is a metric for showing the amount of CPU cycles stolen by other virtual machines in the same virtual host. No doubt that cloud service providers tend to oversell but steal time should not be the basis for this assumption.
Steal time actually accounts for the cycles the local virtual machine is trying to go over its originally allocated resources. It should actually be named involuntary wait as mentioned in the Linux kernel documentation for /proc/stat.”
There are a number of corner cases where hypervisor is doing work on your behalf. It can help you understand what's happening but it doesn't indicate that your performance is worse. The big takeaway is that your performance is not being impacted.
The goal of steal time is to correct process accounting.
Protection, system call performance,
Scheduling and P and C state management
Tips: HVM, which system calls to use for timekeeping, how to manage P and C states
Consistent device drivers provided in a split driver model allows for better portability of machine images across hardware generations. Allows hardware specific drivers to reside in a control operating system, and simple front-end driver in the guest communicates to the back-end through ring buffers in shared memory pages. The multiplexing happens on the host, and it can require host CPU resoruces.
The original challenge of assigning a device to a virtual machine has to with direct memory access, so a device can modify memory without bothering the CPU. That would be a serious security hole if allowed in the context of a multi-tenant host.
IOMMU can identify source devices and either deny or translate memory requests using IOMMU page tables. This enables the hypervisor to assign specific devices to a guest and restrict device memory access to pages owned by the guest. This is how we enable PCI-passthrough for things like GPU instances and SR-IOV network devices.
Single Root I/O Virtualization
Physical network device exposes Virtual Function to instance
Driver in your instance is lightweight PCIe function, limited configuration, direct access to physical NIC.
Packets no longer processed in software.
But it’s a specialized driver, which means:
Your instance OS needs to know about it and be using it
EC2 needs to be told your instance OS knows about it and can handle it.
In a virtualized system, virtual address points to guest physical address which points to a host physical. You have this for both IO domain and instance. Grant maps two different guest physicals to the same host physical with permissions. Grant always has to originate from the instance.
If the request is a write operation, these grants are filled with the desired data to write to the disk and necessary permissions are given to the driver domain, so it can map the grants (either read only if the request is a write operation, or with write permissions if the request is a read operation). Once we have the grants set up, a reference (the grant reference) is added to the request, and the request is finally queued on the shared ring and the driver domain is notified that it has a pending request.
When the driver domain reads the request, it parses the grant references on the message and maps the grants on the driver domain memory space. When that is done, the driver domain can access this memory as if it was local memory.
The request is then fulfilled by the driver domain, and data is read or written to the grants. After the operation has completed, the grants are unmapped, a response is written to the shared ring, and the guest is notified.
Then the guest realizes it has a pending response, it reads it and removes the permissions to share the related grants. After that, the operation is completed.
As we can see from the above flow, there is no memory copy, but each request requires the driver domain to perform several mapping and unmapping operations, and each unmapping requires a TLB flush. TLB flushes are expensive operations, and the time required to perform a TLB flush increases with the number of CPUs.
To solve this problem, an extension to the block ring protocol has been added, called “persistent grants“. Persistent grants consist in reusing the same grants for all the block related transactions between the guest and the driver domain, so there’s no need unmap the grants on the driver domain, and TLB flushes are not performed (unless the device is disconnected and all mappings are removed). Furthermore, since grants are done only once, there is no need to grab the driver domain’s grant lock on every transaction.
This of course, doesn’t come for free, since grants are no longer mapped and unmapped, data has to be copied from or to the persistently mapped grant. But for large numbers of guests, the overhead from TLB flushes and lock contention greatly outweighs the overhead of copying.