A Cloud Gaming System Based on User-Level
Virtualization and Its Resource Scheduling
Youhui Zhang, Member, IEEE, Peng Qu, ...
First, it implements a virtual input-layer for each of con-
currently-running instances, rather than a system-wide one,
wh...
However, none of these researches has considered the
cloud gaming characteristics, including the critical demand
on proces...
which games they can execute. When a client wants to play
some game, the manager will search for candidates among
the regi...
In this paper, we adopt a “just good enough” strategy,
which means that we just keep the game quality at some
acceptable l...
Moreover, the CPU utilization of such one task is almost
negligible, less than 0.2 percent. (Details are presented in
Appe...
performance requirements, there can be at most three con-
current instances of Scrolls. For Combat, the maximum num-
ber o...
For example, one Scrolls, one Combat and two NFS can
run at the same time; if an extra NFS joins, this condition
will not ...
values have been normalized) as several requests have
arrived simultaneously (the request number is illustrated
by the x-a...
4.6 Discussions
4.6.1 Different Game Configurations and/or
Heterogeneous Servers
The above work is targeted to specific hard...
situation, a dedicated timer has been introduced to trigger
audio transmission only if the current interval of successive
...
Fig. 12. The average value of 720P is given in Fig. 13, as well
as the corresponding values of GamingAnywhere and OnLive
(...
2) One Scrolls, one Combat, one NFS and three Birds. For
this case, the sum of each kind of resource consump-
tion is less...
[13] W. Yu, J. Li, C. Hu, and L. Zhong, “Muse: A multimedia streaming
enabled remote interactivity system for mobile devic...
Prochain SlideShare
Chargement dans…5
×

A cloud gaming system based on user level virtualization and its resource scheduling

285 vues

Publié le

A cloud gaming system based on user level virtualization and its resource scheduling.
for more ieee paper / full abstract / implementation , just visit www.redpel.com

Publié dans : Formation
  • Soyez le premier à commenter

A cloud gaming system based on user level virtualization and its resource scheduling

  1. 1. A Cloud Gaming System Based on User-Level Virtualization and Its Resource Scheduling Youhui Zhang, Member, IEEE, Peng Qu, Jiang Cihang, and Weimin Zheng, Member, IEEE Abstract—Many believe the future of gaming lies in the cloud, namely Cloud Gaming, which renders an interactive gaming application in the cloud and streams the scenes as a video sequence to the player over Internet. This paper proposes GCloud, a GPU/CPU hybrid cluster for cloud gaming based on the user-level virtualization technology. Specially, we present a performance model to analyze the server-capacity and games’ resource-consumptions, which categorizes games into two types: CPU-critical and memory-io-critical. Consequently, several scheduling strategies have been proposed to improve the resource-utilization and compared with others. Simulation tests show that both of the First-Fit-like and the Best-Fit-like strategies outperform the other(s); especially they are near optimal in the batch processing mode. Other test results indicate that GCloud is efficient: An off-the-shelf PC can support five high-end video-games run at the same time. In addition, the average per-frame processing delay is 8$19 ms under different image-resolutions, which outperforms other similar solutions. Index Terms—Cloud computing, cloud gaming, resource scheduling, user-level virtualization Ç 1 INTRODUCTION CLOUD gaming provides game-on-demand services over the Internet. This model has several advantages [1]: it allows easy access to games without owning a game console or high-end graphics processing units (GPUs); the game dis- tribution and maintenance become much easier. For cloud gaming, the response latency is the most essen- tial factor of the quality of gamers’ experience “on the cloud”. The number of games that can run on one machine simultaneously is another important issue, which makes this mode economical and then really practical. Thus, to optimize cloud gaming experiences, CPU / GPU hybrid systems are usually employed because CPU-only solutions are not efficient for graphics rendering. One of the industrial pioneers of cloud gaming, Onlive1 emphasized the former: it allocated one GPU per instance for high-end video games. To improve utilization, some other service-providers use the virtual machine (VM) technology to share the GPU among games running on top of VMs. For example, GaiKai2 and G-cluster3 stream games from cloud- servers located around the world to internet-connected devi- ces. Since the end of 2013, Amazon EC2 has also provided the service for streaming games based on VMs.4 More technical details can be acquired from non- commercial projects. GamePipe [2] is a VM-based cloud cluster of CPU/GPU servers. Its characteristic lies in that, not only cloud resources but also the local resources of clients can be employed to improve the gaming quality. Another system, GamingAnywhere [3], has used the user- level virtualization technology. Compared with some solu- tions, its processing delay is lower. Besides, task scheduling is regarded as another key issue to improve the utilization of resources, which has been veri- fied in the high-performance GPU-computing fields [4], [5], [6], [7]. However, to the best of our knowledge, the schedul- ing research for cloud gaming has not received much attention yet. One example based on VMs is VGRIS [8] (including its successor VGASA [9]. It is a GPU-resource management framework in the host OS and schedules vir- tualized resource of guest OSes. This paper proposes the design of a GPU/CPU hybrid sys- tem for cloud gaming and its prototype, GCloud. GCloud has used the user-level virtualization technology to implement a sandbox for different types of games, which can isolate more than one game-instance from each other on a game-server, transparently capture the game’s video/audio outputs for streaming, and handle the remote client-device’s inputs. Moreover, a performance model has been presented; thus we have analyzed resource-consumptions of games and performance bottleneck(s) of a server, through exces- sive experiments using a variety of hardware performance- counters. Accordingly, several task-scheduling strategies have been designed to improve the server utilization and been evaluated respectively. Different from related researches, we focus on the guide- line of task-assignment, that is, on the reception of a game- launch request, we should judge if a server is suitable to undertake the new instance or not, under the condition sat- isfying the performance requirements. In addition, from the aspect of user-level virtualization (there is some existing user-level solution, like Gaming- Anywhere [3]), GCloud has its own characteristics: 1. http://www.onlive.com/ 2. https://www.gaikai.com/ 3. http://www.g-cluster.com/eng/ 4. https://aws.amazon.com/game-hosting/ The authors are with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. E-mail: {zyh02, zwm-dcs}@tsinghua. edu.cn, shen_yhx@163.com, famousjch@qq.com. Manuscript received 13 Nov. 2014; revised 11 May 2015; accepted 11 May 2015. Date of publication 14 May 2015; date of current version 13 Apr. 2016. Recommended for acceptance by Y. Wang. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2015.2433916 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 1239 1045-9219 ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. www.redpel.com +917620593389 www.redpel.com +917620593389
  2. 2. First, it implements a virtual input-layer for each of con- currently-running instances, rather than a system-wide one, which can support more than one Direct-3D games at the same time. Second, it designs a virtual storage layer to trans- parently store each client’s configurations across all servers, which has not been mentioned by related projects. In summary, the following contributions have been accomplished: 1) Enabling-technologies based on the light-weight virtu- alization are introduced, especially those of GCloud ‘s characteristics. (Section 3) 2) To balance the gaming-responsiveness and costs, we adopt a “just good enough” principle to fix the FPS (frame per second) of games to an acceptable level. Under this principle, a performance model is con- structed to analyze resource consumptions of games, which categorizes games into two types: CPU-critical and memory-io-critical; thus several scheduling mech- anisms have been presented to improve the utiliza- tion and compared. In addition, different from previous jobs focused on the GPU-resource, our work has found the host CPU or the memory bus is the system bottleneck when several games are run- ning simultaneously. (Section 4) 3) Such a cloud-gaming cluster has been constructed, which supports the mainstream game-types. Results of tests show that GCloud is highly efficient: An off- the-shelf PC can support up to five concurrently-run- ning video-games (each game’s image-resolution is 1024 Â 768 and the frame per second is 30). The aver- age per-frame processing delay is 8$19 ms under different image-resolutions, which can satisfy the stringent delay requirement of highly-interactive games. Tests have also verified the effects of our per- formance model. (Section 5) The remainder of this paper is organized as follows. Section 2 presents the background knowledge of cloud gam- ing as well as related work. Sections 3 and 4 are the main part: the former introduces the user-level virtualization framework and enabling technologies; the performance model and its analysis method are given in the latter, as well as the scheduling strategies. Section 5 presents the prototype cluster and evaluates its performance. Section 6 concludes. 2 RELATED WORK 2.1 Cloud Gaming Cloud gaming is a type of online gaming that allows direct and on-demand streaming of game-scenes to networked-devices, in which the actual game is running on the server-end (main steps have been described in Fig. 1). Moreover, to ensure the interactivity, all of these serial oper- ations must happen in the order of milliseconds, which challenges the system design critically. The amount of latencies is defined as interaction delay. Existing researches [10] have shown that different types of games put forward different requirements. One solution type of cloud-gaming is VM-based. For the solutions based on VMs, Step 1 is completed in the guest OS while others on the server-end are accomplish by the host. Barboza et al. [11] presents such a solution, which provides cloud gaming services and uses three levels of managers for the cloud, hosts and clients. Some existing work, like GaiKai, G-cluster, Amazon EC2 for streaming games and GamePipe [2], also belong to this category. In contrast to VM-based solutions, the user-level solution inserts the virtualization layer between applications and the run-time environment. This mode simplifies the processing stack; thus it can reduce the extra overhead. GamingAny- where [3] is such a user-level implementation, which sup- ports Direct3D/SDL games on Windows and SDL games on Linux. Some solutions have enhanced the thin-client protocol to support interactive gaming applications. Dependent on the concrete implementation, they can be classified into the two types. For example, Winter et al. [12] have enhanced the thin-client server driver to integrate a real-time desktop streamer to stream the graphical output of applications after GPU processing, which can be regarded as a light-weight virtualization-based solution. In contrast, Muse [13] uses VMs to isolate and share GPU resources on the cloud-end, which has enhanced the remote frame buffer (RFB) protocol to compress the frame-buffer contents of server-side VMs. However, these researches have focused on the optimiza- tion of interaction delay, namely, taken care of the perfor- mance of a single game on the cloud, rather than the interference between concurrently-running instances. More- over, none of these systems has presented any specific scheduling strategy. 2.2 Resource Scheduling For high performance computing (HPC), GPU virtualization has been widely researched [14], [15], [16] for general pur- pose computing. From the scheduling viewpoint, there are also several researches, including Phull et al. [4], Ravi et al. [5], Elliott and Anderson [6], [7] L. Chen et al. [7] and Bautin et al. [17]. Fig. 1. The whole workflow of cloud-gaming. 1240 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389
  3. 3. However, none of these researches has considered the cloud gaming characteristics, including the critical demand on processing latencies, highly-coupled sequential opera- tions and so on. The work on the scheduling for cloud gaming is limited: VGRIS [8] and its successor VGASA [9] are resource man- agement frameworks for VM-based GPU resources, which have implemented several scheduling algorithms for differ- ent objectives. However, they are focused on scheduling rendering tasks on a GPU, without considering other tasks like image-capture / -encoding, etc. iCloudAccess [18] has proposed an online control algorithm to perform gaming- request dispatching and VM-server provisioning to reduce latencies of the cloud gaming platform. A recent work is [19], which has studied the optimized placement of cloud- gaming-enabled VMs. The proposed heuristic algorithms are efficient and nearly optimal. Ours can be regarded as complementary to these researches, because they are focused on the VM-granularity dispatching / provisioning while we pay attention to issues inside an OS. One related work on GPU-scheduling (but not cloud-gam- ing-specific) is TimeGraph [20]: it is a real-time GPU scheduler that has modified the device-driver for protecting important GPU workloads from performance interference. Similarly, it has not considered the cloud gaming characteristics. Another category of related researches [21], [22] is con- cerning the streaming media applications. For example, Cherkasova and Staley [21] developed a workload-aware per- formance model for video-on-demand (VOD) applications, which is helpful to measure the capacity of a streaming sever as well as the resource requirements. We have referred their design principles to construct our performance model. 2.3 Others To improve the processing efficiency and adaptation, Wang and Dey [23] propose a rendering adaptation technique to adapt the game rendering parameters to satisfy Cloud Mobile Gaming’s constraints. Klionsky [24] has presented an architecture which amortizes the cost of across-user ren- dering. However, these two technologies are not transpar- ent to games. In addition, Jurgelionis et al. [25] explored the impact of networking on gaming; Ojala and Tyrvainen [26] developed a business model behind a cloud gaming company. As a summary, compared with the above-mentioned work, GCloud has the following features: 1) It is based on the user-level virtualization. Compared with some existing user-level solution, GCloud has proposed more thorough solutions for the virtual input / storage. 2) From the aspect of performance modeling and sched- uling, more real jobs (including image-capture, encod- ing, etc.) have been considered (compared with VGRIS / VGASA [8], [9]). In addition, we use the hard- ware-assistant video encoding to mitigate the infer- ence between games and to improve the performance. 3) Last but not least, our work is focused on related issues inside a node, while [18], [19] do work on the VM-granularity. 4) Furthermore, quite a few researches have been car- ried out to measure the performance of cloud gam- ing systems, like [27], [28], [29] and [30]. We also referred them to complete our measurements. 3 SYSTEM ARCHITECTURE AND ENABLING TECHNOLOGIES 3.1 The Framework The system (in Fig. 2) is built with a cluster of CPU / GPU- hybrid computing servers; a dedicated storage server is used as the shared storage. Each computing server can host the execution of several games simultaneously. One of these servers is employed as the manager-node, which collects real-time running information of all servers and completes management tasks, including the task-assignment, user authentication, etc. It is necessary to note that the framework in Fig. 2 is for small / medium system-scales. For a large scale system with many users, a hierarchical architecture is needed to avoid the bottleneck of information-exchange. In fact, because the quality of gamers’ experience highly depends on the response latency and the latter is sensitive to the physical distance between clients and servers, the architec- ture may be geographically-distributed, which is out of scape of this paper. It also means that in one site the scale will not be very large.5 Initially, gaming-agents on available computing servers register to the manager, indicating that they are ready and Fig. 2. System architecture. 5. According to OnLive, the theoretical upper bound of the distance between a user and a cloud gaming server is approximately 1,000 miles. In China, some gaming systems provide services for just one city or sev- eral cities. ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1241 www.redpel.com +917620593389 www.redpel.com +917620593389
  4. 4. which games they can execute. When a client wants to play some game, the manager will search for candidates among the registered information. After such a server has been cho- sen, a start-up command will be sent to the corresponding agent to boot up the game within a light-weight virtualiza- tion environment. Then, its address will be sent to the client. Future communication will be done directly between the two ends. During the run time, each agent collects local runtime information and sends it to the manager periodically; the latter can get the latest status of resource-consumptions. The storage server is an important role to provide the personalized game-configuration for each user. For instance, User A had played Game B on Server C. Now A wants to play the game again while the manager finds that Server C’s resources have been depleted. Then the task has to be assigned to another server, D. Consequently, it is nec- essary to restore A’s configurations of B on D, including the game’s progress and other customized information. The storage server is just used as the shared storage for all com- puting nodes. 3.2 The User-Level Virtualization Environment For each game, API Interception is employed to implement a lightweight virtualization environment. API interception means to intercept calls from the application to the underly- ing running system. The typical applications include soft- ware streaming [31], [32], etc. Here it is used to catch the corresponding resource-access APIs from the game. In addi- tion, our main target platform is MS Windows as Windows dominates the PC video-game world. 3.2.1 Image Capture Usually, gaming applications employ the mainstream 3D computer-graphics-rendering libraries, like Direct3D or OpenGL, to complete the hardware (GPU) acceleration; GCloud supports both of them. In the case of Direct3D, the typical workflow of a game is usually an endless loop: First, some CPU computation pre- pares the data for the GPU, e.g., calculating objects in the upcoming frame. Then, the data is uploaded to the GPU buffer and the GPU performs the computation, e.g., render- ing, using its buffer contents and fills the front buffer. To fetch contents of the image into the system memory for the consequent processing, we intercept the Direct3D’s Present API. For OpenGL, we have intercepted the Present-like API in OpenGL, glutSwapBuffers, to capture images. For other games based on the common GUI window, we just set a timer for the application’s main window, then we intercept the right message handler to capture the image of the target window periodically. 3.2.2 Audio Capture Capturing of audio data is a platform-dependent task. Because our main target platform is MS Windows, we inter- cept Windows Audio Session APIs to capture the sound. Core Audio serves as the foundation of quite a few higher- level APIs; thus this method can bring about the best adaptability. 3.2.3 Virtual Input Layer Flash-based or OpenGL-based applications are usually using the window’s default message-loop to handle inputs. Thus, the solution is straightforward: We inject a dedicated input-thread into the intercepted game-process. On recep- tion of any control command from the client, this thread will convert it into a local input message and send it to the target window. For Direct3D-based games, the situation is more compli- cated. The existing work [3] replays input events using the SendInput API on Windows. However, SendInput inserts events into a system-wide queue, rather than the queue of a specific process. So, it is difficult to support more than one instance for the non-VM solution. To conquer this problem, we intercepted quite a few DirectInput APIs to simulate input-queues for any virtualized application; thus the user’s input can be pushed into these queues and made accessible to applications. 3.2.4 Virtual Storage Layer From the storage aspect, a program can be divided into three parts [31]: Part 12 include all resources provided by the OS and those created/modified by the installation pro- cess; Part 3 is the data created/modified/deleted during the run time, which contains game-configurations of each user. For the immutable parts, it is relatively easy to distribute them to servers through some system clone method. The focus is how to migrate resources of Part 3 across servers to provide personalized game-configurations for users. We construct a virtual storage layer by the interception of file-system and registry accessing APIs of all games. During the run time, the resource modified by the game instance will be moved into Part 3. When the previously-described case in Section 3.1 occurs, the virtual storage layer of Game B on the current server can redirect resource-accesses to the shared storage to visit the latest configurations of User A, which were stored by the last run on Server C. 4 PERFORMANCE MODEL AND TASK SCHEDULING As mentioned in Section 1, the response latency and the number of games that one machine can execute simulta- neously are both essential to a cloud gaming system. To a large extent, they are in contradiction and existing systems (like [3], [11], [12]) usually focus on the first issue. However, it is not always economical. For example, if the FPS of a given game is too high, it will consume more resources. Moreover, the loss compression will counteract the high video-quality to a certain extent. Some scheduling work, like VGRIS / VGASA [8], [9], has presented multi-task scheduling strategies. There are several essential differences between our work and VGRIS / VGASA: First, they are focused on how to schedule existing games on a server, including the allocation of enough GPU resources for a game, etc. In contrast, GCloud is focused on the assignment of a new task. Second, they are focused on the GPU resource and no any other operation (like image-capture, encoding, etc.) has been considered, while our tests (presented in Section 4.4) show the host CPU or the memory bus is the bottleneck. Third, VGRIS and VGASA are VM-specific. 1242 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389
  5. 5. In this paper, we adopt a “just good enough” strategy, which means that we just keep the game quality at some acceptable level and then we try to satisfy the interactivity requests of as many games as possible. Hence, there are two main issues: Issue 1: For a given server and its running game instances, how to make sure the game quality is acceptable? Issue 2: On an incoming request, which server is suitable to launch the new game instance? For Issue 1, we first give a brief pipeline model for cloud- gaming, which can be used to judge whether the game qual- ity is acceptable or not. Second, a method to fix the FPS has been presented to provide the “just good enough” quality; some hardware-assistant video encoding technique has also been used to mitigate the inference between games further. For Issue 2, several resource-metrics have been given. Then we carry out tests to measure the server capacity and to cat- egorize games into different types. Accordingly, we design a server capacity model and corresponding task-assignment strategies. These strategies have been compared with others. 4.1 Game Quality A cloud gaming system’s interaction delay contains three parts [27]: (1) Network delay, the time required for a round of data exchange between the server and client; (2) Play-out delay, the time required for the client to handle the received for playback; (3) Processing delay, required for the server to process a player’s command, and to encode and send the corresponding frame back. This paper is mainly about the server-side and the net- work is assumed to be able to provide the sufficient band- width, thus we focus on the processing delay that should be confined into a limited range. The work [25] on measuring the latency of cloud gaming has disclaimed that, for some existing service-providers (like Onlive), the processing delay is about 100-200ms. Thus, we use 100 ms as our scheduling target, denoted MAX_PD. Another measurement of key metrics is FPS; the required FPS is illustrated as FIXED_FPS. In this work, FIXED_FPS is set to 30 by default. As presented by Fig. 1, the gaming workflow can be regarded as a pipeline including four steps: operations of gaming logic, graphic rendering (including the image cap- ture), encoding (including the color-space conversion) and transmission. In addition, our tests show that given the suf- ficient bandwidth, the delay of transmission is much less than other steps. Thus, the fourth step can be skipped and we focus on the remaining three. Furthermore, the first two steps are completed by the intercepted process, which is transparent to us; thus we should combine them together and the sum of these laten- cies is denoted by Tpresent. The average processing time of the encoding step is denoted by Tencoding (The pipeline is presented in Fig. 3). Hence, if the following conditions (referred as Responsiveness Conditions) have been satisfied, the requirement on the FPS and processing delay will be met undoubtedly. To be more precise, satisfaction of the first two conditions means the establishment of the last one, under the default case. Tpresent ¼ 1=FIXED FPS and (1) Tencoding ¼ 1=FIXED FPS and (2) Tencoding þ Tpresent ¼ MAX PD (3) 4.2 Fixed FPS To provide the “just good enough” gaming quality, the FPS value should be fixed to some acceptable level (Issue 1). Because the interface of GPU drivers is not open, our solu- tion is in the user-space, too. Take the Direct3D game as an example, we intercept the Present API to insert a Sleep call for adjusting the loop latency: The rendering complexity is mostly affected by the complexity of gaming scenes and the latter changes gradu- ally. Thus, it is reasonable to predict Tpresent based on its own historical information. In the implementation, the aver- age time (denoted Tavg present) of the past 100 loops is used as the prediction for the upcoming one (the similar method has been adopted by [8], [9]) and the sleep time (Tsleep) is cal- culated as: Tsleep ¼ 1=FIXED FPS À Tavg present The true problem lies in how to judge whether a busy server is suitable to undertake a new game instance or not. Thus, we should solve Issue 2 anyway. 4.3 Hardware-Assistant Video Encoding The fixed-FPS can mitigate the inference between games because it allocates just enough resource for rendering. Fur- ther, we use the hardware-assistant video-encoding capabil- ity of commodity CPUs for less inference. The hardware technology of Intel CPUs, Quick Sync, has been employed. It owns a full-hardware function pipeline to compress raw images in the RGB or YUV format into the H264 video. Now Quick Sync has become one of the main- stream hardware encoding technologies.6 On the test server, a Quick-Sync-enabled CPU can simultaneously support up to twenty 30-FPS encoding tasks (the image resolution is 1024 Â 768); the latency for one frame is as low as 4.9 ms. Fig. 3. Gaming pipeline. 6. Quick Sync was introduced with the Sandy Bridge CPU micro- architecture. It is a part of the integrated on the same die as the CPU. Thus, to enable it work with a discrete graphics card (used for gaming), some special configuration should be set up as described by http:// mirillis.com/en/products/tutorials/ action-tutorial-intel-quick-sync- setup_for_desktops.html. For AMD, its Accelerated Processing Unit (APU) has the similar function. ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1243 www.redpel.com +917620593389 www.redpel.com +917620593389
  6. 6. Moreover, the CPU utilization of such one task is almost negligible, less than 0.2 percent. (Details are presented in Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPDS.2015.2433916). The result means it causes lit- tle interference to other tasks. Thus, we use it as the refer- ence implementation in all following tests, as well as in the system prototype. 4.4 Resource-Metrics Five types of system-resources have been focused on, including the CPU, GPU, system-RAM, video-RAM and the system bandwidth: The first two can be denoted by utiliza- tion ratios; the next two are represented by memory con- sumptions and the last refers to the miss number of the LLC (Last Level Cache). Correspondingly, the server capacity and the average resource requirements of a game (under the condition satisfying the Responsiveness Conditions) can be denoted by a tuple of five-items, U_CPU, U_GPU, M_HOST, M_GPU, B. Based on the above metrics, we should judge whether the remaining resource-capacities of a server can meet the demand of a new game or not. The key lies in how to mea- sure the capacity of a server, as well as the game require- ments. We present the following method to accomplish these tasks, namely, to solve Issue 2. 4.4.1 Test Methods Commercial GPUs usually implement driver / hardware counters to provide the runtime performance information. For example, the NVIDIA’s PerfKit APIs7 can collect resource-consumption information of each GPU in real time. Hence, we can get results accumulated from the previ- ous time the GPU was sampled, including the percentage of time the GPU is idle/busy, the consumption of graphic memories, etc. For commodity CPUs, the similar method has been used, too. For instance, Intel has already provided the capability to monitor performance events inside processors. Through its performance counter monitor (PCM), a lot of perfor- mance-related events per CPU-core, including the number of LLC-misses, instructions per CPU cycle, etc., can be obtained periodically. The sample periods for CPU and GPU are both set to 3s. In addition, we embed monitoring codes into the inter- cepted gaming APIs to record processing delays of each frame, which will be used to judge whether the Responsive- ness Conditions have been met or not. Moreover, it is necessary to note that, the integrated graphics processor (that contains the Quick Sync encoding engine) shares the LLC with CPU cores and there is no on- chip graphics memory.8 Thus the hardware encoding pro- cess needs to access the system memory (if the required data is missed in the LLC), which means the corresponding miss number is still suitable to indicate the memory throughput with hardware encoding. In addition, we select four representative games, includ- ing three Direct3D video games (Direct3D is the most popu- lar development library for PC video games) and one OpenGL game. They are: 1) Need for Speed-Most Wanted (abbreviated to NFS). It is a classic racing video game. 2) Modern Combat 2-Black Pegasus (abbreviated to Combat), a first-person shooter video game. 3) Elder Scrolls: Skyrim-Dragonborn (abbreviated to Scrolls), an action role-playing video game. 4) Angry Birds Classic: (abbreviated to Birds), the well- known mobile-phone game’s PC version. Several volunteers have been invited to play games on the cloud gaming system and encouraged to play quite a few game scenes and the duration is more than 15 minutes for each game. After several loops, runtime information can be collected for further analysis. 4.4.2 Test Cases A Win 7 (64-bit) PC is used as the server, which is equipped with an NVIDIA GTX780 GPU adapter (3 GB video mem- ory), a common Core i7 CPU (four cores, 3.4 GHz) and 8 GB RAM. By default, games will be streamed at the resolution of 1024 Â 768 and the game picture quality is set to medium in all cases; the FPS is fixed to 30. Video encoding is completed by Quick Sync. Single instance (Resource-requirement Tests). Each game has been played in our virtualization environment alone and resource consumptions are recorded in real-time. As expected, Responsiveness Conditions can be met for each game on the powerful machine; the corresponding resource-requirements are presented as follows (Table 1). Considering resource consolidation, the average value of each item of the tuple has been used. Multi-instances running simultaneously. Quite a few game groups have been executed and sampled simultaneously. For example, we have played 2$6 NFS instances at the same time. Based on the runtime information, we can see that this server can support up to five acceptable instances simultaneously (we consider a game’s running quality acceptable if its average FPS-value is not less than 90 per- cent of the FIXED_FPS). While six instances are running, the FPS value is less than 27, which is regarded as unacceptable. Furthermore, we should identify the bottleneck that is pivotal for task assignment. Considering the following facts (in Fig. 4a), NFS is memory-io-critical: When no more than five games are running simulta- neously, the average FPS is stable (about 30.3) and the value of million-miss-number-per-second increases almost linearly. As six instances are running, the FPS is about 24.7 and the throughput nearly remains unchanged (from 37.6 to 37.9). At the same time, both U_GPU and U_CPU are far from exhausted, 47 and 71 percent respectively. This phenomenon indicates that memory accesses have impeded tasks from uti- lizing the CPU/GPU resources efficiently. Moreover, memory consumptions are not the bottleneck; thus no any swap opera- tion will happen (For clarity, the information of memory-con- sumptions is skipped in these figures). For Combat and Scrolls (in Figs. 4b and 4c), the same conclusion does hold: Under the condition satisfying 7. http://www.nvidia.com/object/nvperfkit_home.html 8. http://www.hardwaresecrets.com/printpage/Inside-the-Intel- Sandy-Bridge-Microarchitecture/1161 1244 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389
  7. 7. performance requirements, there can be at most three con- current instances of Scrolls. For Combat, the maximum num- ber of instances is 5. At the same time, both U_GPU and U_CPU are limited, too. On the other hand, Birds (in Fig. 4d) is CPU-critical because it can exhaust the CPU (97 percent as 10 instances running and the average FPS is 27.1), while the value of million-miss-number-per-second increases almost linearly. 4.4.3 Modeling Based on the previous results, we have normalized the resource requirement and the server capacity; the princi- ple is critical-resource-first: (1) For a memory-io-critical game that the game-server can occupy Ni instances, the fifth item (Bandwidth) of its tuple is set as MAX_SYSTEM_ THROUGHPUT9 / Ni, regardless of the absolute value. (2) For any CPU-critical that the game-server can occupy Nj instances, the value of its Ucpu is set as 1/ Nj. (3) The other tuple items are kept unchanged. For example, the tuple of NFS is 9.15 percent, 2.01 per- cent, 526, 220, MAX_SYSTEM_THROUGHPUT / 5, and the Birds’ tuple is 100 percent / 10, 1.1 percent, 181, 142, 6.54. Tuples of these four games are listed in Table 2. Then for a set of M games (each denoted as Gamei, 0 ¼ i M), if the sum of each kind of resource consump- tion is less than the corresponding system-capacity, we con- sider these games can run simultaneously and smoothly. Formally, we use the following notations: U CPUgame i, U GPUgame i, M HOSTgame i, M GPUgame i; B game i : the tuple of resource requirements of Gamei; 100%, 100%, SERVER_RAM_CAPACITY, SERVER_VI- DEO_RAM_CAPACITY, MAX_SYSTEM_THROUGHPUT server: the capacity of a given server. If the following conditions have been met, this sever can occupy all games of the set running simultaneously. X 0 i M U CPUgame i 100% X 0 i M U GPUgame i 100% X 0 i M M HOSTgame i SERVER RAM CAPACITY X 0 i M M GPUgame i SERVER VIDEO RAM CAPACITY X 0 i M Bgame i MAX SYSTEM THROUGHPUT Fig. 4. FPS and resource-consumptions of games. TABLE 1 Resource-Requirements of Each Game U_CPU (%) U_GPU (%) M_HOST (MB) M_GPU (MB) B (million miss- number per second) NFS 9.15% 2.01% 526 220 8.10 Scrolls 14.55% 7.02% 795 560 13.52 Combat 8.47% 3.27% 800 296 7.97 Birds 9.36% 1.1% 181 142 6.54 9. MAX_SYSTEM_THROUGHPUT refers to the maximal LLC-miss- number per second that the system can sustain. It can been evaluated by a specially-designed program to access the memory space randomly and intensively. ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1245 www.redpel.com +917620593389 www.redpel.com +917620593389
  8. 8. For example, one Scrolls, one Combat and two NFS can run at the same time; if an extra NFS joins, this condition will not be met and the bottleneck is B. Quite a few tests of real games will be given in Section 5.1 to verify this design. 4.5 The Scheduling Strategy As a conclusion, the following procedure for task assign- ment is illustrated, which contains two stages. Ready-Stage: when a game is being on-line, it will be tested to get its resource requirements. Then, for any game (denoted as Game_i), a tuple U_CPU, U_GPU, M_HOST, M_GPU, Bgame_i can be given to represent its requirements. In addition, for any Server_j, its capacity is denoted as U_CPU, U_GPU, M_HOST, M_GPU, Bserver_j. The corre- sponding test-process has been described in the previous paragraphs and each element will be labeled as the corre- sponding maximum capacity. Runtime-Stage: During the run time, the concurrent resource-consumptions of each server (denoted as U_CPU, U_GPU, M_HOST, M_GPU, Bserver_j_cur; in our prototype, the average value of the latest one minute have been used) are sampled periodically. Moreover, the main goal of our scheduling strategy is to minimize the number of servers used, which can be regarded as a bin-packing problem. Serveral theoretical researches [33], [34] have claimed that the First-fit and Best-fit algo- rithms behave well for this problem, especially for the online version with requests inserted in random order [34]. Thus, we have designed two heuristic task-scheduling algorithms based on the well-known First-fit and Best-fit strategies, namely first-fit-like (FFL) and best-fit-like (BFL). The princi- ple is straight; thus we only give their outlines here: In FFL, for a given request of game_i, all servers will be checked orderly, if one server (for example, server_j) can occupy the new game, which means that each kind of resource consumptions for all games on server_j (including game_i) does not exceeds the capacity, this algorithm ends successfully. In BFL, the procedure is similar. The difference lies in that, if there is more than one suitable server, the one will leave the least critical resources is the best. 4.5.1 Tests with Artificial Traces We have simulated our algorithms in two situations: 1) Several requests of the four games come simulta- neously and must be dispatched instantly, namely, in the batch processing mode. 2) Requests come one by one. The request-sequence fol- lows a Poisson process with a mean time interval of 5 seconds; the duration of each game also follows a Poisson process and the mean time is 40 minutes. In both situations, we assume that there are enough servers and each has an initial resource usage 10, 5, 3096, 512, 0 (it is gathered from our real servers). Thus, we can start a new sever whenever needed. Moreover, from the aspect of resource-usage, we mainly focus on the number of used-serv- ers by each algorithm. For the first situation, we have compared our algorithms with three others: Size-based task assignment (STA) [35]: This algorithm is widely used in distributed systems, in which all tasks within a given size range of resource requirements are assigned to a particular server. Specific to our case, two types of servers (for CPU-critical and for memory-IO-critical respectively) are designated. Packing algorithm (PA): It is a greedy algorithm. For each server, it will be assigned as much games as possible till all the games have been dispatched. Dominant resource fairness (DRF) [36]: A fair sharing model that generalizes max-min fairness to multiple resource types. In our implementation, the collection of all currently-used servers (called small servers) is regarded as a big server. Whethe the big server can satisfy an incoming request or not just depends on if there exists such a small server. If not, a new small server will be added to enlarge the big. The scheduling strategy inside the big one is First-fit and all gaming requests are considered to be issued by dif- ferent users. We also estimate the ideal server-number for reference. For each kind of resources (denoted by s), the minimum number is P i¼1 P i¼1 n RRs i =RRs . Here n is the total number of game requests; Ri denotes the resource utilization of the i-th game and Rs is the corre- sponding resource capacity of a server. Thus, the maximum of all minimums is the ideal number. In the second situation our algorithms have been com- pared with the STA algorithm, because others require the information of the request sequence (which is unavailable in this case) and will become the FFL. Simulation results of Situation 1 are given in Fig. 5. The y-axis stands for the needed-server numbers (for clarity, TABLE 2 Resource-Requirements of Games Tuple Game type NFS 9.15%, 2.01%, 526, 220, MAX_SYSTEM_BANDWIDTH / 5 memory-io- critical Scrolls 14.55%, 7.02%, 795, 560, MAX_SYSTEM_BANDWIDTH / 3 memory-io- critical Combat 8.47%, 3.27%, 800, 296, MAX_SYSTEM_BANDWIDTH / 5 memory-io- critical Birds 10%, 1.1%, 181, 142, 6.54 CPU-critical Fig. 5. Server-numbers in Situation 1. 1246 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389
  9. 9. values have been normalized) as several requests have arrived simultaneously (the request number is illustrated by the x-axis). We can figure out that, compared with others, the heuristic algorithms are quite good. Even considering the ideal number, our algorithms are really close to the opti- mal (the maximal value is 101.23 percent). Moreover, these two algorithms perform almost equal in all cases. Fig. 6 shows that the number of requested servers when requests arrived in sequence (Situation 2). We can figure out that our heuristic algorithms are more efficient than the STA. These two algorithms also perform simi- larly in all cases: compared with the BFL algorithm, the more consumed resources by the FFL are less than 3.6 percent (57:55). At last, results show the performance of FFL is about 20 percent faster than the BFL, while both are fast enough (in the batch processing mode, both can complete the task-assignment in several milliseconds as the request number is 1,000). 4.5.2 Tests with Real Game-Traces To further evaluate the proposed task-scheduling strategies, we conduct a trace-driven simulation for a large-scale clus- ter (some similar simulation method has been used in [37]); each server is the same as the one presented in Section 4.4. The dataset we used is the World of Warcraft history dataset provided by Yeng-Ting Chen et al. [38]. Although this data- set is based on the MMORPG of “World of Warcraft”, we think it is useful in our case because cloud gaming and MMORPG share many similarities, such as wide variations in the gaming time, a huge bandwidth-demand and a large number of concurrent users. Of course, necessary pre-proc- essing is introduced to make the dataset more suitable, namely, we have mapped the first four races in the dataset (Blood Elf, Orc, Tauren and Troll) to the four kinds of games in our system and the remaining one (the undead) is mapped to one of these four games randomly. In detail, we have used traces of three months that con- sist of 396,631 game-requests (details are shown in Table 3). Accordingly, a cluster of 200 servers has been simulated, in which the master node collects the resource utilization of all servers every one minute. Because previous tests have shown that BFL and FFL policies perform similarly, we have only tested the BFL scheduling policy here. Fig. 7 shows numbers of running game-instances, acti- vated servers and used servers (once it is used, a server will be regarded as a used server no matter whether it is being activated or not); there is an obvious linear relationship between the number of game-instances and the number of activated servers. What’s more, the average number of acti- vated servers is 64, which is significantly less than the maxi- mum number of used servers (152). It means that the scheduling efficiency is good; it also means server consoli- dation [37] can be used to further reduce the number of servers. Fig. 8 shows the average resource-utilizations of acti- vated servers of each day. Although the utilization rates of other resources are relatively low, the bandwidth’s is high. It proves that most games are memory-io-critical, which accords with our performance model. We have completed another simulation, in which the server number is infinite, to illustrate the relationship between the total of used servers and the update-interval for resource utilization. Fig. 9 shows the relationship; we can see that when the update-interval is less than 20 minutes, the number of used servers varies slightly. When the interval is larger, the num- ber has increased significantly. It means that we could use a longer update-interval and the impact on the system effi- ciency is very limited. It is also helpful to manage a large- scale cloud gaming system, because message exchanges between server-agents and the manager will be reduced apparently. Fig. 6. Server-numbers in Situation 2. TABLE 3 Details of the Dataset Parameter Value Simulated period 3 months Server number 200 Total game requests 396,631 Maximum game requests arriving simultaneously 227 Maximum game instances running simultaneously 757 Average lifetime of game instances 85 minutes Average interval between game requests 3 minutes Fig. 7. Running games and servers of each day. ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1247 www.redpel.com +917620593389 www.redpel.com +917620593389
  10. 10. 4.6 Discussions 4.6.1 Different Game Configurations and/or Heterogeneous Servers The above work is targeted to specific hardware and games and we believe the method is practical: it is reasonable to assume that any game should be tested fully before on-line; thus the resource requirements of each game can be mea- sured on the given server of which the hardware configura- tion will remain unchanged for a long period. If heterogeneous servers are used, as we have found that the host CPU or the memory bus is the system bottleneck, new servers’ capacities can also be derived, based on the comparison between the CPU performance and system bandwidth of reference servers and new servers (these met- rics may have been labeled by the producer or can be tested), which can avoid the exponentially-growing com- plexity of testing. Appendix B, available in the online supple- mental material, gives an example to show that the capability of a new server for known games is predictable and then summarizes the prediction method. For different game configurations, the situation is more complicated. Even if only the resolution is different, tests show that there is not an obvious relationship between the resolution and resource consumptions, although the con- sumption of our framework itself (like encoding and image- capture) is proportional to the resolution. Therefore, our solution is: during the real operation ser- vice period, such configurations can be evaluated on line first. For example, we can schedule the same game with same configurations to some dedicated server(s) if a user has demanded. With the accumulation of game-runs, the metrics will become more accurate. 4.6.2 Time-Dependent Factors We use average values to denote resource requirements of a given game. In reality, requirements are time-dependent, which may vary in different gaming stages. However, we believe average values are enough owing to the following facts: 1) The variety degree depends on the time granularity heavily. Our tests show that the degree becomes smaller with the increase of the time interval. When the time interval is 30s (in Appendix C, available in the online supplemental material), the variety of requirements is relatively small. 2) Consider resource consolidation of multiple concur- rently-running games, the usage of average values are reasonable. Moreover, it is necessary to note that some games will last very long time to finish. Thus in our experimental envi- ronment, it is difficult to explore plenty of scenes. However, such a game can be evaluated on line first for data accumu- lation (as we have mentioned above). 5 IMPLEMENTATION AND EVALUATION 5.1 Implementation We have implemented the cloud gaming system based on the user-level virtualization technology. Eight PC servers are connected by a Gigabit Ethernet; their configurations are the same as the previous one in Section 4.4. Detours [39] has been used to complete the required interception func- tions. In detail, we have implemented a DLL (called gamedll) that can be inserted into any gaming process to wrap all interesting APIs and to spawn two threads for input-recep- tion and data-encoding / -streaming respectively. Now our virtualization layer can stream Direct3D games, OpenGL games and flash games to Windows, iOS and Android clients, and receive remote operations. The UDT (UDP-based Data Transfer) protocol [40] is used to deliver the video / audio / operation data between the server and client. We use the periodical video capture as the timing-refer- ence on the server side; any audio data between two conse- cutive video-capture-timestamps will be delivered with the current video data. To be specific, Windows Audio Session APIs provide some interface to create and manage audio streams to and from audio devices. Our interception does replicate such stream buffers. After the current image has been captured, the audio data between the current read / write positions (read position is just the current playback position) of the buffer will be copied out immediately and sent out with the current image. This method completes video / audio synchronization and limits the timing discrepancy to the reciprocal of the FPS value or so. As mentioned in Section 4.1, an exception lies in that games may decrease the FPS deliberately in some scenes, which will cause more timing discrepancies. To remedy this Fig. 8. Resource-utilizations of activated servers. Fig. 9. Used servers of different update-intervals. 1248 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389
  11. 11. situation, a dedicated timer has been introduced to trigger audio transmission only if the current interval of successive frames is longer than a threshold. Moreover, from the aspect of clients, to smooth the play- back of received audio, one extra audio-buffer will be man- aged by the cloud-gaming client software. Any received audio will be stored into this buffer first to be appended to the existing data (also in this buffer). As the whole buffer has been filled, all will be copied to the playback device. Thus, combined with the default buffer of the playback device, it constructs a double-buffering mechanism, which can parallelize the playback and reception and then smooth the playback. Therefore, any audio data will be delayed for some time: in our system, the length of this buffer is set to occupy audio-data of 200 ms, which will make the playback smooth. Results have been given in the next section. 5.2 Evaluation The test environment and configurations are the same as those in Section 4.4, as well as the testing method. 5.2.1 Overheads of the User-Level Virtualization Technology Itself We execute a game on a physical machine directly and record the game speed (in term of the average FPS) and average memory consumption. Then, this game is running in the user-level virtualization environment (all related APIs have been intercepted but no any real work, like image capture, encoding, etc., has been enabled) and in a virtual machine respectively; the same runtime information will be recorded repeatedly. The latest VMware Play 6 is employed and both the host / guest OSes are Win 7. The comparison is shown in Fig. 10 (for clarity, values have been normalized). Consider the GPU utilization, the user-level technology itself almost introduces no performance-loss, while the VM- based solution’s efficiency is a little lower, about 90 percent of the native. On the other side, the memory consumption of the VM-based solution is 2.4 times as many as the native, because the memory occupied by the guest OS is consider- able. For the user-level solution, this consumption is almost the same, too. 5.2.2 Processing Performance of the Server The processing procedure of a cloud-gaming instance can be divided into four parts: (1) image capture, which copies a rendered into the system memory, (2) video encoding, (3) transferring, which sends each compressed-frame into the network, and (4) the process of the game-logic-operation and rendering. The last one is mainly dependent on the con- crete game while GCloud handles the others. Thus the first three are object of this test and the amount of these delays is denoted as SD (Server Delay). Moreover, we intend to get the limit of the performance. Hence only one instance is running on a server and the “try the best” strategy is used. Namely, no Sleep call has been inserted; the games can run as fast as possible. Some exist- ing work [3] has completed the similar test for GamingAny- where and Onlive, so that we can compare results with theirs. Although the tested games of [3] are different, we believe the comparison is meaningful because the server delay is independent on specific games to a large extent. Fig. 11 reports the average SD of three video games under different resolutions. The corresponding FPS is in Fig. 10. Comparison of resource consumption. Fig. 11. Processing performance and the decomposition (three resolutions). ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1249 www.redpel.com +917620593389 www.redpel.com +917620593389
  12. 12. Fig. 12. The average value of 720P is given in Fig. 13, as well as the corresponding values of GamingAnywhere and OnLive (values have been normalized). Results show that, compared with similar solutions, GCloud achieves smaller SDs (ranging from 8 ms to 19 ms), which are positive correlated with resolutions. We think it is mainly attributed to the high encoding performance of Quick Sync. In contrast, the encoding delay of GamingAny- where is about 14$16 ms per frame. The transferring latency is smaller than others by two orders of magnitude. Even in following cases of multiple games, it does still hold. Thus, the transferring latency can be skipped, as we have proposed in Section 4. 5.2.3 Multiple Games The “just good enough” strategy is used; a Sleep call has been used to fix the FPS. First, an OpenGL game and three Direct3D games have been played one by one and the proc- essing delay (including the sleep time) is sampled periodi- cally; the sample period is one frame. Second, quite a few game combinations, each including more than one game, have been executed and sampled. Without loss of general- ity, FPS values of some game combinations that are played simultaneously are presented in Table 4, as well as the aver- age absolute deviations (AADs). These combinations are: Case 1: Two NFS instances; Case 2: One NFS, one Combat and one Scrolls; Case 3: Two NFS, one Combat and one Scrolls; Case 4: One NFS, one Combat, one Scrolls and two Birds. On the whole, the average FPS ranges from 30.5 to 31.5 as one game is running alone. Their average absolute devia- tions are 0.10 (Birds), 0.11 (NFS), 0.15 (Combat) and 1.47 (Scrolls) respectively, which means the FPS value is fairly stable. Of course, there are quite a few delay-fluctuations. It usually means the corresponding game-scenes are changing rapidly, which is the common case for highly-interactive games, especially for Scrolls. With the increment of the number of concurrently-run- ning games (it means more interferences between games), the FPS values decrease correspondingly while the average absolute deviations increase: For Scrolls, as three games running (Case 2) at the same time, its average FPS is 28.3 and the AAD is 2.13. For four instances (Case 3), the values are 27.8 and 2.98 respectively. For Combat, as three games running simultaneously, the average FPS is 29.2; the AAD is 0.89. For four, the values are 28.8 and 1.59 respectively. For the uncertainties of FPS values, we believe the main reason lies in two aspects: 1) There exists interferences among several running instances, including resource contests, which make resource-consumption not totally linear with the increase of instances (as illustrated in Fig. 4). For example, Scrolls consumes the most resources, thus its uncertainty is the biggest. 2) As mentioned in Section 4.6, resource require- ments of games are time-dependent, which may vary in different stages. It has also caused some uncertainties. Anyway, it means that the system can get satisfactory gaming-effect and the FPS can be made relatively stable, as multiple games are running simultaneously. 5.2.4 Verification of the Performance Model According to the result of the performance model and scheduling strategy, we test several typical server loads for verification. Without loss of generality, the following cases have been presented. 1) One Scrolls, one Combat and two NFS. As presented in Table 5 (1st row), the FPS value of each game is more than 27 and the lowest is Scrolls’s, about 27.1. All are not less than 90 percent of the FIXED_FPS (30), thus they are accepetable. Because the system-RAM band- width has been nearly exhausted (about 93 percent of the MAX_SYSTEM_BANDWIDTH), when another game join (regardless NFS or Birds), the FPS of Scrolls will drop below the acceptable level. Fig. 12. FPS of games. Fig. 13. Comparison of the processing delay (1280 Â 720; the lower the better). TABLE 4 FPS Values and Average Absolute Deviations of Different Numbers of the Running Games Game / Case 1 2 3 4 NFS FPS 30.2 30.3 30.2 30.2 AAD 0.18 0.24 0.44 0.70 Combat FPS N/A 29.2 28.8 28.6 AAD N/A 0.89 1.59 1.89 Scrolls FPS N/A 28.3 27.8 27.3 AAD N/A 2.13 2.98 3.30 Birds FPS N/A N/A N/A 29.8 AAD N/A N/A N/A 0.56 1250 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389
  13. 13. 2) One Scrolls, one Combat, one NFS and three Birds. For this case, the sum of each kind of resource consump- tion is less than the corresponding system-capacity; the relative maximum is the sum of memory throughputs, about 95 percent of the MAX_SYS- TEM_THROUGHPUT.In Table 5 (second row), the FPS value of each game is more than 27. 3) One NFS, two Combat and five Birds. 4) Three NFS and five Birds. In Case 3 4, the sum of memory throughputs is about 96 percent of the MAX_SYSTEM_THROUGHPUT. Anyway, as the sum of each kind of resource consumption is less than the corresponding system-capacity, the FPS value of each game is still more than 27. 5.2.5 Discrepancy between Video and Audio We have designed a method to calculate this discrep- ancy: on the server, some sequences of full-black images are inserted into the video-stream to replace original scenes; at the same time, mute data will replace the cor- responding audio-data, too. On the client, a screen recording software is running with the gaming client. Thus, through the analysis of audio / video streams of recorded data, we can get time-stamps of the beginnings of inserted video / audio sequences respectively. Then, the discrepancies can be calculated. Results show that these values are in the range of 180 ms$410 ms (Table 6). We think the reason lies in the following, besides the preset delays aforementioned: 1) The delay-fluctuations of games. The corresponding FPS-values will be less than 30, which will increase the timing discrepancy, because the accumulation process of audio-data will be slowed. 2) The network’s delay-fluctuations. They will increase the timing discrepancy, too. Our tests are carried out in the campus. We believe, for the Internet, this fac- tor will cause more delays. 3) The measurement error. The recording software records the screen periodically, 30 FPS, while the audio recording is consecutive. Thus, beginnings of some sequences of full-black images may be lost, which will decrease the gap. 6 CONCLUSIONS AND FUTURE WORK This paper proposes GCloud, a GPU/CPU hybrid cluster for cloud gaming based on the user-level virtualization technology. We focus on the guideline of task scheduling: To balance the gaming-responsiveness and costs, we fix the game’s FPS to allocate just enough resources, which can also mitigate the inference between games. Accordingly, a per- formance model has been analyzed to explore the server- capacity and the game-demands on resource, which can locate the performance bottleneck and guide the task-sched- uling based on games’ critical resource-demands. Compari- sons show that both the First-Fit-like and Best-Fit-like scheduling strategies can outperform others. Moreover, they are near optimal in the batch processing mode. In the future, we plan to enhance performance models to support heterogeneous servers. ACKNOWLEDGMENTS The work is supported by the High Tech. RD Program of China under Grant No. 2013AA01A215. REFERENCES [1] R. Shea, L. Jiangchuan, E.C.-H. Ngai, and C. Yong, “Cloud gam- ing: Architecture and performance,” IEEE Netw., vol. 27, no. 4, pp. 16–21, Jul./Aug. 2013. [2] Z. Zhao, K. Hwang, and J. Villeta, “GamePipe: A virtualized cloud platform design and performance evaluation,” in Proc. ACM 3rd Workshop Sci. Cloud Comput., 2012, pp. 1–8. [3] C.-Y. Huang, C.-H. Hsu, Y.-C. Chang, and K.-T. Chen, “GamingAnywhere: An open cloud gaming system,” in Proc. ACM Multimedia Syst., Feb. 2013, pp. 36–47. [4] R. Phull, C.-H. Li, K. Rao, S. Cadambi, and S. T. Chakradhar, “Interference-driven resource management for GPU-based het- erogeneous clusters,” in Proc. 21st ACM Int. Symp. High Perform. Distrib. Comput., 2012, pp. 109–120. [5] V. T. Ravi, M. Becchi, G. Agrawal, and S. T. Chakradhar, “Supporting GPU sharing in cloud environments with a transpar- ent runtime consolidation framework,” in Proc. 20th ACM Int. Symp. High Perform. Distrib. Comput., 2011, pp. 217–228. [6] G. A. Elliott and J. H. Anderson, “Globally scheduled real-time multiprocessor systems with GPUs,” Real-Time Syst., vol. 48, no. 1. pp. 34–74, 2012. [7] L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, “Dynamic load balancing on single- and multi-gpu systems,” in Proc. IEEE Int. Symp. Parallel Distrib. Process., 2010, pp. 1–12. [8] M. Yu, C. Zhang, Z. Qi, J. Yao, Y. Wang, and H. Guan, “GRIS: Virtualized GPU resource isolation and scheduling in cloud gaming,” in Proc. 22nd Int. Symp. High-Perform. Parallel Distrib. Comput., 2012, pp. 203–214. [9] C. Zhang, J. Yao, Z. Qi, M. Yu, and H. Guan, “vGASA: Adaptive scheduling algorithm of virtualized GPU resource in cloud gaming,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 11, pp. 3036–3045, 2014. [10] M. Claypool and K. Claypool, “Latency and player actions in online games,” Commun. ACM, vol. 49, no. 11, pp. 40–45, 2006. [11] D. C. Barboza, V. E. F. Rebello, E. W. G. Clua, and H. Lima, “A simple architecture for digital games on demand using low per- formance resources under a cloud computing paradigm,” in Proc. Brazilian Symp., Games Digital Entertainment, 2010, pp. 33–39. [12] D. De Winter, P. Simoens, and L. Deboosere, “A hybrid thin-client protocol for multimedia streaming and interactive gaming applications,” in Proc. Int. Workshop Netw. Oper. Syst. Support Digi- tal Audio Video, 2006, p. 15. TABLE 5 FPS of Concurrently-Running Games TABLE 6 Discrepancy Values on the Client Side Minimum Maximum Average NFS 205 ms 395 ms 287 ms Scrolls 213 ms 410 ms 323 ms Combat 196 ms 336 ms 278 ms Birds 180 ms 275 ms 242 ms ZHANG ET AL.: A CLOUD GAMING SYSTEM BASED ON USER-LEVEL VIRTUALIZATION AND ITS RESOURCE SCHEDULING 1251 www.redpel.com +917620593389 www.redpel.com +917620593389
  14. 14. [13] W. Yu, J. Li, C. Hu, and L. Zhong, “Muse: A multimedia streaming enabled remote interactivity system for mobile devices,” in Proc. 10th Int. Conf. Mobile Ubiquitous Multimedia, 2011, pp. 216–225. [14] L. Shi, H. Chen, and J. Sun, “vCUDA: GPU accelerated high per- formance computing in virtual machines,” in Proc. IEEE Int. Symp. Parallel Distrib. Process., 2009, pp. 1–11. [15] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S. Quintana- Ortı, “rCUDA: Reducing the number of GPU-based accelerators in high performance clusters,” in Proc. Int. Conf. High Perform. Com- put. Simul., 2010, pp. 224–231. [16] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan, “GViM: Gpu-accelerated virtual machines,” in Proc. ACM Workshop Syst.-Level Virtualization High Perform. Comput., 2009, pp. 17–24. [17] M. Bautin, A. Dwarakinath, and T. cker Chiueh, “Graphic engine resource management,” in Proc. 15th Multimedia Comput. Netw., 2008, pp. 15–21. [18] D. Wu Z. Xue, and J. He “iCloudAccess: Cost-effective streaming of video games from the cloud with low latency,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 8, pp. 1405–1416, Jan. 2014. [19] H.-J. Hong, D.-Y. Chen, C.-Y. Huang, K.-T. Chen, and C.-H. Hsu, “Placing virtual machines to optimize cloud gaming experience,” IEEE Trans. Cloud Comput. , vol. 3, no. 1, pp. 42–53, Jan.–Mar. 2015. [20] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa, “TimeGraph: GPU scheduling for real-time multi-tasking environ- ments,” in Proc. USENIX Conf. USENIX Annu. Tech. Conf., 2011, p. 2. [21] L Cherkasova and L Staley, “Building a performance model of streaming media application in utility data center environment,” in Proc. 3rd IEEE/ACM Int. Symp. Cluster Comput. Grid, 2003, pp. 52–59. [22] V. Ishakian and A. Bestavros, “MORPHOSYS: Efficient colocation of QoS-constrainedworkloads in the cloud,” in Proc. 12th IEEE/ ACM Int. Symp. Cluster, Cloud Grid Comput., 2012, pp. 90–97. [23] S. Wang and S. Dey, “Rendering adaptation to address communi- cation and computation constraints in cloud mobile gaming,” in Proc. Global Telecommun. Conf., Dec. 6–10, 2010, pp. 1–6. [24] D. Klionsky. A new architecture for cloud rendering and amor- tized graphics. M.S. Thesis, School Comput. Sci., Carnegie Mellon Univ., CMU-CS-11–122. [Online]. Available: http://reports- archive.adm.cs.cmu.edu/anon/2011/abstracts/11–122.html. [25] A. Jurgelionis, P. Fechteler, P. Eisert, F. Bellotti, and H. David, “Platform for distributed 3D gaming,” Int. J. Comput. Games Tech- nol. , vol. 2009, p. 1, 2009. [26] A. Ojala and P. Tyrvainen, “Developing cloud business models: A case study on cloud gaming,” IEEE Softw., vol. 28, no. 4, pp. 42–47, Jul. 2011. [27] S.-W. Chen, Y.-C. Chang, and P.-H. Tseng, C.-Y. Huang, and C.-L. Lei, “Measuring the latency of cloud gaming systems,” in Proc. 19th ACM Int. Conf. Multimedia, 2011, pp. 1269–1272. [28] S. Choy, B. Wong, G. Simon, and C. Rosenberg “The brewing storm in cloud gaming: A measurement study on cloud to end- user latency,” in Proc. 11th Annu. Workshop Netw. Syst. Support Games, 2012, p. 2. [29] Y.-T. Lee, K.-T. Chen, H.-I. Su, and C.-L. Lei, “Are all games equally cloud-gaming-friendly? An electromyographic approach,” in Proc. IEEE/ACM NetGames, 2012, pp. 109–120. [30] K.-T. Chen, Y.-C. Chang, H.-J. Hsu, D.-Y. Chen, C.-Y. Huang, and C.-H. Hsu, “On the quality of service of cloud gaming systems,” IEEE Trans. Multimedia, vol. 16, no. 2, pp. 480–495, Feb. 2014. [31] Y. Zhang, X. Wang, and L. Hong, “Portable desktop applications based on P2P transportation and virtualization,” in Proc. 22nd Large Installation Syst. Administration Conf., 2008, pp. 133–144. [32] P. Guo, “CDE: Run any linux application on-demand without installation,” in Proc. 25th USENIX Large Installation Syst. Adminis- tration Conf., 2011, p. 2. [33] B. Xia and T. Zhiyi, “Tighter bounds of the first fit algorithm for the bin-packing problem,” Discrete Appl. Math., vol. 158, no. 15, pp. 1668–1675, 2010. [34] C. Kenyon, “Best-fit bin-packing with random order,” in Proc. 7th Annu. ACM-SIAM Symp. Discrete Algorithm, 1996, vol. 96, pp. 359–364. [35] M. Harchol-Balter, M. E. Crovella, and C. Duarte Murta, “On Choosing a task assignment policy for a distributed server sys- tem,” J. Parallel Distrib. Comput., vol. 59, no. 2, pp. 204–228, 1999. [36] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant resource fairness: Fair allocation of mul- tiple resource types,” in Proc. 8th USENIX Symp. Netw. Syst. Des. Implementation, 2011, pp. 323–336. [37] Y.-T. Lee and K.-T. Chen, “Is server consolidation beneficial to MMORPG? A case study of world of warcraft,” in Proc. IEEE 3rd Int. Conf. Cloud Comput., 2013, pp. 435–442. [38] Y.-T. Lee, K.-T. Chen, Y.-M. Cheng, and C.-L. Lei, “World of war- craft avatar history dataset,” in Proc. 2nd Anuu. ACM Multimedia Syst., Feb. 2011, pp. 123–128. [39] G. Hunt and D. Brubacher, “Detours: Binary interception of Win32 functions,” in Proc. 3rd USENIX Windows NT Symp., Jul. 1999, p. 14. [40] Y. Gu and R. L. Grossman, “UDT: UDP-based data transfer for high-speed wide area networks,” Comput. Netw., vol. 51, no. 7, pp. 109–120, May 2007. Youhui Zhang received the BSc and PhD degrees in computer science from Tsinghua Uni- versity, China, in 1998 and 2002. He is currently a professor in the Department of Computer Sci- ence, Tsinghua University. His research interests include computer architecture, cloud computing, and high-performance computing. He is a mem- ber of the IEEE and the IEEE Computer Society. Peng Qu received the BSc degree in computer science from Tsinghua University, China, in 2013. He is currently working toward the PhD degree in the Department of Computer Science, University of Tsinghua, China. His interests include cloud computing and micro-architecture. Cihang Jiang received the BSc degree in com- puter science from Tsinghua University, China, in 2013. He is currently a master student in the Department of Computer Science, University of Tsinghua, China. His research interest is cloud computing. Weimin Zheng received the BSc and MSc degrees in computer science from Tsinghua Uni- versity, China, in 1970 and 1982, respectively. He is currently a professor in the Department of Computer Science, University of Tsinghua, China. His research interests include high perfor- mance computing, network storage and distrib- uted computing. He is a member of the IEEE and the IEEE Computer Society. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. 1252 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 5, MAY 2016 www.redpel.com +917620593389 www.redpel.com +917620593389

×