7. But what could you do if all objects were intelligent… … and connected?
8. What could you do with unlimited computing power… for pennies? Could you predict the path of a storm down to the square kilometer? Could you identify another 20% of proven oil reserves without drilling one hole?
21. SPE BLOCK DIAGRAM Permute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle 64 Byte/Cycle On-Chip Coherent Bus
42. Ported by 235 584 tetrahedra 48 000 nodes 28 iterations in NKMG solver In 3.8 seconds Sustained Performance for large Objects: 52 GFLOP/s Multigrid Finite Element Solver on Cell using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
43. Computational Fluid Dynamics Solver on Cell Ported by Sustained Performance for large Objects: Not yet benchmarked (3/2007) using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
44. Computational Fluid Dynamics Solver on Cell A Lattice-Boltzmann Solver Developed by Fraunhofer IWTM http://www.itwm.fraunhofer.de/
45. Terrain Rendering Engine (TRE) and IBM Blades Systems and Technology Group Commodity Cell BE Blade Add Live Video, Aerial Information, Combat Situational Awareness Next-Gen GCS Combine Data & Render Aircraft data / Field Data BladeCenter-1 Chassis QS20
46. Example: Medical Computer Tomography (CT) Scans Image whole heart in 1 rotation 4D CT – includes time 2 slices 4 slices 8 slices 16 slices 32 slices 64 slices 128 slices 256 slices Current CT Products Future CT Products
47. The moving image is aligned to the fixed image as the registration proceeds. Fixed Image Moving Image Registration Process “ Image Registration” Using Cell
VMX AltiVec SIMD instructions on IBM PowerPC processors Less speculative logic
VMX AltiVec SIMD instructions on IBM PowerPC processors Less speculative logic
Switch gibt es noch nicht
Dr. V. S. Pande, Distributed Computing Project, Stanford University (permission given for showing the video as well) Folding@Home on the PS3: the Cure@PS3 project INTRODUCTION Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation. By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine. FAH has targeted the study of of protein folding and protein folding disease, and numerous scientific advances have come from the project. Now in 2006, we are looking forward to another major advance in capabilities. This advance utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. With this new technology (as well as new advances with GPUs ), we will likely be able to attain performance on the 100 gigaflop scale per computer. With about 10,000 such machines, we would be able to achieve performance on the petaflop scale . With software from Sony, the PlayStation 3 will now be able to contribute to the Folding@Home project, pushing Folding@Home a major step forward. Our goal is to apply this new technology to push Folding@Home into a new level of capabilities, applying our simulations to further study of protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. With these computational advances, coupled with new simulation methodologies to harness the new techniques, we will be able to address questions previously considered impossible to tackle computationally, and make even greater impacts on our knowledge of folding and folding related diseases. ADVANCED FEATURES FOR THE PS3 The PS3 client will also support some advanced visualization features. While the Cell microprocessor does most of the calculation processing of the simulation, the graphic chip of the PLAYSTATION 3 system (the RSX) displays the actual folding process in real-time using new technologies such as HDR and ISO surface rendering. It is possible to navigate the 3D space of the molecule using the interactive controller of the PS3, allowing us to look at the protein from different angles in real-time. For a preview of a prototype of the GUI for the PS3 client, check out a screenshot or one of these videos ( 355K avi , 866K avi , 6MB avi , 6MB avi -- more videos and formats to come). There is also a "bootleg" video of Sony's presentation on FAH that is now on YouTube (although the audio and video quality is pretty bad). http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats
Cell Blade systems compute and compress images. These images are then delivered via the network to clients for decompression and display. GPStream framework can be used to deliver the images to mobile clients via wireless. This is really an example of situational awareness. In this specific case, the Predator Unmanned Aerial Vehicle has a small camera mounted in the nose (blue circle would be live video), the surroundings would be rendered for the remote pilot to help them avoid turning into a mountain or no fly zone. We this this is also valid for commercial aircraft for night, poor weather, etc.
Ein erfahrener Arzt kann aus Schnittbildern sehr viel herauslesen. Aber 3dimensionale Bilder, dynamisch, d.h. Unter Einschluß des Faktors Zeit eröffnen völlig neue Diagnosemöglichkeiten. Medical imaging is another area that is progressing rapidly and creating a new more demanding workload. Today an average exam generates 1GByte of data you can’t go to the future adding time dependent analysis without an application-optimized system. An average exam generates 1GBytes of data (for one Digital x-ray or simple CT scan - much more for complicated CT or MRI studies) We estimate that 10^2-10^4 floating point operations are used to capture, process and analyze a Byte of medical data So, a typical exam requires 10^11- 10^13 operations Assume an exam must be completed in “real time” (5 minutes?) to be of diagnostic use This requires 0.3- 33GF/s of compute power – delivered today by single processor Intel workstations Scanner technology will rapidly evolve to generate 10-20x the amount of data in the same scan time Sixteen Slice CT Scanner 600-2000 slices per exam 300 MB – 1 GB per exam CT Scan workflow – typical helical scan multi-slice acquisition Stage 1: Interpolate data to generate equivalent “step-and-shoot” slices Stage 2: Filtered Back-Projection to generate 2D slice view (Fourier filter + numerical integration) Stage 3: Volume rendering (optional—many radiologists prefer to look at slices, but with increasing resolution/slice count, it may become mandatory) Note (1) Stage 2 should be trivially parallelizable (scale out) Note (2) Increase in the number of slices acquired simultaneously increased computational cost for “cone-effect” corrections. Note (3) There are claims that improved algorithms can reduce the computational burden enormously (UIUC Technology Licensing Office) Example: 313MB of raw scan data 5 x 1MB images (cross-sections?). Each image takes 19 seconds to process on a 3GHz Wintel box. High resolution 3000 slice run (from machines like the new Siemens Somatom 64) might take ~16 hours to process on such a commodity system. Note that the 3GB of 2D image data can be accommodated within main memory. PV-4D (www.pv-4d.com) Showcase at Supercomputing 2005 / Cebit 2006 About 4 times faster than Opteron with same algorithm If fully optimized, projected about > 6 times faster than Opteron Last minute prototype running on four Cell blades Stereo display using shutter glasses, 8-10 frames per second - Achieving this frame rate using two blades at a time - Four blades required for data set size Data sets about 1.6GB in size - Beating heart (400x400x400 voxels, 6 samples) - CFD simulation (~600x200x100 voxels, 40 samples)
Handling large data Handling large code SIMD aspect?
Q: What’s the parameters to spe_create_thread…
Handling large data Handling large code SIMD aspect?
Handling large data Handling large code SIMD aspect?
Middleware / libraries likely to be optimized - media, e.g., mplayer - encryption, e.g., OpenSSH PPE = P ower P rocessor E lement