Parallex - The Supercomputer

TThhee
SSuuppeerr CCoommppuutteerr

PPAARRAALLLLEEXX –– TTHHEE SSUUPPEERR CCOOMMPPUUTTEERR
A PROJECT REPORT
Submitted by
Mr. AMIT KUMAR
Mr. ANKIT SINGH
Mr. SUSHANT BHADKAMKAR
in partial fulfillment for the award of the degree
Of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE
GUIDE: MR. ANIL KADAM
AISSMS’S COLLEGE OF ENGINEERING, PUNE
UNIVERSITY OF PUNE
2007 - 2008

CERTIFICATE
Certified that this project report “Parallex - The Super Computer” is
the bonafide work of
Mr. AMIT KUMAR (Seat No. :: B3*****7)
Mr. ANKIT SINGH (Seat No. :: B3*****8)
Mr. SUSHANT BHADKAMKAR (Seat No. :: B3*****2)
who carried out the project work under my supervision.
Prof. M. A. Pradhan Prof. Anil Kadam
HEAD OF DEPARTMENT GUIDE

Acknowledgment
The success of any project is never limited to an individual undertaking
the project. It is the collective effort of people around the individual that
spell success. There are some key personalities involved whose role has
been very vital to pave way for the success of the project. We take the
opportunity to express our sincere thanks and gratitude to them.
We would like to thank all the faculties (teaching & non-teaching) of
Computer Engineering Department of AISSMS College of Engineering,
Pune. Our project guide Prof. Anil Kadam was very generous in his
time and knowledge with us. We are grateful to Mr. Shasikant
Athavale who was the source of constant motivation and inspiration for
us. We are very thankful and obliged by the valuable suggestions
constantly given by Prof. Nitin Talhar and Ms. Sonali Nalamwar
which proved to be very helpful for the success of our project. Our
deepest gratitude to Prof. M. A. Pradhan for her thoughtful comments
accompanied with her gentle support during the academics.
We would like to thank the college authorities for providing us with full
support regarding lab, network and related software.

Abstract
Parallex is a parallel processing cluster consisting of control nodes and
execution nodes. Our implementation removes all the requirements of kernel level
modification and kernel patches to run a Beowulf cluster system. There can be many
control nodes in a typical Parallex cluster. The many control nodes will no longer just
monitor but will also take part in execution if resources permit. We have removed all
the restrictions of kernel, architecture and platform dependencies making out cluster
system work with completely different sets of CPU powers, operating systems, and
architectures, that too without the use of any existing parallel libraries, such as MPI
and PVM.
With a radically new perspective of how parallel system is supposed to be, we
have implemented our own distribution algorithms and parallel algorithms aimed at
ease of administration and simplicity of usage, without compromising the efficiency.
With a fully modular 7-step design we attack the traditional complications and
deficiencies in existing parallel system, such as redundancy, scheduling, cluster
accounting and parallel monitoring.
A typical Parallex cluster may consist of a few old-386 running NetBSD,
some ultra modern Intel – Dual Core running Linux, and some server class MIPS
processor running IRIX, all working in parallel with full homogeneity.

Table of Contents
Chapter No. Title Page No.
LIST OF FIGURES I
LIST OF TABLES II
1. A General Introduction
1.1 Basic concepts 1
1.2 Promises and Challenges 5
1.2.1 Processing technology 6
1.2.2 Networking technology 6
1.2.3 Software tools and technology 7
1.3 Current scenario 8
1.3.1 End user perspectives 8
1.3.2 Industrial perspective 8
1.3.3 Developers, researchers & scientists perspective 9
1.4 Obstacles and Why we don’t have 10 GHz today 9
1.5 Myths and Realities: 2 x 3 GHz < 6GHz 10
1.6 The problem statement 11
1.7 About PARALLEX 11
1.8 Motivation 12
1.9 Feature of PARALLEX 13
1.10 Why our design is “alternative” to parallel system 13
1.11 Innovation 14
2. REQURIREMENT ANALYSIS 16
2.1 Determining the overall mission of Parallex 16
2.2 Functional requirement for Parallex system 16
2.3 Non-functional requirement for system 17
3. PROJECT PLAN 19

4. SYSTEM DESIGN 21
5. IMPLEMENTATION DETAIL 24
5.1 Hardware architecture 24
5.2 Software architecture 26
5.3 Description for software behavior 28
5.3.1 Events 32
5.3.2 States 32
6. TECNOLOGIES USED 33
6.1 General terms 33
7. TESTING 35
8. COST ESTIMATION 44
9. USER MANUAL 45
9.1 Dedicated cluster setup 45
9.1.1 BProc Configuration 45
9.1.2 Bringing up BProc 47
9.1.3 Build phase 2 image 48
9.1.4 Loading phase 2 image 48
9.1.5 Using the cluster 49
9.1.6 Managing the cluster 50
9.1.7 Troubleshooting techniques 51
9.2 Share cluster setup 52
9.2.1 DHCP 52
9.2.2 NFS 54
9.2.2.1 Running NFS 55
9.2.3 SSH 57
9.2.3.1 Using SSH 60
9.2.4 Host file and name service 65
9.3 Working with PARALLEX 65

10. CONCLUSION 67
11. FUTURE ENHANCEMENT 68
12. REFERENCE 69
APPENDIX A 70 – 77
APPENDIX B 78 – 88
GLOSSARY 89 – 92
MEMORABLE JOURNEY (PHOTOS) 93 – 95
PARALLEX ACHIEVEMENTS 96 - 97

I. LIST OF FIGURES:
1.1 High-performance distributed system.
1.2 Transistor vs. Clock Speed
4.1 Design Framework
4.2 Parallex Design
5.1 Parallel System H/W Architecture
5.2 Parallel System S/W Architecture
7.1 Cyclomatic Diagram for the system
7.2 System Usage pattern
7.3 Histogram
7.4 One frame from Complex Rendering on Parallex: Simulation of an
explosion
II. LIST OF TABLES:
1.1 Project Plan
7.1 Logic/ coverage/decidion Testing
7.2 Functional Test
7.3 Console Test cases
7.4 Black box Testing
7.5 Benchmark Results

The SupeThe SupeThe SupeThe Super Computerr Computerr Computerr Computer
AISSMS “College Of Engineering”AISSMS “College Of Engineering”AISSMS “College Of Engineering”AISSMS “College Of Engineering” - 1 -
Chapter 1. A General Introduction
1.1 BASIC CONCEPTS
The last two decades spawned a revolution in the world of computing; a move away
from central mainframe-based computing to network-based computing. Today,
servers are fast achieving the levels of CPU performance, memory capacity, and I/O
bandwidth once available only in mainframes, at cost orders of magnitude below that
of a mainframe. Servers are being used to solve computationally intensive problems
in science and engineering that once belonged exclusively to the domain of
supercomputers. A distributed computing system is the system architecture that makes
a collection of heterogeneous computers, workstations, or servers act and behave as a
single computing system. In such a computing environment, users can uniformly
access and name local or remote resources, and run processes from anywhere in the
system, without being aware of which computers their processes are running on.
Distributed computing systems have been studied extensively by researchers, and a
great many claims and benefits have been made for using such systems. In fact, it is
hard to rule out any desirable feature of a computing system that has not been claimed
to be offered by a distributed system [24]. However, the current advances in
processing and networking technology and software tools make it feasible to achieve
the following advantages:
• Increased performance. The existence of multiple computers in a distributed system
allows applications to be processed in parallel and thus improves application and
system performance. For example, the performance of a file system can be improved
by replicating its functions over several computers; the file replication allows several
applications to access that file system in parallel. Furthermore, file replication
distributes network traffic associated with file access across the various sites and thus
reduces network contention and queuing delays.
• Sharing of resources. Distributed systems are cost-effective and enable efficient
access to all system resources. Users can share special purpose and sometimes

expensive hardware and software resources such as database servers, compute servers,
virtual reality servers, multimedia information servers, and printer servers, to name
just a few.
• Increased extendibility. Distributed systems can be designed to be modular and
adaptive so that for certain computations, the system will configure itself to include a
large number of computers and resources, while in other instances, it will just consist
of a few resources. Furthermore, limitations in file system capacity and computing
power can be overcome by adding more computers and file servers to the system
incrementally.
• Increased reliability, availability, and fault tolerance. The existence of multiple
computing and storage resources in a system makes it attractive and cost-effective to
introduce fault tolerance to distributed systems. The system can tolerate the failure in
one computer by allocating its tasks to another available computer. Furthermore, by
replicating system functions and/or resources, the system can tolerate one or more
component failures.
• Cost-effectiveness. The performance of computers has been approximately doubling
every two years, while their cost has decreased by half every year during the last
decade. Furthermore, the emerging high speed network technology [e.g., wave-
division multiplexing, asynchronous transfer mode (ATM)] will make the
development of distributed systems attractive in terms of the price/performance ratio
compared to that of parallel computers. These advantages cannot be achieved easily
because designing a general purpose distributed computing system is several orders of
magnitude more difficult than designing centralized computing systems—designing a
reliable general-purpose distributed system involves a large number of options and
decisions, such as the physical system configuration, communication network and
computing platform characteristics, task scheduling and resource allocation policies
and mechanisms, consistency control, concurrency control, and security, to name just
a few. The difficulties can be attributed to many factors related to the lack of maturity
in the distributed computing field, the asynchronous and independent behavior of the

systems, and the geographic dispersion of the system resources. These are
summarized in the following points:
• There is a lack of a proper understanding of distributed computing theory—the field
is relatively new and we need to design and experiment with a large number of
general-purpose reliable distributed systems with different architectures before we can
master the theory of designing such computing systems. One interesting explanation
for the lack of understanding of the design process of distributed systems was given
by Mullender. Mullender compared the design of a distributed system to the design of
a reliable national railway system that took a century and half to be fully understood
and mature. Similarly, distributed systems (which have been around for
approximately two decades) need to evolve into several generations of different
design architectures before their designs, structures, and programming techniques can
be fully understood and mature.
• The asynchronous and independent behavior of the system resources and/or
(hardware and software) components complicate the control software that aims at
making them operate as one centralized computing system. If the computers are
structured in a master–slave relationship, the control software is easier to develop and
system behavior is more predictable. However, this structure is in conflict with the
distributed system property that requires computers to operate independently and
asynchronously.
• The use of a communication network to interconnect the computers introduces
another level of complexity. Distributed system designers not only have to master the
design of the computing systems and system software and services, but also have to
master the design of reliable communication networks, how to achieve
synchronization and consistency, and how to handle faults in a system composed of
geographically dispersed heterogeneous computers. The number of resources
involved in a system can vary from a few to hundreds, thousands, or even hundreds of
thousands of computing and storage resources.
Despite these difficulties, there has been limited success in designing special-purpose
distributed systems such as banking systems, online transaction systems, and point-of-
sale systems. However, the design of a general purpose reliable distributed system

that has the advantages of both centralized systems (accessibility, management, and
coherence) and networked systems (sharing, growth, cost, and autonomy) is still a
challenging task. Kleinrock makes an interesting analogy between the human-made
computing systems and the brain. He points out that the brain is organized and
structured very differently from our present computing machines. Nature has been
extremely successful in implementing distributed systems that are far more intelligent
and impressive than any computing machines humans have yet devised. We have
succeeded in manufacturing highly complex devices capable of high speed
computation and massive accurate memory, but we have not gained sufficient
understanding of distributed systems; our systems are still highly constrained and
rigid in their construction and behavior. The gap between natural and man-made
systems is huge, and more research is required to bridge this gap and to design better
distributed systems. In the next section we present a design framework to better
understand the architectural design issues involved in developing and implementing
high performance distributed computing systems. A high-performance distributed
system (HPDS) (Figure 1.1) includes a wide range of computing resources, such as
workstations, PCs, minicomputers, mainframes, supercomputers, and other special-
purpose hardware units. The underlying network interconnecting the system resources
can span LANs, MANs, and even WANs, can have different topologies (e.g., bus,
ring, full connectivity, random interconnect), and can support a wide range of
communication protocols.

Fig. 1.1 High-performance distributed system.
1.2 PROMISES AND CHALLENGES OF PARALLEL AND
DISTRIBUTED SYSTEMS
The proliferation of high-performance systems and the emergence of high speed
networks (terabit networks) have attracted a lot of interest in parallel and distributed
computing. The driving forces toward this end will be
(1) The advances in processing technology,
(2) The availability of high-speed network, and
(3) The increasing research efforts directed toward the development of software
support and programming environments for distributed computing.
Further, with the increasing requirements for computing power and the diversity in
the computing requirements, it is apparent that no single computing platform will
meet all these requirements. Consequently, future computing environments need to
capitalize on and effectively utilize the existing heterogeneous computing resources.
Only parallel and distributed systems provide the potential of achieving such an
integration of resources and technologies in a feasible manner while retaining desired
usability and flexibility. Realization of this potential, however, requires advances on a

number of fronts: processing technology, network technology, and software tools and
environments.
1.2.1 Processing Technology
Distributed computing relies to a large extent on the processing power of the
individual nodes of the network. Microprocessor performance has been growing at a
rate of 35 to 70 percent during the last decade, and this trend shows no indication of
slowing down in the current decade. The enormous power of the future generations of
microprocessors, however, cannot be utilized without corresponding improvements in
memory and I/O systems. Research in main-memory technologies, high-performance
disk arrays, and high-speed I/O channels are, therefore, critical to utilize efficiently
the advances in processing technology and the development of cost-effective high
performance distributed computing.
1.2.2 Networking Technology
The performance of distributed algorithms depends to a large extent on the bandwidth
and latency of communication among work nodes. Achieving high bandwidth and
low latency involves not only fast hardware, but also efficient communication
protocols that minimize the software overhead. Developments in high-speed networks
provide gigabit bandwidths over local area networks as well as wide area networks at
moderate cost, thus increasing the geographical scope of high-performance distributed
systems.
The problem of providing the required communication bandwidth for distributed
computational algorithms is now relatively easy to solve given the mature state of
fiber-optic and optoelectronic device technologies. Achieving the low latencies
necessary, however, remains a challenge. Reducing latency requires progress on a
number of fronts. First, current communication protocols do not scale well to a high-
speed environment. To keep latencies low, it is desirable to execute the entire protocol
stack, up to the transport layer, in hardware. Second, the communication interface of
the operating system must be streamlined to allow direct transfer of data from the
network interface to the memory space of the application program. Finally, the speed

of light (approximately 5 microseconds per kilometer) poses the ultimate limit to
latency. In general, achieving low latency requires a two-pronged approach:
1. Latency reduction. Minimize protocol-processing overhead by using streamlined
protocols executed in hardware and by improving the network interface of the
operating system.
2. Latency hiding. Modify the computational algorithm to hide latency by pipelining
communication and computation. These problems are now perhaps most fundamental
to the success of parallel and distributed computing, a fact that is increasingly being
recognized by the research community.
1.2.3 Software Tools and Environments
The development of parallel and distributed applications is a nontrivial process and
requires a thorough understanding of the application and the architecture. Although a
parallel and distributed system provides the user with enormous computing power and
a great deal of flexibility, this flexibility implies increased degrees of freedom which
have to be optimized in order to fully exploit the benefits of the distributed system.
For example, during software development, the developer is required to select the
optimal hardware configuration for the particular application, the best decomposition
of the problem on the hardware configuration selected, and the best communication
and synchronization strategy to be used, and so on. The set of reasonable alternatives
that have to be evaluated in such an environment is very large, and selecting the best
alternative among these is a nontrivial task. Consequently, there is a need for a set of
simple and portable software development tools that can assist the developer in
appropriately distributing the application computations to make efficient use of the
underlying computing resources. Such a set of tools should span the software life
cycle and must support the developer during each stage of application development,
starting from the specification and design formulation stages, through the
programming, mapping, distribution, scheduling phases, tuning, and debugging
stages, up to the evaluation and maintenance stages.

1.3 Current Scenario
The current scenario of the Parallel Systems can be viewed under three
perspectives. A common concept that applies to all of the following is the idea of
Total Ownership Cost (TOC). By far TOC is a common scale on which level of
computer processing is assessed worldwide. TOC is defined by the ratio of Total Cost
of Implementation and maintenance by the net throughput the parallel cluster delivers.
TOTAL COST OF IMPLEMENTATION AND MAINTENANCE
TOC = ------------------------------------------------------------------------------------
NETSYSTEM THROUGHPUT (IN FLOATING POINT / SEC)
1.3.1 End user perspectives
Various activities such as rendering, adobe Photoshop applications and
different processes come under this category. As there is increase in need of
processing power day by day it thereby increases hardware cost. From the end user
prospective the Parallel Systems aims to reduce the expenses and avoid the
complexities. At this stage we are trying to implement a Parallel System which is
more cost effective and user friendly. However, as the end user, TOC is less important
in most cases because Parallel Clusters could rarely be owned by a single user, and in
that case the net throughput of the Parallel System becomes the most crucial factor.
1.3.2 Industrial Perspective
In Corporate Sectors Parallel Systems are extensively implemented. Such a
Parallel Systems consist of machines that have to handle millions of nodes
theoretically not practically. From the industrial point of view the Parallel System
aims at resource isolation, replacing large scale dedicated commodity hardware and
Mainframes. Corporate sectors often place TOC as the primary criteria at which a
Parallel Cluster is judged. With increase in scalability, the cost of owing Parallel
Clusters shoot up to unmanageable heights and our primary aim is this area is to bring
down the TOC as much as possible.

1.3.3 Developers, Researchers & Scientists Perspective
Scientific applications such as 3D simulations, high scale scientific rendering,
intense numerical calculations, complex programming logic, and large scale
implementation of algorithms (BLAS and FFT Libraries) require levels of processing
and calculation that no modern day dedicated vector CPU could possibly meet.
Consequently, the Parallel Systems are proven to be the only and the most efficient
alternative in order to keep pace with modern day scientific advancements and
research. TOC is rarely a matter of concern here.
1.4 Obstacles and Why we don’t have 10 GHz today…
Fig 1.2 Transistor vs. Clock Speed
CPU performance growth as we have known it hit a wall
Figure graphs the history of Intel chip introductions by clock speed and number of
transistors. The number of transistors continues to climb, at least for now. Clock
speed, however, is a different story.

Around the beginning of 2003, you’ll note a disturbing sharp turn in the previous
trend toward ever-faster CPU clock speeds. We have added lines to show the limit
trends in maximum clock speed; instead of continuing on the previous path, as
indicated by the thin dotted line, there is a sharp flattening. It has become harder and
harder to exploit higher clock speeds due to not just one but several physical issues,
notably heat (too much of it and too hard to dissipate), power consumption (too high),
and current leakage problems.
Sure, Intel has samples of their chips running at even higher speeds in the
lab—but only by heroic efforts, such as attaching hideously impractical quantities of
cooling equipment. You won’t have that kind of cooling hardware in your office any
day soon, let alone on your lap while computing on the plane.
1.5 Myths and Realities: 2 x 3GHz < 6 GHz
So a dual-core CPU that combines two 3GHz cores practically offers 6GHz of
processing power. Right?
Wrong. Even having two threads running on two physical processors doesn’t
mean getting two times the performance. Similarly, most multi-threaded applications
won’t run twice as fast on a dual-core box. They should run faster than on a single-
core CPU; the performance gain just isn’t linear, that’s all.
Why not? First, there is coordination overhead between the cores to ensure
cache coherency (a consistent view of cache, and of main memory) and to perform
other handshaking. Today, a two- or four-processor machine isn’t really two or four
times as fast as a single CPU even for multi-threaded applications. The problem
remains essentially the same even when the CPUs in question sit on the same die.
Second, unless the two cores are running different processes, or different
threads of a single process that are well-written to run independently and almost never
wait for each other, they won’t be well utilized. (Despite this, we will speculate that
today’s single-threaded applications as actually used in the field could actually see a
performance boost for most users by going to a dual-core chip, not because the extra
core is actually doing anything useful, but because it is running the ad ware and spy
ware that infest many users’ systems and are otherwise slowing down the single CPU

that user has today. We leave it up to you to decide whether adding a CPU to run your
spy ware is the best solution to that problem.)
If you’re running a single-threaded application, then the application can only
make use of one core. There should be some speedup as the operating system and the
application can run on separate cores, but typically the OS isn’t going to be maxing
out the CPU anyway so one of the cores will be mostly idle. (Again, the spy ware can
share the OS’s core most of the time.)
1.6 The problem statement
So now let us summarize and define the problem statement:
• Since the growth of requirements of processing is far greater than the growth
of CPU power, and since the silicon chip is fast approaching its full capacity,
the implementation of parallel processing at every level of computing becomes
inevitable.
• There is a need to have a single and complete clustering solution which
requires minimum user interference but at the same time supports
editing/modifications to suit the user’s requirements.
• There should be no need to modify the existing applications.
• The parallel system must be able to support different platforms
• The system should be able to fully utilize all the available hardware resources
without the need of buying any extra/special kind of hardware.
1.7 About PARALLEX
While the term parallel is often used to describe clusters, they are more
correctly described as a type of distributed computing. Typically, the term parallel
computing refers to tightly coupled sets of computation. Distributed computing is
usually used to describe computing that spans multiple machines or multiple
locations. When several pieces of data are being processed simultaneously in the same
CPU, this might be called a parallel computation, but would never be described as a
distributed computation. Multiple CPUs within a single enclosure might be used for

parallel computing, but would not be an example of distributed computing. When
talking about systems of computers, the term parallel usually implies a homogenous
collection of computers, while distributed computing typically implies a more
heterogeneous collection. Computations that are done asynchronously are more likely
to be called distributed than parallel. Clearly, the terms parallel and distributed lie at
either end of a continuum of possible meanings. In any given instance, the exact
meanings depend upon the context. The distinction is more one of connotations than
of clearly established usage.
Parallex is both a parallel and distributed cluster because it supports both ideas
of multiple CPUs within a single enclosure as well as a heterogeneous collection
of computers.
1.8 Motivation
The motivation behind this project is to provide a cheap and easy to use
solution to cater to the high performance computing requirements of organizations
without the need to install any expensive hardware.
In many organizations including our college, we have observed that when old
systems are replaced by newer ones the older ones are generally dumped or sold at
throw away prices. We also wanted to find a solution to effectively use this “silicon
waste”. These wasted resources can be easily added to our system as the processing
need increases, because the parallel system is linearly scalable and hardware
independent. Thus the intent is to have an environment friendly and effective
solution that utilizes all the available CPU power to execute applications faster.
1.9 Features of Parallex
• Parallex simplifies the cluster setup, configuration and management process.
• It supports machines with hard disks as well as diskless machines running at
the same time.
• It is flexible in design and easily adaptable.
• Parallex does not require any special kind of hardware.

• It is multi platform compatible.
• It ensures efficient utilization of silicon waste (old unused hardware).
• Parallex is scalable.
How these features are achieved and details of design will be discussed in subsequent
chapters.
1.10 Why our design is “Alternative” to parallel system?
Every renowned technology needs to evolve after a particular time as new
generation enhances the sort come of the technology used earlier. So what we
achieved is a bare bone line semantic of parallel system.
When we were studying about the parallel and distributed system, the
advantage is that we were working on the latest technology. The parallel system
designed by scientist, no doubt were far more genius and intelligent than us. Our
system is unique because we are actually splitting up the task according to processing
power of nodes instead of just load balancing. Hence a slow processing node will get
a smaller task compared to a faster one and all nodes will show the output the same
calculated time on master node.
We found some difficulties that how much task should be given to the
heterogeneous system in order to get result at same time. We worked on this problem
to find the solution and developed mathematical distribution algorithm which was
successfully implemented and functional. This algorithm breaks the task according to
the speed of the CPUs by sending a test application to all nodes and storing the return
time of each node into a file. Then we further worked on the automation of the entire
system. We were using password less secure shell login and network file system. We
were successful up to some extent but atomization was not possible to ssh and NFS
configuration. Hence manually setting up of new nodes every time is a demerit of ssh
and NFS. To overcome this demerit we sorted the alternative solution which is
Beowulf cluster, but after studying we concluded that it considered all nodes of same
configuration and send tasks equally to all nodes.
To improve our system we think differently from Beowulf cluster. We tried to
make system more cost effective. We thought of diskless cluster concept in order get
reed of hard disk to cut the cost and enhance the reliability of machine. The storage

device will affect the performance of entire system and increase the cost (due to
replacement of the disks) and increase the waste of time in searching the faults. So,
we studied & patched the Beowulf server & Beowulf distributed process space
according to our need for our system. We made a kernel images for running diskless
clusters using RARP protocol. When clusters runs kernel image in its memory, it
demands for IP from master node or can also be called as server. The server assigns
IP & node number of the clusters. By this, our diskless clusters system stands & ready
to use for parallel computing. Then we modified our various codes including our own
distribution algorithm, according to our new design. The best part of our system was
that there is no need for any authorization setup. Every thing is now automatic.
Till now, we were working on CODE LEVEL PARALLELISM. In this, we
little bit modify code to run on our system just like MPI libraries are used to make
code parallely executable. Now, the challenge with us was that what if we didn’t get
source code instead of which we will get binary file to execute it on our parallel
system. So, now we need to enhance our system by adding BINARY LEVEL
PARALLELISM. We studied Open Mosix. Once open Mosix is installed & all the
nodes are booted, the Open Mosix nodes see each other in the cluster and start
exchanging information about their load level and resource usage. Once the load
increases beyond the defined level, the process migrates to any other nodes on the
network. There might be a situation where process demands heavy resource usage, it
may happen that the process may keep migrating from node to node without been
serviced. This is the major design flaw of the Open Mosix. And we are working out to
find the solution.
So, Our Design is ALTERNATIVE to all problems in the world of parallel
computing.
1.11 Innovation
Firstly our system does not require any additional hardware if the existing
machines are well connected in a network. Secondly, even in a heterogeneous
environment, with few fast CPUs and a few slower ones, the efficiency of the system
does not drop by more than 1 to 5%, still maintaining an efficiency of around 80% for
suitably adapted applications. This is because the mathematical distribution algorithm

considers relative processing powers of the node distributing only the amount of load
that a node can process in the calculated optimal time of the system. All the nodes
will process respective tasks and produce output at this calculated time. The most
important point about our system is the ability to use diskless nodes in cluster, thereby
reducing hardware costs and space and the required maintenance. Also in case of
binary executables (when source code is not available) our system exhibits almost
20% performance gains.

Chapter 2. Requirement Analysis
2.1 Determining the overall mission of Parallex
• User base: Students, educational institutes, small to medium business
organizations.
• Cluster usage: There will be one part of the cluster fully dedicated to solve the
problem at hand and an optional part where computing resources from
individual workstations are used. In the latter part, the parallel problems will
be having lower priorities.
• Software to be run on cluster: Depends upon the user base. At the cluster
management level, the system software will be Linux.
• Dedicated or shared cluster: As mentioned above it will be both.
• Extent of the cluster: Computers that are all on the same subnet
2.2 Functional Requirements for Parallex system
Functional Requirement 1
The PC’s must be connected in LAN so as to enable the system to be use without any
obstacles.
There will one master or controlling node which will distribute the task according to
the processing speed of the node.
Services
Three services are to be provided on the master.
1. There is a Network Monitoring tool for resource discovery (e.g. IP address,
MAC addresses, UP/DOWN Status etc.)
2. The Distribution Algorithm will distribute the task according to the current
processing speed of the nodes.
3. Parallex Master Script that will send the distributed task to the nodes and get
back the result and integrate it and gives out the output.

The final size of the executable code so be such that it should reside in the limited
memory constraints on the machine.
This product will only be used to speed up the applications which are preexisting in
the enterprise.
2.3 Non-Functional Requirements for system
- Performance
Even in a heterogeneous environment, with few fast CPUs and a few slower ones, the
efficiency of the system does not drop by more than 1 to 5%, still maintaining an
efficiency of around 80% for suitably adapted applications. This is because the
mathematical distribution algorithm considers relative processing powers of the node
distributing only the amount of load that a node can process in the calculated optimal
time of the system. All the nodes will process respective tasks and produce output at
this calculated time. The most important point about our system is the ability to use
diskless nodes in cluster, thereby reducing hardware costs and space and the required
maintenance. Also in case of binary executables (when source code is not available)
our system exhibits almost 20% performance gains.
- Cost
While a system of n parallel processors is less efficient than one n times faster
processor, the Parallel System is often cheaper to build. Parallel computation is used
for tasks which require very large amounts of computation, take a lot of time, and can
be divided into n independent subtasks. In recent years, most high performance
computing systems, also known as supercomputers, have parallel architectures.

- Manufacturing costs
No extra hardware required. Cost of setting up LAN.
- Benchmarks
There are at least three reasons for running benchmarks. First, a benchmark will
provide us with a baseline. If we make changes to our cluster or if we suspect
problems with our cluster, we can rerun the benchmark to see if performance is really
any different. Second, benchmarks are useful when comparing systems or cluster
configurations. They can provide a reasonable basis for selecting between
alternatives. Finally, benchmarks can be helpful with planning.
For benchmarking we will use a 3D rendering tool named Povray (Persistence Of
Vision Ray tracer, please see the Appendix for more details).
- Hardware required
x686 Class PCs (Linux (2.6x Kernels) installed with intranet connection)
Switch (100/10T)
Serial port connectors
100 BASE T LAN cable, RJ 45 connectors.
- Software Resources Required
Linux (2.6.x kernel)
Intel Compiler suite (Noncommercial)
LSB (Linux Standard Base) Set of GNU Kits with GNU CC/C++/F77/LD/AS
GNU Krell monitor
Number of PC’s connected in LAN
8 NODES in the LAN.

Chapter 3. Project Plan
Plan of execution for the project was as follows:
Serial
No.
Activity Software
Used
Number Of
Days
1 Project Planning
a) Choosing domain
b) Identifying Key areas of
work
c) Requirement analysis
- 10
2 Basic Installation of LINUX. LINUX (2.6x
Kernel)
3
3 Brushing up on C programming Skills - 5
4 Shell Scripting LINUX (2.6x
Kernel), GNU
BASH
12
5 C Programming in LINUX Environment GNU C
Compiler
Suite
5
6 A Demo Project (Universal Sudoku
Solver)
To familiarize with LINUX
programming environment.
GNU C
Compiler
Suite , INTEL
Compiler suite
(Non-
commercial)
16
7 Study Advanced LINUX tools and
Installation of Packages & RED HAT
RPMs.
Iptraf, mc, tar,
rpm, awk, sed,
GNU plot,
strace, gdb, etc.
10

8 Studying Networking Basics & Network
configuration in LINUX.
- 8
9 Recompiling, Patching and
analyzing the system kernel
LINUX (Kernel
2.6x.x), GNU c
compiler
3
10 Study & implementation of Advanced
Networking Tools : SSH & NFS
ssh & Openssh,
nfs
7
11 a) Preparing the preliminary design of
the total workflow of the project.
b) Deciding the modules for overall
execution, and dividing the areas of the
concentration among the project group.
c) Build Stage I prototype
All of the above 17
12 Build Stage II prototype
(Replacing ssh by custom made
application)
All of the above 15
13 Build Stage III prototype
(Making Diskless Cluster)
All of the above 10
14 Testing & Building Final Packages All of the above 10
Table 1.1 Project Plan

Chapter 4. System Design
Generally speaking, the design process of a distributed system involves three main
activities:
(1) designing the communication system that enables the distributed system resources
and objects to exchange information,
(2) defining the system structure (architecture) and the system services that enable
multiple computers to act as a system rather than as a collection of computers, and
(3) defining the distributed computing programming techniques to develop parallel
and distributed applications.
Based on this notion of the design process, the distributed system design framework
can be described in terms of three layers:
(1) network, protocol, and interface (NPI) layer,
(2) system architecture and services (SAS) layer, and
(3) distributed computing paradigms (DCP) layer. In what follows, we describe the
main design issues to be addressed in each layer.
Fig. 4.1 Design Framework

• Communication network, protocol, and interface layer. This layer describes the
main components of the communication system that will be used for passing control
and information among the distributed system resources. This layer is decomposed
into three sub layers: network type, communication protocols, and network interfaces.
• Distributed system architecture and services layer. This layer represents the
designer’s and system manager’s view of the system. SAS layer defines the structure
and architecture and the system services (distributed file system, concurrency control,
redundancy management, load sharing and balancing, security service, etc.) that must
be supported by the distributed system in order to provide a single-image computing
System.
• Distributed computing paradigms layer. This layer represents the programmer
(user) perception of the distributed system. This layer focuses on the programming
paradigms that can be used to develop distributed applications. Distributed computing
paradigms can be broadly characterized based on the computation and communication
models. Parallel and distributed computations can be described in terms of two
paradigms: functional parallel and data parallel paradigms. In functional parallel
paradigm, the computations are divided into distinct functions which are then
assigned to different computers. In data parallel paradigm, all the computers run the
same program, the same program multiple data (SPMD) stream, but each computer
operates on different data streams.
With reference to Fig. 4.1, Parallex can be described as follows:

Fig. 4.2 Parallex Design

Chapter 5. Implementation Details
The goal of the project is to provide an efficient system that will handle process
parallelism with the help of Clusters. This parallelism will thereby reduce the time of
execution. Currently we form a cluster of 8 nodes. Using a single computer for
execution of any heavy process takes lot of time in execution. So here we are forming
a cluster and executing those processes in parallel by dividing the process into number
of sub processes. Depending on the nodes in cluster we migrate the process to those
node and when the execution is over then it brings back the output produced by them
to the Master node. By doing this we are reducing the process execution time and
increasing the CPU utilization.
5.1 Hardware Architecture
We have implemented a Shared Nothing Architecture of parallel system by
making use of Coarse Grain Cluster structure. The inter-connect is ordinary 8-port
switch and an optionally a Class-B or Class-C network. It is 3 level architecture:
1. Master topology
2. Slave Topology
3. Network interconnect
1. Master is a Linux running machine with a 2.6.x or 2.4.x (both under testing)
kernel. It runs the parallel-server and contains the application interface to drive the
remaining machines. The master runs a network scanning script to detect all the slaves
that are alive and retrieves all the necessary information about each slave. To
determine the load on each slave just before the processing of the main application,
the master sends a small diagnostic application to the slave to estimate the load it can
take at the present moment. Having collected all the relevant information, it does all
the scheduling, implementing of parallel algorithms (distributing the tasks according
processing power and current load), making use of CPU extensions (MMX, SSE,
3DNOW) depending upon the slave architecture, and everything except the execution
of the program itself. It accepts the input/task to be executed. It allocates the tasks to

underlying slave nodes constituting the parallel system, which execute the tasks in
parallel and return the output to the Master. Master plays the role of watchdog, which
may or may not participate in actual processing But manages the entire task.
2. Slave is a single system cluster image (SSCI). It is basically dedicated for
processing purpose. It accepts the sub-task along with the necessary library modules
executes them and returns the output back to the Master. In our case, the slaves would
be multi-boot capable systems, which could at one point of time be diskless cluster
hosts, at other time they might behave as a general purpose cluster node and at some
other time, they could act as normal CPU handling routine tasks of office and homes.
In case of Diskless Machines, the slave will boot on Pre-created kernel image patched
appropriately.
3. Network interconnection is to merge both Master and Slave topologies. It makes
use of an 8-port switch, RJ 45 connectors and serial CAT 5 cables. It is a Star
topology where the Master and the Slaves are interconnected through the Switch.
Fig. 5.1 Parallel System H/W Architecture
Cluster Monitoring: Each slave runs a server that collects the kernel processing / IO
/ memory / CPU and all the related details from PROC VIRTUAL file system and

forwards it to the MASTER NODE (here acting as a slave to each server running on
each slave), and a user base programs plots it interactively on the Server screen thus
showing the CPU / MEMORY / IO details of each node separately.
5.2 SOFTWARE ARCHITECTURE:-
This architecture consists of two parts i.e.
1. Master Architecture
2. Slave Architecture
Master consists of following levels.
1. Linux BIOS: Linux BIOS usually loads a Linux kernel.
2. Linux: Platform on which Master runs.
3. SSCI + Beoboot: This level extracts a single system cluster image used by
Slave nodes.
4. Fedora Core/ Red Hat: Actual Operating System running on Master.
5. System Services: Essential Services running on Master. Eg. RARP Resolver
Daemon.
Slave inherits the Master with the following levels.
1. Linux BIOS
2. Linux
3. SSCI
Fig 5.2 Parallel System S/W Architecture

Parallex is broadly divided in to following Modules:
1. Scheduler: this is the heart of out system. With radically new approach
towards data and instruction level distribution, we have implemented a
completely optimal heterogeneous cluster technology. We do task allocation
based on the actual processing capability on each node and not on the give
GHz power on the manual of the system. The task allocation is dynamic and
the scheduling policy is based on POSIX scheduling implementation. We are
also capable of implementing preemption, which we right now do not do in
favour of the fact that system such as Linux and FreeBSD are capable of
industry level preemption.
2. Job/instruction alligator: this is a set of remote fork like utility that allocates
the jobs to then nodes. Unlike traditional cluster technology, this job allocator
is capable of doing execution in disconnected mode that means that the
network latency would substantially reduce due to temporary disconnection.
3. Accounting: we have written a utility “remote cluster monitor” which is
capable of providing us samples of results from all the nodes, information
about the CPU load, temperature, and memory statistics. We propose that with
less than 0.2% of CPU power consumption, our network monitoring utility can
sample over 1000 nodes in less than 3 seconds.
4. Authentication: all transactions between the nodes are 128 bit encrypted and
do not require root privileges to run. Just a common user on all the standalone
node must exist. For the diskless part, we remove this restriction as well.
5. Resource discovery: we run our own socket layered resource discovery
utility, which discovers any additional nodes. Also reports if the resource has
been lost. In case of any additional hardware capable of being used as part of
parallel system, such as an additional processor to a system, or a replacement
of processor with dual core processor is also reported continually.

6. Synchronizer: the central balancing of the cluster. Since the cluster is capable
of simultaneously running both the diskless, and standalone nodes as part of
the same cluster, the synchronizer makes the result more reasonable in output
is queued in real time so that data is not mixed up. It does instruction
dependency analysis, and also uses pipelines in the network to make
interconnect more communicative.
5.3 Description for software behavior
The end user will submit the process/application to the administrator in case
the application is source based, and the Cluster administrator owns the responsibility
to explicitly parallelize the application for maximum exploitation of parallel
architectures within the CPU and across the cluster nodes. In case the application is
binary ( non source), the user might himself/herself submit the code to Master node
program acceptor, which in turn would run the application with somewhat lower
efficiency as compared to the source submissions to the administrator. Now the total
system is responsible for minimizing the time of processing which in turn increases
the throughput and speed up the processing.

5.3.1 Events
1. System Installation
2. Network initialization
3. Server and host configuration
4. Take input
5. Parallel execution
6. Send response
5.3.2 States
1. System Ready
2. System Busy
3. System Idle

Chapter 6. Technologies Used
6.1 General terms
We will now briefly define the general terms that will be used in further descriptions
or are related to our system.
Cluster: - Interconnection of large number of computers working together in close
synchronized manner to achieve higher performance, scalability and net
computational power.
Master: - Server machine which acts as the administrator of the entire parallel Cluster
and executes task scheduling.
Slave: - A client node which executes the task as given by the Master.
SSCI: - Single System Cluster Image is a hypothetical idea of implementing cluster
nodes into an image, where the cluster nodes will behave as if it were an additional
processor; add on ram etc. into the controlling Master computer. This is the base
theory of cluster level parallelism. Example implementations are, Multi node NUMA
(IBM/Sequent) Multi-quad computers, SGI ATIX Servers. However, the idea of true
SSCI remains unimplemented when it comes to heterogeneous clusters for parallel
processing, except for Supercomputing clusters such as Thunder and Earth Stimulator.
RARP: - Reverse Address Resolution
Protocol is a network layer protocol used to resolve an IP address from a
given hardware address (such as an Ethernet address / MAC Address).

BProc:-
The Beowulf Distributed Process Space (BProc) is set of kernel modifications,
utilities and libraries which allow a user to start processes on other machines in a
Beowulf-style cluster. Remote processes started with this mechanism appear in the
process table of the front end machine in a cluster. This allows remote process
management using the normal UNIX process control facilities. Signals are
transparently forwarded to remote processes and exit status is received using the usual
wait() mechanisms.
Having discussed the basic concepts of parallel and distributed systems, the problems
in this field, and an overview of Parallex, we now move forward with the requirement
analysis and design details of our system.

Chapter 7. Testing
Logic Coverage/Decision Based: Test cases
SI
No
.
Test case name Test
Procedure
Pre-
condition
Expected
Result
Reference
to Detailed
Design
1. Initial_frame_fail Initial frame
not defined
None Parallex
should
give error
& exit
Distribution
algo
2. Final_frame_fail Final frame not
defined
None Parallex
should
give error
& exit
Distribution
algo
3. Initial_final_full Initial & Final
frame given
None Parallex
should
distribute
accordingt
to speed.
Distribution
Algo.
4. Input_file_name_
blank
No input file
given
None Input file
not found
Parallex
Master
5. Input_parameters
_blank
No parameters
defined at
command line
None Exit on
error
Parallex
Master
Table 7.1 Logic/ coverage/decidion Testing

Initial Functional Test Cases for Parallex
Use Case
Function Being
Tested
Initial System
State
Input Expected Output
System
Startup
Master is started
when the switch
is turned "on"
Master is off
Activate the
"on" switch
Master ON
System
Startup
Nodes is started
when the switch
is turned “on”
Nodes is ON
Activate the
"on" switch
NODES is ON
System
Startup
Nodes assigned
IP by master
Booting
Get boot Image
from Master
Master shows that
nodes are UP
System
Shutdown
System is shut
down when the
switch is turned
"off"
System is on and
not servicing a
customer
Activate the
"off" switch
System is off
System
Shutdown
Connection to the
Master is
terminated when
the system is shut
down
System has just
been shut down
Verify from the
Master side that a
connection to the
Slave no longer
exists
Session
System reads a
customer's
Program
System is on and
not servicing a
customer
Insert a readable
Code/Program
Program accepted
Session
System rejects an
unreadable
Program
System is on and
not servicing a
customer
Insert an
unreadable
Code/ program
Program is
rejected; System
displays an error
screen; System is
ready to start a new
sesion

Use Case
Function Being
Tested
Initial System
State
Input Expected Output
System
Startup
Master is started
when the switch
is turned "on"
Master is off
Activate the
"on" switch
Master ON
Session
System accepts
customer's
Program
System is asking
for entry of
RANGE of
calculation
Enter a RANGE
System gets the
RANGE
Session
System breaks
the task
System is
breaking task
according to
processing speed
of Nodes.
Perform
distribution
Algo
System breaks task
& write into a file.
Session
System feeds the
task to Nodes for
processing
System feeds
tasks to the
nodes for
execution
Send tasks
System displays a
menu of task
running on Nodes
Session
Session ends
when all nodes
gives out output
System is
getting output of
all nodes &
display the
output & ends
Get the output
from all nodes.
System displays
the output & quit.
Table 7.2 Functional Test

Cyclomatic Complexity:
Control Flow Graph of a System:
Fig 7.1 Cyclomatic Diagram for the system
Cyclomatic complexity is a software metric (measurement) in computational
complexity theory. It was developed by Thomas McCabe and is used to measure the
complexity of a program. It directly measures the number of linearly independent
paths through a program's source code.
Computation of Cyclomatic Complexity:
In the above flow graph
E = no. of edges = 9
N = no. of nodes = 7
M = E – N + 2
= 9 – 7 + 2
= 4

Console And Black Box Testing:
CONSOLE TEST CASES
Sr.
No.
Test Procedure Pre - Condition Expected Result Actual Result
1
Testing in Linux
terminal
Terminal
variables have
default values
Xterm related tools
are disabled
No graphical
information
displayed
2
Invalid no. of
arguments
All nodes are up Error message Proper Usage given
3
Pop-up terminals
for different
nodes
All nodes are up
No of pop-ups =
no. of cores in alive
nodes
No of pop-ups = no.
of cores in alive
nodes
4
3D Rendering on
single machine
All necessary files
in place
Live 3D rendering
Shows frame being
rendered
5
3D Rendering on
Parallex system.
All nodes are up Status of rendering Rendered video
6 Mplayer testing Rendered frames
Animation in .avi
format
Rendered
video(.avi)
Table 7.3 Console Test cases

BLACK BOX TEST CASES
Sr.
No.
Test Procedure Pre - Condition Expected Result Actual Result
1 New Node up Node is Down
Status Message
Displayed By
NetMon Tool.
Message Node UP
2 Node goes Down Nodes is UP
Status Message
Displayed By
NetMon Tool
Message Node
DOWN
3
Nodes
Information
Nodes are UP
Internal Information
of Nodes
Status, IP , MAC
addr, RAM etc.
4
Main task
submission
Application is
Compiled
Next module called
(distribution algo)
Processing speed
of the nodes.
5
Main task
submission with
faulty input.
Application is
Compiled
ERROR
Display error &
EXIT
6
Distribution
algorithm
Get RANGE
Break task according
processing speed of
the nodes
Breaks The
RANGE &
generates scripts
7 Cluster feed script All nodes up
Task sent to
individual machines
for execution
Display shows
task executed on
each machine
8 Result assembly
All machines have
returned results
Final result
calculation
Final result
displayed on
screen
9 Fault tolerance
Machine(s) goes
down in-between
execution
Error recovery script
is executed
Task resent to all
alive machines
Table 7.4 Black box Testing

System Usage Specification outline:
Fig 7.2 System Usage pattern :
Fig 7.3 Histogram:

Runtime BENCHMARK:
Runtime Benchmark :
Fig 7.4 One frame from Complex Rendering on Parallex: Simulation of an explosion
The following is the output comparison of same application with same
parameters being run on a Standalone Machine, Existing Beowulf Parallel Cluster,
and Our Cluster System Parallex.
Application: POVRAY
Hardware Specifications:
NODE 0 P4 2.8 GHz
NODE 1 Cor2DUO 2.8 GHz
NODE 2 AMD 64, 2.01 GHz
NODE 3 AMD 64, 1.80 GHz
NODE 4 CELERON D,2.16 GHz

Benchmark Results:
Time Single
Machine
Existing
Parallel
Systems(4
NODES)
Parallex
Cluster
System (4
NODES)
Real Time 14m 44.3 s 3m 41.61 s 3m 1.62 s
User Time 13m 33.2s 10m 4.67 s 9m 30.75 s
Sys Time 2m 2.26s 0m 2.26 s 0m 2.31s
Table 7.5 Benchmark Results
Note : User Time of Cluster is approximate sum of all per user system time per node.

Chapter 8. Cost Estimation
Since the growth of requirements of processing is far greater than the growth
of CPU power, and since the silicon chip is fast approaching its full capacity, the
implementation of parallel processing at every level of computing becomes inevitable.
Therefore we propose that in coming ages parallel processing and the
algorithms that sophisticate it, like the ones we have designed and implemented,
would form the heart of modern computing. Not surprisingly, parallel processing has
already begun to penetrate the modern computing marker directly in form of multi
core processors such is Intel dual-core and quad-core processors.
One of ours primary aims are simplistic implementation and least
administrative overhead makes the implementation of Parallex simple and effective.
Parallex can be easily deployed to all sectors of modern computing where
CPU intensive applications form an important part for its growth.
While a system of n parallel processors is less efficient than one n times faster
processor, the Parallel System is often cheaper to build. Parallel computation is used
for tasks which require very large amounts of computation, take a lot of time, and can
be divided into n independent subtasks. In recent years, most high performance
computing systems, also known as supercomputers, have parallel architectures.
Cost effectiveness is one of the major achievements of our Parallex system.
We need no external or expensive hardware nor software, so price of our system is not
been expensive. Our system is based on heterogeneous clusters in which power of
CPU is not an issue due to our mathematical distribution algorithm. Our system
efficiency will not drop by more than 5% due to fewer slower machines.
So, we can say that we are using Silicon waste as challenge to our system,
where we use out dated slower CPUs. Hence our system is Environment friendly
design. One more feature of our system is that we are using diskless nodes which will
reduce the total cost of system by approx. 20% as we are not using the storage devices
of nodes. Apart from separate storage device we will use a centralized storage
solution. Last but not the least our all software tools are Open source.
Hence, we conclude that our Parallex system is one of the most cost effective
systems in its genre.

Chapter 9. User Manual
9.1 Dedicated cluster setup
For the dedicated cluster with one master and many diskless slaves, all the user has to
do is install the RPMs supplied in the installation disk on the master. The BProc
configuration file will then be found at /etc/bproc/config.
9.1.1 BProc Configuration
Main configuration file:
/etc/bproc/config
• Edit with favorite text editor
• Lines consist of comments (starting with #)
• Rest are keyword followed by arguments
• Specify interface:
interface eth0 10.0.4.1 255.255.255.0
• eth0 is interface connected to nodes
• IP of master node is 10.0.4.1
• Netmask of master node is 255.255.255.0
• Interface will be configured when BProc is started
Specify range of IP addresses for nodes:
iprange 0 10.0.4.10 10.0.4.14
• Start assigning IP addresses at node 0

• First address is 10.0.4.10, last is 10.0.4.14
• The size of this range determines the number of nodes in the cluster
• Next entries are default libraries to be installed on nodes
• Can explicitly specify libraries or extract library information from an
executable
• Need to add entry to install extra libraries
librariesfrombinary /bin/ls /usr/bin/gdb
• The bplib command can be used to see libraries that will be loaded
Next line specifies the name of the phase 2 image
bootfile /var/bproc/boot.img
• Should be no need to change this
• Need to add a line to specify kernel command line
• kernelcommandline apm=off console=ttyS0,19200
• Turn APM support off (since these nodes don’t have any)
• Set console to use ttyS0 and speed to 19200
• This is used by beoboot command when building phase 2 image
Final lines specify Ethernet addresses of nodes, examples given
#node 0 00:50:56:00:00:00
#node 00:50:56:00:00:01
• Needed so node can learn its IP address from master
• First 0 is optional, assign this address to node 0
• Can automatically determine and add ethernet addresses using the
nodeadd command

• We will use this command later, so no need to change now
• Save file and exit from editor
Other configuration files
/etc/bproc/config.boot
• Specifies PCI devices that are going to be used by the nodes at boot time
• Modules are included in phase 1 and phase 2 boot images
• By default the node will try all network interfaces it can find
/etc/bproc/node_up.conf
• Specifies actions to be taken in order to bring a node up
• Load modules
• Configure network interfaces
• Probe for PCI devices
• Copy files and special devices out to node
9.1.2 Bringing up BProc
Check BProc will be started at boot time
# chkconfig --list clustermatic
• Restart master daemon and boot server
# service bjs stop
# service clustermatic restart
# service bjs start
• Load the new configuration

• BJS uses BProc, so needs to be stopped first
• Check interface has been configured correctly
# ifconfig eth0
• Should have IP address we specified in config file
9.1.3 Build a Phase 2 Image
• Run the beoboot command on the master
# beoboot -2 -n --plugin mon
• -2 this is a phase 2 image
• -n image will boot over network
• --plugin add plugin to the boot image
• The following warning messages can be safely ignored
WARNING: Didn’t find a kernel module called gmac.o
WARNING: Didn’t find a kernel module called bmac.o
• Check phase 2 image is available
# ls -l /var/clustermatic/boot.img
9.1.4 Loading the Phase 2 Image
• Two Kernel Monte is a piece of software which will load a new
Linux kernel replacing one that is already running

• This allows you to use Linux as your boot loader!
• Using Linux means you can use any network that Linux supports.
• There is no PXE bios or Etherboot support for Myrinet, Quadrics or Infiniband
• “Pink” network boots on Myrinet which allowed us to avoid buying a 1024
port ethernet network
• Currently supports x86 (including AMD64) and Alpha
9.1.5 Using the Cluster
bpsh
• Migrates a process to one or more nodes
• Process is started on front-end, but is immediately migrated onto nodes
• Effect similar to rsh command, but no login is performed and no shell is
started
• I/O forwarding can be controlled
• Output can be prefixed with node number
• Run date command on all nodes which are up
# bpsh -a -p date
• See other arguments that are available
# bpsh -h
bpcp
• Copies files to a node
• Files can come from master node, or other nodes
• Note that a node only has a ram disk by default
• Copy /etc/hosts from master to /tmp/hosts on node 0

# bpcp /etc/hosts 0:/tmp/hosts
# bpsh 0 cat /tmp/hosts
9.1.6 Managing the Cluster
bpstat
• Shows status of nodes
• up node is up and available
• down node is down or can’t be contacted by master
• boot node is coming up (running node_up)
• error an error occurred while the node was booting
• Shows owner and group of node
• Combined with permissions, determines who can start jobs on the node
• Shows permissions of the node
---x------ execute permission for node owner
------x--- execute permission for users in node group
---------x execute permission for other users
bpctl
• Control a nodes status
• Reboot node 1 (takes about a minute)
# bpctl -S 1 –R
• Set state of node 0
# bpctl -S 0 -s groovy
• Only up, down, boot and error have special meaning, everything else
means not down

• Set owner of node 0
# bpctl -S 0 -u nobody
• Set permissions of node 0 so anyone can execute a job
# bpctl -S 0 -m 111
bplib
• Manage libraries that are loaded on a node
• List libraries to be loaded
# bplib –l
• Add a library to the list
# bplib -a /lib/libcrypt.so.1
• Remove a library from the list
# bplib -d /lib/libcrypt.so.1
9.1.7 Troubleshooting techniques
• The tcpdump command can be used to check for node activity during and after a
node has booted
• Connect a cable to serial port on node to check console output for errors in boot
process
• Once node reaches node_up processing, messages will be logged in
/var/log/bproc/node.N (where N is node number)

9.2 Shared Cluster Setup
Once you have the basic installation completed, you'll need to configure the system.
Many of the tasks are no different for machines in a cluster than for any other system.
For other tasks, being part of a cluster impacts what needs to be done. The following
subsections describe the issues associated with several services that require special
considerations.
9.2.1 DHCP
Dynamic Host Configuration Protocol (DHCP) is used to supply network
configuration parameters, including IP addresses, host names, and other information
to clients as they boot. With clusters, the head node is often configured as a DHCP
server and the compute nodes as DHCP clients. There are two reasons to do this. First,
it simplifies the installation of compute nodes since the information DHCP can supply
is often the only thing that is different among the nodes. Since a DHCP server can
handle these differences, the node installation can be standardized and automated. A
second advantage of DHCP is that it is much easier to change the configuration of the
network. You simply change the configuration file on the DHCP server, restart the
server, and reboot each of the compute nodes.
The basic installation is rarely a problem. The DHCP system can be installed as a part
of the initial Linux installation or after Linux has been installed. The DHCP server
configuration file, typically /etc/dhcpd.conf, controls the information distributed to
the clients. If you are going to have problems, the configuration file is the most likely
source.
The DHCP configuration file may be created or changed automatically when some
cluster software is installed. Occasionally, the changes may not be done optimally or
even correctly so you should have at least a reading knowledge of DHCP
configuration files. Here is a heavily commented sample configuration file that
illustrates the basics. (Lines starting with "#" are comments.)

# A sample DHCP configuration file.
# The first commands in this file are global,
# i.e., they apply to all clients.
# Only answer requests from known machines,
# i.e., machines whose hardware addresses are given.
deny unknown-clients;
# Set the subnet mask, broadcast address, and router address.
option subnet-mask 255.255.255.0;
option broadcast-address 172.16.1.255;
option routers 172.16.1.254;
# This section defines individual cluster nodes.
# Each subnet in the network has its own section.
subnet 172.16.1.0 netmask 255.255.255.0 {
group {
# The first host, identified by the given MAC address,
# will be named node1.cluster.int, will be given the
# IP address 172.16.1.1, and will use the default router
# 172.16.1.254 (the head node in this case).
host node1{
hardware ethernet 00:08:c7:07:68:48;
fixed-address 172.16.1.1;

option domain-name "cluster.int";
}
host node2{
hardware ethernet 00:08:c7:07:c1:73;
fixed-address 172.16.1.2;
option domain-name "cluster.int";
}
# Additional node definitions go here.
}
}
# For servers with multiple interfaces, this entry says to ignore requests
# on specified subnets.
subnet 10.0.32.0 netmask 255.255.248.0 { not authoritative; }
As shown in this example, you should include a subnet section for each subnet on
your network. If the head node has an interface for the cluster and a second interface
connected to the Internet or your organization's network, the configuration file will
have a group for each interface or subnet. Since the head node should answer DHCP
requests for the cluster but not for the organization, DHCP should be configured so
that it will respond only to DHCP requests from the compute nodes.
9.2.2 NFS
A network filesystem is a filesystem that physically resides on one computer (the file
server), which in turn shares its files over the network with other computers on the
network (the clients). The best-known and most common network filesystem is
Network File System (NFS). In setting up a cluster, designate one computer as your
NFS server. This is often the head node for the cluster, but there is no reason it has to

be. In fact, under some circumstances, you may get slightly better performance if you
use different machines for the NFS server and head node. Since the server is where
your user files will reside, make sure you have enough storage. This machine is a
likely candidate for a second disk drive or raid array and a fast I/O subsystem. You
may even what to consider mirroring the filesystem using a small high-availability
cluster.
Why use an NFS? It should come as no surprise that for parallel programming you'll
need a copy of the compiled code or executable on each machine on which it will run.
You could, of course, copy the executable over to the individual machines, but this
quickly becomes tiresome. A shared filesystem solves this problem. Another
advantage to an NFS is that all the files you will be working on will be on the same
system. This greatly simplifies backups. (You do backups, don't you?) A shared
filesystem also simplifies setting up SSH, as it eliminates the need to distribute keys.
(SSH is described later in this chapter.) For this reason, you may want to set up NFS
before setting up SSH. NFS can also play an essential role in some installation
strategies.
If you have never used NFS before, setting up the client and the server are slightly
different, but neither is particularly difficult. Most Linux distributions come with most
of the work already done for you.
9.2.2.1 Running NFS
Begin with the server; you won't get anywhere with the client if the server isn't
already running. Two things need to be done to get the server running. The file
/etc/exports must be edited to specify which machines can mount which directories,
and then the server software must be started. Here is a single line from the file
/etc/exports on the server amy:
/home basil(rw) clara(rw) desmond(rw) ernest(rw) george(rw)
This line gives the clients basil, clara, desmond, ernest, and george read/write access
to the directory /home on the server. Read access is the default. A number of other

options are available and could be included. For example, the no_root_squash option
could be added if you want to edit root permission files from the nodes.
Had a space been inadvertently included between basil and (rw), read access would
have been granted to basil and read/write access would have been granted to all other
systems. (Once you have the systems set up, it is a good idea to use the command
showmount -a to see who is mounting what.)
Once /etc/exports has been edited, you'll need to start NFS. For testing, you can use
the service command as shown here
[root@fanny init.d]# /sbin/service nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS mountd: [ OK ]
Starting NFS daemon: [ OK ]
[root@fanny init.d]# /sbin/service nfs status
rpc.mountd (pid 1652) is running...
nfsd (pid 1666 1665 1664 1663 1662 1661 1660 1657) is running...
rpc.rquotad (pid 1647) is running...
(With some Linux distributions, when restarting NFS, you may find it necessary to
explicitly stop and restart both nfslock and portmap as well.) You'll want to change
the system configuration so that this starts automatically when the system is rebooted.
For example, with Red Hat, you could use the serviceconf or chkconfig commands.

For the client, the software is probably already running on your system. You just need
to tell the client to mount the remote filesystem. You can do this several ways, but in
the long run, the easiest approach is to edit the file /etc/fstab, adding an entry for the
server. Basically, you'll add a line to the file that looks something like this:
amy:/home /home nfs rw,soft 0 0
In this example, the local system mounts the /home filesystem located on amy as the
/home directory on the local machine. The filesystems may have different names. You
can now manually mount the filesystem with the mount command
[root@ida /]# mount /home
When the system reboots, this will be done automatically.
When using NFS, you should keep a couple of things in mind. The mount point,
/home, must exist on the client prior to mounting. While the remote directory is
mounted, any files that were stored on the local system in the /home directory will be
inaccessible. They are still there; you just can't get to them while the remote directory
is mounted. Next, if you are running a firewall, it will probably block NFS traffic. If
you are having problems with NFS, this is one of the first things you should check.
File ownership can also create some surprises. User and group IDs should be
consistent among systems using NFS, i.e., each user will have identical IDs on all
systems. Finally, be aware that root privileges don't extend across NFS shared systems
(if you have configured your systems correctly). So if, as root, you change the
directory (cd) to a remotely mounted filesystem, don't expect to be able to look at
every file. (Of course, as root you can always use su to become the owner and do all
the snooping you want.) Details for the syntax and options can be found in the nfs(5),
exports(5), fstab(5), and mount(8) manpages.
9.2.3 SSH

To run software across a cluster, you'll need some mechanism to start processes on
each machine. In practice, a prerequisite is the ability to log onto each machine within
the cluster. If you need to enter a password for each machine each time you run a
program, you won't get very much done. What is needed is a mechanism that allows
logins without passwords.
This boils down to two choices—you can use remote shell (RSH) or secure shell
(SSH). If you are a trusting soul, you may want to use RSH. It is simpler to set up with
less overhead. On the other hand, SSH network traffic is encrypted, so it is safe from
snooping. Since SSH provides greater security, it is generally the preferred approach.
SSH provides mechanisms to log onto remote machines, run programs on remote
machines, and copy files among machines. SSH is a replacement for ftp, telnet, rlogin,
rsh, and rcp. A commercial version of SSH is available from SSH Communications
Security (http://www.ssh.com), a company founded by Tatu Ylönen, an original
developer of SSH. Or you can go with OpenSSH, an open source version from
http://www.openssh.org.
OpenSSH is the easiest since it is already included with most Linux distributions. It
has other advantages as well. By default, OpenSSH automatically forwards the
DISPLAY variable. This greatly simplifies using the X Window System across the
cluster. If you are running an SSH connection under X on your local machine and
execute an X program on the remote machine, the X window will automatically open
on the local machine. This can be disabled on the server side, so if it isn't working,
that is the first place to look.
There are two sets of SSH protocols, SSH-1 and SSH-2. Unfortunately, SSH-1 has a
serious security vulnerability. SSH-2 is now the protocol of choice. This discussion
will focus on using OpenSSH with SSH-2.
Before setting up SSH, check to see if it is already installed and running on your
system. With Red Hat, you can check to see what packages are installed using the
package manager.
[root@fanny root]# rpm -q -a | grep ssh

openssh-3.5p1-6
openssh-server-3.5p1-6
openssh-clients-3.5p1-6
openssh-askpass-gnome-3.5p1-6
openssh-askpass-3.5p1-6
This particular system has the SSH core package, both server and client software as
well as additional utilities. The SSH daemon is usually started as a service. As you
can see, it is already running on this machine.
[root@fanny root]# /sbin/service sshd status
sshd (pid 28190 1658) is running...
Of course, it is possible that it wasn't started as a service but is still installed and
running. You can use ps to double check.
[root@fanny root]# ps -aux | grep ssh
root 29133 0.0 0.2 3520 328 ? S Dec09 0:02 /usr/sbin/sshd
...
Again, this shows the server is running.
With some older Red Hat installations, e.g., the 7.3 workstation, only the client
software is installed by default. You'll need to manually install the server software. If
using Red Hat 7.3, go to the second install disk and copy over the file
RedHat/RPMS/openssh-server-3.1p1-3.i386.rpm. (Better yet, download the latest

version of this software.) Install it with the package manager and then start the
service.
[root@james root]# rpm -vih openssh-server-3.1p1-3.i386.rpm
Preparing... ########################################### [100%]
1:openssh-server ########################################### [100%]
[root@james root]# /sbin/service sshd start
Generating SSH1 RSA host key: [ OK ]
Generating SSH2 RSA host key: [ OK ]
Generating SSH2 DSA host key: [ OK ]
Starting sshd: [ OK ]
When SSH is started for the first time, encryption keys for the system are generated.
Be sure to set this up so that it is done automatically when the system reboots.
Configuration files for both the server, sshd_config, and client, ssh_config, can be
found in /etc/ssh, but the default settings are usually quite reasonable. You shouldn't
need to change these files.
9.2.3.1 Using SSH
To log onto a remote machine, use the command ssh with the name or IP address of
the remote machine as an argument. The first time you connect to a remote machine,
you will receive a message with the remote machines' fingerprint, a string that
identifies the machine. You'll be asked whether to proceed or not. This is normal.
[root@fanny root]# ssh amy

The authenticity of host 'amy (10.0.32.139)' can't be established.
RSA key fingerprint is 98:42:51:3e:90:43:1c:32:e6:c4:cc:8f:4a:ee:cd:86.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'amy,10.0.32.139' (RSA) to the list of known hosts.
root@amy's password:
Last login: Tue Dec 9 11:24:09 2003
[root@amy root]#
The fingerprint will be recorded in a list of known hosts on the local machine. SSH
will compare fingerprints on subsequent logins to ensure that nothing has changed.
You won't see anything else about the fingerprint unless it changes. Then SSH will
warn you and query whether you should continue. If the remote system has changed,
e.g., if it has been rebuilt or if SSH has been reinstalled, it's OK to proceed. But if you
think the remote system hasn't changed, you should investigate further before logging
in.
Notice in the last example that SSH automatically uses the same identity when
logging into a remote machine. If you want to log on as a different user, use the -l
option with the appropriate account name.
You can also use SSH to execute commands on remote systems. Here is an example
of using date remotely.
[root@fanny root]# ssh -l sloanjd hector date
sloanjd@hector's password:

Mon Dec 22 09:28:46 EST 2003
Notice that a different account, sloanjd, was used in this example.
To copy files, you use the scp command. For example,
[root@fanny root]# scp /etc/motd george:/root/
root@george's password:
motd 100% |*****************************| 0 00:00
Here file /etc/motd was copied from fanny to the /root directory on george.
In the examples thus far, the system has asked for a password each time a command
was run. If you want to avoid this, you'll need to do some extra work. You'll need to
generate a pair of authorization keys that will be used to control access and then store
these in the directory ~/.ssh. The ssh-keygen command is used to generate keys.
[sloanjd@fanny sloanjd]$ ssh-keygen -b1024 -trsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/sloanjd/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/sloanjd/.ssh/id_rsa.
Your public key has been saved in /home/sloanjd/.ssh/id_rsa.pub.

The key fingerprint is:
2d:c8:d1:e1:bc:90:b2:f6:6d:2e:a5:7f:db:26:60:3f sloanjd@fanny
[sloanjd@fanny sloanjd]$ cd .ssh
[sloanjd@fanny .ssh]$ ls -a
. .. id_rsa id_rsa.pub known_hosts
The options in this example are used to specify a 1,024-bit key and the RSA
algorithm. (You can use DSA instead of RSA if you prefer.) Notice that SSH will
prompt you for a pass phrase, basically a multi-word password.
Two keys are generated, a public and a private key. The private key should never be
shared and resides only on the client machine. The public key is distributed to remote
machines. Copy the public key to each system you'll want to log onto, renaming it
authorized_keys2.
[sloanjd@fanny .ssh]$ cp id_rsa.pub authorized_keys2
[sloanjd@fanny .ssh]$ chmod go-rwx authorized_keys2
[sloanjd@fanny .ssh]$ chmod 755 ~/.ssh
If you are using NFS, as shown here, all you need to do is copy and rename the file in
the current directory. Since that directory is mounted on each system in the cluster, it
is automatically available.
If you used the NFS setup described earlier, root's home
directory/root, is not shared. If you want to log in as root

without a password, manually copy the public keys to the target
machines. You'll need to decide whether you feel secure setting
up the root account like this.
You will use two utilities supplied with SSH to manage the login process. The first is
an SSH agent program that caches private keys, ssh-agent. This program stores the
keys locally and uses them to respond to authentication queries from SSH clients. The
second utility, ssh-add, is used to manage the local key cache. Among other things, it
can be used to add, list, or remove keys.
[sloanjd@fanny .ssh]$ ssh-agent $SHELL
[sloanjd@fanny .ssh]$ ssh-add
Enter passphrase for /home/sloanjd/.ssh/id_rsa:
Identity added: /home/sloanjd/.ssh/id_rsa (/home/sloanjd/.ssh/id_rsa)
(While this example uses the $SHELL variable, you can substitute the actual name of
the shell you want to run if you wish.) Once this is done, you can log in to remote
machines without a password.
This process can be automated to varying degrees. For example, you can add the call
to ssh-agent as the last line of your login script so that it will be run before you make
any changes to your shell's environment. Once you have done this, you'll need to run
ssh-add only when you log in. But you should be aware that Red Hat console logins
don't like this change.
You can find more information by looking at the ssh(1), ssh-agent(1), and ssh-add(1)
manpages. If you want more details on how to set up ssh-agent, you might look at
SSH, The Secure Shell by Barrett and Silverman, O'Reilly, 2001. You can also find

scripts on the Internet that will set up a persistent agent so that you won't need to
rerun ssh-add each time.
9.2.4 Hosts file and name services
Life will be much simpler in the long run if you provide appropriate name services.
NIS is certainly one possibility. At a minimum, don't forget to edit /etc/hosts for your
cluster. At the very least, this will reduce network traffic and speed up some software.
And some packages assume it is correctly installed. Here are a few lines from the host
file for amy:
127.0.0.1 localhost.localdomain localhost
10.0.32.139 amy.wofford.int amy
10.0.32.140 basil.wofford.int basil
...
Notice that amy is not included on the line with localhost. Specifying the host name as
an alias for localhost can break some software.
9.3 Working with Parallex
Once the master has been configured and all nodes are up, working with Parallex to
utilize all your available resources is very easy. Follow these simple steps to use the
power of all nodes that are up.
• Compile your code and place it in $PARALLEX_DIR/bin/
You can use the Makefile to do this for you.
# make main_app
• After the application is compiled without any errors, first start the networking
monitoring tool of Parallex

Parallex - The Supercomputer

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Parallex - The Supercomputer

Similaire à Parallex - The Supercomputer (20)

Plus de Ankit Singh

Plus de Ankit Singh (16)

Dernier

Dernier (20)

Parallex - The Supercomputer