Introduction:
Cluster computing is a very economical form of parallel computing. The configuration that is described in this document is based on the concept of a Beowulf cluster, however, the software used is not entirely the same. These instructions are intended to make the process as easy as possible.
Parts of a Cluster:
The cluster consists of four major parts. These parts are: 1) The network, 2) The nodes, 3) The server, 4) the gateway. Each part has a specific function. The last line of each description tells the special requirements (if any) that are needed for the hardware to perform its function.
The shared file system on the server will contain the home directories for the users as well as the MPI programs. In order to avoid data loss from hard drive failure, it is highly recommended that the shared drive be a RAID drive. A RAID drive consists of several drives that work as one. Data is striped or mirrored across these drives in a manner that prevents data loss in the event of a hard disk failure. (The method by which the computer does this is determined by something called the RAID level. For a description of the different RAID levels see the RAID setup file.) There are two ways to accomplish this, hardware RAID and software RAID. The hardware RAID uses a special controller card to link the disks together. This makes this option a bit more expensive. A less expensive way to RAID your disks is to use software RAID. With this configuration, the OS handles the RAID arrangement rather than a controller card. All of the RAID levels can be accomplished through software RAID and the process to setup software RAID is described in the RAID setup file.
Hardware Considerations:
Network:
This is the hardware configuration that was used to create this documentation:
Nodes:
Server:
Gateway:
If you use the same or a very similar setup you should be able to use the documentation without making any changes. Any differences (especially in the ethernet card) will not render this documentation worthless, however, you will need to be able to select the correct drivers for your hardware. (Especially while building the minimal kernel.) Every effort has been made to make this documentation as generic as possible, however, it is possible that hardware specific items are still lurking within.
What Hardware do I need?
General Hardware:
Other Considerations:
The hardware list is pretty flexible. To make sure that installation goes as smoothly as possible it is preferable to use hardware that you know will work easily with LINUX.
Computers:
With the computers a general rule applies, the faster the processor and the more memory the better. SCSI is also a good idea for the computers (especially the server). If you intend to store information on a scratch partition on the nodes you will need to have larger drives on the nodes. If you do not intend to do this you can get away with much smaller hard drives.
You are going to set up a rather large number of computers in an enclosed space. The average engineer/architect does not anticipate 18+ computers in a room. (They usually only anticipate one or two) This means that there may be electrical/heat dissipation issues that you need to consider. Before constructing your cluster you should check with someone to make sure that the room that you will build it in is suitable.
What Software do I need?
In order to make your cluster run you will need several software packages. These are RedHat LINUX (your operating system), ssh (your communications package), MPICH (the software that allows you to run parallel programs), and ntp timeserver (not essential, however, it does help keep the time on the cluster synchronized). These packages are provided with this documentation.
Setting up the Cluster:
General Information:
Once you have your hardware assembled the setup of the cluster can be divided into five parts. Each step is explained in documents that are located in the remainder of this documentation.
ADDING USERS
When you add users you will need you make sure that they can ssh unchallenged from the server to any node. To accomplish this you will need to do the following.
COMPILING PROGRAMS
IFLAGS = -I$(INCLDIR) -I/usr/local/mpich/include -I/usr/local/mpich/build/LINUX/ch_p4/include LFLAGS = -l./timing/ -L/usr/local/mpich/build/LINUX/ch_p4/lib -lm -lmpich -ltiming Another way to get around this is to compile with mipcc (or mpiCOMPILERNAME) this should take care of everything as well. MPI compilers should be located in mpich/bin.
You will need to add the following lines to your makefile
To run programs in parallel you will need to add "mpirun -np Insert#OfNodesHere" before the regular command line expression for the program
i.e. If you would normally type "hello++ -ivqrh' to run your program on a cluster configured with 493 nodes you would need to type "mpirun -np 493 hello++ -ivqrh" to run the parallel version of the program.
IMPORTANT NOTE:
In order to run the programs in parallel they have to be written in parallel. This means that you have to have inserted mpi calls into the program. If you do not do this you will not see any gain from running programs in parallel.