Clusters exist for mainly two reasons:
High availability (HA): Another computer takes over the work if the primary computer fails.
Load balancing: The CPU or I/O load is distributed over several computers, in order to increase the capacity of the system.
The industry has created a number of solutions for managing the HA problem. Load balancing, however, is still a challenging problem, especially when it is not only the CPU load, but also I/O load, or a combination of these, that need to be distributed over several computers. Also, load balancing somehow includes the HA problem - the nodes of the clusters fail from time to time, and the system has to cope with that, or uptime goals cannot be reached.
Examples for individual cluster software
Here is a list of cluster components that were developed in the past:
Cluster-wide registry: The task is to reliably store configuration values that can dynamically change (e.g. port numbers, data mappings), and also to ensure enough query capacity for the whole cluster.
Key/value store: The task is to store huge data volumes (in the multi-terabyte range) where only a single field is used as key. The data is distributed by key over the cluster computers. The goal is to get extremely low access latencies, and to also provide hot spare computers that can jump in when one of the primary nodes fails.
Map/reduce: The task is to solve a complex data reordering problem on a cluster, like the creation of a search index. The term "map/reduce" has been introduced by Google researchers for a certain algorithmic schema that provides two placeholder functions, "map" and "reduce". These functions are instantiated by the developer to get the final algorithm. The goal is to maximize the CPU utilization of the cluster, and to minimize network accesses to data. For the schematic parts of map/reduce a platform like Apache Hadoop is used.
Of course, in a real cluster application, a number of such components have to be combined to get the desired functionality.
Choosing the right technology is difficult. The usual "developer reflexes" are not always the right one, and innovative approaches get a chance to prove themselves.
The stability of the running programs cannot be emphasized enough. In normal environments a developer is already happy when the application does not crash weekly. In a cluster system, even a crash rate (per running instance) of once per month may be way too high. Imagine you have 100 nodes - this rate would mean 3 cluster outages per day.
Also, correctness is of superior importance. Given that a cluster computation is very costly (and often also lengthy), it is not desirable to find at the end of the computation that the results are incorrect.
Other criterions for choosing the right technology are the runtime speed, and the "time-to-market", i.e. how long the development cycle takes.
Gerd Stolpmann prefers:
Remote procedure calls (RPC) are a well-established technology for connecting nodes to a cluster system. As a specialty, Gerd Stolpmann is an expert for asynchronous RPC's which allow it to parallelize RPC's without falling back to multi-threading.
Multi-processing means that the parallel-executed tasks are protected from each other (a crash in one task does not have an impact on other tasks). Because of this it is preferred over multi-threading, even at the price of higher IPC costs.
Event-driven programming is very useful to parallelize network accesses within a single thread/process, and also to provide reasonable error paths (e.g. total timeouts for RPC's).
Ocaml as implementation language is still unusual in the software industry, but has some properties that make it superior in the area of cluster computing. Ocaml focuses on correctness (by its rich and unmatched type system), on stability (by provding mature compilers and runtimes, and by systematically banning unsafe language constructs), and on execution speed (by generating machine code). Also, Ocaml programs are change-friendly and the time-to-market is low.
The architecture of a cluster system is extremely important in order to reach the promised performance goals. Generally, Amdahl's law limits the possible speed-ups in a cluster system, and a careful analysis must be done to ensure that really all parts of the system are parallelizable.
The success often depends on the data architecture. If the data can be organized in a cluster-friendly way it can be accessed and processed simultaneously in a natural way, and the best speed-ups can be achieved. Some examples:
Data buckets: This is a simple but very effective technique where the whole data set is split into a number of buckets. A hash function determines which data item goes into which bucket. This works well when there is only a single key, and algorithms randomly access the data items by this key.
Pipelines: The data is a sequence of items, and this sequence flows through a chain of algorithms modifying them.
Algorithm "travels" to data: In a cluster system it makes often more sense when the algorithm is "sent" over the network to the data rather than the other way round. In practice this means a job control scheme is established where jobs are tried to be started on the node where the processed data (or most of it) are located.
There is expertise for:
Apache Hadoop, a map/reduce platform
Apache Lucene, a text indexing engine
ICE, a multi-language RPC platform, and Hydro, the ICE implementation for Ocaml
Ocamlnet, a base library for multi-processing, event-driven programming, and networking