Skip to comments.Linux clusters -- The New Workhorse of Gene Sequencing, Proteomics and Drug Development
Posted on 02/09/2002 3:49:17 PM PST by Justa
Linux clusters, which network multiple processors together to form a unified and more powerful computing system, are becoming a major technology in the bioinformatics industry. Universities, government labs and commercial entities now boast Linux clusters of dozens, if not hundreds of these processors or "nodes" for the explicit purpose of gene sequencing, proteomic research, or drug discovery and development.
A node within a Linux cluster is the basic unit of processing. Typically, it is a server or workstation dedicated to processing information to aid in the massive amounts of number crunching involved in biotech research. With information being processed at such rapid rates, scientists and engineers can devote their time and energy analyzing the information processed by the Linux clusters, not waiting or worrying about speedy or accurate results. Although some traditional supercomputer programs must be adapted for clusters, this investment is attractive for two reasons.
First, nodes consist of readily available off-the-shelf components with commodity pricing. Most users of Linux clusters benchmark a price/performance improvement that is literally 10 times better than other traditional alternatives. Second, the Linux operating system and cluster management software both allow clusters to scale from four processors to several hundred. Users can take advantage of the scalability of Linux clusters to grow their system as demand warrants. The scalability and price/performance ratio of clustering allows program managers to economically add nodes as needed without changing software programs.
Because of the standard components used in Linux clusters, organizations involved with biotechnology research can afford a small cluster to start processing data, and scale as demands increase. Companies such as Tularik and Celera have chosen Linux clusters to facilitate their supercomputing needs and to significantly speed their research and development processes while increasing reliability. The risk of losing an entire processing run can be eliminated when the cluster is programmed to be forgiving of node failures. A portion of computational capacity may be lost without compromising the data or completion of the task.
Algorithm development now focuses on the ability to scale across large numbers of processors, but all nodes are not equally suited for different algorithms. For example, consider clusters with two very different architectures and communications schemes that have been successful. Each cluster has 16 nodes and includes the same chassis, control systems, and Linux operating system:
The tasks used here are for example only. Each user application should be profiled across a variety of node configurations.
The large node configuration is appropriate when the computation is very high for a larger compute task or "granule. Because the software task takes so long to run, once in the node, it becomes more economical to crunch with all power possible, using as much memory as possible. Also, the use of large amounts of local storage for distribution of the problem data set reduces the need for faster communications.
In the "small" node configuration, faster communications means centralized storage is more efficient for some tasks. Faster distribution of smaller computation packets reduces the need for large memory and local storage. But, when trying to run the large application, the small node looses its speed advantage because of churning and communications overhead.
Which node configuration is best? It depends on the application and the program architecture. But as is demonstrated in the example tasks, price/performance is extremely application dependent. Most successes in clustered systems include a rigorous benchmarking against a variety of platforms and "code tweaking" to take advantage of the node and processor configuration. One cluster user relates the benefits of reordering the data access in a program: the performance doubled after minor tweaking.
Cluster Management and Integrated Solutions The advantages of clustering in genomic and drug development applications have been proven by various companies in the biotech industry, such as BioCryst Pharmaceuticals Inc., and Sequemon, but the difficulty of managing clusters remain a challenge, especially as the number of processors increases. The hardware and software tools available from Linux NetworX allow users to control the cluster as a single system. Users have the ability to remotely monitor and manage the entire cluster as well as individual nodes. Features of the software include monitoring of processor temperature, CPU usage, and the ability to reset individual nodes. Cluster management software allows users to overcome many of the concerns associated with large cluster systems. ICE Box, from Linux NetworX provides advanced serial switching and power control so more time can be spent in the drug development or gene sequencing processnot in the server room.
Preintegrated Linux clusters eliminate the headaches associated with complicated system setups. In this area, experience counts. Linux NetworX, for example, designed and delivered the world's first commercial Linux cluster system in 1997. Preintegrated clusters lead to a lower total-cost-of-ownership. The fewer resources an organization has to dedicate to integrating the Linux cluster system, the more time and money it can direct into processing valuable data. A fully integrated Linux cluster solution includes computational hardware, networks, storage, software, applications, management tools, and service and support.
Lower Total Cost-of Ownership Outstanding price/performance ratio is one advantage of Linux clusters, but users need to think beyond the initial price tag to the total cost-of-ownership. Man-hours dedicated to managing and maintaining the cluster, including integration and set-up can lead to a higher total cost-of-ownership. Linux NetworX provides complete end-to-end Linux cluster solutions that help reduce the total-cost of-ownership.
Overall, bioinformatics organizations need to consider value when it comes to Linux clusters. Cluster management tools can save an organization time and money by allowing administrators to control the cluster as a single system. Instead of forcing administrators to attend to every problem, ClusterWorX® management software uses a sophisticated event engine to automatically handle problems, without administrator intervention. For example, the administrator can customize ClusterWorX to send an e-mail alert if a processor reaches a user-defined temperature, and even shut down the problem node.
An organizations administration resources can be greatly reduced when the management software allows for secure remote access of the cluster system. If cluster problems develop at inconvenient times, administrators have full access to the cluster at home or on the road through any Java-enhanced browser. Administrators can also monitor a variety of cluster values and take actions on the cluster through an easy-to-use GUI.
The ICE Box hardware appliance fully integrates with ClusterWorX to provide administrators with advanced serial switching and remote power control capabilities, and is the only appliance of its kind designed specifically to improve the manageability of Linux clusters. ICE Box features include node health and environmental monitoring, power control, node reset capabilities, and advanced serial switching, which allows administrators to maintain redundant serial connections in a cluster. By providing direct serial access to individual nodes within the cluster, ICE Box delivers a level of control, convenience and manageability not previously available for Linux cluster systems.
On larger cluster systems, disk cloning is another significant time-saving feature for large cluster systems because it allows software and other updates to be installed on one node and automatically distributed to the entire system. Users can store a variety of system images for different types of nodes in the cluster.
Adding additional nodes to an existing cluster is as simple as a few clicks of a mouse when using ClusterWorX. Clusters, by nature, are very scalable. Cluster management software can ensure additional nodes seamlessly mesh into the existing system and function at parallel.
Linux clustering has proven to be a powerful, scalable and cost-effective computing resource for biotechnology organizations. With new cluster management technologies, such as ClusterWorX and ICE Box, the complexities of managing large cluster systems have been greatly reduced. Because of all the advantages associated with this computing solution, Linux clusters are becoming the new workhorse of genomic sequencing, proteomics, and drug development.
Copyright 2002 Linux NetworX
Click here: tech_index
That was about all I understood out of that entire article. Is that enough? LOL
Yeah, but it pays itself off in a short time since you don't have to 'dump' your OS knowledgebase with every new release like one has to do with MS. I'd certainly like to be able to use the MS-DOS-thru-NT4 knowledge I have to help me admin 2000 Server but Nope. The best part of the open source *nixes is of course the security available under them. Imo NT4 and NT5 are totally compromised. Of course it's speculation on my part because I'd need access to the source codes to prove it but I've done enough cause-and-effect testing on them to determine they're owned and operated by others than the End User.