The next several years will see the widespread introduction and use of gigabit wide-area and local-area networks. These networks have the potential to transform how people compute and, more importantly, how they interact and collaborate with one another. The increased bandwidth will enable the construction of wide-area virtual computers, or metasystems, that allow users to work in a wide-area environment of interconnected workstations, personal computers, graphics rendering engines, supercomputers, and non-traditional devices (such as televisions, toasters, etc.). Realizing this environment requires new conceptual models, which must provide a solid, integrated foundation for the applications that will unleash the potential of these diverse resources. This foundation must meet several conditions, including the ability to render the underlying physical infrastructure transparent to users and most programmers; support access-, location-, and fault-transparency; allow interoperation between its components; support construction of larger, integrated components; guaranteeing system security; and scale to millions of autonomous hosts.
Legion will tap the enormous computational capabilities of resources connected by local-area networks, enterprise-wide networks, and the National Information Infrastructure. While the user sits at a single workstation, Legion will transparently schedule application components on processors, manage data transfer and coercion, and provide communication and synchronization. System boundaries, data location, and faults will be invisible; processor cycles, communication channels, and data will be shared; and a workstation across the continent will be as accessible as the workstation down the hall.
Legion is not a centralized system controlled by any single person or organization. Legion systems are independent entities that can connect to other systems as they choose. Individual systems determine who has access to their resources and how that access will be monitored. Individual users, once logged on to their local system, are in a wide-ranging network of other Legion systems, and, depending on security and access protocols, can use the resources of any single part of that network.
Legion, an object-oriented metasystem software project at the University of Virginia, was first begun in the fall of 1993. Our goal is to provide a flexible, efficient, and scalable system based on solid principles. When complete, Legion will provide a single, coherent virtual machine that provides universal object and name spaces, application-adjustable fault tolerance, improved response time, and greater throughput.
To realize the Legion vision is not trivial, and several design objectives have been determined to be essential to its success. Among them are site autonomy, multiple languages and interoperability, high performance via parallelism, and a single persistent name space.
Legion is not a monolithic system. As the system grows, it will need to integrate resources owned and controlled by a wide array of organizations. There is simply no way that thousands of organizations and millions of users will subject themselves to the dictates of a "big brother" centralized control mechanism, or subject their resources and data to external management. Organizations, quite properly, will insist on having control over their own resources, specifying how much of a resource can be used, when it can be used, and who can or cannot use the resource.
Legion must support applications written in a variety of languages. It must be possible to integrate heterogeneous source-language application components in much the same manner that heterogeneous architectures are integrated. Interoperability requires that Legion support legacy codes, as well as work with emerging standards such as CORBA [See Ben-Naten] and DCE [See Lockhart].
This does not mean that all applications will be parallel--Legion will necessarily best support relatively course-grain applications. This objective should not be misinterpreted to mean that we think a single application will ever use all of the computers in the country. Most parallel applications will use only a small subset of the total resource pool at any time.
One of the most significant obstacles to wide-area parallel processing is the lack of a single name space for data and resource access. The existing multitude of disjoint name spaces makes writing applications that span sites extremely difficult. Any Legion object should be able to access (subject to security constraints) any other Legion object transparently, without regard to location or replication.
Object-orientation provides a coherent solution to these problems. In Legion, all components of interest to the system are objects, and all objects are instances of defined classes. Thus users, data, applications, and even class definitions are objects. An object-oriented foundation, including encapsulation and inheritance, offers many benefits, including software reuse, fault containment, and reduction in complexity. There is a particularly acute need for such a paradigm in a system as large and complex as Legion.
Objects written in either an object-oriented language or a language such as HPF (High Performance Fortran) will encapsulate their implementation, data structures, and parallelism and interact with other objects via well-defined interfaces. They may have associated inherited timing, fault, persistence, priority, and protection characteristics as well. Naturally, these characteristics may be overloaded to provide different functionality on a class by class basis. Similarly, a class may have multiple implementations with the same interface.
Complementing the use of an object-oriented paradigm is one of Legion's driving philosophical themes: we cannot design a system that will satisfy every user's needs. Legion is designed to allow both users and class implementors flexibility in their applications' semantics. Whenever possible, users choose both the kind and the level of functionality, and make their own trade-offs between function and cost.
For example, users will have widely varying standards of security. Some users are more concerned with privacy, some are content maintaining their data's integrity, and others want both privacy and integrity. Banks and hospitals, for example, would fall in the last category. There is a difference between the kind of security functionality and the degree, or level, of security. Some users are content with password authentication, while others might feel a need for more stringent user identification, such as signature analysis, fingerprint identification, or another approach. A user might be willing to pay the higher cost (in terms of CPU, bandwidth, and time) of a more powerful cryptographic key in order to have a stronger degree of security without changing the basic nature of the type of security provided. On the other hand, an application that requires low overhead cannot afford such a policy and should not be forced to use it. Such an application might instead choose a light-weight policy that merely verifies communication integrity or perhaps one with no security at all. Users decide what trade-offs to make, whether by implementing their own policies or using existing policies via inheritance [See Wulf et al], instead of coping with an inevitably unsatisfactory fixed security mechanism.
Next, consider the problem of maintaining consistent semantics in a distributed file system. To achieve good performance, it is often desirable to copy all or part of a file. If updates to the file are permitted, the different copies may begin to diverge. There are many ways to circumvent this problem: do not replicate writable files, use a cache-invalidate protocol, use lazy updates to a master copy, etc. Of course, while some applications don't require all copies to be the same and some require a strict "reads deliver the last value written" semantics, others consider a file to be read-only. In the last case, consistency protocols are a waste of time. On the other hand, other applications may need different semantics for files in different regions of the application. Independent of file semantics, some users will want automatic backup and frequent archiving, while others may not. The point is that the user--not the system--should make such decisions.
This philosophy has been extended into the system architecture. The Legion object model specifies the composition and functionality of Legion's core objects (objects that cooperate to create, locate, manage, and remove objects in the Legion system). Legion specifies the functionality but not the implementation of the system's core objects. The core therefore consists of extensible, replaceable components. The Legion project provides implementations of the objects that comprise the core but users are not obligated to use them. Instead, users can select or construct objects that implement mechanisms and policies that meet specific requirements.
The object model provides a natural way to achieve this kind of flexibility. Files, for example, are not part of Legion itself. Anyone may define a new class whose general semantics would be recognized as a file but whose specifics match a particular semantics. The current Legion software system provides an initial collection of file classes that reflect the most common needs, but it does not anticipate all possible future needs.
In the summer of 1995, we released the first prototype Legion implementation, the Campus-Wide Virtual Computer (CWVC). This first implementation was based on an earlier object-oriented parallel processing system, Mentat. Mentat was originally designed to operate in homogeneous, dedicated environments, but was extended to operate in an environment with heterogeneous hosts, disjoint file systems, local resource autonomy, and host failure. We could have continued to stretch Mentat, but we felt that a system can be manipulated only so much before it shows signs of stress, and that it would be better to design a new system and work with a clean, coherent architecture, rather than a patchwork of solutions for unrelated problems.
The CWVC was a direct extension of Mentat to a larger scale, and was a prototype for the nationwide Legion system. Even though the CWVC was much smaller and the components much closer together it presented many of the same challenges as a nationwide Legion system. The computational resources of the University were operated by many different departments and had no shared name space and few shared resources, heterogeneous processors, irregular interconnection networks, differences on the scale of orders of magnitude in bandwidth and latency, and on-site applications that required protection. Further, each department operated as an island of services, using individual NFS mount structures and trusting only machines on its own island.
The CWVC consisted of over one hundred workstations and an IBM SP-2 in six buildings using two completely disjoint underlying file systems. We had developed a suite of tools to address common problems, and, in collaboration with domain scientists at the University of Virginia and elsewhere, we also developed a set of applications that exploited the environment.
We ran the CWVC in the local environment, and demonstrated it on wide-area systems. During the Supercomputing '95 conference in San Diego it ran on the I-Way, an experimental network connecting the NSF supercomputer centers, several of the DOE and NASA labs, and a number of other sites. Many of the connections were DS-3 (45 mb/sec) and OC-3 (155 mb/sec) rates. The CWVC was installed at three sites, using seven hosts of three different architectures: At NCSA (Urbana) we used four SGI Power Challenges and the Convex Exemplar, and at CTC (Cornell) and ANL (Argone) we used IBM SP-2s.
Once the IP routing tables had been properly configured moving the CWVC to the wide-area environment was relatively simple. We copied the CWVC to the platforms, adjusted the tables to use IP names that routed through the high-speed network, and tested the system. As expected, files in different locations could be transparently accessed, executables could be moved transparently from one location to another as needed, the scheduler worked, and the system automatically reconfigured on host failure. Utilities and tools such as the debugger also migrated easily. The real bonus, though, was that user applications required no changes to run in the new environment.
For our demonstration, we exercised our utilities and ran one of our applications, complib, on the I-Way. Complib compares two DNA or protein sequence databases using one of several selectable algorithms [See Grimshaw et al]. The first database was located at ANL, while the second was located at NCSA. The application transparently accessed the databases using the Legion file system while the underlying system schedulers placed application computation objects throughout the three-site system. All communication, placement, synchronization, and code and data migration were transparently handled by Legion.
Since then we have repeated the demonstration several times, and are now in the process of constructing a more permanent prototype. The prototype will span NCSA and SDSC and will operate as a part of the DARPA-funded Distributed Object Computation Testbed.
The vision of a seamless metacomputer such as Legion is not new--worldwide computers have been the territory of science fiction and distributed systems research for decades. To our knowledge, no other project has the same broad scope and ambitious goals that Legion has. However, a large body of relevant research in distributed systems, parallel computing, fault-tolerance, workstation farm management, and pioneering wide-area parallel processing projects provides a strong foundation for building a metacomputer.
Related efforts such as OSF/DCE (Open Software Foundation/Distributed Computing Environment) [See H.W. Lockhart] and CORBA (Common Object Request Broker Architecture) [See Ben-Naten] are rapidly becoming industry standards. Legion and DCE share many of the same objectives, and draw upon the same heterogeneous distributed computing literature for inspiration. Consequently, both projects use many of the same techniques (i.e., object-based architecture and model, Interface Descriptive Languages [IDLs] to describe object behavior, and wrappers to support legacy code). However, Legion and DCE differ in fundamental ways: DCE does not target high-performance computing and its underlying computational model is based on blocking remote procedure calls between objects. Further, DCE does not support parallel computing but places its emphasis on client-server-based distributed computing. Legion is based on a parallel computing model, and one of its primary objectives is high-performance via parallel computing. Legion also specifies very little about implementation, since users and resource owners are permitted and even encouraged to provide their own implementations of "system" services. Our core model is completely extensible, and provides choice at every opportunity, from security to scheduling.
Similarly, CORBA defines an object-oriented model for accessing distributed objects. CORBA includes an IDL, and a specification for the functionality of run-time systems that enable access to objects (ORBs). But, like DCE, CORBA is based on a client-server model, rather than a parallel computing model, and less emphasis is placed on issues such as object persistence, placement, and migration.
The Globus project [See Foster et al] at Argonne National Laboratory and the University of Southern California has similar goals, as well as a number of similar design features. Both systems also support a range of programming interfaces, including popular packages such as MPI. They differ significantly, however, in their basic architectural techniques and design principles.
Globus adds value to existing high-performance computing systems by rendering them interoperable and by extending their implementations to operate effectively within a wide-area distributed environment. This sum-of-services approach has a number of advantages, such as taking advantage of code reuse and allowing the user to retain familiar tools and work environments. However, as the amount of provided services grows the lack of a common programming interface and model becomes an increasingly significant burden for end users. Legion's common-object programming model allows users and tool builders to employ the many services needed to effectively use a metacomputing environment: schedulers, I/O services, application components, etc. Essentially, Legion feels that the short-term advantages of patching existing parallel and distributed computing services together do not outweigh the long-term necessity of constructing a metacomputing software system on an extensible design made up of orthogonal building blocks.
The Globe [See van Steen et al] project, developed at Vrije Univeriteit, Netherlands, also shares many goals and attributes with Legion. Globe and Legion are both middleware metasystems that run on top of existing host operating systems and networks, both support flexible implementation, both have a single uniform object model and architecture, and both use class objects to abstract implementation details. But where a Globe object is passive and assumed to be physically distributed over potentially many resources in the system, a Legion object is active and is expected to physically reside within a single address space.
These conflicting views lead to different types of interobject communication: Globe loads part of the object into the caller's address space, but Legion sends a message of a specified format from the caller to the callee. Legion also has a set of core object types that are designed to provide abstractions for a wide variety of implementations. To our knowledge Globe does not offer an equivalent set of tools. It also does not address key issues such as security and site autonomy [See van Steen et al].
Other projects share many of the same objectives, but not the scope of Legion. Nexus [See Foster et al] provides communication and resource management facilities for parallel language compilers. Castle [See The Castle Project] is a set of related projects that aims to support scientific applications, parallel languages and libraries, and low-level communication issues. The NOW [See Anderson et al] project provides a somewhat more unified strategy for managing networks of workstations, but is intended to scale only to hundreds of machines instead of millions.
In its intended application for distributed collaboration and information systems, Legion might be compared to the World Wide Web. In particular, the object-oriented, secure, platform-independent, remote execution model afforded by the Java language [See Sun Microsystems] has added more Legion-like capabilities to the Web. The most significant differences between Java and Legion lie in Java's lack of a remote method invocation facility, lack of support for distributed memory parallelism, and its interpreted nature, which even in the presence of "just-in-time" compilation leads to significantly lower performance than can be achieved using compilation. Furthermore, the security and object placement models provided by Java are rigid, and are a poor fit for many applications.
Although we are committed to the object-oriented paradigm we recognize that Legion applications will be written in a variety of languages to support existing legacy code, to permit organizations to use familiar languages (C, Fortran), and to support parallel processing languages. We will provide support for multiple languages and interoperability between user objects written in these languages by generating object "wrappers" for code written in languages such as Fortran, Ada, and C, by exporting the Legion run-time system interface and retargeting existing compilers, as well as by a combination of these two methods.
Legion has both PVM and MPI interfaces. Both interfaces include functions for data marshaling, message passing, and task creation and control. Legion's PVM and MPI implementations provide an environment that includes Legion security and placement services. For information about the PVM and MPI libraries, please see "PVM" and "MPI" in the Basic User Manual.
Legacy codes and programs written in other languages can be incorporated into Legion applications by encapsulating them in an object wrapper, as shown in Figure1. Object wrappers can be hand generated, using the Mentat Programming Language (please see the "Mentat Programming Language Reference Manual," available from the Legion Research Group) or generated with an IDL and an interface compiler. In the case of the IDL, the programmer provides a "class definition" of the object, which lists the functions, type and number of any parameters, the object's state, etc.; a compiler generates the interface. This is a common technique, and is used in the OMG ORB [See Ben-Naten]. Additional IDLs can be readily supported, because class objects have the capability to export multiple interfaces or views of a class.