When we speak of policies in the context of data management in grids, caution is required. In the literature, many different people use the term "policy" to refer to a number of different things, and many other things that could be considered to be policies are not always referred to as such. Thus, it is important to be clear regarding what is meant when using this term.
The authors of
2,
3 , and
4 , refer to policies as goals of the VO, upon which data placement algorithms may be based. The primary policies they refer to in
2 are related to file replication and dissemination. The also consider the relationship between data management and workflow execution and discuss how this may influence policy. They discuss two examples of real systems currently in use in the scientific community. The first system is PheDEx, a data management system for the high-energy physics community used by the Compact Muon Solenoid (CMS) experiment at CERN (see
5 ). This system distributes data throughout the VO after initial publication at CERN based on a tiered dissemination model. The second system is the Lightweight Data Replicator (LDR), used by gravitational wave physicists at Laser Interferometer Gravitational Wave Observatory (LIGO) for data distribution throughout the sites in the VO based on metadata queries by scientists at these sites (see
10 and
11). In these papers, the authors also classify policies into three groups: policies for staging data in to computational nodes during workflow execution, policies for staging data out after workflow execution, and policies for the purpose data reliability and durability in general, not necessarily during workflow execution.
In
15, the authors introduce a system for data management for the CEDPS, called the Managed Object Placement Service (MOPS) that is intended to place data intelligently according to data management policies of the VO or workflow manager. The authors in this paper use policies in much the same way as in
2 and provide similar policy examples that involve data replication and dissemination, for instance, referring to
1 .
In
8,
13 and
14, the authors present a data management system that enforces policies and is known as the integrated Rule-Oriented Data System (iRODS). This system uses a rule engine for policy expression and enforcement. The authors speak of management policies for data replication, pre- and post-processing, metadata extraction and assignment, administration, authorization, auditing, and accounting, as well as policies to enforce integrity, access restrictions, and data placement and presentation. The system iRODS is also referred to by the authors in
12, but the focus is, rather, more on digital curation and preservation, as opposed to the types of policies mentioned in
2. They also discuss other mechanisms for digital curation and preservation.
The authors in
6,
7,
9,
19,
20, and
21 also mention policies in the context of grid computing. However, they focus on resource usage and management policies as opposed to data management policies, so the use of the term "policy" in this context is not related to data management policies. Nevertheless, we mention this work here because resource usage policies do have repercussions for data management, so it is not possible to fully separate the two concerns.
Policies in grid computing are also mentioned in
17 and
18. However, these policies primarily relate to security and authorization, so again, these are not related to data management policies.
Lastly, in
16, the authors describe a system used in the European DataGrid project for replica management. While they do not explicitly use the term policy, the goals of their replica management system are more are less like the replication policies described in
2,
3, and
4.
In this work, when we refer to policies, we mean data management policies similar to those described in
2,
3, and
4. Primarily, we focus on data dissemination and replication policies. In the future, we will examine more complex policies that may incorporate other goals, such as policies that relate to data placement to improve workflow execution.
Up to now, we have implemented two practical policies that are enforced by our rule engine-based Policy-Driven Data Placement application. The first policy specifies a hierarchical or tier-based pattern for data dissemination to sites in a VO upon initial data publication. This policy is modeled after the PheDEx system used by CMS at CERN, as mentioned above and in
5. The second policy enforces a rule that every data file have at minimum number of copies at various storage elements within the system, subject to certain constraints. For instance, it stipulates that no file should have two copies on the same storage element and that copies should only be transferred to a storage element if it has a certain number of bytes free. This policy was implemented with the goals of availability and reliability in mind.