h1. Control Policies Process Flow
Translating human readable policy into machine readable policy takes a substantial amount of human effort, so it is likely in practice to be limited to the areas where machine readable policy will create the most benefit to the organisation. To be able to undertake the translation, the written natural language policy must be available.
The SCAPE project has identified three stages to this process: firstly there are activities which look at the whole policy; the second stage looks at the statements within policy and finally the resulting machine readable statements are checked for overlaps and inconsistencies.
h2. Terms used here
| *User community* | An identifiable set of people who will use/manage the preserved digital object, c.f. OAIS Designated Community \\ |
| *Content Set* | A cohesive collection of digital objects, where the same preservation actions are applicable to all members of the content set \\ |
| *Preservation Case* | A particular risk/event, in which a particular user community and content set are combined. There may be more than one user community and/or content set in a preservation case as long as all the condition/objectives apply \\ |
| *Objective* | A particular condition/question/goal for preservation at a very low level. \\ |
h2. Stage 1: Whole policy activities
h3. Stage 1.1: Identify the content set which is addressed by the policy
The content set is an intellectually cohesive collection of digital objects to which *all the objectives within a preservation case apply without exceptions*. Depending on the scope of the written policy, there may be a one to one mapping between the scope and a content set (especially likely if the scope is about a specific file format) or there may many content sets for the policy (especially likely if the policy is about a subject area which uses many file types for different types of material)
h3. Stage 1.2: Identify the user communities/roles required by the policy
The identification of the user community(ies) is an important as it defines the group of people who have a specific role/use case within the preservation case.At a minimum there will be the curators/managers of the collection/digital objects and the potential users of the preserved digital objects, as these two roles are fairly universal. The third obvious role is that of the data creator/original owner.
For collections/content set which have special restrictions, there may be distinctions in one or both of these basic underpinning roles. As a starting point, we recommend that fine distinctions are made, as it is easier to merge categories at the end than it is to make finer grained distinctions at the end of a process.
h2. Stage 2: Policy statements within the whole policy activities
Natural language policy is written by humans in a specific context (organisation, legal framework, etc.) and it is usually intended to be used by colleagues based in the same context, therefore there may be information that machine actionable statements would need to know which are not explicitly stated in the written document. Whilst it is the aim for policy makers to be precise when making policy, it is not possible using natural language to be completely unambiguous, especially to those outside the organizational context.
For each of the relevant lines in the human readable policy, follow the procedure below:
h3. {color:#000000}{*}Stage 2.1: Clarification of implicit meaning{*}{color}
This stage is designed to check for and remove as much implicit contextual meaning so that the resulting control policy statements are as unambiguous as possible. This is not a straight-forward activity as part of the issue of implicit information is that the reader or writer doe not know that it needs to be written done as it will be assumed that everyone would know it. An example of this is In an organisation where a document management system is routinely used then there would be a common understanding of the fact that documentation is centrally kept and automatically versioned and might not be explicitly mentioned in policy.
h3. Stage 2.2: Identification of control policy preservation case
The next step for each policy statement is to decide on the user communities who have an interest in this and what the content set is in question to enable an initial choice of the preservation case being described.
The control policy model preservation cases enable the link between a content set, a specific user community and the objectives required satisfy this combination to be made. The final preservation cases are likely to emerge at the end of the process once the entire natural language policy has been through the process.
h3. Stage 2.3: Identification of objectives
Using the content of the policy statement, identify the testable objectives which a machine could use to ensure the intent behind the natural language statement. Keep in mind whether these objectives only apply to this particular combination of user community and content set or might apply for other combinations of user community and content set.
h3. Stage 2.4: Generate control policy statements
Either using a tool, or creating RDF by hand, transfer the objectives into RDF statements with specific measurable statements. The SCAPE control policy implementation used an internal measure and attributes controlled list to enable the objectives to be realised.
h2. Stage 3: Review the Preservation Cases and identify any rationalisation required
After the completion of Stage 2, a check should be made to ensure that the preservation cases are distinct and if there is significant overlap then combining preservation cases or adding them to organisational level control statement sets should be considered.
h2. Worked Example
We are using an illustrative example, it is not based on any specific organisation but is intended to help understanding of the process.
An university has a collection of digitised newspapers, at least one of the papers in the collection is rare and of local interest; the rest are available from other sources. The newspaper collection is available to staff and students of the University. The high level policy concerning material includes the following statements
{quote}
“The University Library seeks to preserve data in highly independent environments (e.g. geographically, technically and organisationally) within the resources available. The file formats used must be appropriate”
{quote}
The preservation procedure policy states that
{quote}
Two copies of all material will be kept in separate geographically dispersed locations, and in the case of particularly rare material an additional copy will be kept. The file formats used must be well understood and be part of the approved list of supported formats, see Appendix A.
{quote}
{section:border=true}{section}
{column:width=10}{column}
{column:width=30}{column}
{column:width=30}{column}
|| Step || Outcome || Notes ||
| 1.1 Define Content set | 1.Digitised newspapers \\
2. Rare digitised newspaper | As one of the newspapers is rare and additional preservation activities will be undertaken, then this should be a content set in its own right. \\
Therefore 2 content sets have been identified |
| 1.2 Identify user communities | 1. Library Preservation Manager \\
2. Library users | It is not clear from the information provided whether there are further restrictions/differences between student, teaching staff and researchers. There might be access restrictions to specific journals as a result of publisher's requests, but these are not described in the example. |
| 2.1 Clarification of Implicit meaning | For the statement A on copies the following statements can be generated: \\
1. Two copies of material will be kept \\
2. These two copies will be held in different geographic locations \\
3. For rare material three copies will be kept \\
4. They will be in three separate geographic locations \\
For the statement B on file formats can be generated: \\
5 File formats must have documentation \\
6. Approved file format fro digtised images TIFF and JPEG | The first statement is fairly self-explanatory, although for the third copy it is not explicit about wherher this implies a third geographic location, for the purposes of this example we will assume that it does. For the second one, the local meaning of "well understood" has been related to documentation on the file format and the information in appendix A (not shown in the example) has been explicitly stated. |
| 2.2 Identification of control policy preservation case | Statement A is Geographic dispersion of material to minimize risks due to loss of the bits \\
\\
Statement B relates to File format quality checking for data managers | The control policy preservation cases are related to planning preservation cases and risks, the two risks being mitigated are, at a high level, bit loss and unreadable/unmaintainable file formats. \\ |
| 2.3 Identification of objectives | Statement A \\
For normal collection \\
* There must be at two copies
* There must be at two geographic locations \\
For rare collection \\
* There must be at three copies
* There must be at three geographic locations \\
Statement B \\
* The file format should have documentation
* File format must be of an approved format for the contentset
* The file format should be able to be validated \\ | |
| 2.4 Generate Control Statements | Statement A \\
For content set: digitised newspapers and user community of curation managers \\
* No-of-copies = 2
* No-Geographic locations = 2 For content set: rare digitised newspapers and user community of curation managers
* No-of-copies = 3
* No-Geographic locations = 3 \\
Statement B \\
* format documentation quality MUST be COMPLETE
* format documentation availability MUST be Yes
* format validation support MUST be Yes
* format identifier MUST be TIFF | To be able to generated machine readable control statements there needs to be a common vocabulary to ensure that the tools and the user mean the same thing by the vocab used. SCAPE project used an internal vocab for this work. |
| 3.1 Review | These examples haven't produced any overlap and so there is no change. | With a larger policy there might be overlap between policy on metadata and access for example and it might mean that the same objectives are generated for different parts of the policy. |
h1.
Translating human readable policy into machine readable policy takes a substantial amount of human effort, so it is likely in practice to be limited to the areas where machine readable policy will create the most benefit to the organisation. To be able to undertake the translation, the written natural language policy must be available.
The SCAPE project has identified three stages to this process: firstly there are activities which look at the whole policy; the second stage looks at the statements within policy and finally the resulting machine readable statements are checked for overlaps and inconsistencies.
h2. Terms used here
| *User community* | An identifiable set of people who will use/manage the preserved digital object, c.f. OAIS Designated Community \\ |
| *Content Set* | A cohesive collection of digital objects, where the same preservation actions are applicable to all members of the content set \\ |
| *Preservation Case* | A particular risk/event, in which a particular user community and content set are combined. There may be more than one user community and/or content set in a preservation case as long as all the condition/objectives apply \\ |
| *Objective* | A particular condition/question/goal for preservation at a very low level. \\ |
h2. Stage 1: Whole policy activities
h3. Stage 1.1: Identify the content set which is addressed by the policy
The content set is an intellectually cohesive collection of digital objects to which *all the objectives within a preservation case apply without exceptions*. Depending on the scope of the written policy, there may be a one to one mapping between the scope and a content set (especially likely if the scope is about a specific file format) or there may many content sets for the policy (especially likely if the policy is about a subject area which uses many file types for different types of material)
h3. Stage 1.2: Identify the user communities/roles required by the policy
The identification of the user community(ies) is an important as it defines the group of people who have a specific role/use case within the preservation case.At a minimum there will be the curators/managers of the collection/digital objects and the potential users of the preserved digital objects, as these two roles are fairly universal. The third obvious role is that of the data creator/original owner.
For collections/content set which have special restrictions, there may be distinctions in one or both of these basic underpinning roles. As a starting point, we recommend that fine distinctions are made, as it is easier to merge categories at the end than it is to make finer grained distinctions at the end of a process.
h2. Stage 2: Policy statements within the whole policy activities
Natural language policy is written by humans in a specific context (organisation, legal framework, etc.) and it is usually intended to be used by colleagues based in the same context, therefore there may be information that machine actionable statements would need to know which are not explicitly stated in the written document. Whilst it is the aim for policy makers to be precise when making policy, it is not possible using natural language to be completely unambiguous, especially to those outside the organizational context.
For each of the relevant lines in the human readable policy, follow the procedure below:
h3. {color:#000000}{*}Stage 2.1: Clarification of implicit meaning{*}{color}
This stage is designed to check for and remove as much implicit contextual meaning so that the resulting control policy statements are as unambiguous as possible. This is not a straight-forward activity as part of the issue of implicit information is that the reader or writer doe not know that it needs to be written done as it will be assumed that everyone would know it. An example of this is In an organisation where a document management system is routinely used then there would be a common understanding of the fact that documentation is centrally kept and automatically versioned and might not be explicitly mentioned in policy.
h3. Stage 2.2: Identification of control policy preservation case
The next step for each policy statement is to decide on the user communities who have an interest in this and what the content set is in question to enable an initial choice of the preservation case being described.
The control policy model preservation cases enable the link between a content set, a specific user community and the objectives required satisfy this combination to be made. The final preservation cases are likely to emerge at the end of the process once the entire natural language policy has been through the process.
h3. Stage 2.3: Identification of objectives
Using the content of the policy statement, identify the testable objectives which a machine could use to ensure the intent behind the natural language statement. Keep in mind whether these objectives only apply to this particular combination of user community and content set or might apply for other combinations of user community and content set.
h3. Stage 2.4: Generate control policy statements
Either using a tool, or creating RDF by hand, transfer the objectives into RDF statements with specific measurable statements. The SCAPE control policy implementation used an internal measure and attributes controlled list to enable the objectives to be realised.
h2. Stage 3: Review the Preservation Cases and identify any rationalisation required
After the completion of Stage 2, a check should be made to ensure that the preservation cases are distinct and if there is significant overlap then combining preservation cases or adding them to organisational level control statement sets should be considered.
h2. Worked Example
We are using an illustrative example, it is not based on any specific organisation but is intended to help understanding of the process.
An university has a collection of digitised newspapers, at least one of the papers in the collection is rare and of local interest; the rest are available from other sources. The newspaper collection is available to staff and students of the University. The high level policy concerning material includes the following statements
{quote}
“The University Library seeks to preserve data in highly independent environments (e.g. geographically, technically and organisationally) within the resources available. The file formats used must be appropriate”
{quote}
The preservation procedure policy states that
{quote}
Two copies of all material will be kept in separate geographically dispersed locations, and in the case of particularly rare material an additional copy will be kept. The file formats used must be well understood and be part of the approved list of supported formats, see Appendix A.
{quote}
{section:border=true}{section}
{column:width=10}{column}
{column:width=30}{column}
{column:width=30}{column}
|| Step || Outcome || Notes ||
| 1.1 Define Content set | 1.Digitised newspapers \\
2. Rare digitised newspaper | As one of the newspapers is rare and additional preservation activities will be undertaken, then this should be a content set in its own right. \\
Therefore 2 content sets have been identified |
| 1.2 Identify user communities | 1. Library Preservation Manager \\
2. Library users | It is not clear from the information provided whether there are further restrictions/differences between student, teaching staff and researchers. There might be access restrictions to specific journals as a result of publisher's requests, but these are not described in the example. |
| 2.1 Clarification of Implicit meaning | For the statement A on copies the following statements can be generated: \\
1. Two copies of material will be kept \\
2. These two copies will be held in different geographic locations \\
3. For rare material three copies will be kept \\
4. They will be in three separate geographic locations \\
For the statement B on file formats can be generated: \\
5 File formats must have documentation \\
6. Approved file format fro digtised images TIFF and JPEG | The first statement is fairly self-explanatory, although for the third copy it is not explicit about wherher this implies a third geographic location, for the purposes of this example we will assume that it does. For the second one, the local meaning of "well understood" has been related to documentation on the file format and the information in appendix A (not shown in the example) has been explicitly stated. |
| 2.2 Identification of control policy preservation case | Statement A is Geographic dispersion of material to minimize risks due to loss of the bits \\
\\
Statement B relates to File format quality checking for data managers | The control policy preservation cases are related to planning preservation cases and risks, the two risks being mitigated are, at a high level, bit loss and unreadable/unmaintainable file formats. \\ |
| 2.3 Identification of objectives | Statement A \\
For normal collection \\
* There must be at two copies
* There must be at two geographic locations \\
For rare collection \\
* There must be at three copies
* There must be at three geographic locations \\
Statement B \\
* The file format should have documentation
* File format must be of an approved format for the contentset
* The file format should be able to be validated \\ | |
| 2.4 Generate Control Statements | Statement A \\
For content set: digitised newspapers and user community of curation managers \\
* No-of-copies = 2
* No-Geographic locations = 2 For content set: rare digitised newspapers and user community of curation managers
* No-of-copies = 3
* No-Geographic locations = 3 \\
Statement B \\
* format documentation quality MUST be COMPLETE
* format documentation availability MUST be Yes
* format validation support MUST be Yes
* format identifier MUST be TIFF | To be able to generated machine readable control statements there needs to be a common vocabulary to ensure that the tools and the user mean the same thing by the vocab used. SCAPE project used an internal vocab for this work. |
| 3.1 Review | These examples haven't produced any overlap and so there is no change. | With a larger policy there might be overlap between policy on metadata and access for example and it might mean that the same objectives are generated for different parts of the policy. |
h1.