CSDMS Standard Names — Overview
- The Semantic Web concept and movement recognizes that the already transformative World Wide Web will become even more powerful if it is extended beyond an interconnected set of human readable documents to a set of machine readable documents that are able to capture and convey knowledge.
- The field of semantics is concerned with the study of meaning, and ontology is essentially concerned with capturing and organizing knowledge. In computer science, an ontology is a system that attempts to capture and organize knowledge in a particular domain (in machine readable form), as understood by experts in that domain or subject area.
- There are a variety of concepts that fall under the banner of "semantics and ontology" that are used to address specific issues in the development of "intelligent software". Some of these are:
- While there are subtle differences between the items in this list, they can be divided into two broad groups. The terms controlled vocabulary, crosswalk, lingua franca, nomenclature, preferred label and standard names are all closely related and have the fairly simple, linear structure of a list or lookup table. They are used primarily to map a term used in one setting to an equivalent term in another setting. Relationships between entries in the list are not of primary interest. The main interest is knowing whether two terms refer to the same object.
- By contrast, the terms ontology, master dictionary, taxonomy and typology represent efforts to capture relationships between entries (objects). They attempt to organize the objects into a hierarchy, which may include nested classes (sub- and super-classes). The connectedness or closeness of objects is also of interest. Because of this, they have the potential to capture knowledge, which is broadly concerned with relationships and the degree to which objects are similar. Their fundamental structure is that of a graph (nodes connected by lines) instead of a list.
- These two broad groups of tools are used to address three main software use cases, namely
Semantic mediation and matching
Discovery of related information
Capture and archiving of domain knowledge
The last two of these require tools from the second, more complex group.
- The vocabulary of a well-educated person contains on the order of 50,000 to 100,000 words. See this BBC article.
- The CSDMS semantic "use case" is one of automated semantic mediation, matching or reconciliation. While our focus is on a "lingua franca", our standard names are often built from a hierarchical set of concepts and may eventually be used to construct a type of ontology.
- The CSDMS plug-and-play modeling system requires a set of standard names for input and output variables (quantities) in order to automatically determine whether an input variable in one model (or database) is equivalent to (or compatible with) an output variable in another model (or database) for the purpose of coupling the two resources (as user and provider). There is no need or requirement for these standard names to be used within a model, and they are too long to be used in this way. However, CSDMS requires model contributors to implement the BMI (Basic Model Interface), and this includes mapping each of the model's input and output variables to a CSDMS Standard Name. In addition, contributors provide a Model Metadata File (MMF) that (1) specifies how each standard name is used within the model (e.g. units, assumptions, etc.) and (2) describes other key attributes of the model that must be known to facilitate coupling to other models. See CSDMS Basic Model Interface for more information.
- Our focus is more on identifying general rules and patterns for consistent construction of standard names (i.e. a systematic naming scheme) that span the geosciences and less on creating an exhaustive list of names, which comes later. We have identified numerous patterns and templates that cover a broad range of needs and these are listed and discussed in the subsequent sections of this document. This includes numerous Object Templates, Quantity Templates and Operation Templates.
- RDF (Resource Description Framework) is built around an "object + attribute + value" concept. Our "object + quantity" names follow a similar pattern and are used to retrieve the values from a model or database . The word "attribute" is a more general term than "quantity"; the latter is essentially a type of attribute that can be described with numbers and has units.
- Units are not given as part of the name, as with CF Standard Names. However, in CF Standard Names, a certain SI unit is often implied by the name. Also, the CF Standard Names allow inclusion of assumptions in the name, such as "_assuming_clear_sky". In CSDMS Standard Names, we use the name as a "key" or "index" to access not only the associated values but associated metadata that provides the units, set of assumptions, datum, how measured, etc. If all assumptions, etc. are included in the standard name, it limits the number of matches that are likely to be found during the discovery process or when trying to couple models. It also discourages a complete listing of the relevant assumptions. Metadata (including assumptions) can be used to distinguish between exact and approximate matches, and this information can be presented to users when desirable.
- Guidelines for construction of CF Standard Names can be found at CF Standard Name Guidelines. The rules for CSDMS Standard Names being developed here are meant to be more general, more rigorously defined and less ambiguous. As of 5/3/12, there are 2134 CF Standard Names, but the number of distinct patterns reflected in this set is much, much smaller. Some of them already conform to the patterns and templates of the CSDMS Standard Names and these will be favored (or assimilated) whenever possible. However, CSDMS plans to provide a lookup table that maps each CF Convention Standard Name to a CSDMS Standard Name.