Data:Geomaterials Vocab: Difference between revisions

From CSDMS
No edit summary
m (Text replacement - "http://csdms.colorado.edu/wiki/" to "https://csdms.colorado.edu/wiki/")
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Data description
{{Data description
|One-line data description=A Structured Vocabulary for Geomaterials
|One-line data description=A Structured Vocabulary for Geomaterials
|Extended data description=<br>Introduction
|Extended data description=<br>


Word-based data are pervasive in the geosciences, even in the field of numerical modeling. Parameters, materials, processes, events are all identified linguistically and accompanying their namings is a semantic that involves causality, arrangement, units, agents, etc. The CSDMS Standard Names (http://csdms.colorado.edu/wiki/CSDMS_Standard_Names) documents parameter naming syntaxes in the context of numerical modeling for earth surface dynamics.
(In roll-out stage March 2014)


The Vocabulary
''Introduction''


As a contribution to earth surface modeling and data handling, a comprehensive vocabulary of earth materials is presented here. It is not an ontology, though formal ontology can be derived from it. It is a semantic net accompanied by some other information. Semantic nets allow for more complex and quantitative relationships than in ontologies.
Word-based data are pervasive in the geosciences, even in the field of numerical modeling. Parameters and units, materials, processes, events are all identified linguistically. For example, in the context of numerical modeling for earth surface dynamics the CSDMS Standard Names (https://csdms.colorado.edu/wiki/CSDMS_Standard_Names) documents syntaxes used for parameter word-based namings.
Geomaterials include soils, sediments, rocks, biogenic buildups, ice and snow, and man-moved and man-made materials. A paper on the vocabulary is being finalized.


Building the Vocabulary
As a contribution to earth surface modeling and data handling, a comprehensive vocabulary of earth materials is presented here. Geomaterials include soils, sediments, rocks, biogenic buildups, ice and snow, and man-moved and man-made materials. The vocabulary is presented as a number of resources, including an ontology document which is a subset of the total vocabulary structure. A paper on the vocabulary is being finalized.


The vocabulary presented here is computed from a corpus of glossaries, dictionaries, thesauri, ontologies, classifications. It is necessary to compute it because of the great number of geomaterials terms now available – estimated to be 10^4. Manual efforts to create a structured vocabulary through ontologies have encompassed only ~300 terms in several years of work (Geosciml 2012). Furthermore, the relationships in the existing structures are rudimentary.
''Building the Vocabulary''
In contrast, by mining relationships from corpus hundreds of nodes and relationships can be gleaned from single glossary, etc. documents. The glossary etc. texts used here were sourced from institutions such as British Geological Survey, US National Aeronautical and Space Agency (NASA), US Geological Survey (USGS), Society for Sedimentary Geology, CSIRO Australia, US Federal Geographic Data Committee, Center for Deep Earth Exploration (CDEX) in Japan, World Meteorological Organization (WMO), and the American Geological Institute (AGI). 


Components
The vocabulary is computed from a corpus of glossaries, dictionaries, thesauri, ontologies, classifications. It was necessary to compute it because of the great number of geomaterials terms now available – estimated to be 10^4. Manual efforts to create a structured vocabulary through ontologies have encompassed only ~300 terms with rudimentary relationships in several years of work (Geosciml 2012). By computing the vocabulary, quantitative linguistic measures of concept distance and scope can also be made.


i. A table of geomaterials concepts with their names, definitions, relationships, metrics and metadata.
The corpora used here were sourced from authoritative institutions such as British Geological Survey, US National Aeronautical and Space Agency (NASA), US Geological Survey (USGS), Society for Sedimentary Geology, CSIRO Australia, US Federal Geographic Data Committee, Center for Deep Earth Exploration (CDEX) in Japan, and the World Meteorological Organization (WMO). At last count there were 962 nodes (concepts) being served, and 1126 'strong words' from processing these corpora.
ii. Tables of ‘strong words’ and weak words (a ‘stop list’) that are used to describe geomaterials concepts. The strong words are accompanied by frequency metrics and the sets of words which they associate with.
Strongwords are those that occur in the names of geomaterials concepts and are not in the weak-words list.
iii. A formal ontology of subsumption relations (i.e., related, synonym, broader, narrower) expressed using OWL, SKOS and RDF logic systems in XML syntax.
iv. A semantic net of subsumption relations, and also quantitative strengths on the links between them.


Use cases
''Components''
 
Please see the detailed documentation that is in the served zip file. The vocabulary comes in three parts - general components, vocab for the geology ('litho') and cryology ('cryo') subthemes.
 
The tallies are: 2315 strongwords, 836 lithology concepts, 16 corpora, ##.
 
(i) A table of geomaterials concepts with their names, definitions, relationships, metrics and metadata.
(ii) Tables of ‘strong words’ and weak words (the ‘stop list’) that are involved in describing the  geomaterials concepts. The strong words are accompanied by frequency metrics, the sets of words which they associate with, levenstein variants, and stemmed morphologies.
The strongwords are those that occur in the names of geomaterials concepts and are not in the stop-list.
(iii) A formal ontology of subsumption relations (i.e., related, synonym, broader, narrower) expressed using SKOS and RDF logic systems in TTL syntax.
(iv) (TBA) A semantic net of subsumption relations, and also quantitative strengths on the links between them.
 
''Use cases''


The vocabulary components provide a large resource which are needed for downstream software applications such as query mediation, semantic crosswalk, disambiguation, databasing.
The vocabulary components provide a large resource which are needed for downstream software applications such as query mediation, semantic crosswalk, disambiguation, databasing.


For example, a query can be launched using certain terms (e.g., “plagioclase-bearing arenites with glauconite”). The query is using local vocabulary and could alternatively we written “feldspathic sandstones with verdine”. A ‘smart search’ (‘concept search’) drawing on a semantic net resource is able to search for both expressions – and narrower such as “glauconitic albitic sands”. This is ‘query mediation’. In these times of interdisciplinary, global-scale earth sciences it is important for information retrievals to give a full data response. Once a complete retrieval is made, data on porosities, strengths, textures locations can accompany the results.
(i) A query can be launched using a set of terms (e.g., “feldspar-bearing sediments with glauconite”). The query is using local vocabulary and could alternatively we written “feldspathic sediments with verdine”. A ‘smart search’ (‘concept search’) drawing on a semantic net resource is able to search for both expressions – and also narrower ones such as “glauconitic albitic sands”. This is ‘query mediation’ and ‘query extension’.  
 
(ii) Crosswalks relate and compare two concepts. How close are they, do they subsume, what are their neighbours ?  
Crosswalks are a similar idea: two concept names need to be related. How close are they, do they subsume, what are their neighbours ? Disambiguation is another similar concept: given a homonym like “caterpillar”, animal and tractor can be distinguished by their typical word-associates in the text, with the patterns defined in a structured vocabulary like that served here.
(iii) Disambiguation is a similar concept: given a homonym like “caterpillar”, animal and tractor can be distinguished by their typical word-associates in the text, with the patterns defined in a structured vocabulary like that served here.
|Upload image dataset=Logi Graph.png
|Upload image dataset=Logi Graph.png
|Caption dataset image=Simple visualization of part of the vocab
|Caption dataset image=Simple visualization of part of the vocab

Latest revision as of 17:19, 19 February 2018

Geomaterials Vocab dataset information page



Short Description

Logi Graph.png
Simple visualization of part of the vocab

Statement: A Structured Vocabulary for Geomaterials

Abstract:

(In roll-out stage March 2014)

Introduction

Word-based data are pervasive in the geosciences, even in the field of numerical modeling. Parameters and units, materials, processes, events are all identified linguistically. For example, in the context of numerical modeling for earth surface dynamics the CSDMS Standard Names (https://csdms.colorado.edu/wiki/CSDMS_Standard_Names) documents syntaxes used for parameter word-based namings.

As a contribution to earth surface modeling and data handling, a comprehensive vocabulary of earth materials is presented here. Geomaterials include soils, sediments, rocks, biogenic buildups, ice and snow, and man-moved and man-made materials. The vocabulary is presented as a number of resources, including an ontology document which is a subset of the total vocabulary structure. A paper on the vocabulary is being finalized.

Building the Vocabulary

The vocabulary is computed from a corpus of glossaries, dictionaries, thesauri, ontologies, classifications. It was necessary to compute it because of the great number of geomaterials terms now available – estimated to be 10^4. Manual efforts to create a structured vocabulary through ontologies have encompassed only ~300 terms with rudimentary relationships in several years of work (Geosciml 2012). By computing the vocabulary, quantitative linguistic measures of concept distance and scope can also be made.

The corpora used here were sourced from authoritative institutions such as British Geological Survey, US National Aeronautical and Space Agency (NASA), US Geological Survey (USGS), Society for Sedimentary Geology, CSIRO Australia, US Federal Geographic Data Committee, Center for Deep Earth Exploration (CDEX) in Japan, and the World Meteorological Organization (WMO). At last count there were 962 nodes (concepts) being served, and 1126 'strong words' from processing these corpora.

Components

Please see the detailed documentation that is in the served zip file. The vocabulary comes in three parts - general components, vocab for the geology ('litho') and cryology ('cryo') subthemes.

The tallies are: 2315 strongwords, 836 lithology concepts, 16 corpora, ##.

(i) A table of geomaterials concepts with their names, definitions, relationships, metrics and metadata. (ii) Tables of ‘strong words’ and weak words (the ‘stop list’) that are involved in describing the geomaterials concepts. The strong words are accompanied by frequency metrics, the sets of words which they associate with, levenstein variants, and stemmed morphologies. The strongwords are those that occur in the names of geomaterials concepts and are not in the stop-list. (iii) A formal ontology of subsumption relations (i.e., related, synonym, broader, narrower) expressed using SKOS and RDF logic systems in TTL syntax. (iv) (TBA) A semantic net of subsumption relations, and also quantitative strengths on the links between them.

Use cases

The vocabulary components provide a large resource which are needed for downstream software applications such as query mediation, semantic crosswalk, disambiguation, databasing.

(i) A query can be launched using a set of terms (e.g., “feldspar-bearing sediments with glauconite”). The query is using local vocabulary and could alternatively we written “feldspathic sediments with verdine”. A ‘smart search’ (‘concept search’) drawing on a semantic net resource is able to search for both expressions – and also narrower ones such as “glauconitic albitic sands”. This is ‘query mediation’ and ‘query extension’. (ii) Crosswalks relate and compare two concepts. How close are they, do they subsume, what are their neighbours ? (iii) Disambiguation is a similar concept: given a homonym like “caterpillar”, animal and tractor can be distinguished by their typical word-associates in the text, with the patterns defined in a structured vocabulary like that served here.

Data format

Data type: Substrates
Data origin: Measured
Data format: ASCII
Other format:
Data resolution: All
Datum: All

Data Coverage

Spatial data coverage: All
Temporal data coverage: Time series
Time period covered: All

Availability

Download data: http://instaar.colorado.edu/~jenkinsc/dbseabed/resources/geomaterials/GeomaterialsVocab.zip
Data source: http://instaar.colorado.edu/~jenkinsc/dbseabed/resources/geomaterials/GeomaterialsVocab.zip

References