Between the Bytes


CSDMS as a Teenager
Greg Tucker, September 2020

The year 2020 marks the 13th birthday of the Community Surface Dynamics Modeling System. Following a series of community workshops and white papers, CSDMS (the acronym is often pronounced affectionately as “systems”) became an entity in April 2007, when NSF awarded a five-year grant led by Prof. Jaia Syvitski to establish a new facility at the University of Colorado, Boulder. The early vision expressed the ambition of a hopeful community:

CSDMS is envisioned as a modeling environment containing a community-built and freely available suite of integrated, ever-improving software modules predicting the transport and accumulation of sediment and solutes in landscapes and sedimentary basins over a broad range of time and space scales -- Syvitski et al., 2004

The vision built on the notion that the earth’s surface is the environment, and that to understand water and sediment dynamics quantitatively—and ultimately forecast behavior—would require a computationally centered community effort.

Parents-1700744 1920.jpg

Like most 13-year-olds, CSDMS has come a long way since birth, but has plenty more growth and development ahead before reaching full maturity and potential. And like a typical adolescent, CSDMS’ development has come at different speeds in different dimensions: late-blooming in some aspects, and precocious in others.

One of the surprises has been the growth of community. The makeup of CSDMS’ first executive committee gives a sense of the early disciplinary scope: sedimentary geologists, geomorphologists, and sediment-oriented oceanographers. I sat on that committee as a chair of the Terrestrial Working Group, and at the time the working groups were envisioned as just that: small teams that would actually create and manage software. But after the first few years of operation, it became clear that interest in CSDMS extended way beyond its original core of sedimentary processes. In response to the surge of interest from related communities, CSDMS established Focus Research Groups, with topics ranging from solid-earth geodynamics to ecosystems and human dimensions. Today, CSDMS has nearly 2000 members, divided among a dozen different Working and Focus Research Groups. Even the smallest group now has over 100 members, while the largest—Terrestrial, chaired by Nicole Gasparini of Tulane University and Leslie Hsu of the US Geological Survey—numbers over 900 members. The annual all-hands meetings are popular, especially with early career scientists, and full of enthusiastic buzz. The buzz remained even when the meeting was forced online by the COVID-19 pandemic: the May 2020 event had over 400 individual attendees. CSDMS may have set out to build software, but it ended up building a community.

CSDMS has also helped nurture a new culture of code sharing. When the facility first launched, model codes were mostly trade secrets: kept within lab groups and close networks of collaborators. A common attitude was that a computer model is like a lab; as Randy Leveque of the University of Washington put it, sharing code could be seen as “like inviting every scientist in the world to come use your carefully constructed lab apparatus free of charge.” But while that view has merit in some situations, Randy went on to note that there are many good reasons to share code anyway. For one thing, funding agencies require open sharing of software and data. But even if that weren’t the case, the fear of being scooped by your own software is almost always unwarranted. The reality is that no one understands your code better than you do (in fact, to create a research code that’s as accessible to outsiders as it is to its creator would be a rare and remarkable feat). And there’s no shortage of important questions that a well-crafted code can help address. In my experience, a much more common outcome from code sharing is new collaborations and contributions, as other researchers seek to build on what you’ve started.

But that wasn’t the prevailing view in the earth-surface community when the CSDMS Model Repository was first created as a platform for open sharing of version-controlled model software and metadata. The question was (to paraphrase Field of Dreams): if you built a repository, would they come? The answer turned out to be a resounding “yes”: the CSDMS Model Repository now catalogues over 370 models and tools, and continues to grow.

The same spirit of generosity took hold in the sharing of technical expertise. For the past ten years, community members have volunteered their time and energy to offer hands-on “clinics” at the CSDMS annual meetings, on topics ranging from techniques like machine-learning to the use of particular models.

Meanwhile, the vision of a comprehensive, multi-scale, and ever-improving modeling environment posed a computational challenge worthy of a tech giant. Simply constructing a single numerical model, perhaps global in scale, would have been challenging enough. But the community made their wishes clear: a single model could never hope to encompass all the scales, processes, and concepts that lie at the forefront of the earth-surface sciences. The modeling system would have to be modular, with the ability to swap in alternative sub-models. It would have to address processes ranging from glacial erosion on high peaks to mud transport on submarine fans. And would have to embrace time scales ranging from storm events to geologic periods.

A small facility with just two or three research software engineers could never hope to build all of this, from scratch, by themselves. The key to success therefore lay in taking full advantage of existing resources, and making it an open community-wide project. It would be a “stone soup” vision: the facility provides the kettle, while the community brings the ingredients. The Integration Facility began with technology fronted by a graphical user interface. The CSDMS Modeling Tool displayed community-developed modules as graphical icons, which were coupled by drawing lines to connect inputs and outputs. Once assembled, the resulting model would run on a remote high-performance computing cluster. It was cloud computing before that term even existed.

The development team quickly discovered the need for two additional elements: a standard interface through which to operate and query each module, and a standard vocabulary—an ontology—for naming variables in a consistent way. The vocabulary standard addressed the proliferation of different names for the same thing ( “discharge” and “stream flow,” for example), as well as similar names for different quantities (means annual versus instantaneous discharge, for instance). Scott Peckham designed the ontology pattern, first as the CSDMS Standard Names, and later, with Maria Stoica, in an expanded version known as the Scientific Variables Ontology.

To meet the need for a standard programmatic interface, CSDMS developed the Basic Model Interface (BMI). In order to have a numerical model act as a modular component—a software “building block” that can be initialized, advanced, queried, given new data, and combined with other components—that model code needs to provide a consistent set of interface functions. The BMI specifies what these functions should look like: their names, their signatures, and their return types, as well as the syntax specific to particular programming languages. A model equipped with a BMI becomes interactive. You can advance it, pause execution, interrogate state variables, plot data—and exchange values with another model, which becomes the key to model coupling. Beyond that, the BMI provides a standardized operating mechanism: like the steering wheel and accelerator in a car, it offers a set of standardized controls that are the same from model to model, making the learning curve much simpler. And BMI is catching on. It’s now used, for example, in models developed by researchers at Deltares, the US Geological Survey, and the Netherlands eScience Center.

The CSDMS framework tool that makes use of BMI has continued to evolve. We learned that many, perhaps most, model coupling and model-data integration projects need a level of programmatic finesse that can only be handled by scripting. In response to this need, the script-based machinery behind the graphical front end was brought forward into a user-facing product: the Python Modeling Tool (pymt). With the 1.0 release in 2019, pymt recognizes the explosive growth in the popularity of Python in the geoscience community, and provides access to a collection of BMI-enabled components and tools, alongside Jupyter notebooks that provide hands-on tutorials. Pymt has already been used to power research ranging from permafrost to river and coastal morphodynamics.

Pymt provides a standardized, accessible pathway to legacy models and model-integration tools, but what about creating new models? New data and ideas drive new and refined theory, and that in turn requires adaptation of the numerical software that embodies these ideas. To meet the need for efficient creation and modification of numerical models, CSDMS supports the Landlab Toolkit. Landlab is a Python-language programming library that promotes standardization and re-use by providing interoperable process components that can be assembled, together with a grid object, to create complete integrated models. Since its 2016 debut, Landlab has featured in more than two dozen publications, with applications that collectively span hydrology, geomorphology, tectonics, ecology, basin stratigraphy, landslide hazards, and ecohydrology.

Still, much remains to be done to fully realize the CSDMS community’s vision. One challenge—not just in the geosciences, but across the sciences—lies in training. Many scientists report spending a large fraction of their research time in developing software, yet they also report being largely self-taught. Self-taught scientific programmers are less likely to be aware of tools and best practices that can significantly improve software reliability, transparency, reusability, and productivity. Clearly, geoscientists should not be expected to possess the complete skill set of a software engineer, yet some level of training beyond the status quo is essential if we are to have a computationally fluent scientific workforce. Domain-science facilities like CSDMS have an important role to play. To this end, in 2020 CSDMS launched a new summer institute for early career scientists (albeit initially a virtual one, due to the COVID-19 pandemic). Similarly, CSDMS continues to provide opportunities for community members to work directly with, and learn from, professional Research Software Engineers.

Likewise, a sustainable cyber-ecosystem requires rewards and incentives for contribution. The emergence of new software journals like the Journal of Open Source Software helps a lot here, by providing a formal review and publication venue for well-designed, tested, and documented research software. Domain-based awards that recognize software contributions, like the CSDMS Syvitski Student Modeler Award, are important ingredients as well.

Plenty of opportunities and challenges remain on the technology front. The increasing capability of cloud computing presents a potentially valuable resource for research, given the flexibility in hardware resources that it offers. And a critical frontier lies in discovery through data-model integration: a need that CSDMS has begun to address with a standard programmatic interface for accessing and sub-setting datasets, and a library of access functions known as Data Components. There is plenty of room to grow the library of BMI-enabled model components that can operate in frameworks like pymt. And Landlab has just begun to scratch the surface, with lots of potential for new capabilities such as automated matrix configuration tools, performance enhancement, visualization, and 3D gridding.

Looking back, it’s heartening see a growing and thriving community, and the roots of connection across interests and disciplines that have grown around it. CSDMS isn’t fully grown yet but it’s come a long way. Welcome to the teen years.