Created
August 21, 2018 07:57
-
-
Save jduckles/e811cc4b0e6b3e962b5ea0cb9b360593 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Abstract | |
This presentation will outline the untapped potential of Information and | |
Library Science (ILS) programs as an integral space for the long-term | |
training and support of biodiversity informatics work. It will also | |
outline the specific proposed steps taken at Indiana University, | |
Bloomington (IU), to provide long-term, systematized training of | |
students focused on information work within this broad domain. | |
As a discipline, ILS has long been preoccupied with the organization, | |
description, curation, and access to a wide variety of information and | |
data sources. ILS curriculum necessarily emphasizes a broad range of | |
information topics given that many different kinds of institutions | |
require these particular skillsets. Typical ILS curriculums focus on | |
topics such as, knowledge organization, metadata, ontologies, database | |
design, scholarly communication, intellectual property, information | |
ethics, interface design, data analytics, online publishing, museum | |
studies, data curation, and collection management/administration. Given | |
this broad range of training, students graduating from ILS programs are | |
perfectly situated to support biodiversity informatics broadly | |
conceived, especially as it relates to the standardization and | |
normalization of data sources across geographically and temporally | |
distributed locations and sources within specific institutional | |
environments. | |
Yet, despite the overlaps between ILS departments, biodiversity | |
informatics, and museum environments, no ILS program has officially | |
taken steps to support this intersectional space. Using concrete | |
examples, this talk will show how the ILS program at IU is building on | |
top of already-existing capacities to more robustly support biodiversity | |
work. The proposed way forward is a tightly integrated approach to | |
biodiversity informatics that integrates theoretical experience and | |
technical training with hands-on internships in museum and biodiversity | |
environments. Through close partnerships with on-campus institutes, such | |
as the Indiana Geological & Water Survey and the Center for Biological | |
Research Collections, as well as larger, external institutions such as | |
the Smithsonian National Museum of Natural History, students will be | |
provided intense fieldwork experience in data management and | |
standards-driven work specific to the museum and biodiversity world. A | |
tiered approach to this training will be suggested, as this kind of | |
training should proceed at both the professional level (for example, | |
master's level work), as well as more advanced levels focused on more | |
research-driven activity (such as postdoctoral work). | |
Part of this new approach to biodiversity informatics training requires | |
the rearticulation of ILS courses, as well as the addition of new | |
courses that can provide domain-specific knowledge. This presentation, | |
then, will outline a proposed curriculum to support this kind of | |
collaborative training and work. A distributed training structure will | |
be suggested, utilizing expertise from across the globe. In addition, it | |
will show how a more project- and field work-centric approach to ILS | |
education can more quickly and deeply train students to enter the | |
quickly changing field. | |
Part of the difficulty with training biodiversity informatics | |
specialists is that building such programs from the ground up is often | |
costly and requires the building of new workflows and practices. An | |
integrated approach, such as that proposed in this presentation, | |
however, will leverage the respective strengths of ILS program and | |
museum environments in ways that are sustainable and resilient for the | |
long term. The goal here is for institutions to support each other in | |
ways that strengthen their core missions, as well as push the discipline | |
forward in systematic and unique ways. | |
Abstract | |
As rapid advances in sequencing technology result in more branches of | |
the tree of life being illuminated, there has actually been a decrease | |
in the percentage of sequence records that are backed by voucher | |
specimens Trizna 2018b. The good news is that there are tools Trizna | |
(2017), NCBI (2005), Biocode LLC (2014) to enable well-databased museum | |
vouchers to automatically validate and format specimen and collection | |
metadata for high quality sequence records. Another problem is that | |
there are millions of existing sequence records that are known to | |
contain either incorrect or incomplete specimen data. I will show an | |
end-to-end example of sequencing specimens from a museum, depositing | |
their sequence records in NCBI\'s (National Center for Biotechnology | |
Information) GenBank database, and then providing updates to GenBank as | |
the museum database revises identifications. I will also talk about | |
linking records from specimen databases as well. Over one million | |
records in the Global Biodiversity Information Facility (GBIF) Trizna | |
(2018a) contain a value in the Darwin Core term \"associatedSequences\", | |
and I will examine what is currently contained in these entries, and how | |
best to format them to ensure that a tight connection is made to | |
sequence records. | |
Abstract | |
SOCCOMAS is a ready-to-use Semantic Ontology-Controlled Content | |
Management System (http://escience.biowikifarm.net/wiki/SOCCOMAS). Each | |
web content management system (WCMS) run by SOCCOMAS is controlled by a | |
set of ontologies and an accompanying Java-based middleware with the | |
data housed in a Jena tuple store. The ontologies describe the behavior | |
of the WCMS, including all of its input forms, input controls, data | |
schemes and workflow processes (Fig. 1). | |
Data is organized into different types of data entries, which represent | |
collections of data referring to a particular material entity, for | |
instance an individual specimen. SOCCOMAS implements a suite of general | |
processes, which can be used to manage and organize all data entry | |
types. One category of processes manages the life-cycle of a data entry, | |
including all required for changing between the following possible entry | |
states: | |
current draft version; | |
backup draft version; | |
recycle bin draft version; | |
deleted draft version; | |
current published version; | |
previously published version. | |
The processes also allow a user to create a revised draft based on the | |
current published version. Another category of processes automatically | |
tracks the overall provenance (i.e. creator, authors, creation and | |
publication date, contributers, relation between different versions, | |
etc.) for each particular data entry. Additionally, on a significantly | |
finer level of granularity, SOCCOMAS also tracks in a detailed | |
change-history log all changes made to a particular data record at the | |
level of individual input fields. All information (data, provenance | |
metadata, change-history metadata) is stored based on Resource | |
Description Framework (RDF) compliant data schemes into different named | |
graphs (i.e. a URI under which triple statements are stored in the tuple | |
store). All recorded information can be accessed through a SPARQL | |
endpoint. All data entries are Linked Open Data and thus provide access | |
to an HTML representation of the data for visualization in a web-browser | |
or as a machine-readable RDF file. The ontology-controlled design of | |
SOCCOMAS allows administrators to easily customize already existing | |
templates for input forms of data entries, define new templates for new | |
types of data entries, and define underlying RDF-compliant data schemes | |
and apply them to each relevant input field. SOCCOMAS provides an engine | |
for running and developing semantic WCMSs, where only ontology editing, | |
but no middleware and front end programming, are required for adapting | |
the WCMS to one\'s own specific requirements. | |
Abstract | |
Taxonomic names are ambiguous as identifiers of biodiversity data, as | |
they refer to a particular concept of a taxon in an expert's mind | |
(Kennedy et al. 2005). This ambiguity is particularly problematic when | |
attempting to reconcile taxonomic names from disparate sources with | |
clades on a phylogeny. Currently, such reconciliation requires expert | |
interpretation, which is necessarily subjective, difficult to reproduce, | |
and refractory to scaling. In contrast, phylogenetic clade definitions | |
are a well-developed method for unambiguously defining the semantics of | |
a clade concept in terms of shared evolutionary ancestry (Queiroz and | |
Gauthier 1990, Queiroz and Gauthier 1994), and these semantics allow | |
locating clades on any phylogeny. Although a few software tools have | |
been created for resolving clade definitions, including for definitions | |
expressed in the Mathematical Markup Language (e.g. Names on Nodes in | |
Keesey 2007) and as lists of GenBank accession numbers (e.g. mor in | |
Hibbett et al. 2005), these are application-specific representations | |
that do not provide formal definitions with well-defined semantics for | |
every component of a clade definition. Being able to create such | |
machine-interpretable definitions would allow computers to store, | |
compare, distribute and resolve semantically-rich clade definitions. | |
To this end, the Phyloreferencing project (http://phyloref.org, | |
Cellinese and Lapp 2015) is working on a specification for encoding | |
phylogenetic clade definitions as ontologies using the Web Ontology | |
Language (OWL in W3C OWL Working Group 2012). Our specification allows | |
the semantics of these definitions, which we call phyloreferences, to be | |
described in terms of shared ancestor and excluded lineage properties. | |
The aim of this effort is to allow any OWL-DL reasoner to resolve | |
phyloreferences on a phylogeny that has itself been translated into a | |
compatible OWL representation. We have developed a workflow that allows | |
us to curate phyloreferences from phylogenetic clade definitions | |
published in natural language, and to resolve the curated phyloreference | |
against the phylogeny upon which the definition was originally created, | |
allowing us to validate that the phyloreference reflects the authors' | |
original intent. We have started work on curating dozens of | |
phyloreferences from publications and the clade definition database | |
RegNum (http://phyloregnum.org), which will provide an online catalog of | |
all clade definitions that are part of the Phylonym Volume, to be | |
published together with the PhyloCode (https://www.ohio.edu/phylocode/). | |
We will comprehensively curate these definitions into a reusable and | |
fully computable ontology of phyloreferences. | |
In our presentation, we will provide an overview of phyloreferencing and | |
will describe the model and workflow we use to encode clade definitions | |
in OWL, based on concepts and terms taken from the Comparative Data | |
Analysis Ontology (Prosdocimi et al. 2009), Darwin-SW (Baskauf and Webb | |
2016) and Darwin Core (Wieczorek et al. 2012). We will demonstrate how | |
phyloreferences can be visualized, resolved and tested on the phylogeny | |
that they were originally described on, and how they resolve on one of | |
the largest synthetic phylogenies available, the Open Tree of Life | |
(Hinchliff et al. 2015). We will conclude with a discussion of the | |
problems we faced in referring to taxonomic units in phylogenies, which | |
is one of the key challenges in enabling better integration of | |
phylogenetic information into biodiversity analyses. | |
Abstract | |
Parasitism can be defined as an interaction between species in which one | |
of the interaction partners, the parasite, lives in or on the other, the | |
host. The parasite draws food from its host and harms it in the process. | |
According to some estimates, over 40% of all eukaryotes are parasites. | |
Nevertheless, it is difficult to obtain information about a particular | |
taxon is a parasite computationally making it difficult to query large | |
sets of taxa. | |
Here we test to what extend it is possible to use the Open Tree of Life | |
(OTL), a synthesis of phylogenetic trees on a backbone taxonomy | |
(resulting in unresolved nodes), to expand available information via | |
phylogenetic trait prediction. We use the Global Biotic Interactions | |
(GloBI) database to categorise 25,992 and 34,879 species as parasites | |
and free-living, respectively, and predict states for over \~2.3 million | |
(97.34%) leaf nodes without state information. | |
We estimate the accuracy of our maximum parsimony based predictions | |
using cross-validation and simulation at roughly 60-80% overall, but | |
strongly varying between clades. The cross-validation resulted in an | |
accuracy of 98.17% which is explained by the fact that the data are not | |
uniformly distributed. We describe this variation across taxa as | |
associated with available state and topology information. We compare our | |
results with several smaller scale studies, which used manual expert | |
curation and conclude that computationally inferred state changes | |
largely agree in number and placement with those. In clades in which | |
available state information is biased (mostly towards parasites, e.g. in | |
Nematodes) phylogenetic prediction is bound to provide results | |
contradicting conventional wisdom. | |
This represents, to our knowledge, the first comprehensive computational | |
reconstruction of the emergence of parasitism in eukaryotes. We argue | |
that such an approach is necessary to allow further incorporation of | |
parasitism as an important trait in species interaction databases and in | |
individual studies on eukaryotes, e.g. in the microbiome. | |
Abstract | |
The Open Tree of Life project is a collaborative effort to synthesize, | |
share and update a comprehensive tree of life Fig. 1. We have completed | |
a draft synthesis of a tree summarizing digitally available taxonomic | |
and phylogenetic knowledge for all 2.6 million named species, available | |
at tree.opentreeoflife.org Hinchliff et al. 2015. . . This tree provides | |
ready access to phylogenetic information which can link together | |
biodiversity data on the basis of what we know about relevant | |
evolutionary history. Both the unified reference taxonomy Rees and | |
Cranston 2017 and the published phylogenetic statements underlying the | |
tree McTavish et al. 2015 are available and accessible online. Taxa in | |
the phylogenies are mapped to the the reference taxonomy, which aligns | |
Open Tree taxon identifiers to those from NCBI and GBIF, among several | |
other taxonomy resources. The synthesis tree is revised as new data | |
become available, and captures conflict and consensus across different | |
published phylogenetic estimates. This undertaking requires both | |
development of novel infrastructure and analysis tools, as well as | |
community engagement with the Open Tree of Life project. I will discuss | |
the challenges in and the progress towards achieving these goals. | |
Abstract | |
Connecting biodiversity data across databases is not as easy as one | |
might think. Different databases use different identifiers and | |
taxonomies and connecting these data often results in loss of | |
information and precision. Here we present some of the challenges we | |
faced with integrating multiple biodiversity data sets, including | |
specimen data from the scientific collections, during a hackathon hosted | |
by the Phenoscape project in December of 2017. The hackathon brought | |
together a diverse group of participants, including biologists and | |
software developers, to explore ways of using the computable phenotype | |
data in the Phenoscape Knowledgebase (KB) (Edmunds et al. 2015). The KB | |
contains ontology-annotated data that links evolutionary phenotypes from | |
the comparative literature to model organism phenotypes enabling, e.g., | |
the retrieval of candidate genes for evolutionary phenotypes and the | |
generation of synthetic supermatrices of presence/absence characters. | |
During this hackathon, our team explored how to link phenotype data in | |
the KB to museum specimen data in iDigBio (Matsunaga et al. 2013) with | |
the hope of creating visualizations including world maps showing species | |
distributions with different character states and their phylogenetic | |
relationships. We visualized lineage relationships by querying the Open | |
Tree of Life (OT) (Hinchliff et al. 2015) website using data integrated | |
by another group at the hackathon that linked KB and OT taxonomic | |
identifiers. | |
Phenoscape uses terms from anatomy, quality, and taxonomy ontologies to | |
annotate characters and taxonomic information from the phylogenetic | |
literature along with specimen information. When populating the KB, | |
specimen identifiers such as occurrence identifiers, collector's number, | |
and catalog numbers were preserved if present in the literature. We | |
found that these identifiers, although standard in the biodiversity | |
domain, were mostly insufficient to uniquely identify the source | |
specimen in iDigBio. As an alternative, we instead mapped all the | |
occurrences of taxa using string matches of the genus and species from | |
Vertebrate Taxonomy Ontology identifiers. Without specimen identifiers | |
that are consistent across databases, we lost the ability to explore | |
spatial and temporal variation of characters within genera and were only | |
able to explore phenotypes and geographic distributions among genera. We | |
look forward to discussing these issues with the collections community | |
represented at this meeting by the Society for the Preservation of | |
Natural History Collections (SPNHC). | |
We developed an R Shiny application that integrates characters and taxa | |
from Phenoscape with specimen records from iDigBio and phylogenies from | |
OT, to visualize phenotypic characters and taxon distributions in three | |
interactive panels. The app allows a user to visualize OT phylogenies | |
and place presence/absence character data on the tree. Specifically, | |
users can: select taxa or specific characters to visualize their | |
geographic distributions, navigate a phylogeny browser which displays | |
character and specimen data available for taxa under consideration, and | |
view a heatmap of characters available for character and taxon | |
combinations. Because of our challenges joining data, our distribution | |
map leaves users with the impression that all individuals in a genus | |
exhibit a character whereas the KB was populated with data describing | |
individuals. We hope that with improved data standards and their use by | |
more people, constructing applications like ours will become easier. | |
Abstract | |
There is a large amount of publicly available biodiversity data from | |
many different data sources. When doing research, one ideally interacts | |
with biodiversity data programmatically so their work is reproducible. | |
The entry point to biodiversity data records is largely through | |
taxonomic names, or common names in some cases (e.g., birds). However, | |
many researchers have a phylogeny focused project, meaning taxonomic | |
names are not the ideal interface to biodiversity data. Ideally, it | |
would be simple to programmatically go from a phylogeny to biodiversity | |
records through a phylogeny based query. | |
I\'ll discuss a new project \'phylodiv\' | |
(https://github.com/ropensci/phylodiv/) that attempts to facilitate | |
phylogeny based biodiversity data collection (see Fig. 1). The project | |
takes the form of an R software package. The idea is to make the user | |
interface take essentially two inputs: a phylogeny and a phylogeny based | |
question. Behind the scenes we\'ll do many things, including gathering | |
taxonomic names and hierarchies for the taxa in the phylogeny, send | |
queries to GBIF (or other data sources), and map the results. The user | |
will of course have control over the behind the scenes parts, but I | |
imagine the majority use case will be to input a phylogeny and a | |
question and expect an answer back. | |
We already have R tools to do nearly all parts of the work-flow shown | |
above: there\'s a large number of phylogeny tools, | |
\'taxize\'/\'taxizedb\' can handle taxonomic name collection, while | |
\'rgbif\' can handle interaction with GBIF, and there\'s many mapping | |
options in R. There are a few areas that need work still however. | |
First, there\'s not yet a clear way to do a phylogeny based query. | |
Ideally a user will be able to express a simple query like \"taxon A vs. | |
its sister group\". That\'s simple to imagine, but to implement that in | |
software is another thing. | |
Second, users ideally would like answers back - in this case a map of | |
occurrences - relatively quickly to be able to iterate on their research | |
work-flow. The most likely solution to this will be to use GBIF\'s map | |
tile service to visualize binned occurrence data, but we\'ll need to | |
explore this in detail to make sure it works. | |
Abstract | |
Xper3 (Vignes Lebbe et al. 2016) is a collaborative knowledge base | |
publishing platform that, since its launch in november 2013, has been | |
adopted by over 2 thousand users (Pinel et al. 2017). This is mainly due | |
to its user friendly interface and the simplicity of its data model. The | |
data are stored in MySQL Relational DBs, but the exchange format uses | |
the TDWG standard format SDD (Structured Descriptive Data Hagedorn et | |
al. 2005). However, each Xper3 knowledge base is a closed world that the | |
author(s) may or may not share with the scientific community or the | |
public via publishing content and/or identification key (Kopfstein | |
2016). The explicit taxonomic, geographic and phenotypic limits of a | |
knowledge base are not always well defined in the metadata fields. | |
Conversely terminology vocabularies, such as Phenotype and Trait | |
Ontology PATO and the Plant Ontology PO, and software to edit them, such | |
as Protégé and Phenoscape, are essential in the semantic web, but | |
difficult to handle for biologist without computer skills. These | |
ontologies constitute open worlds, and are expressed themselves by RDF | |
triples (Resource Description Framework). Protégé offers vizualisation | |
and reasoning capabilities for these ontologies (Gennari et al. 2003, | |
Musen 2015). | |
Our challenge is to combine the user friendliness of Xper3 with the | |
expressive power of OWL (Web Ontology Language), the W3C standard for | |
building ontologies. We therefore focused on analyzing the | |
representation of the same taxonomic contents under Xper3 and under | |
different models in OWL. After this critical analysis, we chose a | |
description model that allows automatic export of SDD to OWL and can be | |
easily enriched. We will present the results obtained and their | |
validation on two knowledge bases, one on parasitic crustaceans | |
(Sacculina) and the second on current ferns and fossils (Corvez and | |
Grand 2014). The evolution of the Xper3 platform and the perspectives | |
offered by this link with semantic web standards will be discussed. | |
Abstract | |
Anthropogenic-induced climate change has already altered the conditions | |
to which species have adapted locally, and consequently, shifts of | |
occurrence areas have been previously reported (Chen et al. 2011). | |
Anticipating the results of climate change is urgent, and using these | |
results efficiently to guide decision-making can help to build | |
strategies to protect species from those changes. Therefore, our | |
objective is to propose the use of climate change impact assessments, | |
obtained through species distribution models (SDMs), to guide decision | |
making. The emphasis will be on data that could help determine the | |
potentially vulnerable species and the priority areas, which could act | |
as climate refuges, as well as wildlife corridors. SDMs are based on | |
species occurrence points, available mainly from biological collections | |
and observations (Franklin 2010). When combined with geospatially | |
explicit layers of abiotic or biotic data (e. g. temperature, | |
precipitation, land use), which defines the ecological requirements of | |
species under study, it can generate species distribution models. These | |
models are projected in the form of maps indicating areas where the | |
species can find the most suitable habitats and, therefore, where one | |
can most likely find them. To support public policies decision, the | |
generation of robust and reliable model is an important factor. A | |
minimum number of six occurrence points is a mandatory requirement, with | |
non-overlapping area as a filter criteria. Unfortunatelly, in Brasil, as | |
well as in Latin America in general, this type of data is scarce. | |
Thus, with SDMs, four types of decision making information data | |
regarding priority species and areas could be obtained (Fig. 1). | |
Size of potential occurrence areas: species that have a small area of | |
occurrence are potentially vulnerable, since they present endemism, | |
usually living in restricted environmental conditions. In this case, any | |
small change in environmental conditions can result in the extinction of | |
the impacted species. Thus, this region needs to be protected. | |
Difference between current and future area: species presenting the most | |
significant reduction in potential areas should be prioritized by | |
decision-makers. This measurement could be used as an indication of | |
vulnerability. | |
Even species that have no predicted area reduction or an increase could | |
be prioritized in management programs due to its role in the complex | |
interaction networks of ecosystem services, such as pollinators, seed | |
dispersers or disease control. These species could be more resilient to | |
network interaction changes due climate, and possibly are better able to | |
provide their services in the extreme unfavorable climate scenarios. | |
Areas that maintain higher species diversity in future scenarios: their | |
protection could be prioritized in restoration and conservation | |
programs. Especially in cases involving multiple species, those areas | |
could be considered as climate refuges by decision-makers. Additionally, | |
for the reconstruction and use of SDM published in peer-reviewed | |
journals, it is necessary that all pieces of information about models, | |
its generation, ensemble methods, data cleaning and data quality | |
criteria applied should be available. | |
The availability of the four above mentioned types of information can | |
help on decision-making strategies aiming the protection of priority | |
species and areas. In conclusion, SDMs present essential information | |
about the present and future impacts of projected climate change and | |
their derived data could be preserved using a standard controlled | |
vocabulary. | |
Abstract | |
Can Essential Biodiversity Variables (EBVs) be developed to monitor | |
changes in species interactions? That was the difficult question asked | |
at the GLOBIS-B workshop in February, 2017 in which \>50 experts | |
participated. EBVs can be defined as harmonized measurements that allow | |
us to inform policy about essential changes in biodiversity. They can be | |
seen as biological state variables from which more refined indicators | |
may be derived. They have been presented as a means to monitor global | |
biodiversity change and as a concept to drive the gathering, sharing, | |
and standardisation of data on our biota (Geijzendorffer et al. 2015, | |
Kissling et al. 2017, Pereira et al. 2013). | |
There are different classes of EBVs that characterize, for example, the | |
state of species populations, species traits and ecosystem structure and | |
function. It has also been proposed that there should be EBVs related to | |
species interactions. However, until now there has been little progress | |
formulating what these should be, even though species interactions are | |
central to ecology. Species interactions cover a wide range of important | |
processes, from mutualisms, such as pollination, to different forms of | |
heterotrophic nutrition, such as the predator-prey relationship. Indeed, | |
ecological interactions are critical to understand why an ecosystem is | |
more than the sum of its parts. Nevertheless, direct observation of | |
species interactions is often difficult and time consuming work, which | |
makes it difficult to monitor them in the long-term. For this reason the | |
workshop focused on those species interactions that are feasible to | |
study and are most relevant to policy. To bring focus to our discussions | |
we concentrated on pollination, predation and microbial interactions. | |
Taking pollination as an example, there was recognition of the | |
importance of ecological networks and that network metrics may be a | |
sensitive indicator of change. Potential EBVs might be the number of | |
pairwise interactions between species or the modularity and interaction | |
diversity of the whole network. This requires standardised data | |
collection and reporting (e.g. standardization of measures of | |
interaction strength or minimum data specifications for ecological | |
networks) and sufficient data across time to regularly calculate these | |
metrics. Other simpler surrogates for pollination might also prove | |
useful, such as flower visitation rates or the proportion of fruit set. | |
Finally, there was a recognition that we do not yet have enough tools to | |
monitor some important interactions. Many interactions, particular among | |
microbes, can currently only be inferred from the co-occurrence of taxa. | |
However, technology is rapidly developing and it is possible to foresee | |
a future where even these interactions can be monitored efficiently. | |
Species interactions are essential to understanding ecology, but they | |
are also difficult to monitor. Yet, delegates at the workshop left with | |
a positive outlook that it is valuable to develop standardisation and | |
harmonization of species interaction data to make them suitable for EBV | |
production. | |
Abstract | |
Understanding the role that species play in their environment is a | |
fundamental goal of biodiversity research, bringing knowledge on | |
ecosystem maintenance and in provision of ecosystem services. Different | |
types of interaction that different species establish with their | |
partners regulate the functioning of ecosystems (McCann 2007). | |
Interactions between plants and pollinators (Potts et al. 2016) and | |
between plants and seed dispersers (Wang and Smith 2002) are examples of | |
mutualism, crucial to the maintenance of the floristic composition and | |
overall biodiversity in different biomes. They also illustrate well the | |
nature\'s contribution to people, supporting ecosystem services with key | |
economic consequences, such as pollination of agricultural crops (Klein | |
et al. 2007) and seed dispersal of natural or assisted restoration of | |
degraded areas (Wunderle 1997). | |
Interactions are mediated by different functional traits (morphological | |
and/or behavioral characteristics of organisms that influence their | |
performance) (Ball et al. 2015). As the zoochorous transfer of pollen | |
grains and seeds usually involves contact, the success of pollination | |
and seed dispersal depends to a large extend on the relationship of size | |
and morphology between flower/fruit and their respective pollinator/seed | |
disperser. Selected over a long history of shared evolutionary history, | |
it is feasible to rely on the predictive potential these traits may have | |
to determine if a certain animal is able to transfer pollen grains | |
and/or seeds of specific plants in the landscape (Howe 2016). | |
Biodiversity is facing constant negative impacts, especially related to | |
climate and habitat changes. They are threatening the provision of | |
ecosystem services, jeopardizing the basic premise of sustainable | |
development, which is to guarantee resources for future generations. The | |
novel landscapes that result from these impacts will certainly be | |
dependent of these ecosystem services, but will they persist in face of | |
extinctions and invasive competitors? Ultimately, will these services be | |
predicable by functional traits, in landscapes where shared evolutionary | |
history is reduced? Strategies that help our understanding of the | |
interactions and their role in the provision of services are urgent | |
(Corlett 2011). Given this context, our objective here is to present the | |
type of data that, if made available, could assist in determining the | |
role of species in terms of the interactions they make and the provision | |
of ecosystem services. Moreover, we aimed to elucidate how this role can | |
be associated with functional traits. | |
The current work focuses on the following groups: plants, birds, bats | |
and bees (Fig. 1). Of particular interest are interactions involving: | |
pollination, which is carried out predominantly by bees, but also by | |
nectarivorous birds and bats; and | |
seed dispersal, mainly carried out by frugivorous birds and bats. | |
These interactions are mediated by key traits. In plants, common flower | |
traits are the aperture, color, odor strength and type, shape | |
orientation, size and symmetry, nectar guide and sexual organ, and | |
reward. Fruit or seed traits, such as fleshy nutrient, chemical | |
attractant and clinging structures, are also relevant for seed | |
dispersal. In animals the most common traits are the body size (for | |
bees, the intertegular distance; for bats, forearm length; and for | |
birds, the weight), gape-width for birds and the feeding habit | |
(nectarivorous, frugivorous, omnivorous) for bats and birds. Providing | |
standardized data on traits involving interactions between fauna and | |
flora is important to fill knowledge gaps, which could help in the | |
decision making processes aiming conservation, restoration and | |
management programs for protecting ecosystem services based on | |
biodiversity. | |
Abstract | |
The Brazilian Plant-Pollinator Interactions Network\*1 (REBIPP) aims to | |
develop scientific and teaching activities in plant-pollinator | |
interaction. The main goals of the network are to: | |
generate a diagnosis of plant-pollinator interactions in Brazil; | |
integrate knowledge in pollination of natural, agricultural, urban and | |
restored areas; | |
identify knowledge gaps; | |
support public policy guidelines aimed at the conservation of | |
biodiversity and ecosystem services for pollination and food production; | |
and encourage collaborative studies among REBIPP participants. | |
To achieve these goals the group has resumed and built on previous works | |
in data standard definition done under the auspices of the IABIN-PTN | |
(Etienne Américo et al. 2007) and FAO (Saraiva et al. 2010) projects | |
(Saraiva et al. 2017). The ultimate goal is to standardize the ways data | |
on plant-pollinator interactions are digitized, to facilitate data | |
sharing and aggregation. A database will be built with standardized data | |
from Brazilian researchers members of the network to be used by the | |
national community, and to allow sharing data with data aggregators. | |
To achieve those goals three task groups of specialists with similar | |
interests and background (e.g botanists, zoologists, pollination | |
biologists) have been created. Each group is working on the definition | |
of the terms to describe plants, pollinators and their interactions. The | |
glossary created explains their meaning, trying to map the suggested | |
terms into Darwin Core (DwC) terms, and following the TDWG Standards | |
Documentation Standard\*2 in definition. | |
Reaching a consensus on terms and their meaning among members of each | |
group is challenging, since researchers have different views and | |
concerns about which data are important to be included into a standard. | |
That reflects the variety of research questions that underlie different | |
projects and the data they collect. Thus, we ended up having a long list | |
of terms, many of them useful only in very specialized research | |
protocols and experiments, sometimes rarely collected or measured. | |
Nevertheless we opted to maintain a very comprehensive set of terms, so | |
that a large number of researchers feel that the standard meets their | |
needs and that the databases based on it are a suitable place to store | |
their data, thus encouraging the adoption of the data standard. | |
An update of the work will soon be available at REBIPP website and will | |
be open for comments and contributions. This proposal of a data standard | |
is also being discussed within the TDWG Biological Interaction Data | |
Interest Group\*3 in order to propose an international standard for | |
species interaction data. | |
The importance of interaction data for guiding conservation practices | |
and ecosystem services provision management has led to the proposal of | |
defining Essential Biodiversity Variables (EBVs) related to biological | |
interactions. Essential Biodiversity Variables (Pereira et al. 2013) | |
were developed to identify key measurements that are required to | |
monitoring biodiversity change. EBVs act as intermediate abstract layer | |
between primary observations (raw data) and indicators (Niemeijer 2002). | |
Five EBV classes have been defined in an initial stage: genetic | |
composition, species populations, species traits, community composition, | |
ecosystem function and ecosystem structure. Each EBV class defines a | |
list of candidate EBVs for biodiversity change monitoring (Fig. 1). | |
Consequently, digitalization of such data and making them available | |
online are essential. Differences in sampling protocols may affect data | |
scalability across space and time, hence imposing barriers to the full | |
use of primary data and EBVs calculation (Henry et al. 2008). Thus, | |
common protocols and methods should be adopted as the most | |
straightforward approach to promote integration of collected data and to | |
allow calculation of EBVs (Jürgens et al. 2011). Recently a Workshop was | |
held by GLOBIS B\*4 (GLOBal Infrastructures for Supporting Biodiversity | |
research) to discuss Species Interactions EBVs (February, 26-28, Bari, | |
Italy). Plant-pollinator interactions deserved a lot of attention and | |
REBIPP\'s work was presented there. As an outcome we expect to define | |
specific EBVs for interactions, and use plant-pollinators as an example, | |
considering pairwise interactions as well as interaction network related | |
variables. | |
The terms in the plant-pollinator data standard under discussion at | |
REBIPP will provide information not only on EBV related with | |
interactions, but also on other four EBV classes: species populations, | |
species traits, community composition, ecosystem function and ecosystem | |
structure. As we said, some EBVs for specific ecosystem functions (e.g. | |
pollination) lay beyond interactions network structures. The EBV | |
\'Species interactions\' (EBV class \'Community composition\') should | |
incorporate other aspects such as frequency (Vázquez et al. 2005), | |
duration and empirical estimates of interaction strengths (Berlow et al. | |
2004). | |
Overall, we think the proposed plant-pollinator interaction data | |
standard which is currently being developed by REBIPP will contribute to | |
data aggregation, filling many data gaps and can also provide indicators | |
for long-term monitoring, being an essential source of data for EBVs. | |
Abstract | |
The cTAKES package (using the ClearTK Natural Language Processing | |
toolkit Bethard et al. 2014, http://cleartk.github.io/cleartk/) has been | |
successfully used to automatically read clinical notes in the medical | |
field (Albright et al. 2013, Styler et al. 2014). It is used on a daily | |
basis to automatically process clinical notes and extract relevant | |
information by dozens of medical institutions. ClearEarth is a | |
collaborative project that brings together computational linguistics and | |
domain scientists to port Natural Language Processing (NLP) modules | |
trained on the same types of linguistic annotation to the fields of | |
geology, cryology, and ecology. The goal for ClearEarth in the ecology | |
domain is the extraction of ecologically-relevant terms, including | |
eco-phenotypic traits from text and the assignment of those traits to | |
taxa. Four annotators used Anafora (an annotation software; | |
https://github.com/weitechen/anafora) to mark seven entity types | |
(biotic, aggregate, abiotic, locality, quality, unit, value) and six | |
reciprocal property types (synonym of/has synonym, part of/has part, | |
subtype/supertype) in 133 documents from primarily Encyclopedia of Life | |
(EOL) and Wikipedia according to project guidelines | |
(https://github.com/ClearEarthProject/AnnotationGuidelines). | |
Inter-annotator agreement ranged from 43% to 90%. Performance of | |
ClearEarth on identifying named entities in biology text overall was | |
good (precision: 85.56%; recall: 71.57%). The named entities with the | |
best performance were organisms and their parts/products (biotic | |
entities - precision: 72.09%; recall: 54.17%) and systems and | |
environments (aggregate entities - precision: 79.23%; recall: 75.34%). | |
Terms and their relationships extracted by ClearEarth can be embedded in | |
the new ecocore ontology after vetting | |
(http://www.obofoundry.org/ontology/ecocore.html). This project enables | |
use of advanced industry and research software within natural sciences | |
for downstream operations such as data discovery, assessment, and | |
analysis. In addition, ClearEarth uses the NLP results to generate | |
domain-specific ontologies and other semantic resources. | |
Abstract | |
There are many ways to capture data from herbarium specimen labels. Here | |
we compare the results of in-house verses out-sourced data transcription | |
with the aim of evaluating the pros and cons of each approach and | |
guiding future projects that want to do the same. | |
In 2014 Meise Botanic Garden (BR) embarked on a mass digitization | |
project. We digitally imaged of some 1.2 million herbarium specimens | |
from our African and Belgian Herbaria. The minimal data for a third of | |
these images was transcribed in-house, while the remainder was | |
out-sourced to a commercial company. The minimal data comprised the | |
fields: specimen's herbarium location, barcode, filing name, family, | |
collector, collector number, country code and phytoregion (for the | |
Democratic Republic of Congo, Rwanda & Burundi). The out-sourced data | |
capture consisted of three types: | |
additional label information for central African specimens having | |
minimal data; | |
complete data for the remaining African specimens; and, | |
species filing name information for African and Belgian specimens | |
without minimal data. As part of the preparation for out-sourcing, a | |
strict protocol had to be established as to the criteria for acceptable | |
data quality levels. | |
Also, the creation of several lookup tables for data entry was necessary | |
to improve data quality. During the start-up phase all the data were | |
checked, feedback given, compromises made and the protocol amended. | |
After this phase, an agreed upon subsample was quality controlled. If | |
the error score exceeded the agreed level, the batch was returned for | |
retyping. The data had three quality control checks during the process, | |
by the data capturers, the contractor's project managers and ourselves. | |
Data quality was analysed and compared in-house versus out-sourced modes | |
of data capture. The error rate by our staff versus the external company | |
was comparable. The types of error that occurred were often linked to | |
the specific field in question. These errors include problems of | |
interpretation, legibility, foreign languages, typographic errors, etc. | |
A significant amount of data cleaning and post-capture processing was | |
required prior to import into our database, despite the data being of | |
good quality according to protocol (error \< 1%). By improving the | |
workflow and field definitions a notable improvement could be made in | |
the "data cleaning" phase. | |
The initial motivation for capturing some data in-house was financial. | |
However, after analysis, this may not have been the most cost effective | |
approach. Many lessons have been learned from this first mass | |
digitisation project that will implemented in similar projects in the | |
future. | |
Abstract | |
Recent developments in digitisation technologies and equipment have | |
enabled advances in the rate of natural history specimen digitisation. | |
However Europe's Natural History Collection Institutions are home to | |
over one billion specimens and currently only a small fraction of these | |
have been digitally catalogued with fewer imaged. It is clear that | |
institutions still face huge challenges when digitising the vast number | |
of specimens in their collections. | |
I will present the results of two surveys that aimed to discover the | |
main successes and challenges facing institutions in their digitisation | |
programmes. The first survey was undertaken in 2014 within the SYNTHESYS | |
3 project and gathered information from project partners on their | |
current digitisation facilities, equipment and workflows providing some | |
key recommendations based on these findings. The second survey was | |
completed more recently in 2017, through the Consortium of European | |
Taxonomic Facilities (CETAF) Digitisation Working Group. This survey | |
aimed to discover the successful protocols and implementation of | |
digitisation, and to identify the shortfalls in resources and protocols. | |
Results from both surveys will be fed into the future programme of the | |
CETAF Digitisation Working Group as well as forthcoming and proposed EU | |
projects, including Innovation and Consolidation for large-scale | |
Digitisation of natural heritage (ICEDIG). | |
Abstract | |
On herbarium sheets, data elements such as plant name, collection site, | |
collector, barcode and accession number are found mostly on labels glued | |
to the sheet. The data are thus visible on specimen images. With | |
continuously improving technologies for collection mass-digitisation it | |
has become easier and easier to produce high quality images of herbarium | |
sheets and in the last few years herbarium collections worldwide have | |
started to digitize specimens on an industrial scale (Tegelberg et al. | |
2014). To use the label data contained in these massive numbers of | |
images, they have to be captured and databased. Currently, manual data | |
entry prevails and forms the principal cost and time limitation in the | |
digitization process. The StanDAP-Herb Project has developed a standard | |
process for (semi-) automatic detection of data on herbarium sheets. | |
This is a formal extensible workflow integrating a wide range of | |
automated specimen image analysis services, used to replace | |
time-consuming manual data input as far as possible. We have created | |
web-services for OCR (Optical Character Recognition); for identifying | |
regions of interest in specimen images and for the context-sensitive | |
extraction of information from text recognized by OCR. We implemented | |
the workflow as an extension of the OpenRefine platform (Verborgh and De | |
Wilde 2013). | |
Abstract | |
Globally there are a number of citizen science portals to support | |
digitisation of biodiversity collections. Digitisation not only involves | |
imaging of the specimen itself, but also includes the digital | |
transcription of label and ledger data, georeferencing and linking to | |
other digital resources. Making use of the skills and enthusiasm of | |
volunteers is potentially a good way to reduce the great backlog of | |
specimens to be digitised. | |
These citizen science portals engage the public and are liberating data | |
that would otherwise remain on paper. There is also considerable scope | |
for expansion into other countries and languages. Therefore, should we | |
continue to expand? Volunteers give their time for free, but the | |
creation and maintenance of the platform is not without costs. Given a | |
finite budget, what can you get for your money? How does the quality | |
compare with other methods? Is crowdsourcing of label transcription | |
faster, better and cheaper than other forms of transcription system? | |
We will summarize the use of volunteer transcription from our own | |
experience and the reports of other projects. We will make our | |
evaluation based on the costs, speed and quality of the systems and | |
reach conclusions on why you should or should not use this method. | |
Abstract | |
The Atlas of Living Costa Rica (http://www.crbio.cr/) is a biodiversity | |
data portal, based on the Atlas of Living Australia (ALA), which | |
provides integrated, free, and open access to data and information about | |
Costa Rican biodiversity in order to support science, education, and | |
conservation. It is managed by the Biodiversity Informatics Research | |
Center (CRBio) and the National Biodiversity Institute (INBio). | |
Currently, the Atlas of Living Costa Rica includes nearly 8 million | |
georeferenced species occurrence records, mediated by the Global | |
Biodiversity Information Facility (GBIF), which come from more than 900 | |
databases and have been published by research centers in 36 countries. | |
Half of those records are published by Costa Rican institutions. In | |
addition, CRBio is making a special effort to enrich and share more than | |
5000 species pages, developed by INBio, about Costa Rican vertebrates, | |
arthropods, molluscs, nematodes, plants and fungi. These pages contain | |
information elements pertaining to, for instance, morphological | |
descriptions, distribution, habitat, conservation status, management, | |
nomenclature and multimedia. This effort is aligned with collaboration | |
established by Costa Rica with other countries such as Spain, Mexico, | |
Colombia and Brazil to standarize this type of information through | |
Plinian Core (https://github.com/PlinianCore), a set of vocabulary terms | |
that can be used to describe different aspects of biological species. | |
The Biodiversity Information Explorer (BIE) is one of the modules made | |
available by ALA which indexes taxonomic and species content and | |
provides a search interface for it. We will present how CRBio is | |
implementing BIE as part of the Atlas of Living Costa Rica in order to | |
share all the information elements contained in the Costa Rican species | |
pages. | |
Abstract | |
Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global | |
Biodiversity Information Facility (GBIF) node of Australia. They | |
developed an open and free platform for sharing and exploring | |
biodiversity data. All the modules are publicly available for reuse and | |
customization on their GitHub account | |
(https://github.com/AtlasOfLivingAustralia). | |
GBIF Benin, hosted at the University of Abomey-Calavi, has published | |
more than 338 000 occurrence records from 87 datasets and 2 checklists. | |
Through the GBIF Capacity Enhancement Support Programme | |
(https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), | |
GBIF Benin, with the help of GBIF France, is in the process of deploying | |
the Beninese data portal using the GBIF France back-end architecture. | |
GBIF Benin is the first African country to implement this module of the | |
ALA infrastructure. | |
In this presentation, we will show you an overview of the registry and | |
the occurrence search engine using the Beninese data portal. We will | |
begin with the administration interface and how to manage metadata, then | |
we will continue with the user interface of the registry and how you can | |
find Beninese occurrences through the hub. | |
Abstract | |
Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global | |
Biodiversity Information Facility (GBIF) node of Australia. In 2010, | |
they launched an open and free platform for sharing and exploring | |
biodiversity data. Thanks to this new infrastructure, they have been | |
able to drastically increase the number of occurrences published through | |
the GBIF.org . In order to help other GBIF nodes or institutions, they | |
made all of their modules publicly available for reuse and customization | |
through GitHub (https://github.com/AtlasOfLivingAustralia). | |
Since 2013, the community created by developers interested by ALA tools, | |
organized, with the help of GBIF, 8 technical workshops around the | |
world. These workshops helped the launch of at least 13 data portals. | |
The last training session, funded through the GBIF Capacity Enhancement | |
Support Programme | |
(https://www.gbif.org/programme/82219/capacity-enhancement-support-programme), | |
was been attended by 23 participants from 19 countries on 6 continents. | |
Moreover, on the new GBIF website, a section has been dedicated to this | |
programme (https://www.gbif.org/programme/82953/living-atlases), the | |
Living Atlases community official website has been launched in 2017 | |
(https://living-atlases.gbif.org) and the technical documentation has | |
been improved and translated in several languages. All of these | |
achievements would not have been possible without a huge effort from the | |
ALA developer community. | |
After a brief introduction of the Living Atlases community, we will | |
present you the work done by ALA to simplify the process of getting a | |
living atlas up and running. We will also show you how ALA developers | |
managed to help the community members to create their own version by | |
performing simple HTML/CSS customizations. | |
Abstract | |
Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global | |
Biodiversity Information Facility (GBIF) node of Australia. Since 2010, | |
they have developed and improved a platform for sharing and exploring | |
biodiversity information. All the modules are publicly available for | |
reuse and customization on their GitHub account | |
(https://github.com/AtlasOfLivingAustralia). | |
The National Biodiversity Network, a registered charity, is the UK GBIF | |
node and has been sharing biodiversity data since 2000. They published | |
more than 79 million occurrences from 818 datasets. In 2016, they | |
launched the NBN Atlas Scotland (https://scotland.nbnatlas.org/) based | |
on the Atlas of Living Australia infrastructure. Since then, they | |
released the NBN Atlas (https://nbnatlas.org/), the NBN Atlas Wales | |
(https://wales.nbnatlas.org/) and soon the NBN Atlas Isle of Man. In | |
addition to the occurrence/species search engine and the metadata | |
registry, they put in place several tools that help users to work with | |
data published in the network: the spatial portal and \"explore your | |
region\" module. Both elements are based on Atlas of Living Australia | |
developments. | |
Because the Atlas of Living Australia platform is really powerful an | |
reusable, we want to show you these two applications used to make | |
geographical analyses. In order to perform this, we will present you the | |
specificities of each component by giving examples of some | |
functionalities. | |
Abstract | |
During the last few years, a large number of countries have deployed | |
national customized versions of The Atlas of Living Australia (ALA) | |
(https://www.ala.org.au/), which is a collaboratively developed, open | |
infrastructure for collecting and presenting biodiversity data | |
nationally and for sharing it globally through GBIF (https://gbif.org). | |
The increasing number of national nodes deploying this free and open | |
source software platform has built a worldwide community involving more | |
than 17 countries, that collaborate openly in a decentralized way | |
(https://living-atlases.gbif.org/), helping each other out by organizing | |
technical workshops and by developing and sharing new software modules | |
using GitHub. | |
One of these modules in the Living Atlases infrastructure is an R | |
package called ALA4R originally created by Ben Raymond | |
(https://github.com/AtlasOfLivingAustralia/ALA4R). It provides the | |
research community with programmatic data access to many of the Living | |
Atlases data services using R. | |
This presentation will show how ALA4R can be used to access data from | |
different national Living Atlases nodes and how this R package can | |
enable research studies that utilize methods and practices for | |
reproducible workflows that are being increasingly established within | |
the research community | |
(https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf). | |
Abstract | |
Many, if not most, countries have several official or widely used | |
languages. And most, if not all, of these countries have herbaria. | |
Furthermore, specimens have been exchanged between herbaria from many | |
countries, so herbaria are often polylingual collections. It is | |
therefore useful to have label transcription systems that can attract | |
users proficient in a wide variety of languages. Belgium is a typical | |
polylingual country at the boundary between the Romance and Franconian | |
languages (French, Dutch & German). Yet, currently there are few | |
non-English transcription platforms for citizen science. This is why in | |
Belgium we built DoeDat, from the Digivol system of the Atlas of Living | |
Australia. | |
We will be demonstrating DoeDat and its multilingual features. We will | |
explain how we enter translations, both for the user interface and for | |
the dynamic parts of the website. We will share our experiences of | |
running a multilingual site and the challenges it brings. Translating | |
and running such a website requires skilled personnel and patience. | |
However, our experience has been positive and the number and quality of | |
our volunteer transcriptions has been rewarding. We look forward to the | |
further use of DoeDat to transcribe data in many other languages. There | |
are no reasons anymore to exclude willing volunteers in any language. | |
Abstract | |
MapBio is a project initiated by the Chinese Academy of Sciences, which | |
aims at integrating species distribution data from different sources and | |
mapping the biodiversity of China to support biodiversity research and | |
biodiversity conservation decisions. Species distribution data may be | |
found in journal articles, books and different databases in various | |
formats, and most species distributions are described in free text. | |
MapBio is trying to build up a workflow for collecting this free text, | |
parsing it into standardized data and projecting distributions onto a | |
map for each species in China. A map module of MapBio is designed and | |
implemented based on Web GIS to visualize species distributions on a map | |
at different levels, e.g., occurrence points, county, province, | |
distribution range, protected area, waterbody, biogeographic realm. | |
Since the completeness of distribution data is very important for | |
assessing biodiversity, we developed a tool in MapBio for analysis of | |
the gaps in distribution data. Based on the species distribution data, | |
especially the occurrence data, MapBio provides an integrated modeling | |
tool for helping users to build species niche models. MapBio is an open | |
access project. Users can get data and services from it easily for | |
biodiversity research and conservation, and also can contribute their | |
own biodiversity data to MapBio. | |
Abstract | |
For more than a decade, the biodiversity informatics community has | |
recognised the importance of stable resolvable identifiers to enable | |
unambiguous references to data objects and the associated concepts and | |
entities, including museum/herbarium specimens and, more broadly, all | |
records serving as evidence of species occurrence in time and space. | |
Early efforts built on the Darwin Core institutionCode, collectionCode | |
and catalogueNumber terms, treated as a triple and expected to uniquely | |
to identify a specimen. Following review of current technologies for | |
globally unique identifiers, TDWG adopted Life Science Identifiers | |
(LSIDs) (Pereira et al. 2009). Unfortunately, the key stakeholders in | |
the LSID consortium soon withdrew support for the technology, leaving | |
TDWG committed to a moribund technology. Subsequently, publishers of | |
biodiversity data have adopted a range of technologies to provide unique | |
identifiers, including (among others) HTTP Universal Resource | |
Identifiers (URIs), Universal Unique Identifiers (UUIDs), Archival | |
Resource Keys (ARKs), and Handles. Each of these technologies has merit | |
but they do not provide consistent guarantees of persistence or | |
resolvability. More importantly, the heterogeneity of these solutions | |
hampers delivery of services that can treat all of these data objects as | |
part of a consistent linked-open-data domain. | |
The geoscience community has established the System for Earth Sample | |
Registration (SESAR) that enables collections to publish standard | |
metadata records for their samples and for each of these to be | |
associated with an International Geo Sample Number (IGSN | |
http://www.geosamples.org/igsnabout). IGSNs follow a standard format, | |
distribute responsibility for uniqueness between SESAR and the | |
publishing collections, and support resolution via HTTP URI or Handles. | |
Each IGSN resolves to a standard metadata page, roughly equivalent in | |
detail to a Darwin Core specimen record. The standardisation of | |
identifiers has allowed the community to secure support from some | |
journal publishers for promotion and use of IGSNs within articles. | |
The biodiversity informatics community encompasses a much larger number | |
of publishers and greater pre-existing variation in identifier formats. | |
Nevertheless, it would be possible to deliver a shared global identifier | |
scheme with the same features as IGSNs by building off the aggregation | |
services offered by the Global Biodiversity Information Facility (GBIF). | |
The GBIF data index includes normalised Darwin Core metadata for all | |
data records from registered data sources and could serve as a platform | |
for resolution of HTTP URIs and/or Handles for all specimens and for all | |
occurrence records. The most significant trade-off requiring | |
consideration would be between autonomy for collections and other | |
publishers in how they format identifiers within their own data and the | |
benefits that may arise from greater consistency and predictability in | |
the form of resolvable identifiers. | |
Abstract | |
A simple, permanent and reliable specimen identifier system is needed to | |
take the informatics of collections into a new era of interoperability. | |
A system of identifiers based on HTTP URI (Uniform Resource | |
Identifiers), endorsed by the Consortium of European Taxonomic | |
Facilities (CETAF), has now been rolled out to 14 member organisations | |
(Güntsch et al. 2017). | |
CETAF-Identifiers have a Linked Open Data redirection mechanism for both | |
human- and machine-readable access and, if fully implemented, provide | |
Resource Description Framework (RDF) -encoded specimen data following | |
best practices continuously improved by members of the initiative. To | |
date, more than 20 million physical collection objects have been | |
equipped with CETAF Identifiers (Groom et al. 2017). | |
To facilitate the implementation of stable identifiers, simple | |
redirection scripts and guidelines for deciding on the local identifier | |
syntax have been compiled | |
(http://cetafidentifiers.biowikifarm.net/wiki/Main\_Page). Furthermore, | |
a capable \"CETAF Specimen URI Tester\" (http://herbal.rbge.info/) | |
provides an easy-to-use service for testing whether the existing | |
identifiers are operational. | |
For the usability and potential of any identifier system associated with | |
evolving data objects, active links to the source information are | |
critically important. This is particularly true for natural history | |
collections facing the next wave of industrialised mass digitisation, | |
where specimens come online with only basic, but rapidly evolving label | |
data. Specimen identifier systems must therefore have components for | |
monitoring the availability and correct implementation of individual | |
data objects. Our next implementation steps will involve the development | |
of a \"Semantic Specimen Catalogue\", which has a list of all existing | |
specimen identifiers together with the latest RDF metadata snapshot. The | |
catalogue will be used for semantic inference across collections as well | |
as the basis for periodic testing of identifiers. | |
Abstract | |
Life sciences research, and even more specifically biodiversity sciences | |
research, has yet to coalesece on a single system of identifiers for | |
specimens (physical samples collected for research) or even a single set | |
of standards for identifiers. Diverse identifier systems lead to | |
duplication and ambiguity, which in turn lead to challenges in finding | |
specimens, tracking and citing their usage, and linking them to data. | |
Other research disciplines provide experience that biodiversity sciences | |
could use to overcome these challenges. Earth sciences/geology may be | |
the most advanced discipline in this regard, thanks to the use of the | |
International GeoSample Number (IGSN) system, which was established to | |
provide globally unique identifiers for geological samples. The original | |
motivation of IGSN was to overcome duplication of sample numbers | |
reported in the scientific literature and to support the correlation of | |
observations on the same samples carried out by different laboratories | |
and reported in different publications. The IGSN system is managed | |
through a small set of \'allocating agents\' who act on behalf of a | |
national agency or community, under the overall coordination of the IGSN | |
Organization - a volunteer group representing a mixture of research | |
institutions and agencies. Similar to widely-recognized Digital Object | |
Identifiers (DOIs), the primary requirement of an allocating agent is to | |
maintain the mapping from an IGSN to a web \'landing page\' | |
corresponding to each sample. A standard (minimal) schema for describing | |
samples registered with IGSN has been developed, but individual IGSN | |
allocating agents will often supplement the base metadata with | |
additional information. Other efforts are working on cross-disciplinary | |
sample metadata schemas, but no single core standard has been agreed | |
upon yet. An important part of the development of the IGSN system has | |
been an engagement with scholarly publishers, with a goal of making each | |
mention of an IGSN within a report or paper be a hyperlink, and also for | |
links to other observations relating to the same sample to be | |
automatically highlighted by the publisher. | |
Abstract | |
Zooarchaeological specimens are the remains of animals, including | |
vertebrate and invertebrate taxa, recovered from, or in association | |
with, archaeological contexts of deposition or surrounding landscapes. | |
The physical scope of zooarchaeological specimens is diverse and | |
includes macro- and micro-zooarchaeological specimens composed of | |
archaeologically preserved bone, shell, exoskeletons, teeth, hair or | |
fur, scales, horns or antlers, as well as geochemical (e.g., isotopes) | |
and biochemical (e.g., ancient DNA) signatures derived from faunal | |
remains. Artifacts and objects created from animal remains, such as bone | |
pins, shell beads, preserved animal hides, are also zooarchaeological | |
specimens. Here we present recent work to utilize identifiers for | |
archaeological samples in new data publishing routines, focusing on key | |
challenges. One critical challenge is that archaeological samples are | |
often composited into different units depending on managers of | |
collections and analysts. Thus, in some cases, when migrating datasets | |
for publication, identifiers can refer to different sets of units, even | |
within the same dataset. Another key challenge is assuring that | |
different repositories can share sample identifiers. We show how Open | |
Context, a site-based archaeology-focused repository that also manages | |
objects such as zooarchaeological material, and VertNet, a | |
specimen-oriented biodiversity repository, have collaborated to share | |
sample identifiers. | |
While this illustrates a success story of linking data across | |
repositories, we discuss the complexity of how "occurrence identifiers," | |
but not true sample identifiers, in VertNet are propagated to another | |
system where the identifiers point to a similar record called "Animal | |
Bone" in Open Context. | |
Abstract | |
The Ocean Biogeographic Information System (OBIS) began in 2000 as the | |
repository for data from the Census of Marine Life. Since that time, | |
OBIS has expanded its goals beyond simply hosting data to supporting | |
more aspects of marine conservation (Pooter et al. 2017). In order to | |
accomplish those goals, the OBIS secretariat in partnership with its | |
European node (EurOBIS) hosted at the Flanders Marine Institute (VLIZ, | |
Belgium), and the Intergovernmental Oceanographic Commission (IOC) | |
Committee on International Oceanographic Data and Information Exchange | |
(IODE, 23rd session, March 2015, Brugge) established a 2-year pilot | |
project to address a particularly problematic issue that environmental | |
data collected as part of marine biological research were being | |
disassociated from the biological data. OBIS-Event-Data is the solution | |
that was developed from that pilot project, which devised a method for | |
keeping environmental data together with the biological data (Pooter et | |
al. 2017). | |
OBIS is seeking early adopters of the new data standard OBIS-Event-Data | |
from among the marine biodiversity monitoring communities, to further | |
validate the data standard, and develop data products and scientific | |
applications to support the enhancement of Biological and Ecosystem | |
Essential Ocean Variables (EOVs) in the framework of the Global Ocean | |
Observing System (GOOS) and the Marine Biodiversity Observation Network | |
of the Group on Earth Observations (GEO BON MBON). | |
After the successful 2-year IODE pilot project OBIS-ENV-DATA, the IOC | |
established a new 2-year IODE pilot project OBIS-Event-Data for | |
Scientific Applications (2017-2019). The OBIS-Event-Data data standard, | |
building on Darwin Core, provides a technical solution for combined | |
biological and environmental data, and incorporates details about | |
sampling methods and effort, including event hierarchy. It also | |
implements standardization of parameters involved in biological, | |
environmental, and sampling details using an international standard | |
controlled vocabulary (British Oceanographic Data Centre Natural | |
Environment Research Council). | |
A workshop organized by IODE/OBIS in April brought together major animal | |
tagging and tracking networks such as the Ocean Tracking Network (OTN), | |
the Animal Telemetry Network (ATN), the Integrated Marine Observing | |
System (IMOS), the European Tracking Network (ETN) and the Acoustic | |
Tracking Array Platform (ATAP) to test the OBIS-Event-Data standard | |
through the development of some data products and science applications. | |
Additionally, this workshop contributes to the further maturation of the | |
GOOS EOV on fish as well as the EOV on birds, mammals and turtles. | |
We will present the outcomes as well as any lessons learned from this | |
workshop on problems, solutions, and applications of using Darwin | |
Core/OBIS-Event-Data for bio-logging data. | |
Abstract | |
In recent years, bio-logging data, automatically gathered by sensors | |
deployed on animals, has become one of the fastest growing sources of | |
biodiversity data. This is largely due to the steadily declining mass, | |
size and costs of sensors, continuously opening new opportunities to | |
monitor new species. While previously 'tracking data'---data from | |
spatially enabled sensors such as GPS sensors---was most prominent, | |
currently almost 70% of all bio-logging data is comprised of non-spatial | |
data as e.g., physiological data. In contrast to the biodiversity data | |
community, where standards to mobilize and exchange data are relatively | |
well established, the bio-logging community is still lacking standards | |
to transport data from sensors into repositories, or to mobilize data in | |
a standardized format from different repositories to enable cooperation | |
between users, shared software tools, data aggregation for | |
meta-analysis, or a consistent format for long-term archiving. | |
To set the stage for a discussion about standards for bio-logging data | |
to be developed or adapted, we present a mind map describing the | |
different pathways of bio-logging data during its life cycle, and the | |
opportunities for standardization within this cycle. As an example we | |
present the use of the Open Geospatial Consortium (OGC) 'SensorML' and | |
'Observations & Measurements' standards to transfer bio-logging data | |
from a sensor to a repository and ultimately to a user for subsequent | |
analysis. These standards provide machine-readable methods for | |
describing bio-logging sensors and the measurements they collect, | |
offering a standardized structure that can be customized by the | |
bio-logging community (e.g. with standardized vocabularies) to achieve | |
interoperability. | |
Abstract | |
To usefully describe sensor deployments on animals is a major challenge | |
for advocates of data standards. Bio-logging studies also need to be | |
documented in a standard manner to facilitate discovery and determine | |
relevance? For systems aggregating biodiversity occurrence records, the | |
use of the Darwin Core standard (Wieczorek et al. 2012) to express | |
species occurrences is near ubiquitous. Bio-logging studies are | |
universally multiple instances of species occurrences that output high | |
quality spatial and temporal data recorded by specialists. | |
There are a lot of benefits to summarising these studies by means of a | |
single, flat file record. Simple Darwin Core offers the ability to do | |
this by representing the multiple occurrences as a date range in | |
dwc:eventDate and a footprint polygon using dwc:footprintWKT for the | |
area covered by the track. By also uniformly describing the species, the | |
dwc:basisOfRecord as Machine Observation, and a controlled vocabulary to | |
describe the type of bio-logging data, systems could offer an effective | |
means of querying tracking data. It's important to look to other data | |
standards initiatives relevant to bio-logging to ensure common usage of | |
Darwin Core terms. | |
The Atlas of Living Australia is using an implementation of Simple | |
Darwin Core to represent data from the bio-logging platform ZoaTrack as | |
occurrence data to make it discoverable via location or species-based | |
searches. Other initiatives, for example Swedish LifeWatch follow a | |
similar approach to represent data from the Wireless Remote Animal | |
Monitoring (WRAM) Scandinavian bio-logging infrastructure. With | |
endorsement from the community, the implementation could be useful as a | |
type of metadata catalogue record, opening it for usage in application | |
programmer interface (API) development and thus enabling machine | |
interoperability between systems and users. In short, bio-logging | |
systems and practitioners would be able to easily discover relevant | |
studies by searching by location and/or species. | |
Abstract | |
With the continuous development of imaging technology, the amount of | |
insect 3D data is increasing, but research on data management is still | |
virtually non-existent. This paper will discuss the specifications and | |
standards relevant to the process of insect 3D data acquisition, | |
processing and analysis. | |
The collection of 3D data of insects includes specimen collection, | |
sample preparation, image scanning specifications and 3D model | |
specification. The specimen collection information uses existing | |
biodiversity information standards such as Darwin Core. However, the 3D | |
scanning process contains unique specifications for specimen | |
preparation, depending on the scanning equipment, to achieve the best | |
imaging results. | |
Data processing of 3D images includes 3D reconstruction, tagging | |
morphological structures (such as muscle and skeleton), and 3D model | |
building. There are different algorithms in the 3D reconstruction | |
process, but the processing results generally follow DICOM (Digital | |
Imaging and Communications in Medicine) standards. There is no available | |
standard for marking morphological structures, because this process is | |
currently executed by individual researchers who create operational | |
specifications according to their own needs. 3D models have specific | |
file specifications, such as object files | |
(https://en.wikipedia.org/wiki/Wavefront\_.obj\_file) and 3D max format | |
(https://en.wikipedia.org/wiki/.3ds), which are widely used at present. | |
There are only some simple tools for analysis of three-dimensional data | |
and there are no specific standards or specifications in Audubon Core | |
(https://terms.tdwg.org/wiki/Audubon\_Core), the TDWG standard for | |
biodiversity-related multi-media. | |
There are very few 3D databases of animals at this time. Most of insect | |
3D data are created by individual entomologists and are not even stored | |
in databases. Specifications for the management of insect 3D data need | |
to be established step-by-step. Based on our attempt to construct a | |
database of 3D insect data, we preliminarily discuss the necessary | |
specifications. | |
Abstract | |
iDigBio Matsunaga et al. 2013 currently references over 22 million media | |
files, and stores approximately 120 terabytes worth of those media files | |
co-located with our compute infrastructure. Using these images for | |
scientific research is a logistical and technical challenge. | |
Transferring large numbers of images requires programming skill, | |
bandwidth, and storage space. While simple image transformations such as | |
resizing and generating histograms are approachable on desktops and | |
laptops, the neural networks commonly used for learning from images | |
require server-based graphical processing units (GPUs) to run | |
effectively. | |
Using the GUODA (Global Unified Open Data Access) infrastructure, we | |
have built a model pipeline for applying user-defined processing to any | |
subset of the images stored in iDigBio. This pipeline is run on servers | |
located in the Advanced Computing and Information Systems lab (ACIS) | |
alongside the iDigBio storage system. We use Apache Spark, the Hadoop | |
File System (HDFS), and Mesos to perform the processing. We have placed | |
a Jupyter notebook server in front of this architecture which provides | |
an easy environment with deep learning libraries for Python already | |
loaded for end users to write their own models. Users can access the | |
stored data and images and manipulate them according to their | |
requirements and make their work publicly available on GitHub. | |
As an example of how this pipeline can be used in research, we applied a | |
neural network developed at the Smithsonian Institution to identify | |
herbarium sheets that were prepared with hazardous mercury containing | |
solutions Schuettpelz et al. 2017. The model was trained with | |
Smithsonian resources on their images and transferred to the GUODA | |
infrastructure hosted at ACIS which also houses iDigBio. We then applied | |
this model to additional images in iDigBio to classify them to | |
illustrate the application of these techniques to broad image corpora | |
potentially to notify other data publishers of contamination. We present | |
the results of this classification not as a verified research result, | |
but as an example of the collaborative and scalable workflows this | |
pipeline and infrastructure enable. | |
Abstract | |
Earth's ecosystems are threatened by anthropogenic change, yet | |
relatively little is known about biodiversity across broad spatial (i.e. | |
continent) and temporal (i.e. year-round) scales. There is a significant | |
gap at these scales in our understanding of species distribution and | |
abundance, which is the precursor to conservation (Hochachka et al. | |
2012). The cost and availability of experts to collect data does not | |
scale to broad spatial or temporal surveys. With recent advances in | |
artificial intelligence (AI) it is becoming possible to automate some of | |
this data collection and analysis (Joppa 2017). The Cornell Lab of | |
Ornithology is working to apply AI in three ways: | |
incorporating AI into the analysis of radar data to assess densities of | |
migratory birds at a continent-wide scale and across years; | |
utilizing new techniques in convolution neural networks (CNNs) to | |
improving our ability to classify natural sounds by limiting background | |
noise; | |
applying our ability to train models to classify birds in images to | |
build a system that can analyze video streams. | |
Our approach to accomplishing this is through partnerships between our | |
non-profit organization, computer science faculty, and industry leaders. | |
By leveraging deep learning technologies and including an array of | |
stakeholders, we are able to process data that would take years to | |
analyze using traditional methods. | |
Methods. | |
We use 28 years of Next-Generation Radar (NEXRAD) imagery, which | |
contains birds aloft during nocturnal migration. Using CNNs we can | |
assess the density of birds captured on radar images to count the number | |
of individuals crossing the continental U.S. each spring and fall. For | |
acoustical analysis of birds vocalizing during nocturnal migration, we | |
are using recorders to monitor the calling activity of birds aloft and | |
CNN's to detect and classify bird vocalizations in noisy landscapes. We | |
gathered more than 6 million images from the eBird community, archived | |
them in the Macaulay Library at the Cornell Lab of Ornithology, and | |
crowdsourced millions of annotations to train models to classify more | |
than 5,000 species of birds in images. Now we are applying this approach | |
to video. These projects have used both supervised and unsupervised | |
learning techniques. With supervised learning and the use of elaborate | |
training datasets, we made tremendous headway in bird photo | |
identification. Unsupervised learning was used to eliminate rain in | |
NEXRAD images successfully, with little training data incorporated. We | |
expect advances in unsupervised learning will open new possibilities in | |
the future. | |
Conclusions. | |
The Cornell Lab pioneered the concept of autonomous recording units for | |
monitoring biodiversity two decades ago, but without AI to process the | |
data, discoveries were limited by human processing time. Today, we can | |
combine our findings using radar with acoustic monitoring and sightings | |
from citizen scientists for a more complete understanding of bird | |
populations. We now expect AI processes to be able to identify birds | |
with high confidence in the near future for images, audio recordings and | |
videos. Furthermore, while conventional approaches require using | |
separate neural nets that are combined in a separate process, we now | |
combine multi-model sensor integration into a single CNN. There is no | |
longer a need for pre-processing of data for AI pattern recognition. Our | |
vision is to continue to apply these techniques to create a 'real-time | |
global bird monitoring network', with a combination of humans and | |
automated sensors. This network of sensors (or robots) will have | |
comparable ability as a human to detect, identify, and count birds, | |
gathering information systematically and in places where humans cannot | |
reach. | |
Abstract | |
Widespread technology usage has resulted in a deluge of data that is not | |
limited to scientific domains. For example, technology companies | |
accumulate vast amounts of data on their users to support their | |
applications and platforms. The participation of many domains in big | |
data collection, data analysis and visualization, and the need for fast | |
data exploration has provided a stellar market opportunity for high | |
quality data visualization software to emerge. In this talk, leading | |
industry visualization software (Tableau) will be used to explore a | |
biodiversity dataset (Carex spp. distribution and morphology). The | |
advantages and disadvantages of using Tableau for scientific exploration | |
will be discussed, as well as how to integrate data visualization tools | |
early into the data pipeline. Lastly, the potential for developing a | |
data visualization \"stack\" (i.e., a combination of software products | |
and programming languages) using available tools will be discussed, as | |
well as what the future might look like for scientists looking to | |
capitalize on the growth of industry tools. | |
Abstract | |
Phytoplankton form the basis of the marine food web and are an indicator | |
for the overall status of the marine ecosystem. Changes in this | |
community may impact a wide range of species (Capuzzo et al. 2018) | |
ranging from zooplankton and fish to seabirds and marine mammals. | |
Efficient monitoring of the phytoplankton community is therefore | |
essential (Edwards et al. 2002). Traditional monitoring techniques are | |
highly time intensive and involve taxonomists identifying and counting | |
numerous specimens under the light microscope. With the recent | |
development of automated sampling devices, image analysis technologies | |
and learning algorithms, the rate of counting and identification of | |
phytoplankton can be increased significantly (Thyssen et al. 2015). The | |
FlowCAM (Álvarez et al. 2013) is an imaging particle analysis system for | |
the identification and classification of phytoplankton. Within the | |
Belgian Lifewatch observatory, monthly phytoplankton samples are taken | |
at nine stations in the Belgian part of the North Sea. These samples are | |
run through the FlowCAM and each particle is photographed. Next, the | |
particles are identified based on their morphology (and fluorescence) | |
using state-of-the-art Convolutional Neural Networks (CNNs) for computer | |
vision. This procedure requires learning sets of expert validated | |
images. The CNNs are specifically designed to take advantage of the two | |
dimensional structure of these images by finding local patterns, being | |
easier to train and having many fewer parameters than a fully connected | |
network with the same number of hidden units. | |
In this work we present our approach to the use of CNNs for the | |
identification and classification of phytoplankton, testing it on | |
several benchmarks and comparing with previous classification | |
techniques. The network architecture used is ResNet50 (He et al. 2016). | |
The framework is fully written in Python using the TensorFlow (Abadi, M. | |
et al. 2016) module for Deep Learning. | |
Deployment and exploitation of the current framework is supported by the | |
recently started European Union Horizon 2020 programme funded project | |
DEEP-Hybrid-Datacloud (Grant Agreement number 777435), which supports | |
the expensive training of the system needed to develop the application | |
and provides the necessary computational resources to the users. | |
Abstract | |
Over the next 5 years major advances in the development and application | |
of numerous technologies related to computing, mobile phones, artificial | |
intelligence (AI), and augmented reality (AR) will have a dramatic | |
impact in biodiversity monitoring and conservation. Over a 2-week period | |
several of us had the opportunity to meet with multiple technology | |
experts in the Silicon Valley, California, USA to discuss trends in | |
technology innovation, and how they could be applied to conservation | |
science and ecology research. Here we briefly highlight some of the key | |
points of these meetings with respect to AI and Deep Learning. | |
Computing: Investment and rapid growth in AI and Deep Learning | |
technologies are transforming how machines can perceive the environment. | |
Much of this change is due to increased processing speeds of Graphics | |
Processing Units (GPUs), which is now a billion-dollar industry. Machine | |
learning applications, such as convolutional neural networks (CNNs) run | |
more efficiently on GPUs and are being applied to analyze visual imagery | |
and sounds in real time. Rapid advances in CNNs that use both supervised | |
and unsupervised learning to train the models is improving accuracy. By | |
taking a Deep Learning approach where the base layers of the model are | |
built upon datasets of known images and sounds (supervised learning) and | |
later layers relying on unclassified images or sounds (unsupervised | |
learning), dramatically improve the flexibility of CNNs in perceiving | |
novel stimuli. The potential to have autonomous sensors gathering | |
biodiversity data in the same way personal weather stations gather | |
atmospheric information is close at hand. | |
Mobile Phones: The phone is the most widely used information appliance | |
in the world. No device is on the near horizon to challenge this | |
platform, for several key reasons. First, network access is ubiquitous | |
in many parts of the world. Second, batteries are improving by about 20% | |
annually, allowing for more functionality. Third, app development is a | |
growing industry with significant investment in specializing apps for | |
machine-learning. While GPUs are already running on phones for video | |
streaming, there is much optimism that reduced or approximate Deep | |
Learning models will operate on phones. These models are already working | |
in the lab, with the biggest hurdle being power consumption and | |
developing energy efficient applications and algorithms to run | |
complicated AI processes will be important. It is just a matter of time | |
before industry will have AI functionality on phones. | |
These rapid improvements in computing and mobile phone technologies have | |
huge implications for biodiversity monitoring, conservation science, and | |
understanding ecological systems. Computing: AI processing of video | |
imagery or acoustic streams create the potential to deploy autonomous | |
sensors in the environment that will be able to detect and classify | |
organisms to species. Further, AI processing of Earth spectral imagery | |
has the potential to provide finer grade classification of habitats, | |
which is essential in developing fine scale models of species | |
distributions over broad spatial and temporal extents. Mobile Phones: | |
increased computing functionality and more efficient batteries will | |
allow applications to be developed that will improve an individual's | |
perception of the world. Already AI functionality of Merlin improves a | |
birder's ability to accurately identify a bird. Linking this | |
functionality to sensor devices like specialized glasses, binoculars, or | |
listening devises will help an individual detect and classify objects in | |
the environment. | |
In conclusion, computing technology is advancing at a rapid rate and | |
soon autonomous sensors placed strategically in the environment will | |
augment the species occurrence data gathered by humans. The mobile phone | |
in everyone's pocket should be thought of strategically, in how to | |
connect people to the environment and improve their ability to gather | |
meaningful biodiversity information. | |
Abstract | |
Reliable plant species identification from seeds is intrinsically | |
difficult due to the scarcity of features and because it requires | |
specialized expertise that is becoming increasingly rarer, as the number | |
of field plant taxonomists is diminishing (Bacher 2012, Haas and Häuser | |
2005). On the other hand, seed identification is relevant in some | |
science domains such as plant community ecology, archaeology, | |
paleoclimatology. Besides, economic activities such as agriculture, | |
require seed identification to assess weed species contained in the | |
\"soil seed banks\" (Colbach 2014) to enable targeted treatments before | |
they become a problem. | |
In this work, we explore and evaluate several approaches by using | |
different training image sets with various requisites and assessing | |
their performance with test datasets of different sources. | |
The core training dataset is provided by the Anthos project (Castroviejo | |
et al. 2017) as a subset of its image collection. It consists of nearly | |
a 1000 images of seeds identified by experts. | |
As identification algorithm, we will use state-of-the-art convolutional | |
neural networks for image classification (He et al. 2016). The framework | |
is fully written in Python using the TensorFlow (Abadi et al. 2016) | |
module for deep learning. | |
Abstract | |
Automated identification of plants and animals has improved considerably | |
in the last few years, in particular thanks to the recent advances in | |
deep learning. In order to evaluate the performance of automated plant | |
identification technologies in a sustainable and repeatable way, a | |
dedicated system-oriented benchmark was setup in 2011 in the context of | |
ImageCLEF (Goëau et al. 2011). Each year, since that time, several | |
research groups participated in this large collaborative evaluation by | |
benchmarking their image-based plant identification systems. In 2014, | |
the LifeCLEF research platform (Joly et al. 2014) was created in the | |
continuity of this effort so as to enlarge the evaluated challenges by | |
considering birds and fishes in addition to plants, and audio and video | |
contents in addition to images. | |
The 2017-th edition of the LifeCLEF plant identification challenge (Joly | |
et al. 2017) is an important milestone towards automated plant | |
identification systems working at the scale of continental floras with | |
10.000 plant species living mainly in Europe and North America | |
illustrated by a total of 1.1M images. Nowadays, such ambitious systems | |
are enabled thanks to the conjunction of the dazzling recent progress in | |
image classification with deep learning and several outstanding | |
international initiatives, aggregating the visual knowledge on plant | |
species coming from the main national botanical institutes. The | |
PlantCLEF plant challenge that we propose to present at this workshop | |
aimed at evaluating to what extent a large noisy training dataset | |
collected through the web (then containing a lot of labelling errors) | |
can compete with a smaller but trusted training dataset checked by | |
experts. To fairly compare both training strategies, the test dataset | |
was created from a third data source, the Pl\@ntNet (Joly et al. 2015) | |
mobile application that collects millions of plant image queries all | |
over the world. | |
Due to the good results obtained at the 2017-th edition of the LifeCLEF | |
plant identification challenge, the next big question is how far such | |
automated systems are from the human expertise. Indeed, even the best | |
experts are sometimes confused and/or disagree with each other when | |
validating images of living organism. A multimedia data actually | |
contains only partial information that is usually not sufficient to | |
determine the right species with certainty. Quantifying this uncertainty | |
and comparing it to the performance of automated systems is of high | |
interest for both computer scientists and expert naturalists. This work | |
reports an experimental study following this idea in the plant domain. | |
In total, 9 deep-learning systems implemented by 3 different research | |
teams were evaluated with regard to 9 expert botanists of the French | |
flora. The main outcome of this work is that the performance of | |
state-of-the-art deep learning models is now close to the most advanced | |
human expertise. This shows that automated plant identification systems | |
are now mature enough for several routine tasks, and can offer very | |
promising tools for autonomous ecological surveillance systems. | |
Abstract | |
The fast and accurate identification of forest species is critical to | |
support their sustainable management, to combat illegal logging, and | |
ultimately to conserve them. Traditionally, the anatomical | |
identification of forest species is a manual process that requires a | |
human expert with a high level of knowledge to observe and differentiate | |
certain anatomical structures present in a wood sample (Wiedenhoeft | |
(2011)). | |
In recent years, deep learning techniques have drastically improved the | |
state of the art in many areas such as speech recognition, visual object | |
recognition, and image and music information retrieval, among others | |
(LeCun et al. (2015)). In the context of the automatic identification of | |
plants, these techniques have recently been applied with great success | |
(Carranza-Rojas et al. (2017)) and even mobile apps such as Pl\@ntNet | |
have been developed to identify a species from images captured | |
on-the-fly (Joly et al. (2014)). In contrast to conventional machine | |
learning techniques, deep learning techniques extract and learn by | |
themselves the relevant features from large datasets. | |
One of the main limitations for the application of deep learning | |
techniques to forest species identification is the lack of comprehensive | |
datasets for the training and testing of convolutional neural network | |
(CNN) models. For this work, we used a dataset developed at the Federal | |
University of Parana (UFPR) in Curitiba, Brazil, that comprises 2939 | |
images in JPG format without compression and a resolution of 3.264 x | |
2.448 pixels. It includes 41 different forest species of the Brazilian | |
flora that were cataloged by the Laboratory of Wood Anatomy at UFPR | |
(Paula Filho et al. (2014)). Due to the lack of comprehensive datasets | |
world wide, this has become a benchmark dataset in previous research | |
(Paula Filho et al. (2014), Hafemann et al. (2014)). | |
In this work, we propose and demonstrate the power of deep CNNs to | |
identify forest species based on macroscopic images. We use a | |
pre-trained model which is built from the resnet50 model and uses | |
weights pre-trained on ImageNet. We apply fine-tuning by first | |
truncating the top layer (softmax layer) of the pre-trained network and | |
replacing it with a new softmax layer. Then we train again the model | |
with the dataset of macroscopic images of species of the Brazilian flora | |
used in (Hafemann et al. (2014), Paula Filho et al. (2014)). | |
Using the proposed model we achieve a top-1 98% accuracy which is better | |
than the 95.77% reported in (Hafemann et al. (2014)) using the same data | |
set. In addition, our result is slightly better than the reported in | |
(Paula Filho et al. (2014)) of 97.77% which was obtained by combining | |
several conventional techniques of computer vision. | |
Abstract | |
Costa Rica is one of the countries with highest species biodiversity | |
density in the world. More than 2,000 tree species have already been | |
identified, many of which are used in the building, furniture, and | |
packaging industries (Grayum et al. 2003). This rich diversity makes the | |
correct identification of tree species very difficult. As a result, it | |
is common to see in the national market that species are commercialized | |
with mistaken identifications, which makes quality control particularly | |
challenging. In addition, because 90 timber tree species have been | |
classified as "threatened" in Costa Rica, correct identifications are | |
indispensable for law-enforcement. | |
The traditional system for tree species identification is based on macro | |
and microscopic evaluations of the anatomy of the wood. It entails | |
assesing anatomical features such as patterns of vessels, parenchymas, | |
and fibers. Typically, 7.7 x 10 cm pieces of wood cuts are used to | |
identify the tree species (Pan and Kudo 2011, Yusof et al. 2013). | |
However, assessing these features is extremely difficult for taxonomists | |
because properties of the wood can vary considerably due to | |
environmental conditions and intra-specific genetic variability. | |
Deep learning techniques have recently been used to identify plant | |
species (Carranza-Rojas et al. 2017a, Carranza-Rojas et al. 2017b) and | |
are potentially useful to detect subtle differences in patterns of | |
vessels, parenchyma, and other anatomical features of wood. However, it | |
is necessary to have a large collection of macroscopic photographs of | |
individuals from various parts of the country (Pan and Kudo 2011). As a | |
first step in the application of deep learning techniques, we have | |
defined a formal, standard protocol for collecting wood samples, | |
physically processing them, taking pictures, performing data | |
augmentation, and using metadata to provide the primary data necessary | |
for deep learning applications. Unlike traditional xylotheque sampling | |
methods that destroy trees or use wood from fallen trees, we propose a | |
method that extracts small size samples with sufficient quality for | |
anatomical characterization but does not affect the growth and survival | |
of the individual. | |
This study has been developed in three forest permanent plots in Costa | |
Rica, all of which are sites with historical growth data over the last | |
20 years. We have so far evaluated 40 species (10 individuals per | |
species) with diameters greater than 20 cm. From each individual, a | |
cylindrical sample of 12 mm diameter and 7.5 cm in length was extracted | |
with a cordless drill. Each sample is then cut into five of 8 x 8 x 8 mm | |
cubes and further processed to result in curated xylotheque samples, a | |
dataset with all relevant metadata and original images, and a dataset | |
with images obtained by performing data augmentation on the original | |
images. | |
Abstract | |
As a child, I loved exhibits at the museum. As an adult conservation | |
biologist, entering the back rooms of the museum to view the collections | |
is even more remarkable. I have begun to realise the scope of what might | |
be held in museum collections, and to consider what these specimens, | |
artefacts, taonga (treasure) might tell us. Using examples from my work | |
on insects, birds and kahukurii (dogskin cloaks), and analyses from | |
morphometrics to isotopes, I will show how sampling from museum | |
collections can add layers of richness and complexity to research, with | |
the added dimensions of space, time, and connection to communities. | |
Finally, I'll discuss some of the ethics and understandings that guide | |
my work with museum collections, and what it means to be part of | |
collaborative partnerships of discovery with museum curators and | |
communities. | |
Abstract | |
Natural history collections are essential for understanding the world's | |
biodiversity and drive research in taxonomy, systematics, ecology and | |
biosecurity. One of the biggest challenges faced is the decline of new | |
taxonomists and public interest in collections-based research, which is | |
alarming considering that an estimated 70% of the world's species are | |
yet to be formally described. | |
Science communication combines public relations with the dissemination | |
of scientific knowledge and offers many benefits to promoting natural | |
history collections to a wide audience. For example, social media has | |
revolutionised the way collections and their staff communicate with the | |
public in real time, and can attract more visitors to collection | |
exhibits and new students interested in natural history. Although not | |
everyone is born a natural science communicator, institutions can | |
encourage and provide training for their staff to become engaging | |
spokespeople skilled in social media and public speaking, including | |
television, radio and/or print media. By embracing science | |
communication, natural history collections can influence their target | |
audiences in a positive and meaningful way, raise the profile of their | |
institution, encourage respect for biodiversity, promote their events | |
and research outputs, seek philanthropic donations, connect with other | |
researchers or industry leaders, and most importantly, inspire the next | |
generation of natural historians. | |
Abstract | |
Since 2010, the Canterbury region on the eastern coast of New Zealand's | |
South Island has experienced more than 14,000 earthquakes. This | |
presentation begins by considering the immediate impact of these seismic | |
events on Canterbury Museum; how were its buildings, its collections, | |
its team and its communities affected? Within the first weeks and | |
months, what processes were put in place to manage the collections and | |
to what extent was the Museum's team able to undertake work to ensure | |
the institution remained relevant during a national disaster? With a | |
distance of almost eight years since the first major earthquake, this | |
presentation reflects on some of the lessons learnt about the realities | |
of planning for, and responding to, disaster and the impact of a | |
continuing series of earthquakes on the concept of 'business as usual'. | |
Abstract | |
Taxonomic work is slow and time consuming. Alarm bells have rung for | |
years about the need to go faster, the need to attract and train new | |
taxonomic workers, and the need to convince other branches of science | |
that taxonomic work is vital. Morphological taxonomy is either being | |
overrun or augmented -- depending on your perspective -- by genomics, | |
artificial intelligence, new imaging methods and species-related data | |
from other branches of science. | |
Ecology is one such branch of science, where defining, documenting and | |
managing information about species traits has emerged as one of the most | |
significant problems in the discipline. Traits have been recorded for | |
aeons, but the resulting data has largely been insulated within cliques. | |
How do we integrate these data and make them available in a form that | |
will help to address significant issues about our environment? The | |
'speed bumps' on the route to a useful solution may be more social than | |
technical. | |
Cross-disciplinary collaboration is required to address the big | |
questions in biodiversity research today, and it will need to extend | |
beyond taxonomy and ecology to other disciplines, such as pharmacology | |
and material science. As Harry Truman said, and John LaSalle often | |
quoted, "It is amazing what you can accomplish if you do not care who | |
gets the credit". | |
We are challenged to understand and answer the key questions about the | |
world on which we all depend. What are the challenges and the | |
opportunities to accelerate biodiversity discovery and documentation? | |
Abstract | |
Standards set up by Biodiversity Information Standards-Taxonomic | |
Databases Working Group (TDWG), initially developed as a way to share | |
taxonomical data, greatly facilitated the establishment of the Global | |
Biodiversity Information Facility (GBIF) as the largest index to | |
digitally-accessible primary biodiversity information records (PBR) held | |
by many institutions around the world. The level of detail and coverage | |
of the body of standards that later became the Darwin Core terms enabled | |
increasingly precise retrieval of relevant records useful for increased | |
digitally-accessible knowledge (DAK) which, in turn, may have helped to | |
solve ecologically-relevant questions. | |
After more than a decade of data accrual and release, an increasing | |
number of papers and reports are citing GBIF either as a source of data | |
or as a pointer to the original datasets. GBIF has curated a list of | |
over 5,000 citations that were examined for contents, and to which tags | |
were applied describing such contents as additional keywords. The list | |
now provides a window on what users want to accomplish using such DAK. | |
We performed a preliminary word frequency analysis of this literature, | |
starting at titles, which refers to GBIF as a resource. Through a | |
standardization and mapping of terms, we examined how the | |
facility-enabled data seem to have been used by scientists and other | |
practitioners through time: what concepts/issues are pervasive, which | |
taxon groups are mostly addressed, and whether data concentrate around | |
specific geographical or biogeographical regions. We hoped to cast light | |
on which types of ecological problems the community believes are | |
amenable to study through the judicious use of this data commons and | |
found that, indeed, a few themes were distinctly more frequently | |
mentioned than others. Among those, generally-perceived issues such as | |
climate change and its effect on biodiversity at global and regional | |
scales seemed prevalent. The taxonomic groups were also unevenly | |
mentioned, with birds and plants being the most frequently named. | |
However, the entire list of potential subjects that might have used | |
GBIF-enabled data is now quite wide, showing that the availability of | |
well-structured data has spawned a widening spectrum of possible use | |
cases. Among them, some enjoy early and continuous presence (e.g. | |
species, biodiversity, climate) while others have started to show up | |
only later, once a critical mass of data seemed to have been attained | |
(e.g. ecosystems, suitability, endemism). Biodiversity information in | |
the form of standards-compliant DAK may thus already have become a | |
commodity enabling insight into an increasingly more complex and diverse | |
body of science. Paraphrasing Tennyson, more things were wrought by data | |
than TDWG dreamt of. | |
Abstract | |
Agile, interconnected and diverse communities of practice can serve as a | |
hedge on an uncertain world. We currently live in an era of populist | |
politics and diminishing government funding, challenging our collective | |
optimism for the future. However, the communities we build and | |
contribute to can be prepared and strengthened to address the challenges | |
ahead. How we choose to operate in this world of less funding is tied to | |
the collective impacts we all believe we can achieve by working | |
together. How we choose to work together and structure our communities | |
matters. | |
Abstract | |
Taxidermy made for display is often considered less significant in | |
museum research collections. This is because historical taxidermy | |
material often becomes disassociated with key data and through the | |
rigours of public display, end up in poor physical condition. | |
However by tracing a specimen\'s biography as a living animal and | |
following its transition into a museum afterlife, much can be revealed | |
about the development of natural history collections and changing | |
attitudes towards animals. | |
This presentation will investigate several pieces of taxidermy in the | |
zoology collection of the Tasmanian Museum and Art Gallery (TMAG) | |
(http://www.tmag.tas.gov.au/collections\_and\_research/zoology/collections), | |
where research has uncovered surprising stories and helped reassess the | |
significance and cultural value of this material. | |
An unregistered lion head, identified as animal celebrity John Burns, | |
tells the story of the golden age of Australian and New Zealand | |
circuses, changing attitudes around animal ethics in the circus and the | |
negotiations between scientific institutions in acquiring exotics | |
species in the late nineteenth century. | |
A collection of taxidermied domestic chickens from the 1940s is found to | |
mark the modernisation of the TMAG public displays in communicating | |
current research and the development of a dedicated museum education | |
unit. | |
The colourful afterlife of these specimens in the museum collection | |
highlights struggles with storage issues, changes in collecting | |
priorities and evolution of public display and education at TMAG. | |
Abstract | |
In France, a national information system on water withdrawals called | |
Banque Nationale des Prélèvements en Eau (BNPE) has been set up to | |
comply with the Water Framework Directive (WFD) and national Law on | |
Water and Aquatic Environments. The aims are to centralize information | |
on the volume of water withdrawals and to share it on the website | |
www.bnpe.eaufrance.fr, where data can both be viewed and exported | |
without restriction. BNPE shares data in a form that can be used for | |
water management studies, scientific research, or to assess impacts on | |
aquatic habitats. | |
THE BNPE PROJECT SCOPE | |
The BNPE is a part of the French Water Information System (SIE), set up | |
to share public data on water and aquatic environments\*1. The BNPE | |
project is managed by the French Biodiversity Agency (AFB) and the | |
Adour-Garonne Water Agency, and is supervised by the French Ministry in | |
charge of Environment. Database and related tools were developed with | |
the French Geological Survey (BRGM). | |
To achieve its goals, the project mainly reuses information from Water | |
Agencies, based on taxes collected using the \'taker-payer\' principle: | |
persons who take water from the natural environment have to pay. Data on | |
water withdrawals disseminated by BNPE can now be reused by land | |
managers, decision-makers and researchers due to the single access of | |
these data for all of France (metropolitan and overseas). These data | |
are: | |
Detailed data of water withdrawn: volume of water withdrawn (m^3^), | |
geographic coordinates of the water pump, water uses (e.g. energy, | |
irrigation, drinking water supply, industries), type of water | |
(groundwater, surface water: river, lake or estuary), | |
Aggregated data: synthesis is available by year, geography, use or type | |
of water. | |
In 2018, BNPE shared data from 2008 to 2016. | |
CHALLENGES OF CENTRALIZATION AND REUSE OF DATA : FEEDBACK FROM THE | |
PROJECT | |
The BNPE project faced the challenges of centralization and reuse of | |
data at a national level by making the data available to everyone. The | |
reuse of data derived from taxes due to environmental issues is not | |
easy, even in an open data context. We identified two main issues: | |
The data standardization issue | |
The stakeholders of the project set up a dictionary to define \*2 common | |
repositories and a data exchange format. This work was done with the | |
collaboration of the Sandre\*3, the French National Service for Water | |
Data and Common Repositories Management. However, the definition of the | |
standard is too broad and producers encounter issues in standardizing | |
their data. This project shows us the need to define a limited core of | |
data concepts to share, which are very well defined and cannot be | |
misinterpreted. BNPE also focuses on the importance of using concepts | |
that already exist in the producer's information system. Centralization | |
and enrichment of datasets are two additional steps that need to be | |
differentiated for a project to succeed. | |
The challenge of reusing data | |
The project is confronting issues related to assembling a relevant | |
dataset of water withdrawals. Data from taxes paid by water takers lack | |
key environmental information that limits its use for environmental | |
studies. For example, only 50% of water withdrawn is linked to a | |
specific river, lake or groundwater source. Moreover, because current | |
water use datasets are derived from taxes on withdrawals greater than | |
7000 m^3^ per year, the data are missing for some withdrawals. AFB is | |
studying additional data sources to complete the dataset (e.g., local | |
authorities, crowdsourcing, spatial joining). | |
Abstract | |
The European Search Catalogue for Plant Genetic Resources, EURISCO, | |
provides information about more than 1.9 million accessions of crop | |
plants and their wild relatives, preserved ex situ by almost 400 | |
institutes in Europe and beyond (Weise et al. 2017). EURISCO, which is | |
being maintained on behalf of the European Cooperative Programme for | |
Plant Genetic Resources, is based on a network of National Inventories | |
of 43 member countries. It represents an important effort for the | |
preservation of the world's agrobiological diversity by providing | |
information about the large genetic diversity kept by the collaborating | |
institutions. | |
Besides the classical passport data, in 2016, EURISCO started to | |
additionally collect phenotypic data about the documented germplasm | |
accessions. The selection of genebank material for both research and | |
breeding purposes is increasingly carried out through the selection of | |
specific phenotypic values, e.g. flowering time or plant height. Thus, | |
these data are of high importance to users of plant genetic resources | |
(PGR) since they determine the value of the respective germplasm. | |
However, because there are no commonly agreed standards existing within | |
the genebank community, this kind of data is very difficult to handle. | |
In this context, the challenges range from synonymous/homonymous | |
descriptor names over different rating scales to different/insufficient | |
amounts of meta information, thus hampering both integration and | |
cross-experiment comparison of data. | |
The presentation will illustrate the approach followed within EURISCO, | |
together with the challenges resulting therefrom. Using this as a solid | |
basis for a discussion about the utilization of this kind of data, the | |
presentation shall be regarded as a call for cooperation. | |
Abstract | |
Trait data in biology can be extracted from text and structured for | |
reuse within and across taxa. For example, body length is one trait | |
applicable to many species and \"body length is about 170 cm\" is one | |
trait data point for the human species. Trait data can be used in more | |
detailed analyses to describe species evolution and development | |
processes, so it has begun to be valued by more than taxonomists. The | |
EOL (Encyclopedia of Life) TraitBank provides an example of a trait | |
database. | |
Current trait databases are in their infancy. Most are based on | |
morphological data such as shape, color, structural and sexual | |
characteristics. In fact, some data such as behavioral and biological | |
characteristics may be similarly included in trait databases. | |
To build a trait database we constructed a list of controlled vocabulary | |
to record the states of various terms. These terms may exhibit common | |
characteristics: | |
They can be grouped as conceptual (subject) and descriptive (delimiter) | |
terms. For example, in "the shoulder height is 65--70 cm", \"shoulder | |
height\" is the conceptual term and \"65--70 cm\" is the descriptive | |
term. | |
Conceptual terms may be part of an interdependent hierarchical | |
structure. Examples in morphology, physiology and conservation or | |
protection status, demonstrate how parts or systems may be broken into | |
smaller measurable (quantifiable) or enumerable pieces. | |
Descriptive terms will modify or delimit parameters of conceptual terms. | |
These may be numerical with distinguishing units, counts, or other | |
adjectives or enumerable with special nouns. | |
Although controlled vocabularies about animals are complex, they can be | |
normalized using RDF (Resource Description Framework) and OWL (web | |
ontology language) standards. | |
Next, we extract traits from two main types of existing descriptions. | |
tabular data, which is more easily digested by machine, and | |
descriptive text, which is complex. | |
Pure text often needs to be extracted manually or by NLP (computerized | |
natural language processing). Sometimes machine learning methods can be | |
used. Moreover, different human languages may demand different | |
extraction methods. | |
Because the number of recordable traits exceeds current collection | |
records, the database structure should be optimized for retrieval speed. | |
For this reason, key-value databases are more suitable for storage of | |
traits data than relational databases. EOL used the database Virtuoso | |
for Traitbank, which is a non-relational database. | |
Using existing mature tools and standards of ontology, we can construct | |
a preliminary work-flow for animal trait data, but some tools and | |
specifications for data analysis and use need to await additional data | |
accumulation. | |
Abstract | |
The South African National Biodiversity Institute (SANBI) has initiated | |
the development of the National Biodiversity Information System to | |
provide access to integrated South African biodiversity information. The | |
aim of the project is to centrally manage all biodiversity information | |
to support researchers, conservationists, policy and decision-makers in | |
achieving their goals, support planners in making sensible decisions, | |
and help SANBI understand the anthropogenic impact on biodiversity. The | |
project is set to deliver a centralised web-based infrastructure to | |
capture, aggregate, manage, discover, analyse and visualise biodiversity | |
data and associated information through a suite of tools and spatial | |
layers. The infrastructure is a Microsoft technology stack with | |
microservices component architecture | |
(http://microservices.io/patterns/microservices.html), which is vital to | |
building an application out of small collaborating services, stemming | |
from integrating the enterprise system. | |
SANBI conducted a review of the data holdings of the individual herbaria | |
and museums in South Africa. The intention is to have a federated | |
approach to data management, exposing what is available as a collection | |
but ensuring that each individual natural science collection has full | |
ownership and management control over their data within a defined | |
framework and governed by internationally accepted data policies and | |
standards. The presentation highlights the opportunities and unexpected | |
difficulties with developing a national botanical and zoological | |
collections data management service in South Africa. | |
Abstract | |
The long-term lifecycle management of natural history data requires | |
careful planning. Elements that have a significant impact on this | |
planning include data quality, domain-specific requirements, and data | |
interoperability. Standards like Darwin Core Wieczorek et al. 2012 are | |
built to be flexible, allowing institutions to share data quickly | |
without extensive modification of internal information management | |
processes. However, there is often limited consensus on the exact | |
meanings and use of key terms by various domains. If we want to increase | |
the quality, interoperability, and long-term health of collections data, | |
we must reassess how we record specimen data, paying special attention | |
to the terms we use and how we use them. | |
Here we share results from efforts to evaluate current data sharing | |
practices for data from paleontology collections. By analysing the use | |
of terms in Darwin Core, we are constructing a framework for how | |
paleontological data is shared, how terms are used across many | |
institutions, and where there are inconsistencies or lack of terms to | |
support a fully robust record. We have also used data quality assessment | |
and validation tools developed by organizations like the Global | |
Biodiversity Information Facility (GBIF) to provide insight and testing | |
for term-specific requirements addressing quality on a more global scale | |
than might be the focus of any more locally driven data quality | |
assessment. | |
These assessments can guide the development of a new framework for | |
sharing paleontological data, enabling the community to collaborate and | |
find solutions to increase quality and interoperability. Additionally, | |
individual institutions can utilize the framework to enhance long-term | |
care of digital assets with global participation in mind. | |
Abstract | |
Since the Nagoya Protocol on Access to genetic resources and Benefit | |
Sharing (ABS) came into force in 2014, the conservation and assurance of | |
national biodiversity has been internationally stressed. The Government | |
of South Korea is exercising significant efforts to integrate and manage | |
the information pertaining to biological resources in line with this | |
global trend. However, connecting and sharing biodiversity data has | |
certain challenges because the existing databases and information | |
systems are being operated using different standards. | |
In the present study, we established an integrated management system for | |
freshwater biodiversity information, the Freshwater Biodiversity | |
Platform (FBP), to support the conservation and sustainable use of | |
biodiversity. This platform allows the management of various types of | |
biodiversity data, such as occurrences, habitats and genetics, for | |
freshwater species inhabiting South Korea. The data fields are based on | |
a global biodiversity data standard, Darwin Core, and national | |
biodiversity standards of South Korea in order to share our data more | |
efficiently, both nationally and internationally. It is important to | |
note that the platform deals with information related to the utilization | |
of biological resources as well as information representing the national | |
biodiversity. We have collected bibliographical data, such as papers and | |
patents, from databases, including information on the use of biological | |
resources. The data have been refined by applying a national species | |
list of South Korea and ontology terms in (MeSH) to compile valuable | |
information for biological industries. Furthermore, our platform is open | |
source and is compatible with multiple language packs to facilitate the | |
availability of biodiversity data for other countries and institutions. | |
Currently, the Freshwater Biodiversity Platform is being used to collect | |
and standardize various types of existing freshwater biodiversity data | |
to build foundations for data management. Based on these data, we will | |
improve the platform by adding new systems that can analyze and release | |
data for public access. This platform will provide integrated | |
information on freshwater species from the Korean Peninsula to the world | |
and contribute to the conservation and sustainable use of biological | |
resources. | |
Abstract | |
Freshwater biodiversity is critically understudied in Rwanda, and to | |
date there has not been an efficient mechanism to integrate freshwater | |
biodiversity information or make it accessible to decision-makers, | |
researchers, private sector or communities, where it is needed for | |
planning, management and the implementation of the National Biodiversity | |
Strategy and Action Plan (NBSAP). A framework to capture and distribute | |
freshwater biodiversity data is crucial to understanding how economic | |
transformation and environmental change is affecting freshwater | |
biodiversity and resulting ecosystem services. To optimize conservation | |
efforts for freshwater ecosystems, detailed information is needed | |
regarding current and historical species distributions and abundances | |
across the landscape. From these data, specific conservation concerns | |
can be identified, analyzed and prioritized. | |
The purpose of this project is to establish and implement a long-term | |
strategy for freshwater biodiversity data mobilization, sharing, | |
processing and reporting in Rwanda. The expected outcome of the project | |
is to support the mandates of the Rwanda Environment Management | |
Authority (REMA), the national agency in charge of environmental | |
monitoring and the implementation of Rwanda's NBSAP, and the Center of | |
Excellence in Biodiversity and Natural Resources Management (CoEB). The | |
project also aligns with the mission of the Albertine Rift Conservation | |
Society (ARCOS) to enhance sustainable management of natural resources | |
in the Albertine rift region. Specifically, organizational structure, | |
technology platforms, and workflows for the biodiversity data capture | |
and mobilization are enhanced to promote data availability and | |
accessibility to improve Rwanda's NBSAP and support other | |
decision-making processes. The project is enhancing the capacity of | |
technical staff from relevant government and non-government institutions | |
in biodiversity informatics, strengthening the capacity of CoEB to | |
achieve its mission as the Rwandan national biodiversity knowledge | |
management center. Twelve institutions have been identified as data | |
holders and the digitization of these data using Darwin Core standards | |
is in progress, as well as data cleaning for the data publication | |
through the ARCOS Biodiversity Information System | |
(http://arbmis.arcosnetwork.org/). The release of the first national | |
State of Freshwater Biodiversity Report is the next step. CoEB is a | |
registered publisher to the Global Biodiversity Information Facility | |
(GBIF) and holds an Integrated Publishing Toolkit (IPT) account on the | |
ARCOS portal. This project was developed for the African Biodiversity | |
Challenge, a competition coordinated by the South African National | |
Biodiversity Institute (SANBI) and funded by the JRS Biodiversity | |
Foundation which supports on-going efforts to enhance the biodiversity | |
information management activities of the GBIF Africa network. This | |
project also aligns with SANBI's Regional Engagement Strategy, and | |
endeavors to strengthen both emerging biodiversity informatics networks | |
and data management capacity on the continent in support of sustainable | |
development. | |
Abstract | |
As a national center for managing biological data, the Korean | |
Bioinformation Center (KOBIC) provides capabilities and resources to | |
manage and standardize the explosively growing amount of biological data | |
from national Research and Development grants by developing a systematic | |
and integrative approach. The biological data includes biological | |
material resource, genome, and biodiversity data, such as observation, | |
collection, taxonomy, character, and genome information of living | |
organisms. The Korean government has enacted legislature for the | |
collection, management and utilization of biological data in 2009 and, | |
as a follow-up, KOBIC has undertaken the mission to collect and | |
integrate the scattered biological data in Korea. We first made a | |
biological data format for exchanging data between government agencies. | |
After that, the Korean Bio-resource Information System (KOBIS) has been | |
developed. KOBIS is an integrated information system for efficient | |
acquisition and systematic management of biological data. KOBIS contains | |
more than 109,000 species and 12.1 million occurrence records from 107 | |
collaborating institutions from four ministries. KOBIS is a system that | |
establishes a catalog of scientific names by linking species information | |
by ministries. The main function is integrated information search. The | |
results of integrated information search show character information, | |
bibliographic information, electronic book, DNA classification, gene | |
information, photo image, and research achievement. We will continue to | |
focus our efforts on the management of KOBIS for facilitation of | |
information sharing, distribution, and service towards mining biological | |
data. | |
KOBIS is available at http://www.kobis.re.kr. | |
Abstract | |
Primary biodiversity data, or occurrence data, are being produced at an | |
increasing rate and are used in numerous studies (Hampton et al. 2013, | |
La Salle et al. 2016). This data avalanche is a remarkable opportunity | |
but it comes with hurdles. First, available software solutions are rare | |
for very large datasets and those solutions often require significant | |
computer skills (Gaiji et al. 2013), while most biologists are not | |
formally trained in bioinformatics (List et al. 2017). Second, large | |
datasets are heterogeneous because they come from different producers | |
and they can contain erroneous data (Gaiji et al. 2013). Hence, they | |
need to be curated. In this context, we developed a biodiversity | |
occurrence curator designed to quickly handle large amounts of data | |
through a simple interface: the Darwin Core Spatial Processor (DwCSP). | |
DwCSP does not require the installation or use of third-party software | |
and has a simple graphical user interface that requires no computer | |
knowledge. DwCSP allows for the data enrichment of biodiversity | |
occurrences and also ensures data quality through outlier detection. For | |
example, the software can enrich a tabulated occurrence file (Darwin | |
Core for instance) with spatial data from polygon files (e.g., Esri | |
shapefile) or a Rasters file (geotiff). The speed of the enriching | |
procedures is ensured through multithreading and optimized spatial | |
access methods (R-Tree indexes). DwCSP can also detect and tag outliers | |
based on their geographic coordinates or environmental variables. The | |
first type of outlier detection uses a computed distance between the | |
occurrence and its nearest neighbors, whereas the second type uses a | |
Mahalanobis distance (Mahalanobis 1936). One hundred thousand | |
occurrences can be processed by DwCSP in less than 20 minutes and | |
another test on forty million occurrences was completed in a few days on | |
a recent personal computer. DwCSP has an English interface including | |
documentation and will be available as a stand-alone Java Archive (JAR) | |
executable that works on all computers having a Java environment | |
(version 1.8 and onward). | |
Abstract | |
Museum-preserved samples are attracting attention as a rich resource for | |
DNA studies. Museomics aims to link DNA sequence data back to the museum | |
collection. Molecular biologists are interested in morphological | |
information including body size, pattern, and colors, and sequence data | |
have also become essential for biodiversity research as evidence for | |
species identification and phylogenetic analysis. | |
For more than 30 years, molecular data, such as DNA and protein | |
sequences, have been captured by the DNA Data Bank of Japan (DDBJ), the | |
European Bioinformatics Institute (EBI, UK), and the National Center for | |
Biotechnology Information (NCBI, US) under the International Nucleotide | |
Sequence Database Collaboration (INSDC). INSDC provides collected | |
molecular data to researchers as public databases including GenBank for | |
DNA sequences and Gene Expression Omnibus (GEO) for gene expression. | |
These three institutes synchronize archived data and publish all data on | |
an FTP (File Transfer Protocol) site so that it is available for big | |
data analysis. | |
In recent years, high-throughput sequencing technology, also called | |
next-generation sequencing (NGS) technology, has been widely utilized | |
for molecular biology including genomics, transcriptomics, and | |
metagenomics. Biodiversity researchers also focus on NGS data for DNA | |
barcoding and phylogenetic analysis as well as molecular biology. | |
Additionally, a portable NGS platform, MinION (Oxford Nanopore | |
Technologies), has been launched, enabling biodiversity researchers to | |
perform DNA sequencing in the field. Along with GenBank and GEO data, | |
INSDC accepts NGS data and provides a public primary database, called | |
the Sequence Read Archive (SRA). As of March 2018, 6.4 Peta Bases of NGS | |
data is freely available under more than 130,000 projects in SRA. The | |
Database Center for Life Science (DBCLS) provides a search engine for | |
public NGS data, called DBCLS SRA (http://sra.dbcls.jp/) in | |
collaboration with DDBJ. SRA contains not only raw sequence reads or | |
processed data mapped to genome, but also information on the | |
experimental design, including project types, sequencing platforms, and | |
sample species. Researchers can use this data to refine their search | |
results. We also linked publications referring to NGS data to the | |
corresponding SRA entries. | |
The mission of DBCLS is to accelerate the accessibility of life science | |
data. Collected data used to be described in the Excel-readable tabular | |
format, but these formats are difficult to merge with other databases | |
because of the ambiguity of labels. To overcome this difficulty, we | |
recently integrated life science data with Semantic Web technology. We | |
held annual meetings to integrate life science data, called | |
BioHackathons, in which researchers from all over the world | |
participated. UniProt and Ensembl databases currently provide an RDF | |
(Resource Description Framework) version of curated genome and protein | |
data, respectively. In the biodiversity domain, there are many databases | |
such as GBIF (The Global Biodiversity Information Facility) for species | |
occurrence records, EoL (The Encyclopedia of Life) as a knowledge base | |
of all species, and BoL (The Barcode of Life) for DNA barcoding data. | |
RDF is utilized to describe Darwin Core based data so that | |
bioinformatics and biodiversity informatics researchers can technically | |
merge both types of data. Currently, specimen data and DNA sequence data | |
are not linked. Museomics starts with cross-referencing specimen and | |
sequence IDs and by making data sources comply with an existing | |
standard. | |
Abstract | |
The Eastern Highlands of Zimbabwe is a biodiversity hotspot that forms | |
part of the Eastern Afromontane region, which has seen an increase in | |
human activities such as agriculture, illegal mining, and introduction | |
of invasive species. These anthropogenic activities have had negative | |
environmental consequences including land degradation and water | |
pollution, which have negatively impacted on the quality of aquatic | |
habitats and biodiversity in the region. The region harbours several | |
freshwater species of conservation interest whose numbers and | |
distribution are little known. We also do not know the impacts of the | |
ongoing human activities and threats on the local wetland biodiversity | |
and the integrity of the ecosystem in the region. The relevant data on | |
the wetland biodiversity from previous studies and surveys is also not | |
readiliy available to guide poliies and conservation efforts in this | |
region. | |
With the aid of the Biodiversity Information for Development (BID) | |
program sponsored by the Global Biodiversity Information Facility (GBIF) | |
and the European Union (EU), a project titled \'Freshwater Biodiversity | |
of the Eastern Highlands of Zimbabwe: Assessing Conservation Priorities | |
Using Primary Species-Occurrence Data\' has mobilized and digitized over | |
2,000 occurrence records on freshwater biodiversity, with a focus on | |
fish, invertebrates, amphibians and bird species in the region, since | |
October 2017. The project also makes use of biodiversity informatics | |
tools such as ecological niche modelling, to identify the important | |
sites for conservation of the freshwater biodiversity in this region. | |
The outputs will help to show policy makers, wildlife managers, | |
researchers and conservationists where to target resources and | |
conservation efforts. This will also help protect the biodiversity that | |
still existsin the unprotected wetlands of the Eastern Highlands of | |
Zimbabwe and that could be lost to human activities such as clearing for | |
agriculture. | |
Abstract | |
Recognizing the abundance and the accumulation of information and data | |
on biodiversity that are still poorly exploited and even unfunded, the | |
REBIOMA project (Madagascar Biodiversity Networking), in collaboration | |
with partners, has developed an online dataportal in order to provide | |
easy access to information and critical data, to support conservation | |
planning and the expansion of scientific and professional activities in | |
Madagascar biodiversity. | |
The mission of the REBIOMA data portal is to serve quality-labeled, | |
up-to-date species occurrence data and environmental niche models for | |
Madagascar's flora and fauna, both marine and terrestrial. REBIOMA is a | |
project of the Wildlife Conservation Society Madagascar and the | |
University of California, Berkeley. | |
REBIOMA serves species occurrence data for marine and terrestrial | |
regions of Madagascar. Following upload, data is automatically validated | |
against a geographic mask and a taxonomic authority. Data providers can | |
decide whether their data will be public, private, or shared only with | |
selected collaborators. Data reviewers can add quality labels to | |
individual records, allowing selection of data for modeling and | |
conservation assessments according to quality. Portal users can query | |
data in numerous ways. | |
One of the key features of the REBIOMA web portal is its support for | |
species distribution models, created from taxonomically valid and | |
quality-reviewed occurrence data. Species distribution models are | |
produced for species for which there are at least eight, reliably | |
reviewed, non-duplicate (per grid cell) records. Maximum Entropy | |
Modeling (MaxEnt for short) is used to produce continuous distribution | |
models from these occurrence records and environmental data for | |
different eras: past (1950), current (2000), and future (2080). The | |
result is generally interpreted as a prediction of habitat suitability. | |
Results for each model are available on the portal and ready for | |
download as ASCII and HTML files. | |
The REBIOMA Data Portal address is http://data.rebioma.net, or visit | |
http://www.rebioma.net for more general information about the entire | |
REBIOMA project. | |
Abstract | |
Herbaria in Taiwan face critical data challenges: | |
Different taxonomic views prevent data exchange; | |
There is a lack of development practices to keep up with standard and | |
technological advances; | |
Data is disconnected from researchers' perspective, thus it is difficult | |
to demonstrate the value of taxonomists' activities, even though a few | |
herbaria have their specimen catalogue partially exposed in Darwin Core. | |
In consultation with the Herbarium of the Taiwan Forestry Research | |
Institute (TAIF), the Herbarium of the National Taiwan University (TAI) | |
and the Herbarium of the Biodiversity Research Center, Academia Sinica | |
(HAST), which together host most important collections of the vegetation | |
on the island, we have planned the following activities to address data | |
challenges: | |
Investigate a new data model for scientific names that will accommodate | |
different taxonomic views and create a web service for access to | |
taxonomic data; | |
Refactor existing herbarium systems to utilize the aforementioned | |
service so the three herbaria can share and maintain a standardized name | |
database; | |
Create a layer of Application Programming Interface (API) to allow | |
multiple types of accessing devices; | |
Conduct behavioral research regarding various personas engaged in the | |
curatorial workflow; | |
Create a unified front-end that supports data management, data | |
discovery, and data analysis activities with user experience | |
improvements. | |
To manage these developments at various levels, while maximizing the | |
contribution of participating parties, it is crucial to use a proven | |
methodological framework. As the creative industry has been leading in | |
the area of solution development, the concept of design thinking and | |
design thinking process (Brown and Katz 2009) has come to our radar. | |
Design thinking is a systematic approach to handling problems and | |
generating new opportunities (Pal 2016). From requirement capture to | |
actual implementation, it helps consolidate ideas and identify agreed-on | |
key priorities by constantly iterating through a series of interactive | |
divergence and convergence steps, namely the following: | |
Empathize: A divergent step. We learn about our audience, which in this | |
case includes curators and visitors of the herbarium systems, about what | |
they do and how they interact with the system, and collate our findings. | |
Define: A convergent step. We construct a point of view based on | |
audience needs. | |
Ideate: A divergent step. We brainstorm and come up with creative | |
solutions, which might be novel or based on existing practice. | |
Prototype: A convergent step. We build representations of the chosen | |
idea from the previous step. | |
Test: Use the prototype to test whether the idea works. Then refine from | |
step 3 if problems were with the prototyping, or even step 1, if the | |
point of view needs to be revisited. | |
The benefits by adapting to this process are: | |
Instead of "design for you", we "design together", which strengthens the | |
sense of community and helps the communication of what the revision and | |
refactoring will achieve; | |
When put in context, increased awareness and understanding of | |
biodiversity data standards, such as Darwin Core (DwC) and Access to | |
Biological Collections Data (ABCD); | |
As we lend the responsibility of process control to an external | |
facilitator, we are able to focus during each step as a participant. | |
We illustrate how the planned activities are conducted by the five | |
iterative steps. | |
Abstract | |
GBIF Benin, hosted at the University of Abomey-Calavi, has published | |
more than 338,000 occurrence records in 87 datasets and checklists. It | |
has been a Global Biodiversity Information Facility (GBIF) node since | |
2004 and is a leader in several projects from the Biodiversity | |
Information for Development (BID) programme. | |
GBIF facilitates collaboration between nodes at different levels through | |
its Capacity Enhancement Support Programme (CESP) | |
\[https://www.gbif.org/programme/82219/capacity-enhancement-support-programme\]. | |
One of the actions included in the CESP guidelines is called 'Mentoring | |
activities'. Its main goal is the transfer of knowledge between partners | |
such as information, technologies, experience, and best practices. | |
Sharing architecture and development is the key solution to solve some | |
technical challenges or impediments (hosting, staff turnover, etc.) that | |
GBIF nodes could face. The Atlas of Living Australia (ALA) team | |
developed a functionality called 'data hub'. It gives the possibility to | |
create a standalone website with a dedicated occurrence search engine | |
that seeks among a range of data (e.g. specific genus, geographic area). | |
In 2017, GBIF Benin and GBIF France wanted to strengthen their | |
partnership and started a CESP project. One of the core objectives of | |
this project is the creation of the Atlas of Living Benin using ALA | |
modules. GBIF France developers, with the help of the GBIF Benin team, | |
are in the process of configuring a data hub that will give access to | |
Beninese data only, while at the same time Atlas of Living France will | |
give access to French data only. Both data portals will use the same | |
back end, therefore the same databases. Benin is the first African GBIF | |
node to implement this kind of infrastructure. | |
On this poster, we will present the Atlas of Living Benin specific | |
architecture and how we have managed to distinguish data coming from | |
Benin and coming from France. | |
Abstract | |
The existing web representation of the Flora of North America (FNA) | |
project needs improvement. Despite being electronically available, it | |
has little more functionality than its printed counterpart. Over the | |
past few years, our team has been working diligently to build a new more | |
effective online presence for the FNA. The main objective is to | |
capitalize on modern Natural Language Processing (NLP) tools built for | |
biodiversity data (Explorer of Taxon Concepts or ETC; Cui et al. 2016), | |
and present the FNA online in both machine and human readable formats. | |
With machine-comprehensible data, the mobilization and usability of | |
flora treatments is enhanced and capabilities for data linkage to a | |
Biodiversity Knowledge Graph (Page 2016) are enabled. For example, | |
usability of treatments increases when morphological statements are | |
parsed into finely grained pieces of data using ETC, because these data | |
can be easily traversed across taxonomic groups to reveal trends. | |
Additionally, the development of new features in our online FNA is | |
facilitated by FNA data parsing and processing in ETC, including a | |
feature to enable users to explore all treatments and illustrations | |
generated by an author of interest. The current status of the ongoing | |
project to develop a Semantic MediaWiki (SMW) platform for the FNA is | |
presented here. New features recently implemented are introduced, | |
challenges in assembling the Semantic MediaWiki are discussed, and | |
future opportunities, which include the integration of additional floras | |
and data sources, are explored. Furthermore, implications of | |
standardization of taxonomic treatments, which work such as this | |
entails, will be discussed. | |
Abstract | |
In 2015, the global biodiversity information initiatives Biodiversity | |
Heritage Library (BHL), Barcode of Life Data systems (BoLD), Catalogue | |
of Life (CoL), Encyclopedia of Life (EOL), and the Global Biodiversity | |
Information Facility (GBIF) took the first step to work on the idea for | |
building a single shared authoritative nomenclature and taxonomic | |
foundation that could be used as a backbone to order and connect | |
biodiversity data across various domains. At present, the Catalogue of | |
Life is being used by BHL, BoLD, EOL, and GBIF, but each extend the CoL | |
with additional data to meet the specific backbone services required. | |
The goal of the CoL+ project is to innovate the CoL systems by | |
developing a new information technology infrastructure that includes | |
both the current Catalogue of Life and a provisional Catalogue of Life | |
(replacing the current GBIF backbone taxonomy), separates scientific | |
names and taxonomic concepts with associated unique identifiers, and | |
provides some (infrastructural) support for taxonomic and nomenclatural | |
content authorities to finish their work. The project's specific | |
objectives are to | |
establish a clearinghouse covering scientific names across all life; | |
provide a single taxonomic view grounded in the consensus classification | |
of the Catalogue of Life along with candidate taxonomic sources, show | |
differences between sources, and provide an avenue for feedback to | |
content authorities while allowing the broader community to contribute, | |
and | |
establish a partnership and governance, allowing a continuing commitment | |
after the project's end for a clearinghouse infrastructure and its | |
associated components, including a roadmap for future developments of | |
the infrastructure. | |
As result of the project we expect to have a shared information space | |
for names and taxonomy between the Catalogue of Life, nomenclator | |
content authorities (e.g. IPNI, ZooBank) and several global biodiversity | |
information initiatives. | |
Abstract | |
The 3i World Auchenorrhyncha database (http://dmitriev.speciesfile.org) | |
is being migrated into TaxonWorks (http://taxonworks.org) and comprises | |
nomenclatural data for all known Auchenorrhyncha taxa (leafhoppers, | |
planthoppers, treehoppers, cicadas, spittle bugs). Of all those | |
scientific names, 8,700 are unique genus-group names (which include | |
valid genera and subgenera as well as their synonyms). According to the | |
Rules of Zoological Nomenclature, a properly formed species-group name | |
when combined with a genus-group name must agree with the latter in | |
gender if the species-group name is or ends with a Latin or Latinized | |
adjective or participle. This provides a double challenge for | |
researchers describing new or citing existing taxa. For each species, | |
the knowledge about the part of speech is essential information (nouns | |
do not change their form when associated with different generic names). | |
For the genus, the knowledge of the gender is essential information. | |
Every time the species is transferred from one genus to another, its | |
ending may need to be transformed to make a proper new scientific name | |
(a binominal name). In modern day practice, it is important, when | |
establishing a new name, to provide information about etymology of this | |
name and the ways it should be used in the future publications: the | |
grammatical gender for a genus, and the part of speech for a species. | |
The older names often do not provide enough information about their | |
etymology to make proper construction of scientific names. That is why | |
in the literature, we can find numerous cases where a scientific name is | |
not formed in conformity to the Rules of Nomenclature. An attempt was | |
made to resolve the etymology of the generic names in Auchenorrhyncha to | |
unify and clarify nomenclatural issues in this group of insects. In | |
TaxonWorks, the rules of nomenclature are defined using the NOMEN | |
onthology (https://github.com/SpeciesFileGroup/nomen). | |
Abstract | |
Compilation and retrieval of reliable data on biological interactions is | |
one of the critical bottlenecks affecting efficiency and statistical | |
power in testing ecological theories. TaxonWorks, a web-based workbench, | |
can facilitate such research by enabling the digitization of complex | |
biological interactions involving multiple species, individuals, and | |
trophic levels. These data can be further organized into spatial and | |
temporal axes, and annotated at the level of individual or grouped | |
interactions (e.g. singularly citing the combined elements of a | |
tritrophic interaction). The simple, customizable nature of tools | |
ultimately reduces the time-consuming steps of data gathering, cleaning, | |
and formatting of datasets for subsequent exploration and analysis while | |
also improving the asserted semantics. | |
An example use case is provided with a dataset of associations among | |
plants, pathogens and insect vectors. The curated data are accessed | |
through the JSON serving TaxonWorks API (Application Programming | |
Interface) by an R package. Analysis and visualization of the network | |
graphs persisted in TaxonWorks is demonstrated using core R | |
functionality and the igraph package (Csardi and Nepusz 2006). | |
TaxonWorks is open-source, collaboratively built software available at | |
http://taxonworks.org. | |
Abstract | |
As part of the Biodiversity Information System on Nature and Landscapes | |
(Système d\'Informations Nature et Paysages or SINP), the French | |
National Natural History Museum has been appointed by the French | |
ministry in charge of ecology to develop mechanisms for biodiversity | |
data exchange, especially taxon occurrences (there are also elements on | |
habitat occurrences, geo-heritage, etc.). Given that there are thousands | |
of different sources for datasets, containing over 42 million records, | |
such a development brings into question the underlying quality of data. | |
To add complexity, there can be several layers of quality assurance: one | |
by the producer of the data, one by a regional node, and another one by | |
the national node. | |
The approach to quality issues was addressed by a dedicated working | |
group, representative of biodiversity stakeholders in France. The | |
resulting documents focus on core methodology elements that characterize | |
a data quality process for, in the first instance, taxon occurrences | |
only. It may be extended to habitats, geology, etc. in the near future. | |
For scientific validation, two processes are used: | |
One automated process that uses expertise upstream (automated validation | |
based on previous databases created through the use of said expertise), | |
with several criteria such as comparison with a national taxonomic | |
reference database (TAXREF), and with species reference distributions. | |
The outcomes of this process will indicate error potential and can be | |
used to automatically flag data above a certain threshold for the | |
following process. | |
A second, manual process, that allows for further scrutiny in order to | |
reach a conclusive evaluation. | |
The combination of both processes allows experts to focus on data that | |
has a higher likelihood of being erroneous, thus saving time and | |
resources. | |
One objective of the INPN (Inventaire National du Patrimoine Naturel, or | |
National Inventory of Natural Heritage), after one or both approaches, | |
is to have each record assigned a confidence level. | |
The poster will be about National scientific validation of data in the | |
SINP. It will show for whom and why it is done, whether the expertise | |
lies upstream or downstream (manual validation through expert networks), | |
what documents may exist, and what attributes have been considered to be | |
added to the national standards so as to convey the information derived | |
from these processes. | |
Abstract | |
Web portals are commonly used to expose and share scientific data. They | |
enable end users to find, organize and obtain data relevant to their | |
interests. With the continuous growth of data across all science | |
domains, researchers commonly find themselves overwhelmed as finding, | |
retrieving and making sense of data becomes increasingly difficult. | |
Search engines can help find relevant websites, but the short | |
summarizations they provide in results lists are often little | |
informative on how relevant a website is with respect to research | |
interests. | |
To yield better results, a strategy adopted by Google, Yahoo, Yandex and | |
Bing involves consuming structured content that they extract from | |
websites. Towards this end, the schema.org collaborative community | |
defines vocabularies covering common entities and relationships (e.g., | |
events, organizations, creative works) (Guha et al. 2016). Websites can | |
leverage these vocabularies to embed semantic annotations within web | |
pages, in the form of markup using standard formats. Search engines, in | |
turn, exploit semantic markup to enhance the ranking of most relevant | |
resources while providing more informative and accurate summarization. | |
Additionally, adding such rich metadata is a step forward to make data | |
FAIR, i.e. Findable, Accessible, Interoperable and Reusable. | |
Although schema.org encompasses terms related to data repositories, | |
datasets, citations, events, etc., it lacks specialized terms for | |
modeling research entities. The Bioschemas community (Garcia et al. | |
2017) aims to extend schema.org to support markup for Life Sciences | |
websites. A major pillar lies in reusing types from schema.org as well | |
as well-adopted domain ontologies, while only proposing a limited set of | |
new types. The goal is to enable semantic cross-linking between | |
knowledge graphs extracted from marked-up websites. An overview of the | |
main types is presented in Fig. 1. Bioschemas also provides profiles | |
that specify how to describe an entity of some type. For instance, the | |
protein profile requires a unique identifier, recommends to list | |
transcribed genes and associated diseases, and points to recommended | |
terms from the Protein Ontology and Semantic Science Integrated | |
Ontology. | |
The success of schema.org lies in its simplicity and the support by | |
major search engines. By extending schema.org, Bioschemas enables life | |
sciences research communities to benefit from a lightweight semantic | |
layer on websites and thus facilitates discoverability and | |
interoperability across them. From an initial pilot including just a few | |
bio-types such as proteins and samples, the Bioschemas community has | |
grown and is now opening up towards other disciplines. The biodiversity | |
domain is a promising candidate for such further extensions. We can | |
think of additional profiles to account for biodiversity-related | |
information. For instance, since taxonomic registers are the backbone of | |
many web portals and databases, new profiles could describe taxa and | |
scientific names while reusing well-adopted vocabularies such as Darwin | |
Core terms (Baskauf et al. 2016) or TDWG ontologies (TDWG Vocabulary | |
Management Task Group 2013). Fostering the use of such markup by web | |
portals reporting traits, observations or museum collections could not | |
only improve information discovery using search engines, but could also | |
be a key to spur large-scale biodiversity data integration scenarios. | |
Abstract | |
BIOfid is a specialized information service currently being developed to | |
mobilize biodiversity data dormant in printed historical and modern | |
literature and to offer a platform for open access journals on the | |
science of biodiversity. Our team of librarians, computer scientists and | |
biologists produce high-quality text digitizations, develop new | |
text-mining tools and generate detailed ontologies enabling semantic | |
text analysis and semantic search by means of user-specific queries. In | |
a pilot project we focus on German publications on the distribution and | |
ecology of vascular plants, birds, moths and butterflies extending back | |
to the Linnaeus period about 250 years ago. The three organism groups | |
have been selected according to current demands of the relevant research | |
community in Germany. The text corpus defined for this purpose comprises | |
over 400 volumes with more than 100,000 pages to be digitized and will | |
be complemented by journals from other digitization projects, | |
copyright-free and project-related literature. With TextImager (Natural | |
Language Processing & Text Visualization) and TextAnnotator (Discourse | |
Semantic Annotation) we have already extended and launched tools that | |
focus on the text-analytical section of our project. Furthermore, | |
taxonomic and anatomical ontologies elaborated by us for the taxa | |
prioritized by the project's target group - German institutions and | |
scientists active in biodiversity research - are constantly improved and | |
expanded to maximize scientific data output. Our poster describes the | |
general workflow of our project ranging from literature acquisition via | |
software development, to data availability on the BIOfid web portal | |
(http://biofid.de/), and the implementation into existing platforms | |
which serve to promote global accessibility of biodiversity data. | |
Abstract | |
A new R package for biodiversity data cleaning, \'bdclean\', was | |
initiated in the Google Summer of Code (GSoC) 2017 and is available on | |
github. Several R packages have great data validation and cleaning | |
functions, but \'bdclean\' provides features to manage a complete | |
pipeline for biodiversity data cleaning; from data quality explorations, | |
to cleaning procedures and reporting. Users are able go through the | |
quality control process in a very structured, intuitive, and effective | |
way. A modular approach to data cleaning functionality should make this | |
package extensible for many biodiversity data cleaning needs. Under GSoC | |
2018, \'bdclean\' will go through a comprehensive upgrade. New features | |
will be highlighted in the demonstration. | |
Abstract | |
TaxonWorks (http://taxonworks.org) is an integrated workbench for | |
taxonomists and biodiversity scientists. It is designed to capture, | |
organize, and enrich data, share and refine it with collaborators, and | |
package it for analysis and publication. It is based on PostgreSQL | |
(database) and the Ruby-on-Rails programming language and framework for | |
developing web applications | |
(https://github.com/SpeciesFileGroup/taxonworks). The TaxonWorks | |
community is built around an open software ecosystem that facilitates | |
participation at many levels. TaxonWorks is designed to serve both | |
researchers who create and curate the data, as well as technical users, | |
such as programmers and informatics specialists, who act as data | |
consumers. TaxonWorks provides researchers with robust, user friendly | |
interfaces based on well thought out customized workflows for efficient | |
and validated data entry. It provides technical users database access | |
through an application programming interface (API) that serves data in | |
JSON format. The data model includes coverage for nearly all classes of | |
data recorded in modern taxonomic treatments primary studies of | |
biodiversity, including nomenclature, bibliography, specimens and | |
collecting events, phylogenetic matrices and species descriptions, etc. | |
The nomenclatural classes are based on the NOMEN ontology | |
(https://github.com/SpeciesFileGroup/nomen). | |
Abstract | |
Providing data in a semantically structured format has become the gold | |
standard in data science. However, a significant amount of data is still | |
provided as unstructured text - either because it is legacy data or | |
because adequate tools for storing and disseminating data in a | |
semantically structured format are still missing. We have developed a | |
description module for Morph∙D∙Base, a semantic knowledge base for | |
taxonomic and morphologic data, that enables users to generate highly | |
standardized and formalized descriptions of anatomical entities using | |
free text and ontology-based descriptions. The main organizational | |
backbone of a description in Morph∙D∙Base is a partonomy, to which the | |
user adds all the anatomical entities of the specimen that they want to | |
describe. Each element of this partonomy is an instance of an ontology | |
class and can be further described in two different ways: | |
as semantically enriched free-text description that is annotated with | |
terms from ontologies, and | |
semantically through defined input forms with a wide range of | |
ontology-terms to choose from. | |
To facilitate the integration of the free text into a semantic context, | |
text can be automatically annotated using jAnnotator, a javascript | |
library that uses about 700 ontologies with more than 8.5 million | |
classes of the National Center for Biomedical Ontology (NCBO) bioportal. | |
Users get to choose from suggested class definitions and link them to | |
terms in the text, resulting in a semantic markup of the text. This | |
markup may also include labels of elements that the user already added | |
to the partonomy. Anatomical entities marked in the text can be added to | |
the partonomy as new elements that can subsequently be described | |
semantically using the input forms. Each free text together with its | |
semantic annotations is stored following the W3C Web Annotation Data | |
Model standard (https://www.w3.org/TR/annotation-model). The whole | |
description with the annotated free text and the formalized semantic | |
descriptions for each element of the partonomy are saved in the | |
tuplestore of Morph∙D∙Base. | |
The demonstration is targeted at developers and users of data portals | |
and will give an insight to the semantic Morph∙D∙Base knowledge base | |
(https://proto.morphdbase.de) and jAnnotator | |
(http://git.morphdbase.de/christian/jAnnotator). | |
Abstract | |
Web APIs (Application Programming Interface) are a common means for Web | |
portals and data producers to enable HTTP-based, machine-processable | |
access to their data. They are a prominent source of information\*1 | |
pertaining to topics as diverse as scientific information, social | |
networks, entertainment or finance. The methods of Linked Data (Heath | |
and Bizer 2011) similarly aim to publish machine-readable data on the | |
Web, while connecting related resources within and between datasets, | |
thereby creating a large distributed knowledge graph. Today, the | |
biodiversity community is increasingly adopting the Linked Data | |
principles to publish data such as trait banks, museum collections and | |
taxonomic registers (Parr et al. 2016, Baskauf et al. 2016). However, | |
standard approaches are still missing to combine disparate | |
representations coming from both Linked Data interfaces and the manifold | |
Web APIs that were developed during the last two decades to expose | |
legacy biodiversity databases on the Web. | |
The SPARQL Micro-Service architecture (Michel et al. 2018) tackles the | |
goal of reconciling Linked Data interfaces and Web APIs. It proposes a | |
lightweight method to query a Web API using SPARQL (Harris and Seaborne | |
2013), the Semantic Web standard to query knowledge graphs expressed in | |
the Resource Description Framework (RDF). A SPARQL micro-service | |
provides access to a small RDF graph, typically resource-centric, that | |
it builds at run-time by transforming a fraction of the whole dataset | |
served by the Web API into RDF triples. Furthermore, Web APIs | |
traditionally rely on internal, proprietary resource identifiers that | |
are unsuited for use as Uniform Resource Identifiers (URIs). To address | |
this concern, a SPARQL micro-service can assign a URI to a Web API | |
resource, allowing an application to look up this URI and get a | |
description of the resource in return (this process is referred to as | |
dereferencing). | |
In this demo, we wish to showcase the value of SPARQL micro-services in | |
the biodiversity domain. We first query TAXREF-LD, a Linked Data | |
representation of the French taxonomic register of living beings (Michel | |
et al. 2017), to retrieve information about a given taxon. Then, we | |
demonstrate how we can enrich our knowledge about this taxon with | |
various types of data retrieved on-the-fly from multiple Web APIs: | |
trait data from the Encyclopedia of Life trait bank (Parr et al. 2016), | |
articles or books from the Biodiversity Heritage Library, | |
audio recordings from the Macaulay scientific media archive, | |
photos from the Flickr photography social network, and | |
music tunes from MusicBrainz. | |
Different visualizations are demonstrated, ranging from raw RDF triples | |
to Web pages generated dynamically and integrating heterogeneous data, | |
as suggested in Fig. 1. Depending on the audience's interests, we shall | |
touch upon the alignment of Web APIs' proprietary vocabularies with | |
well-adopted thesauri or ontologies, or more technical concerns e.g. | |
related to the effort required to deploy a new SPARQL micro-service. | |
Abstract | |
In recent years, the natural history collections community has made | |
great progress in accelerating the pace of collection digitization and | |
global data-sharing. However, a common workflow bottleneck often occurs | |
in that period immediately following image capture but preceding image | |
submission to portals, a critical phase involving quality control, file | |
management, image processing, metadata capture, data backup, and | |
monitoring performance and progress. | |
While larger institutions have likely developed reliable, automated | |
workflows over time, small and medium institutions may not have the | |
expertise or resources to design and implement workflows that take full | |
advantage of automation opportunities. Without automation, these | |
institutions must invest many hours of manual effort to meet quality and | |
performance goals. | |
To address its own needs, BRIT developed a number of workflow automation | |
components, which coalesced over time into a suite of tools that operate | |
on both an image capture station as a client application and on a server | |
that provides file storage and image processing features. Together, | |
these tools were created to meet the following goals: | |
Simplify file management and data preservation through automation | |
Quickly identify quality issues | |
Quickly capture skeletal metadata to facilitate later databasing | |
Significantly reduce time between image capture and online availability | |
Provide performance and quality monitoring and reporting | |
Easy configuration and maintenance of client and server | |
The client and server components together can be considered a | |
"digitization appliance": software integrated with the specific goal of | |
providing a comprehensive suite of digitization tools that can be | |
quickly and easily deployed on simple consumer hardware. We have made | |
this software available to the natural history collections community | |
under an open-source license at | |
https://github.com/BRITorg/digitization\_appliance. | |
Abstract | |
The Specify Software Project (www.specifysoftware.org) has been funded | |
by the University of Kansas and with grants from the U.S. National | |
Science Foundation for 20 years. In 2018, the effort is pivoting from a | |
grant-funded project to a community-supported effort through the | |
establishment of a consortium of biological collection institutions. | |
Specify Collection Consortium software products will remain open source | |
and free to download and use. Consortium membership benefits will | |
include access to technical support services and seats on the Board of | |
Directors and advisory committees, groups that will determine priorities | |
for future products, platform capabilities, and technical support | |
services. In 2017 and 2018, we have been engaged in organizational | |
planning and development--modeling the Specify Collections Consortium on | |
examples of viable open source and open access consortia in other | |
research communities. Founding members of the Consortium in the U.S. | |
include the University of Michigan, University of Florida, and | |
University of Kansas. The Consortium\'s mission will be to support | |
collections institutions in mobilizing data from their holdings to | |
broader biological and computational initiatives to advance | |
collections-based research, while facilitating efficient data curation | |
and collection management. We will provide an update on our progress | |
with the Consortium\'s development and highlight new capabilities and | |
integration features of the Specify 6 & 7 software platforms. | |
Abstract | |
To improve access to biodiversity knowledge for diverse audiences, the | |
Encyclopedia of Life (EOL) aggregates materials from hundreds of content | |
providers. In addition to text, media, references, taxon names and | |
hierarchies, traits and other structured data are an increasingly | |
important component of EOL (TraitBank). Content priorities for TraitBank | |
include information about body size, geographic distribution, habitat, | |
trophic ecology, and biotic interactions in general. Our goal is to | |
summarize available data at the level of species and supraspecific taxa | |
and to achieve broad taxonomic coverage for high priority topics. | |
Integration of information from heterogeneous sources relies on a | |
variety of community standards (e.g., Dublin Core, Darwin Core, Audubon | |
Core) as well as post-hoc semantic annotations that standardize | |
terminology for traits and metadata and provide links to domain | |
ontologies and controlled vocabularies (e.g., Ontology of Biological | |
Attributes, Phenotypic Quality Ontology, Environment Ontology, Uber | |
Anatomy Ontology). Taxon names are mapped to a reference hierarchy that | |
leverages taxonomic information from many different resources (e.g., | |
Catalogue of Life, World Register of Marine Species, Paleobiology | |
Database, National Center for Biotechnology Information). Names | |
reconciliation takes into account canonical name strings, authorities, | |
and synonym relationships as well as information about ranks and | |
hierarchies (parent/child taxa). In EOL version 3 this infrastructure | |
supports complex queries across EOL data sets, autogenerated natural | |
language descriptions of taxa, and knowledge-based recommender systems | |
for the exploration of content along multiple axes, including phylogeny, | |
ecology, life history, relevance to humans and other characteristics | |
derived from structured data. Most TraitBank data currently come from | |
published data compilations and databases of specialist projects, but | |
there are still significant gaps in coverage for many lesser known | |
groups. Recent advances in natural language processing, image analysis, | |
and machine learning technologies, facilitate the automated extraction | |
and processing of data from unstructured text and images. This will soon | |
make it possible to recruit vast amounts of information from millions of | |
pages of taxonomic, ecological, and natural history literature available | |
in open access repositories like Biodiversity Heritage Library (BHL) and | |
Plazi. Natural history collections are another promising source of new | |
taxon information. Millions of museum specimens indexed by organizations | |
like the Global Biodiversity Information Facility (GBIF) and Integrated | |
Digitized Biocollections (iDigBio) already contribute significantly to | |
our understanding of species occurrences in space and time. But | |
specimens and associated labels and field notes can also provide | |
information about morphology, phenology, habitats, and biotic | |
interactions. Data mined from literature corpora or specimen collections | |
will generally lack detailed descriptions of what exactly was measured, | |
metadata about the data capture process, measurement accuracy, and other | |
important parameters. The integration of this information with data sets | |
from the primary literature therefore poses challenges that go beyond | |
the standardization of taxonomy and terminology. Leverage of data from a | |
wide variety of sources is however necessary to achieve a comprehensive, | |
interconnected biodiversity knowledge base that supports the exploration | |
of trait diversity across the tree of life. | |
Abstract | |
The World Flora Online (WFO) is primarily a data management project | |
initiated in 2012 in response to Target 1 of the Global Strategy for | |
Plant Conservation -- \"To create an online flora of all known plants by | |
2020\". A WFO Consortium has been formed of now 42 international | |
partners with a governing Council and three Working Groups. The World | |
Flora Online Public Portal (www.worldfloraonline.org) was launched at | |
the International Botanical Congress in Shenzhen, China in July, 2017. | |
The baseline Public Portal was primarily populated with a taxonomic | |
backbone of information gathered from The Plant List augmented by newer | |
taxonomic sources like Solanaceae Source. To support all known plant | |
names in the WFO. including both vascular and non-vascular plants, new | |
WFO identifiers (WFOIDs) were created, which were also cross-referenced | |
to the International Plant Names Index (IPNI) identifiers for plant | |
names included there. The next phase of the World Flora Online involves | |
additional enhancement of the taxonomic backbone by engagement of new | |
plant Taxonomic Expert Networks (TENs) and acceleration of ingestion of | |
descriptive data from digital floras and monographs, and other sources | |
like International Union for Conservation of Nature (IUCN) threat | |
assessments and the Botanic Gardens Conservation International (BGCI) | |
Global Tree Assessment. Descriptive data can be text descriptions, | |
images, geographic distributions, identification keys, phylogenetic | |
trees, as well as atomized trait data like threat status, lifeform or | |
habitat. Initial digital descriptive datasets have been received by WFO | |
from Flora of Brazil, Flora of South Africa, Flora of China, Flora of | |
North Africa, Solanaceae Source and several others. The hard work is | |
underway to match the names associated with the submitted descriptions | |
to the names and WFOIDs in the World Flora Online taxonomic backbone and | |
then merging the descriptive data elements into the WFO database. | |
Numerous data tools have been adopted and created to accomplish the data | |
cleaning, standardization and transformation required before descriptive | |
data can be integrated. The WFO project has discovered many variations | |
between just the few datasets received so far, which highlights the need | |
for better standardization and controlled vocabularies for flora and | |
monographic descriptive data. This presentation will review some of the | |
issues identified by the project when merging descriptive data and some | |
potential gaps in the TDWG standards specifically for flora descriptive | |
data. Some opportunities for consideration by the TDWG Species | |
Information Interest Group will be presented. | |
Abstract | |
Species level information, as an important component of the biodiversity | |
information landscape, is an area where some TDWG standards and | |
activities, coincide. Plinian Core (Plinian Core Task Group 2018) is a | |
generalistic specification that covers aspects such species descriptions | |
and nomenclature, as well as many others (legal, conservation, | |
management, etc.). While the Plinian Core non-biological terms have no | |
counterpart in the TDWG developments, some of its biological ones have, | |
and that is the focus of this work. First, it must be noticed that | |
Plinian Core relies on some TDWG standards for specific facets of | |
species information: | |
Standard: Darwin Core (Darwin Core maintenance group, Biodiversity | |
Information Standards (TDWG) 2014) | |
Elements: taxonConceptID, Hierarchy, MeasurementOrFact, | |
ResourceRelationShip. | |
Standard:Ecological Metadata Language (EML project members 2011) | |
Elements: associatedParty, keywordSet, coverage, dataset | |
Standard:Encyclopedia of Life Schema (EOL Team 2012) | |
Elements: AncillaryData: DataObjectBase | |
Standard:Global Invasive Species Network (GISIN 2008) | |
Elements: origin, presence, persistence, distribution, harmful, | |
modified, startValidDate, endValidDate, countryCode, stateProvince, | |
county, localityName, county, language, citation, abundance\... | |
Standard:Taxon Concept Schema. TCS (Taxonomic Names and Concepts | |
interest group 2006) | |
Elements: scientificName | |
Given the direct dependency of Plinian Core for these terms, they do not | |
pose any compatibility or interoperability problem. However, biological | |
descriptions \--especially structured ones\-- are the object of DELTA | |
(Dallwitz 2006) and the Structured Descriptive Data (SDD) (Hagedorn et | |
al. 2005), and also covered by Plinian Core. This convergence presents | |
overlaps, mismatches and nuances, which discussion is the core of this | |
work. | |
Using some species descriptions as a test case, and transforming them | |
between these standards (Plinian Core, DELTA, and SDD), the strengths | |
and compatibility issues of these specifications are evaluated and | |
discussed. | |
Some operational aspects of Plinian Core in relation to GBIF\'s IPT | |
(GBIF Secretariat 2016) and the INSPIRE directive (European Commission | |
2007) are also reviewed. | |
Abstract | |
Taxonomic monographs are a series of publications covering a higher | |
taxonomic group with each monograph focusing on an individual species. | |
They are a compendium of the current state of research and knowledge | |
detailing many aspects of the species and are extensively used by | |
researchers, ornithologists and conservationists to learn what is | |
'currently' known about a species. Birds, being one of the more easily | |
seen and studied taxa, have a number of specialized taxonomic monographs | |
where data from a wide variety of disciplines are combined into a single | |
place and utilized for research and conservation management. Many of the | |
existing avian monographs have regional or subdomain focus such as | |
"Birds of the Western Palearctic" or "Catalan Breeding Bird Atlas | |
1999-2002" and monographs are sometimes focused on different user | |
communities, ranging from those with casual interest to professional | |
ornithologists and researchers. | |
The Lab of Ornithology maintains several monograph series. Merlin and | |
All About Birds include simplified information that is of interest to | |
the casual observer and Birds of North America and Neotropical Birds | |
Online are monographs with complete, detailed life histories, prepared | |
for ornithologists and active researchers. These monograph projects were | |
originally supported using different Content Management Systems which | |
became very difficult to maintain, difficult to keep content current and | |
provided no capacity for organizing and sharing of content across | |
monograph projects. Bird taxonomies change annually and the previous | |
systems had no capacity to intelligently manage taxonomic changes. To | |
solve these issues, we created a new Content Management System with | |
Taxonomic Concepts at its core. Reviewing a number of existing monograph | |
projects led us to create an underlying content structure that is very | |
analogous to Plinian Core. The initial requirement to support multiple | |
monograph series, some focused on the professional community and others | |
focused on budding amateurs, presented challenges to creating a 'one | |
size fits all' model for structuring content that includes authoritative | |
articles covering most aspects of a species life history, traditional | |
range maps, dynamic observation maps, relative abundance models, photos, | |
images, video and a bibliography. In this talk I'll present in detail | |
the Content Management System and the underlying models we have | |
developed. Four of these five models are tied to the underlying | |
taxonomic concept while the fifth is tied to the taxonomic names. | |
Articles, multimedia (including traditional range maps), taxonomic | |
description and bibliography have long existed in print monographs and | |
having these authored and displayed via the web makes it much simpler to | |
incorporate new information and, keep the information current and | |
publish the information to an existing standard. The incorporation of | |
dynamic content has only been possible with the advent of the web and | |
standards for the underlying Taxonomic Concepts. With four monographs | |
currently in production and several more in development, we've | |
encountered both advantages and disadvantages in using these models for | |
managing and serving monograph series. I will discuss these in detail | |
and compare the models with Plinian Core to highlight both fundamental | |
differences as well as common ground. | |
Abstract | |
Aiming at promoting interaction among researchers and the integration of | |
data from their pollen collections, herbaria and bee collections, RCPol | |
was created in 2013. In order to structure RCPol work, researchers and | |
collaborators have organized information on Palynology and trophic | |
interactions between bees and plants. During the project development, | |
different computing tools were developed and provided on RCPol website | |
(http://rcpol.org.br), including: interactive keys with multiple inputs | |
for species identification (http://chaves.rcpol.org.br); a glossary of | |
palinology related terms | |
(http://chaves.rcpol.org.br/profile/glossary/eco); a plant-bee | |
interactions database (http://chaves.rcpol.org.br/interactions); and a | |
data quality tool (http://chaves.rcpol.org.br/admin/data-quality). Those | |
tools were developed in partnership with researchers and collaborators | |
from Escola Politécnica (USP) and other Brazilian and foreign | |
institutions that act on palynology, floral biology, pollination, plant | |
taxonomy, ecology, and trophic interactions. The interactive keys are | |
organized in four branches: palynoecology, paleopalynology, | |
palynotaxonomy and spores. These information are collaboratively | |
digitized and managed using standardized Google Spreadsheets. All the | |
information are assessed by a data quality assurance tool (based on the | |
conceptual framework of TDWG Biodiversity Data Quality Interest Group | |
Veiga et al. 2017) and curated by palynology experts. In total, it has | |
published 1,774 specimens records, 1,488 species records (automatically | |
generated by merging specimens records with the same scientific name), | |
656 interactions records, 370 glossary terms records and 15 institutions | |
records, all of them translated from the original language (usually | |
Portuguese or English) to Portuguese, English and Spanish. During the | |
projectʼs first three years, 106 partners, among researchers and | |
collaborators from 28 institutions from Brazil and abroad, actively | |
participated on the project. An important part of the project\'s | |
activities involved training researchers and students on palynology, | |
data digitization and on the use of the system. Until now six training | |
courses have reached 192 people. | |
Abstract | |
The Australian Department of the Environment and Energy (DoEE) is | |
working with the Atlas of Living Australia (ALA), Biodiversity Climate | |
Change Virtual Laboratory (BCCVL) together with 2 state environment | |
departments (New South Wales and Queensland) to develop a standard | |
framework for modelling threatened species distributions for use in | |
policy / environmental decision making. | |
In addition, DoEE is working with 7 state and territory environment | |
departments to implement a common assessment method (CAM) for the | |
assessment and listing of nationally threatened species. The method is | |
based on the IUCN Red List criteria. Each Australian jurisdiction has | |
traditionally used different assessment method, including categories, | |
criteria, thresholds, definitions and scales of assessment to list | |
threatened species within their jurisdiction. The CAM is a standardised | |
method for species assessed for listing at the national level. Through | |
cross-jurisdictional collaboration, this will improve the efficiency of | |
the assessment process and facilitate consistency across jurisdictional | |
lists. | |
The BCCVL includes linkages to species observations on the ALA and users | |
are able to add their own data including contextual and species data. | |
The project aims to create a secure environment where | |
cross-jurisdictional collaboration can occur both on the standardisation | |
of methodologies for creating species distributions and the integration | |
of data. The project also aims to provide a secure platform for | |
jurisdictions to contribute sensitive observations not available through | |
the ALA and take into consideration expert feedback on the distribution | |
of species. | |
The project will provide a public-facing platform whereby SDM's can be | |
published. This will be searchable by area, species or contributor. All | |
outputs will be scientifically robust, repeatable, maintainable, open | |
and transparent. The increased validity and robustness of models lead to | |
better informed decisions relating to impacts of development and | |
conservation of species. | |
Abstract | |
How do you successfully engage volunteers in citizen science projects? | |
In recent years, citizen science has grown considerably in popularity, | |
resulting in rapid increases in the number of citizen science and | |
crowdsourcing projects and providing cost-effective means for scientists | |
to gather more data over broader spatial ranges to tackle research | |
questions in a wide variety of scientific, conservation, and | |
environmental fields Bonney et al. 2016, Aceves-Bueno et al. 2017. While | |
the proliferation of such projects has produced a growing abundance of | |
citizen scientist-generated data and published research informed by | |
citizen science methods Follett and Strezov 2015, this also means that | |
volunteers have a greater number of projects competing for their time. | |
When faced with an increasingly-crowded landscape, how can you generate | |
interest in a citizen science or crowdsourcing project and maintain | |
contributions over the project's lifetime? | |
The Biodiversity Heritage Library (BHL) supports a variety of citizen | |
science and crowdsourcing projects, from transcribing field notes to | |
tagging scientific illustrations with taxonomic names on Flickr and | |
enhancing data for 19^th^ century periodicals through its | |
Zooniverse-based Science Gossip project. Through a variety of outreach | |
strategies including collaborative social media campaigns, partnerships | |
with citizen science communities, and interactive incentives, BHL has | |
successfully engaged volunteers with diverse projects to enrich the | |
library's data and increase discoverability of its collections. | |
This presentation will discuss outreach strategies for citizen science | |
projects that BHL has undertaken to further support research initiatives | |
with our content. In addition, the presentation will share | |
lessons-learned and offer suggestions that attendees can apply to their | |
own citizen science engagement efforts. | |
Abstract | |
Biodiversity literature and archival collections are not only | |
indispensable in taxonomic research, they provide crucial information | |
for understanding of museums' natural history collections. Literature | |
and archives document collecting events resulting in specimen | |
collections, contain original descriptions based on those specimens, and | |
provide a wealth of other contextual information for the study of life | |
on earth. The Biodiversity Heritage Library is committed to improving | |
research efficiency by providing open access to a growing body of | |
biodiversity literature and archives. While descriptive metadata is | |
widely available for both specimen collections (i.e., DarwinCore) and | |
literature (i.e., MARCXML), connections between the two collection types | |
cannot generally be found at these descriptive levels thus hindering | |
efficient discovery of relevant materials. The integration of name | |
finding services, powered by Global Names Architecture, provides a | |
significant value-add through page-level access to mentions of a given | |
taxon name. Yet how might one search based on a museum code, a common | |
name, or a place name? This presentation will share how BHL's top | |
technical priorities for 2018 will help facilitate more efficient | |
searching and discovery of information in the pages of the BHL corpus. | |
Specifically, updates on BHL's top two priorities -- implementation of | |
full text search and incorporation of available crowdsourced | |
transcriptions---will be covered. | |
Abstract | |
The classification of living things depends upon the literature. Access | |
to this literature is essential to taxonomic research and to our | |
understanding of biodiversity. There have been tremendous efforts to | |
digitise the world's biodiversity literature; the Biodiveristy Heritage | |
Library (BHL) alone has uploaded over 54 million pages, all of which is | |
freely accessible online. Our scientific literature is far more | |
accessible than it has ever been, but that does not mean it is easily | |
discoverable. Much of the taxonomic literature online remains outside | |
the linked network of scholarly research. But that is rapidly changing. | |
Taxonomic aggregators are an invaluable source of authoritative | |
information on species names and their hierarchical classification. It | |
is critical that this information includes citations for taxonomic | |
descriptions, that these citations link to the published literature | |
online and that (wherever possible) the citations include DOIs (Digital | |
Object Identifiers). The DOI is an essential part of a publication's | |
bibliographic metadata and should be included (as a live link) in any | |
reference to that content. | |
However, the definitive (DOI'd) versions of recent publications are | |
frequently behind paywalls. And, while much of the historic literature | |
available online is open access, commercial publishers are uploading | |
out-of-copyright publications onto their own websites, assigning DOIs to | |
"their" definitive versions (the versions that must be cited in other | |
publications, as per DOI requirements) and then locking the defintiive | |
versions behind paywalls. This is perfectly within their rights. DOIs | |
may be assigned to legacy publications retrospectively, providing that: | |
a) the party assigning them owns the rights for the content, or has | |
permission from the rights holder to assign a DOI, and b) the | |
publication does not already have a DOI. If there are no rights attached | |
to a piece of content, anyone can assign a DOI to it. | |
This means that citation traffic from the bibliographies of current | |
publications is increasingly directed towards commercial publishers' | |
websites, rather than towards open access versions, such as those freely | |
available on the Biodiversity Heritage Library (BHL). However, taxonomic | |
aggregators are not bound by the same obligations as publishers and may | |
therefore choose to link to any online version of a publication | |
(although the DOI should still be included in the citation). | |
Many taxonomic aggregators link to the literature available on BHL. The | |
taxonomic name profiles in EOL (Encyclopedia of Life), GBIF (Global | |
Biodiversity Information Facility) and ALA (Atlas of Living Australia) | |
each contain a BHL bibliography: a list of links to the pages in BHL | |
that contain an identified mention of that taxon name. However, the | |
lists of returned results can be long, and they may or may not include | |
the citations for accepted names, synonyms and taxon concepts. Some | |
biodiversity aggregators feature these key citations on the names pages | |
(or tabs) of taxon profiles. However, where these do exist, they are | |
usually plain text rather than links. | |
BHL is now registering DOIs for the content it hosts and is creating | |
landing pages for articles, containing the full bibliographic metadata, | |
including (where applicable) the DOI. Articles are now discoverable by | |
article title, keywords within titles (scientific names, locations, | |
traits, etc.), author names and DOIs, and can be easily linked to (via | |
their landing pages) by other parties. | |
This paper will examine the issues, benefits and complexities associated | |
with linking to definitive versions, the difference between easy and | |
open access, the ethics of putting out-of-copyright content behind | |
paywalls, and the future of creating order amongst the massively | |
expanding resource of literature online. | |
Abstract | |
The Biodiversity Heritage Library (BHL) provides open access to over 54 | |
million pages of biodiversity literature. Much of this literature is | |
either in the public domain or is licensed for reuse under the Creative | |
Commons framework. Anyone can therefore freely reuse much of the | |
information and data provided by BHL. This presentation will outline how | |
the work of a citizen scientist using BHL content might benefit research | |
scientists. It will discuss how a citizen scientist can reuse and link | |
BHL literature and data in Wikipedia and Wikidata. It will explain the | |
research efficiencies that can be obtained through this reuse and | |
linking, for example through the consolidation of database identifiers. | |
The presentation will outline the subsequent reuse of the BHL data added | |
to Wikipedia and Wikidata by the internet search engine Google. It will | |
discuss an example of the linking of this information in the citizen | |
science observation platform iNaturalist. The presentation will explain | |
how BHL, as a result of its open reuse licensing of information and | |
data, helps in the creation of more accurate citizen science generated | |
biodiversity data and assists with the wider and more effective | |
dissemination of biodiversity information. | |
Abstract | |
A program to integrate species diversity information systems was | |
launched by the Chinese Academy of Sciences (CAS) in January 2018, with | |
funding from the CAS Earth project, a Strategic Priority Research | |
Program of CAS. The program will create a series of data products, such | |
as China flora online, species catalogues, distribution maps, software | |
tools for data mining and knowledge discovery based on big data and | |
artificial intelligence technology, and a service platform and portal | |
highlighting species diversity information in China. The products and | |
platform will provide the robust data to support decision making on | |
biodiversity conservation, fundamental research on biodiversity | |
evolution and spatial patterns, and species identification for citizen | |
science. China flora online will include 35,000 species of higher plants | |
in China and an online editing environment for botanists to maintain the | |
floral records. The trait database will include structured data of | |
animals, plants and fungi, such as weight, height, length, color and | |
shape of organisms. This species catalogue will be the annually updated | |
version of the Catalogue of Life, China. The distribution maps will show | |
the spatial pattern for each species of vertebrate animal and higher | |
plant. Cell phone apps will help users to easily and quickly identify | |
plants in the field. The mechanism and workflow for data collection, | |
integration, public sharing and quality control will be built up in the | |
next few years. | |
Abstract | |
Due to the recent establishment of the Global Genome Biodiversity | |
Network (GGBN) data portal, we have extended Specify collections | |
management software (http://www.sustain.specifysoftware.org/) to more | |
effectively manage, publish, and integrate tissue and DNA extract data | |
by adding support for the GGBN data schema. Specify's database design | |
now includes a number of data fields and tables proscribed in GGBN | |
standard vocabularies. We also realigned some of the underlying table | |
relationships to address the needs of specimen curation and collection | |
transactions for extract and tissue samples. Specify now also supports | |
"Next Generation" sequencing metadata with fields to record NCBI SRA ID | |
numbers for web-linking tissue and extract metadata to entries in the | |
NCBI SRA databases. | |
With the ongoing evolution of the TDWG Darwin Core (DwC) standard for | |
specimen data exchange, we generalized Specify 7's data publishing | |
capabilities to export collections data to any DwC or other | |
standards-based, exchange schema. This generic, external schema mapping | |
capability enables Specify collections to design and map data packages | |
to integrate their data with any community aggregator or collaborative | |
project database based on Darwin Core or other community standard-based | |
format. The development of these versatile new integration capabilities | |
was in collaboration with, and through financial support from GGBN. This | |
talk will highlight these changes in the context of delivery of museum | |
tissue and extract data records to the GGBN data portal for aggregation. | |
Abstract | |
The Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/) | |
is an open access repository for geographic and ecological metadata | |
associated with biosamples and genetic data. It contributes to the | |
informatics stack -- Biocode Commons -- of the Genomic Observatories | |
Network | |
(https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-2). | |
The GeOMe project interface enables administrators to plan and execute | |
field based sample collection efforts. GeOMe projects specify a core set | |
of sample metadata fields based on community standard vocabularies and | |
also includes plugins for associating samples with photos, subsamples, | |
NextGen sequence metadata, and permits. Users can upload their own | |
expedition-specific metadata, which contributes to the overall project | |
dataset while providing the user a convenient method for updating and | |
refining their contributed data. GeOMe provides connection points to the | |
Global Biodiversity Information Facility and archived genetic data | |
stored in the National Center for Biotechnology Information\'s (NCBI\'s) | |
Sequence Read Archive (SRA), linking specimens and seqeuences via unique | |
persistent identifiers. | |
Abstract | |
Genomic research depends upon access to DNA or tissue collected and | |
preserved according to high-quality standards. At present, the | |
collections in most natural history museums do not sufficiently address | |
these standards. In response to these challenges, natural history | |
museums, culture collections, herbaria, botanical gardens and others | |
have started to build high-quality biodiversity biobanks. Unfortunately, | |
information about these collections remains fragmented, scattered and | |
largely inaccessible. Without a central registry of relevant | |
institutions, it is difficult and time-consuming to locate the needed | |
samples. | |
The Global Genome Biodiversity Network (GGBN) was created to fill this | |
gap by establishing a central access point for locating samples meeting | |
quality standards for genome-scale applications, while complying with | |
national and international legislations and conventions (e.g. the Nagoya | |
Protocol). The GGBN is rapidly growing and currently has 70 members and | |
works closely together with GBIF, SPNHC, CETAF, INSDC, BOLD, ESBB, | |
ISBER, GSC and others to reach its goals. | |
Knowledge of biodiversity biobank content is urgently needed to enable | |
concerted efforts and strategies in collecting and sampling new material | |
and making ABS a reality. GGBN provides an infrastructure for making | |
genomic samples discoverable and accessible. | |
While respecting national law, GGBN requires that its members comply | |
with the provisions of the Nagoya-protocol. Thus researchers, | |
collection-holding institutions, and networks should adopt a common Best | |
Practice approach to manage ABS, as has been developed by GGBN. A Code | |
of Conduct; recommendations for implementing the Code of Conduct (the | |
Best Practices), and implementation tools, such as standard Material | |
Transfer Agreements (MTA) and mandatory and recommended data fields in | |
collection databases, are tools which will aid compliance. This talk | |
provides an overview of GGBN and comprises updates on GGBN's best | |
practices on ABS and the Nagoya Protocol, with examples of their use and | |
applicability. | |
Abstract | |
Arctos (https://arctosdb.org), an online collection management | |
information system, was developed in 1999 to manage museum specimen data | |
and to make those data publicly available. The portal | |
(arctos.database.museum) now serves data on over 3.5 million cataloged | |
specimens from more than 130 collections throughout North America in an | |
instance at the Texas Advanced Computing Center. Arctos also is a | |
community of museum professionals that collaborates on museum best | |
practices and works together to improve Arctos data richness and | |
functionality for on-line museum data streaming. In 2017, three large | |
Arctos genomics collections at the Museum of Southwestern Biology (MSB), | |
Museum of Vertebrate Zoology, Berkeley (MVZ), and University of Alaska | |
Museum of the North (UAM), received support from GGBN to create a | |
pipeline for publishing data from Arctos to the GGBN portal. | |
Modifications to Arctos included standardization of controlled | |
vocabulary for tissues; changes to the data structure and code tables | |
with regard to permit information, container history, part attributes, | |
and sample quality; implementation of interfaces and protocols for | |
parent-child relationships between tissues, tissue subsamples, and DNA | |
extracts; and coordination with the DWC community to ensure that all | |
GGBN data standards and formatting are included in the standard DWC | |
export in order to finalize the pipeline to GGBN. The addition of these | |
three primary Arctos biorepositories to the GGBN network will add over | |
750,000 tissue and DNA records representing over 11,000 species and 667 | |
families. These voucher-based archives represent primarily vertebrate | |
taxa, with growing collections of arthropods, endoparasites, and | |
incipient collections of microbiome and environmental samples associated | |
with online media and linked to GenBank and other external databases. | |
The high-quality data in Arctos complement and significantly extend | |
existing GGBN holdings, and the establishment of an Arctos-GGBN pipeline | |
also will facilitate future collaboration between more Arctos | |
collections and GGBN. | |
Abstract | |
The GGBN Data Standard | |
(https://terms.tdwg.org/wiki/GGBN\_Data\_Standard) provides a platform | |
based on a documented agreement to promote the efficient sharing and | |
usage of genomic sample material and associated specimen information in | |
a consistent way. It builds upon existing standards commonly used within | |
the community extending them with the capability to exchange data on | |
tissue, environmental and DNA samples as well as sequences. The standard | |
has been recently extended to support environmental DNA and High | |
Throughput Sequencing (HTS) library samples. Both, eDNA and HTS library | |
sample use cases have been published in the GGBN Sandbox | |
(http://sandbox.ggbn.org) and will be presented here. The use case | |
collection is documented in the GGBN wiki | |
(http://wiki.ggbn.org/ggbn/Use\_Case\_Collection). | |
In addition a general overview of the GGBN Data Portal | |
(http://www.ggbn.org) will be given. Based on ABCD, DwC and the GGBN | |
Data Standard the GGBN Data Portal is the gateway to standardized access | |
of DNA, tissue and environmental samples and their associated specimens. | |
The third core piece of GGBN is the GGBN Document Library | |
(https://library.ggbn.org), today containing more than 300 documents | |
about research, management and legal aspects of biodiversity biobanks. | |
We will provide an overview of covered topics and gaps that the | |
community can help to fill. | |
Finally an outlook of goals and priority tasks for the next two years | |
will be given. | |
Abstract | |
The Open Biodiversity Knowledge Management System (OBKMS) is an | |
end-to-end, eXtensible Markup Language (XML)- and Linked Open Data | |
(LOD)-based ecosystem of tools and services that encompasses the entire | |
process of authoring, submission, review, publication, dissemination, | |
and archiving of biodiversity literature, as well as the text mining of | |
published biodiversity literature (Fig. 1). These capabilities lead to | |
the creation of interoperable, computable, and reusable biodiversity | |
data with provenance linking facts to publications. | |
OBKMS is the result of a joint endeavour by Plazi and Pensoft lasting | |
many years. The system was developed with the support of several | |
biodiversity informatics projects - initially (Virtual Biodiversity | |
Research and Access Network for Taxonomy) ViBRANT, and then followed by | |
pro-iBiosphere, European Biodiversity Observation Network (EU BON), and | |
Biosystematics, informatics and genomics of the big 4 insect groups | |
(BIG4). The system includes the following key components: | |
ARPHA Journal Publishing Platform: a journal publishing platform based | |
on the TaxPub XML extension for National Library of Medicine (NLM)'s | |
Journal Publishing Document Type Definition (DTD) (Version 3.0). Its | |
advanced ARPHA-BioDiv component deals with integrated biodiversity data | |
and narrative publishing (Penev et al. 2017). | |
GoldenGATE Imagine: an environment for marking up, enhancing, and | |
extracting text and data from PDF files, supporting the TaxonX XML | |
schema. It has specific enhancements for articles containing | |
descriptions of taxa (\"taxonomic treatments\") in the field of | |
biological systematics, but its core features may be used for general | |
purposes as well. | |
Biodiversity Literature repository (BLR): a public repository hosted at | |
Zenodo (CERN) for published articles (PDF and XML) and images extracted | |
from articles. | |
Ocellus/Zenodeo: a search interface for the images stored at BLR. | |
TreatmentBank: an XML-based repository for taxonomic treatments and data | |
therein extracted from literature. | |
The OpenBiodiv knowledge graph: a biodiversity knowledge graph built | |
according to the Linked Open Data (LOD) principles. Uses the RDF data | |
model, the SPARQL Protocol and RDF Query Language (SPARQL) query | |
language, is open to the public, and is powered by the OpenBiodiv-O | |
ontology (Senderov et al. 2018). | |
OpenBiodiv portal: | |
Semantic search and browser for the biodiversity knowledge graph. | |
Multiple semantic apps packaging specific views of the biodiviersity | |
knowledge graph. | |
Supporting tools: | |
Pensoft Markup Tool (PMT) | |
ARPHA Writing Tool (AWT) | |
ReFindit | |
R libraries for working with RDF and for converting XML to RDF | |
(ropenbio, RDF4R). | |
Plazi RDF converter, web services and APIs. | |
As part of OBKMS, Plazi and Pensoft offer the following services beyond | |
supplying the software toolkit: | |
Digitization through imaging and text capture of paper-based or | |
digitally born (PDF) legacy literature. | |
XML markup of both legacy and newly published literature (journals and | |
books). | |
Data extraction and markup of taxonomic names, literature references, | |
taxonomic treatments and organism occurrence records. | |
Export and storage of text, images, and structured data in data | |
repositories. | |
Linking and semantic enhancement of text and data, bibliographic | |
references, taxonomic treatments, illustrations, organism occurrences | |
and organism traits. | |
Re-packaging of extracted information into new, user-demanded outputs | |
via semantic apps at the OpenBiodiv portal. | |
Re-publishing of legacy literature (e.g., Flora, Fauna, and Mycota | |
series, important biodiversity monographs, etc.). | |
Semantic open access publishing (including data publishing) of journal | |
and books. | |
Integration of biodiversity information from legacy and newly published | |
literature into interoperable biodiversity repositories and platforms | |
(Global Biodiversity Information Facility (GBIF), Encyclopedia of Life | |
(EOL), Species-ID, Plazi, Wikidata, and others). | |
In this presentation we make the case for why OpenBiodiv is an essential | |
tool for advancing biodiversity science. Our argument is that through | |
OpenBiodiv, biodiversity science makes a step towards the ideals of open | |
science (Senderov and Penev 2016). Furthermore, by linking data from | |
various silos, OpenBiodiv allows for the discovery of hidden facts. | |
A particular example of how OpenBiodiv can advance biodiversity science | |
is demonstrated by the OpenBiodiv\'s solution to \"taxonomic anarchy\" | |
(Garnett and Christidis 2017). \"Taxonomic anarchy\" is a term coined by | |
Garnett and Christidis to denote the instability of taxonomic names as | |
symbols for taxonomic meaning. They propose an \"authoritarian\" | |
top-down approach to stablize the naming of species. OpenBiodiv, on the | |
other hand, relies on taxonomic concepts as integrative units and | |
therefore integration can occur through alignment of taxonomic concepts | |
via Region Connection Calculus (RCC-5) (Franz and Peet 2009). The | |
alignment is \"democratically\" created by the users of system but no | |
consensus is forced and \"anarchy\" is avoided by using unambiguous | |
taxonomic concept labels (Franz et al. 2016) in addition to Linnean | |
names. | |
Abstract | |
The temporality of specimens is an often overlooked but quintessential | |
part of using aggregated biodiversity occurrences for research, | |
especially when millions of these occurrences exist in deep time. | |
Presently in Darwin Core, there are terms for describing the geological | |
context of specimens, which is needed for paleontological specimens. | |
However, information about the contextual absolute date associated with | |
a specimen, and how that date was generated is not supported in Darwin | |
Core, but would strongly enhance usability for research. Providers do | |
occasionally try provisioning this information, but it is currently | |
hidden in a few different Darwin Core fields, making it hard to discover | |
and nearly impossible to search for in biodiversity portals. Here we | |
provide an overview of where absolute date content for paleontological | |
and archaeological specimens are currently found in published specimens | |
records. We will then introduce a working Darwin Core extension that | |
focuses on chronometric content, and demonstrate the use of this | |
extension with published datasets from the zooarchaeological and | |
paleontological communities. This new advancement will allow providers | |
to make these crucial data available, researchers to easily find the | |
temporal range associated with an occurrence, evaluate how this range | |
was determined, and compile occurrences based on their shared ages to | |
help streamline the research process. | |
Abstract | |
Important initiatives, such as the Convention on Biological Diversity\'s | |
(CBD) Aichi targets, the United Nations\' 2030 Agenda for Sustainable | |
Development (and its Sustainable Development Goals) highlight the urgent | |
need to stop the continuous and increasing loss of biodiversity. That | |
requires an increase in the knowledge that will allow for sustainable | |
use of natural resources. To accomplish that, detailed studies are | |
needed to evaluate multiple species and regions. These studies demand | |
great effort from professionals, searching for species and/or observing | |
their behavior. In this case, the use of new monitoring devices could be | |
beneficial in data collection and identification, optimizing the | |
specialist effort to detect and observe species in-situ. With the | |
advance of technology platforms for developing connected devices and | |
sensors, associated with the evolution of the Internet of Things (IoT) | |
concepts, and the advances of unmanned aerial vehicles (UAVs) and | |
Wireless sensor networks (WSN), new scenarios in biodiversity studies | |
are possible. The technology available now could allow studies applying | |
relatively cheaper sensors with long-range (approx. 15 km), low power, | |
low bit rate communication and up to 10-year battery life, using a Low | |
Power Wide Area Network (LPWAN) and with capacity to run bio-acoustic or | |
image processing detection. Platforms like Raspberry Pi or any other | |
with signal processing capabilities can be applied (Hodgkinson and Young | |
2016). Sensor technologies protocols applied in IoT networks are usually | |
simple and flexible. Common semantics and metadata definitions are | |
necessary to extract information and representations to construct | |
complex networks. Some of these metadata definitions can be adopted from | |
the current Darwin Core schema. However, Darwin Core evolved based on | |
enterprise technologies (i.e. XML) and relational database definitions, | |
that usually need machines with significant bandwidth to transmit data. | |
Today the technology scenario is taking another route, going from | |
centralized to distributed architectures, occasionally applying | |
non-relational and distributed databases, ready to deal with | |
synchronization and eventual consistency problems. These distributed | |
databases are usually employed to construct complex networks, where | |
relation restrictions are not mandatory or, sometimes, even desired | |
(Baggio et al. 2016). With these new techniques becoming a reality in | |
biodiversity conservation studies, new metadata definitions are | |
necessary. Those new metadata need to standardize and create a shared | |
vocabulary that includes requirements for devices information exchange, | |
data analytics, and model generation. Also, these new definitions could | |
aggregate the Essential Biodiversity Variables (EBVs) concepts, that aim | |
to identify the minimum of variables that can be used to inform | |
scientists, managers and decision makers (Haase et al. 2018). For this | |
reason, we propose the insertion of EBV definitions in the construction | |
of sensor integration metadata and models characterization inside the | |
Darwin Core metadata definitions (Fig. 1). | |
Abstract | |
The Specialized Information Service Biodiversity Research (BIOfid; | |
http://biofid.de/) has recently been launched to mobilize valuable | |
biodiversity data hidden in German print sources of the past 250 years. | |
The partners involved in this project started digitisation of the | |
literature corpus envisaged for the pilot stage and provided novel | |
applications for natural language processing and visualization. In order | |
to foster development of new text mining tools, the Senckenberg | |
Biodiversity Informatics team focuses on the design of ontologies for | |
taxa and their anatomy. We present our progress for the taxa prioritized | |
by the target group for the pilot stage, i.e. for vascular plants, moths | |
and butterflies, as well as birds. With regard to our text corpus a key | |
aspect of our taxonomic ontologies is the inclusion of German vernacular | |
names. For this purpose we assembled a taxonomy ontology for vascular | |
plants by synchronizing taxon lists from the Global Biodiversity | |
Information Facility (GBIF) and the Integrated Taxonomic Information | |
System (ITIS) with K.P. Buttler's Florenliste von Deutschland | |
(http://www.kp-buttler.de/florenliste/). Hierarchical classification of | |
the taxonomic names and class relationships focus on rank and status | |
(validity vs. synonymy). All classes are additionally annotated with | |
details on scientific name, taxonomic authorship, and source. Taxonomic | |
names for birds are mainly compiled from ITIS and the International | |
Ornithological Congress (IOC) World Bird List, for moths and butterflies | |
mainly from GBIF, both lists being classified and annotated accordingly. | |
We intend to cross-link our taxonomy ontologies with the Environment | |
Ontology (ENVO) and anatomy ontologies such as the Flora Phenotype | |
Ontology (FLOPO). For moths and butterflies we started to design the | |
Lepidoptera Anatomy Ontology (LepAO) on the basis of the already | |
available Hymenoptera Anatomy Ontology (HAO). LepAO is planned to be | |
interoperable with other ontologies in the framework of the OBO foundry. | |
A main modification of HAO is the inclusion of German anatomical terms | |
from published glossaries that we add as scientific and vernacular | |
synonyms to make use of already available identifiers (URIs) for | |
corresponding English terms. International collaboration with the | |
founders of HAO and teams focusing on other insect orders such as | |
beetles (ColAO) aims at development of a unified Insect Anatomy | |
Ontology. With a restriction on terms applicable on all insects the | |
unified Insect Anatomy Ontology is intended to establish a basis for | |
accelerating the design of more specific anatomy ontologies for any | |
particular insect order. The advancement of such ontologies aligns with | |
current needs to make knowledge accumulated in descriptive studies on | |
the systematics of organisms accessible to other domains. In the context | |
of BIOfid our ontologies provide exemplars on how semantic queries of | |
yet untapped data relevant for biodiversity studies can be achieved for | |
literature in non-English languages. Furthermore, BIOfid will serve as | |
an open access platform for professional international journals | |
facilitating non-commercial publishing of biodiversity and | |
biodiversity-related data. | |
Abstract | |
Field data collection by Citizen Scientists has been hugely assisted by | |
the rapid development and spread of smart phones as well as apps that | |
make use of the integrated technologies contained in these devices. We | |
can improve the quality of the data by increasing utilisation of the | |
device in-built sensors and improving the software user-interface. | |
Improvements to data timeliness can be made by integrating directly with | |
national and international biodiversity repositories, such as the Atlas | |
of Living Australia (ALA). | |
I will present two Citizen Science apps that we developed for the | |
conservation of two of Australia's iconic species -- the koala and the | |
echidna. First is the Koala Counter app used in the Great Koala Count 2 | |
-- a two-day Blitz-style population census. The aim was to improve both | |
the recording of citizen science effort as well as to improve the | |
recording of "absence" data which would improve population modelling. | |
Our solution was to increase the transparent use of the phone sensors as | |
well as providing an easy-to-use user interface. Second is the | |
EchidnaCSI app -- an observational tool for collecting sightings and | |
samples of echidna. | |
From a software developer's perspective, I will provide details on | |
multi-platform app development as well as collaboration and integration | |
with the Australian national biodiversity repository -- the Atlas of | |
Living Australia. Preliminary analysis regarding data quality will be | |
presented along with lessons learned and paths for future research. I | |
also seek feedback and further ideas on possible enhancements or | |
modifications that might usefully be made to improve these techniques. | |
Abstract | |
Scratchpads are an online Virtual Research Environment (VRE) for | |
biodiversity scientists, allowing anyone to share their data and create | |
their own research networks (http://scratchpads.eu/). In operation since | |
2007, the platform has supported more than 1,000 communities in their | |
efforts to share, manage and aggregate information on the natural world. | |
Funded through a series of European Commission and United Kingdom | |
research council grants, the platform reached a height of popularity in | |
2014 with more than 14,500 users, but high levels of usage, coupled with | |
the difficulty of sustaining external funding, led to a significant | |
decline in the quality of service provision and support available to the | |
project. Consequently, the Scratchpads service was closed to new | |
communities in October 2016 and was managed on an essential care and | |
maintenance basis until new permanent funding became available in | |
December 2017. Despite these challenges, the Scratchpad system continues | |
to be used by a loyal community of taxonomists and systematists. As part | |
of our efforts to stabilise the platform and develop a sustainable | |
future for its users, we present our findings from an in-depth analysis | |
of Scratchpad usage metrics and user behaviour. We investigate the | |
growth of the Scratchpads since their inception; how global taxonomic | |
concepts have been generated, used and adapted; the geographical and | |
taxonomic coverage of Scratchpads; the functionality most popular with | |
users, and those features that failed to gain traction with the | |
community; and finally how aggregated data was used and modified by | |
select user communities. Our presentation examines the challenges of | |
maintaining a complex digital project once funding expires and the | |
initial project team disperses. We conclude with a summary of the | |
Scratchpad software development roadmap based on this quantitative | |
analysis of user behaviour. This is informing the future of the | |
Scratchpads system and identifies how VREs for the biodiversity data | |
community might be developed to provide a more integrated and | |
sustainable solution to the problem of community management for | |
biodiversity data. | |
Abstract | |
The quality of data produced by citizen science (CS) programs has been | |
called into question by academic scientists, governments, and | |
corporations. Their doubts arise because they perceive CS groups as | |
intruding on the rightful opportunities of standard science and industry | |
organizations, because of a normal skepticism of novel approaches, and | |
because of a lack of understanding of how CS produces data. | |
I propose a three-pronged strategy to overcome these objections and | |
improve trust in CS data. | |
Develop methods for CS programs to advertise their efforts in data | |
quality control and quality assurance (QCQA). As a first step the PPSR | |
core could incorporate a field that would allow programs to point to | |
webpages that document the QAQC practices of each program. It is my | |
experience that many programs think carefully about data quality, but | |
the CS community currently lacks an established protocol to share this | |
information. | |
Define and implement best practices for generating biodiversity data | |
using different methods. Wiggins et al. 2011 published a list of | |
approaches that can be used for QCQA in CS projects but how these | |
approaches should be implemented has not been systematically | |
investigated. | |
Measure and report data quality. If one takes the point of view that | |
citizen science is akin to a new category of scientific instruments, | |
then the ideas of instrument measurement and calibration can be applied | |
CS. Scientists are well aware that any instrument needs to be calibrated | |
before its efficacy can be established. However, because CS is new | |
approach, the specific procedures needed for different kinds of programs | |
are just now being worked out for the first time. | |
The strategy outlined above faces some specific challenges. Citizen | |
science biodiversity programs must address two important problems that | |
standard scientific entities encounter when sampling and monitoring | |
biodiversity. The first is correctly identifying species. For citizens | |
this can be a problem because they often do not have the training and | |
background of scientist teams. Likewise, it may be difficult for CS | |
projects to manage updating and maintaining the taxonomies of the | |
species being investigated. A second set of challenges is the diverse | |
kinds of biodiversity data collected by CS programs. For instances, | |
Notes from Nature decodes that labels of museum specimens, Snapshot | |
Serengeti identifies species of large mammals from camera trap | |
photographs, iNaturalist collections images of species and then has a | |
crowdsource identification processs, while eBird collects observations | |
of birds that are immediately filtered with computer algorithms for | |
review by the observer and if, subsequently flagged, reviewed by a local | |
expert. Each of these programs likely requires a different set of best | |
practices and methods to measure data quality. | |
Abstract | |
Pl\@ntNet is an international initiative which was the first one | |
attempting to combine the force of citizens networks with automated | |
identification tools based on machine learning technologies (Joly et al. | |
2014). Launched in 2009 by a consortium involving research institutes in | |
computer sciences, ecology and agriculture, it was the starting point of | |
several scientific and technological productions (Goëau et al. 2012) | |
which finally led to the first release of the Pl\@ntNet app (iOS in | |
February 2013 (Goëau et al. 2013) and Android (Goëau et al. 2014) the | |
following year). Initially based on 800 plant species, the app was | |
progressively enlarged to thousands of species of the European, North | |
American and tropical regions. Nowadays, the app covers more than 15 000 | |
species and is adapted to 22 regional and thematic contexts, such as the | |
Andean plant species, the wild salads of southern Europe, the indigenous | |
trees species of South Africa, the flora of the Indian Ocean Islands, | |
the New Caledonian Flora, etc. The app is translated in 11 languages and | |
is being used by more than 3 millions of end-users all over the world, | |
mostly in Europe and the US. | |
The analysis of the data collected by Pl\@ntnet users, which represent | |
more than 24 millions of observations up to now, has a high potential | |
for different ecological and management questions. A recent work | |
(Botella et al. 2018), in particular, did show that the stream of | |
Pl\@ntNet observations could allow a fine-grained and regular monitoring | |
of some species of interest such as invasive ones. However, this | |
requires cautious considerations about the contexts in which the | |
application is used. In this talk, we will synthesize the results of | |
this study and present another one related to phenology. Indeed, as the | |
phenological stage of the observed plants is also recorded, these data | |
offer a rich and unique material for phenological studies at large | |
geographical or taxonomical scale. We will share preliminary results | |
obtained on some important pantropical species (such as the Melia | |
azedarach L., and the Lantana camara L.), for which we have detected | |
significant intercontinental phenological patterns, among the project | |
data. | |
Abstract | |
Many organisations running citizen science projects don't have access to | |
or the knowledge or means to develop databases and apps for their | |
projects. Some are also concerned about long-term data management and | |
also how to make the data that they collect accessible and impactful in | |
terms of scientific research, policy and management outcomes. To solve | |
these issues, the Atlas of Living Australia (ALA) has developed | |
BioCollect. BioCollect is a sophisticated, yet simple to use tool which | |
has been built in collaboration with hundreds of real users who are | |
actively involved in field data capture. It has been developed to | |
support the needs of scientists, ecologists, citizen scientists and | |
natural resource managers in the field-collection and management of | |
biodiversity, ecological and natural resource management (NRM) data. | |
BioCollect is a cloud-based facility hosted by the ALA and also includes | |
associated mobile apps for offline data collection in the field. | |
BioCollect provides form-based structured data collection for: | |
Ad-hoc survey-based records; | |
Method-based systematic structured surveys; and | |
Activity-based projects such as natural resource management intervention | |
projects (eg. revegetation, site restoration, seed collection, weed and | |
pest management, etc.). | |
This session will cover how BioCollect is being used for citizen science | |
in Australia and some of the features of the tool. | |
Abstract | |
eBird is a global citizen science project that gathers observations of | |
birds. The project has been making a considerable contribution to the | |
collection and sharing of bird observations, even in the data-poorest | |
countries, and is accelerating the accumulation of bird records | |
globally. On 22 March 2018 eBird surpassed ½ billion bird observations. | |
A primary component of ensuring the best quality data is the network of | |
more than 1300 volunteer reviewers who scour incoming data for accuracy. | |
Reviewers provide active feedback to participants on everything from | |
bird identification to best practices for data collection. Since eBird's | |
inception in 2002, almost 23 million observations have been reviewed, | |
requiring more than 190,000 hours of effort by reviewers. In this | |
presentation we review how eBird recruits expert reviewers, describe | |
their responsibilities, and offer some insight in new developments to | |
improve the reviewing process. | |
How are reviewers recruited. There are three primary methods that used | |
to identify new reviewers. First, if we don't have any active | |
participants in a region (e.g., Kamchatka Russia) eBird staff search | |
birding listserves to find an individual who is reporting a lot of | |
high-quality observations from the area. We then contact those | |
individuals and offer them the opportunity to review records for the | |
region. This option has the lowest likelihood of success. Second, if an | |
individual is submitting a lot of records to eBird from a region that | |
needs a reviewer we contact them and request their participation. Third, | |
in much of the world eBird has partner groups. These partner | |
organizations (e.g., Taiwan, Spain, India, Portugal, Australia, and all | |
of the Western Hemisphere) recruit their own reviewers. The third method | |
is the most effective way to gain expert participation. | |
What does a reviewer do? eBird reviewers work to improve eBird data in | |
three primary areas. First, they develop and manage the eBird checklist | |
filters for a region. These filters generate a checklist of birds for a | |
particular time and location, and determine what records get flagged for | |
further review. Second, if an eBird participant tries to report a | |
species that is not on the checklist, or if the number of individuals of | |
a species exceeds the filter limit, then these records get flagged for | |
review. Reviewers contact the observer and request further | |
documentation. Currently, 57% of all records that are evaluated by | |
reviewers are validated. Finally, eBird reviewers validate whether the | |
participant is eBirding correctly. That is, are they correctly filling | |
out the information on when, where, and how they went birding. It has | |
been our experience that different types of reviewers are required to | |
effectively review eBird submissions: those who are good at reviewing | |
bird records and those who are good at educating observers on how to | |
participate. | |
What are future plans? eBird will move towards more effective reviewer | |
teams, where the volume of observations can be split amongst a number of | |
individuals with different strengths, allowing identification experts to | |
focus on observation-level ID issues; and strong communicators to focus | |
on working with contributors on checklist-level best practices. | |
Currently, a single eBird review platform handles a broad array of | |
different reviewing functions. It is our intent to split some of these | |
functions into multiple platforms. For example, right now all review | |
happens at the database level of the 'observation': a record of a taxon | |
at a date and location. Plans are underway to develop tools that will | |
allow reviewers to work at the entire checklist level (i.e., to more | |
easily review the accuracy of how all the observations during a | |
checklist event were submitted), which will enable much more effective | |
review of checklist-level data quality concerns. | |
Abstract | |
Volunteers, researchers and citizen scientists are important | |
contributors to observation and monitoring databases. Their | |
contributions thus become part of a global digital data pool, that forms | |
the basis for important and powerful tools for conservation, research, | |
education and policy. With the data contributed by citizen scientists | |
also come concerns about data completeness and quality. For data | |
generated by citizen scientists taxonomic bias effects, where certain | |
species (groups) are underrepresented in observations, are even stronger | |
than for professionally collected data. Identification tools that help | |
citizen scientists to access more difficult, underrepresented groups, | |
can help to close this gap. | |
We are exploring the possibilities of using artificial intelligence for | |
automatic species identification as a tool to support the registration | |
of field observations. Our aim is to offer nature enthusiasts the | |
possibility of automatically identifying species, based on photos they | |
have taken as part of an observation. Furthermore, by allowing them to | |
register these identifications as part of the observation, we aim to | |
enhance the completeness and quality of the observation database. We | |
will demonstrate the use of automatic species recognition as part of the | |
process of observation registration, using a recognition model that is | |
based on deep learning techniques. | |
We investigated the automatic species recognition using deep learning | |
models trained with observation data of the popular website | |
Observation.org (https://observation.org/). At Observation.org data | |
quality is ensured by a review process of all observations by experts. | |
Using the pictures and corresponding validated metadata from their | |
database, models were developed covering several species groups. These | |
techniques were based on earlier work that culminated in ObsIdentify, an | |
free offline mobile app for identifying species based on pictures taken | |
in the field. The models are also made available as an API web service, | |
which allows for identification by offering a photo through common | |
HTTP-communication - essentially like uploading it through a webpage. | |
This web service was implemented in the observation entry workflows of | |
Observation.org. By providing an automatically generated taxonomic | |
identification with each image, we expect to stimulate existing citizen | |
scientists to generate a larger quantity of and more biodiverse | |
observations. Additionally we hope to motivate new citizen scientists to | |
start contributing. | |
Additionally, we investigated the use of image recognition for the | |
identification of additional species in the photo other than the primary | |
subject, for example the identification of the host plant in photos of | |
insects. The Observation.org database contains many of such photos which | |
are associated with a single species observation, while additional, | |
other species are also present in the photo, but are unidentified. | |
Combining object detection to detect individual species with species | |
recognition models opens up the possibility of automatically identifying | |
and counting these species, enhancing the quality of the observations. | |
In the presentation we will present the initial results of this | |
application of deep learning technology, and discuss the possibilities | |
and challenges. | |
Abstract | |
Specimen labels are written in numerous languages and accurate | |
interpretation requires local knowledge of place names, vernacular names | |
and people's names. In many countries more than one language is in | |
common usage. Belgium, for example, has three official languages. | |
Crowdsourcing has helped many collections digitize their labels and | |
generates useful data for science. Furthermore, direct engagement of the | |
public with a herbarium increases the collection's visibility and | |
potentially reinforces a sense of common ownership. For these reasons we | |
built DoeDat, a multilingual crowdsourcing platform forked from Digivol | |
of the Australian Museum (Figs 1, 2). Some of the useful features we | |
inherited from Digivol include a georeferencing tool, configurable | |
templates, simple project management and individual institutional | |
branding. | |
Running a multilingual website does increase the work needed to setup | |
and manage projects, but we hope to gain from the broader engagement we | |
can attract. Currently, we are focusing our work on Belgian collections | |
were Dutch and French are the primary languages, but in the future we | |
may expand our languages when we work on our international collections. | |
We also hope that we can eventually merge our code with that of Digivol, | |
so that we can both benefit from each other\'s developments. | |
Abstract | |
The implementation of Citizen Science in biodiversity studies has led | |
the general public to engage in environmental actions and to contribute | |
to the conservation of natural resources (Chandler et al. 2017). | |
Smartphones have become part of the daily lives of millions of people, | |
allowing the general public to collect data and conduct automatic | |
measurements at a very low cost. Indeed, a series of Citizen Science | |
mobile applications have allowed citizens to rapidly record specimen | |
observations and contribute for the development of large biodiversity | |
databases around the World. Citizen Science applications have a | |
multitude of purposes, as well as target a variety of taxa, biological | |
questions and geographical regions. | |
Brazil is a megadiverse country that includes many threatened species | |
and Biomes. Conversation efforts are urgent and the engagement of the | |
civil society is critical. Brazilian dry and wet forests are dominated | |
by members of the plant family Bignoniaceae, all of which are | |
characterized by beautiful trumpet-shaped flowers and a big-bang | |
flowering strategy. Species of the Neotropical Bignoniaceae trees are | |
popularly known in Brazil as "Ipê" and are broadly cultivated throughout | |
the country due to the showy flowers and strong wood. Different species | |
have different flower colors, making its identification relatively easy. | |
The showy and colorful flowers are extremely admired by the local | |
population and the media. Flowering of "Ipês" is triggered by dry | |
climate, lower temperatures and increasing day-light, making this group | |
an excellent model for phenological and climatic studies involving | |
Citizen Science. | |
Here, we developed a multi-platform mobile application focused on the | |
plant family Bignoniaceae that allows users to contribute phenological | |
data for species from this plant family. More specifically, through this | |
application the user is able to provide data about specimen locations, | |
phenology and date, all of which can be validated by a photograph. This | |
platform is based on React Native, a hybrid app framework that helps the | |
developers to reuse the code across multiple mobile platforms, a | |
development much more efficient and with efforts focused on the user | |
experience. This technology uses Javascript as programming language and | |
Facebook React as a basis for development. The system is similar to | |
other CS apps such as iNaturalist. Namely, the overall observations | |
improve the quality of the ranking through positive feedback from the | |
community, strengthening the network of interactions between users and | |
encouraging active participation. On the other hand, the application | |
allows users to access all previously stored observations, which, in | |
turn, can suggest improvements to that particular observation. | |
Furthermore, observations without a correct ID can be stored until | |
others can suggest a correct identification, maximizing the value of | |
individual observations and data gathered. | |
An important aspect of this mobile application is the participation of a | |
network of experts on this plant family, allowing a rapid and accurate | |
verification of individual observations. This team of Bignoniaceae | |
experts is also able to make full use of the data gathered by | |
correlating climate and phenological patterns. Results from these | |
analyses are provided to the citizens gathering the data which will, in | |
turn, stimulate the collection of new data, especially in poorly sampled | |
locations. This is a very dynamic mobile application, that aims to | |
engage the civil society with true scientific research, stimulating the | |
management of natural resources and conservation efforts. Through this | |
mobile app, we hope to engage the general public into biodiversity | |
studies by improving their knowledge on an iconic group of Brazilian | |
plants, while contributing data for scientific studies. The system is | |
expected to be released in May and will be available at | |
ipesdobrasil.org.br. | |
Abstract | |
The Online Pollen Catalogs Network (RCPol) (http://rcpol.org.br) was | |
conceived to promote interaction among researchers and the integration | |
of data from pollen collections, herbaria and bee collections. In order | |
to structure RCPol work, researchers and collaborators have organized | |
information on Palynology in four branches: palynoecology, | |
paleopalynology, palynotaxonomy and spores. This information is | |
collaboratively digitized and managed using standardized Google | |
Spreadsheets. These datasets are assessed by the RCPol palynology | |
experts and when a dataset is compliant with the RCPol data quality | |
policy, it is published to http://chaves.rcpol.org.br. | |
Data quality assessment used to be performed manually by the experts and | |
was time-consuming and inconsistent in detecting data quality problemas | |
such as incomplete and inconsistent information. In order to support | |
data quality assessment in a more automated and effective way, we are | |
developing a data quality tool which implements a series of mechanisms | |
to measure, validate and improve completeness, consistency, conformity, | |
accessibility and uniqueness of data, prior to a manual expert | |
assessment. The system was designed according to the conceptual | |
framework proposed by Task Group 1 of the Biodiversity Data Quality | |
Interest Group Veiga et al. 2017. For each sheet in the Google | |
Spreadsheet, the system generates a set of assertions of measures, | |
validations and amendments for the records (rows) and datasets (sheets), | |
according to a profile defined for RCPol. The profile follows the | |
policies of data quality measurement, validation and enhancement. The | |
data quality measurement policy encompassess the dimensions of | |
completeness, consistency, conformity, accessibility and uniqueness. | |
RCPol uses a quality assurance approach: only data that are compliant | |
with all the quality requirements are published in the system. | |
Therefore, its data quality validation policy only considers datasets | |
with 100% completeness, consistency, conformity, accessibility and | |
uniqueness. In order to improve the quality in each relevant dimension, | |
a set of enhancements was defined in the data quality enhancement | |
policy. Based on this RCPol profile, the system is able to generate | |
reports that contain measures, validations and amendments assertions | |
with the method and tool used to generate the assertion. This web-based | |
system can be tested at http://chaves.rcpol.org.br/admin/data-quality | |
with the dataset | |
https://docs.google.com/spreadsheets/u/1/d/1gH0aa2qqnAgfAixGom3Gnx6Qp | |
91ZvWhUHPb\_QeoIreQ. This system is able to assure that only data | |
compliant with the data quality profile defined by RCPol are fit for use | |
and can be published. | |
This system contributes significantly to decreasing the workload of the | |
experts. Some data may still contain values that cannot be easily | |
automatically assessed, e.g. validate if the content of an image matches | |
the respective scientific name, so expert manual assessment remains | |
necessary. After the system reports that data are compliant with the | |
profile, a manual assessment must be performed by the experts, using the | |
data quality report as support, and only after that will the data be | |
published. The next steps include archival of the data quality reports | |
in a database, improving the web interface to enable searching and | |
sorting of assertions, and to provide a machine readable interface for | |
the data quality reports. | |
Abstract | |
Task Group 2 of the TDWG Data Quality Interest Group aims to provide a | |
standard suite of tests and resulting assertions that can assist with | |
filtering occurrence records for as many applications as possible. | |
Currently 'data aggregators' such as the Global Biodiversity Information | |
Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run | |
their own suite of tests over records received and report the results of | |
these tests (the assertions): there is, however, no standard reporting | |
mechanisms. We reasoned that the availability of an internationally | |
agreed set of tests would encourage implementations by the aggregators, | |
and at the data sources (museums, herbaria and others) so that issues | |
could be detected and corrected early in the process. | |
All the tests are limited to Darwin Core terms. The \~95 tests refined | |
from over 250 in use around the world, were classified into four output | |
types: validations, notifications, amendments and measures. Validations | |
test one of more Darwin Core terms, for example, that | |
dwc:decimalLatitude is in a valid range (i.e. between -90 and +90 | |
inclusive). Notifications report a status that a user of the record | |
should know about, for example, if there is a user-annotation associated | |
with the record. Amendments are made to one or more Darwin Core terms | |
when the information across the record can be improved, for example, if | |
there is no value for dwc:scientificName, it can be filled in from a | |
valid dwc:taxonID. Measures report values that may be useful for | |
assessing the overall quality of a record, for example, the number of | |
validation tests passed. | |
Evaluation of the tests was complex and time-consuming, but the | |
important parameters of each test have been consistently documented. | |
Each test has a globally unique identifier, a label, an output type, a | |
resource type, the Darwin Core terms used, a description, a dimension | |
(from the Framework on Data Quality from TG1), an example, references, | |
implementations (if any), test-prerequisites and notes. For each test, | |
generic code is being written that should be easy for institutions to | |
implement -- be they aggregators or data custodians. | |
A valuable product of the work of TG2 has been a set of general | |
principles. One example is "Darwin Core terms are either: | |
literal verbatim (e.g., dwc:verbatimLocality) and cannot be assumed | |
capable of validation, | |
open-ended (e.g., dwc:behavior) and cannot be assumed capable of | |
validation, or | |
bounded by an agreed vocabulary or extents, and therefore capable of | |
validation (e.g., dwc:countryCode)". | |
Another is "criteria for including tests is that they are informative, | |
relatively simple to implement, mandatory for amendments and have power | |
in that they will not likely result in 0% or 100% of all record hits." A | |
third: "Do not ascribe precision where it is unknown." | |
GBIF, the ALA and iDigBio have committed to implementing the tests once | |
they have been finalized. We are confident that many museums and | |
herbaria will also implement the tests over time. We anticipate that | |
demonstration code and a test dataset that will validate the code will | |
be available on project completion. | |
Abstract | |
In the process of sharing information, it is of highest importance that | |
we utilize common codes and signifiers, so that communication is | |
effective. This process presents a series of complexities that are | |
related to capturing and transmitting the meaning of the information | |
despite homonymy, polysemy and synonymy. Biodiversity data sharing is | |
not exempt from these challenges and understanding the meaning often | |
requires expert knowledge. For communication to be effective, and | |
therefore for data to be of maximal re-use, we need common vocabularies | |
that unequivocally refer us to the same concepts. | |
The community has agreed upon some vocabularies to structure shared | |
information, i.e., biodiversity data standards such as the Darwin Core | |
standard (Wieczorek et al. 2012). The bterms in Darwin Core can be | |
thought of as the names of the columns in a spreadsheet. For example, | |
there are terms such as genus, stateProvince, sex, etc. This allows us | |
to capture and share information which we agree belongs under one of | |
those terms. However, we have not yet reached an agreement on how to | |
express the permitted values under all those terms, that is, | |
vocabularies of values. As a simple example, we agree that if we have a | |
record of an organism that is a female, we will share the fact that it | |
is a female under the "sex" term, but we could represent female with the | |
values "female", "fem.", "f.", and other possible abbreviation and | |
language variants. Other more complex examples, bound to expert | |
knowledge, include biological taxonomies and how we name distinct | |
species and species concepts. | |
While many vocabularies exist in the community, we currently do not | |
possess a full suite of vocabularies of values that apply uniformly | |
across the biodiversity data community and there is no single repository | |
to explore the available resources. While some of the available | |
vocabularies are discipline-specific, many that could be applied more | |
broadly remain independent and scattered. Additionally, similar lists of | |
terms that refer to the same concepts can be found in different | |
languages, but disconnected from one another. | |
The lack of or non-adherence to vocabularies of values constitutes a | |
data quality issue, as the heterogeneity in the data renders data less | |
discoverable and difficult to use. Capturing information in myriad ways | |
risks being incomplete and inaccurate in our transmission of | |
information. If we cannot be certain that a particular value | |
unambiguously refers to a particular concept, we cannot assert that a | |
record containing that value could reliably be used for a particular | |
purpose. In this context, the construction and use of vocabularies of | |
values, including the explicit declaration of usage, is a data quality | |
issue. | |
From the TDWG Data Quality Interest Group we have begun to tackle this | |
problem, with the aim of creating a suitable environment for thought and | |
development of vocabularies of values. Accordingly, a new task group has | |
been constituted, whose main goals are to: | |
prepare a scoping document in which we will determine the types of | |
vocabularies needed (including multi-lingual approaches) and the | |
strategy for organizing the construction and/or management of | |
new/existing vocabularies; | |
develop a common repository to store vocabularies and/or link to | |
existing ones; | |
develop best practices for building TDWG vocabularies; and | |
develop an exemplary vocabulary following the standard format. | |
This will provide the community with a framework to work on and build | |
upon vocabularies of values in a way that would allow better | |
understanding and maximal interoperability. | |
Abstract | |
As the world strives towards achieving Sustainable Development Goals, | |
development planners both at national and local levels have now come to | |
understand the importance of informed decision-making. Natural resources | |
management is one of the areas where careful planning is required to | |
ensure sustainable use of and maximum benefit from the services we get | |
from ecosystems. | |
In developing countries, the scarcity of resources (both in terms of | |
funding and skills) constitutes the main hindrance to the generation of | |
accurate and timely data and information that would guide planning and | |
implementation of development strategies. As a result, decisions are | |
taken on an ad-hoc basis and without possibility of appreciating the | |
long-term effect of these decisions. | |
In that regard, Albertine Rift Conservation Society (ARCOS) has | |
developed a participatory and cost-effective framework to monitor the | |
status and trends of biodiversity and ecosystem services at the | |
landscape level and to assess the socio-economic conditions that affect | |
them. | |
The approach termed "Integrated Landscape Assessment and Monitoring -- | |
ILAM" uses the Driver-Pressure-State-Impact-Response model and applies a | |
simple indicators framework that allows teams to collect needed data in | |
a rapid and cost-effective way. Burkhard and Müller (2008) | |
This approach is flexible enough to be adaptable to the available time | |
and funding resources and is therefore very suitable to be applied in | |
the context of the developing world including east-African countries. | |
This flexibility ranges from the use GIS and remote sensing techniques | |
combined with thorough biodiversity field surveys to simple rapid | |
assessment of key indicators using smaller teams and for short periods | |
of time in the field. | |
Since 2013, ARCOS has been biennially conducting ILAM studies in its | |
five focal landscapes in Rwanda, Uganda and Burundi and the results have | |
influenced major decisions such as the designation of at least two | |
wetlands as Ramsar sites and the upgrade of one forest as a national | |
park. | |
In addition to this, other planning processes have been informed by the | |
results of these studies, such as the process to develop the new Rwandan | |
National Strategy for Transformation for 2017--2024 and the development | |
of the districts' strategic plans for 2018--2024. | |
Currently the biodiversity data generated through these studies is being | |
published by Global Biodiversity Information Facility (GBIF) for wider | |
access by researchers and educators in the region and a portal, the | |
ARCOS Biodiversity Information Management System (ARBIMS), has been | |
established to facilitate sharing of data and information to guide | |
planning and decision-making in the region. | |
Abstract | |
Species-level observational data comprise the largest and | |
fastest-growing part of the Global Biodiversity Information Facility | |
(GBIF). The largest single contributor of species observations is eBird, | |
which so far has contributed more than 361 million records to GBIF. | |
eBird engages a vast network of human observers (citizen-scientists) to | |
report bird observations, with the goal of estimating the range, | |
abundance, habitat preferences, and trends of bird species at high | |
spatial and temporal resolutions across each species' entire life-cycle. | |
Since its inception, eBird has focused on improving the data quality of | |
its observations, primarily focused in two areas: | |
ensuring that participants describe how they gathered their observations | |
and, | |
all observations are reviewed for accuracy. | |
In this presentation I will review how this is done in eBird. | |
Standardized Data Collection. eBird gathers bird observations based on | |
how bird watchers typically observe birds with units of data collection | |
being "checklists" of zero or more species including a count of | |
individuals for each species observed. Participants choose the location | |
where they made their observations and submit their checklists via | |
Mobile Apps (50% of all submissions) or the website (50% of all | |
submissions). All checklists are submitted in a standard format | |
identifying where, how, and with whom they made their observations. | |
Mobile apps precisely record locations, the track taken, and the | |
distance they traveled while making the observations. The start time and | |
duration of surveys are also recorded. All observers must report whether | |
they reported all the birds they detected and identified, which allows | |
analysts to infer absence of birds if they were not reported. All data | |
are stored within an Oracle data management framework. | |
Data Accuracy. The most significant data quality challenge for species | |
observations is detecting and correctly identifying organisms to | |
species. The issue involves how to handle both false positives --- the | |
misidentification of an observed organism, and false negatives---failing | |
to report a species that was present. The most egregious false positives | |
can be identified as anomalies that fall outside the norm of occurrence | |
for a species at a particular time or space. However, false positives | |
can also be misidentifications of common species. These challenges are | |
addressed by: | |
Data-driven filters. eBird's existing data can identify and flag | |
potentially erroneous records at increasingly fine spatial, temporal, | |
and user-specific scales. These filters can identify outliers and likely | |
errors, which are the foundation of the eBird review process. By using | |
the vetted data to identify outliers, data quality checks run against | |
expected occurrence probabilities at very fine scales and identify | |
anomalies during data submission (including on mobile devices). | |
Incorporate observer expertise scores. Observer differences are the | |
largest source of variability in eBird data. Assessment of observer | |
metrics, and the inclusion of these data in species distribution models, | |
improves analysis output and model performance. | |
Expert reviewer network. More than 2000 volunteers review records | |
identified by the data-driven filters and contact data submitters to | |
confirm their observations. The existing data quality process functions | |
globally. Currently the approach is focused on misidentified birds, but | |
in the future will also involve collection event issues (e.g., issues | |
with protocol, location, or methodology), sensitive species, exotic | |
species, and better handle widely-observed individual rarities. | |
Additional tools are also to be developed to help editors improve | |
efficiency and better prioritize review. | |
In 2017, 4,107,757 observations representing 4.6% of all eBird records | |
submitted were flagged for review by the data driven filters. Of these | |
records 57.4% were validated and 42.6% were invalidated. | |
Abstract | |
From 81 study sites across the United States, the US National Ecological | |
Observatory Network (NEON), generates \>75,000 samples per year. Samples | |
range from soil and dust deposition material, tissue samples (e.g., | |
small mammals and fish), DNA extracts, and whole organisms (e.g., ground | |
beetles and ticks). Samples are collected, processed, and documented | |
according to protocols that are standardized across study sites and | |
according to the needs of the ecological research community for future | |
studies. NEON has faced numerous challenges with managing data related | |
to these many diverse physical samples, particularly when data are | |
gathered at numerous steps throughout processing. Here, we share these | |
challenges as well as solutions, including innovative semantically | |
driven software tools and processing pipelines that manage data from | |
each sample\'s point of collection to its ultimate fate (consumption, | |
archive facility, or partnering data repository) while maintaining links | |
across sample hierarchies. | |
Abstract | |
What is a provider (or consumer) of biodiversity data to think when one | |
quality assessment tool asserts that a particular problem exists in | |
their data, while a different tool asserts that this problem is not | |
present? Is there a problem with their data? Is there a problem with one | |
of the tools? The Biodiversity Data Quality Task Group 2 is developing a | |
suite of standardized descriptions of tests (validations, measures, | |
amendments) of biodiversity data, implementations of which would be | |
expected to provide consistent assertions about a particular data set so | |
that input of identical data sets into two different test suite | |
implementations will produce the same results (for some meaning of "the | |
same"). | |
Development of standard test definitions is a big step in the direction | |
of consistency. More is needed. Clear and detailed specifications for | |
each test will help. For example, data might have suitable quality for | |
global change analysis if collecting dates have a temporal resolution of | |
one year or less. One implementer\'s test may check if the event date | |
has a duration of 365 days or less, another might account for leap days, | |
another might test if the data can be unambiguously binned into single | |
years. For some data, each implementation will produce different | |
assertions about the record. If the standard test specification states | |
which of these meanings apply, then correct implementations should make | |
identical assertions. To tell, however, if two implementations of a | |
suite of tests will produce the same result for identical inputs we need | |
two things, one is a set of tests (of the tests), the other is an | |
understanding of what it means for results to be the same. It is | |
expected that there will be changes in the results of tests of | |
scientific names over time, and that different authorities will have | |
different opinions about that set of scientific names. One element of | |
"the same" is an expectation that results will be the same when test | |
implementations are run at the same time and with the same | |
configuration, but not necessarily otherwise. | |
Consider tests at three levels: First, tests of the internals of a test, | |
separate from the fitness for use framework (Veiga et al. 2017) or | |
serialization of test results. At this first level, unit tests are very | |
appropriate, but these are tightly coupled to the language of | |
implementation and the unit testing framework, and to the internal | |
details of the implementation. Unit tests are very effective for | |
software quality control, but not particularly portable. Second, | |
consider tests of the output of a suite of tests. At this level (of | |
integration tests), we are tightly coupled to both the fitness for use | |
framework and the serialization, and the meaning of "the same" is | |
important. Different software implementations may be expected to have | |
different orders of output for the same input, and human readable | |
comments would be expected to vary (e.g. with internationalization). | |
Identity of machine readable assertions but in varying orders should be | |
tolerable, but this is not easily accomplished. Implementation at this | |
level is difficult. Third, consider tests of the framework output of a | |
particular test. Order becomes unimportant, only machine readable | |
framework assertions can be considered, and this is probably the level | |
to target for testing. Input data for tests could be synthetic, real, or | |
modified real data. Real data has the advantage of being realistic, but | |
it is difficult to find real data which contains single issues. Clean | |
real data into which synthetic error conditions have been introduced is | |
enticing for test purposes, but risks confusion with real data, so I | |
propose some standard values for certain Darwin Core terms for | |
identifying synthetic data. | |
Abstract | |
The ability to communicate and assess the quality and fitness for use of | |
data is crucial to ensure maximum utility and re-use. Data consumers | |
have certain requirements for the data they seek and need to be able to | |
check if a data set conforms with these requirements. Data publishers | |
aim to provide data with the highest possible quality and need to be | |
able to identify potential errors that can be addressed with the | |
available information at hand. The development and adoption of data | |
publication guidelines is one approach to define and meet those | |
requirements. However, the use of a guideline, the mapping decisions, | |
and the requirements a dataset is expected to meet, are generally not | |
communicated with the provided data. Moreover, these guidelines are | |
typically intended for humans only. | |
In this talk, we will present \'whip\': a proposed syntax for data | |
specifications. With whip, one can define column-based constraints for | |
tabular (tidy) data using a number of rules, e.g. how data is structured | |
following Darwin Core, how a term uses controlled vocabulary values, or | |
what the expected minimum and maximum values are. These rules are human- | |
and machine-readable, which communicates the specifications, and allows | |
to automatically validate those in pipelines for data publication and | |
quality assessment, such as Kurator. Whip can be formatted as a (yaml) | |
text file that can be provided with the published data, communicating | |
the specifications a dataset is expected to meet. The scope of these | |
specifications can be specific to a dataset, but can also be used to | |
express expected data quality and fitness for use of a publisher, | |
consumer or community, allowing bottom-up and top-down adoption. As | |
such, these specifications are complementary to the core set of data | |
quality tests as currently under development by the TDWG Biodiversity | |
Data Quality Task 2 Group 2. Whip rules are currently generic, but more | |
specific ones can be defined to address requirements for biodiversity | |
information. | |
Abstract | |
Georeferencing helps to fill in biodiversity information gaps, allowing | |
biodiversity data to be represented spatially to allow for valuable | |
assessments to be conducted. The South African National Biodiversity | |
Institute has embarked on a number of projects that have required the | |
georeferencing of biodiversity data to assist in assessments for | |
redlisting of species and measuring the protection levels of species. | |
Data quality in biodiversity information is an important aspect. Due to | |
a lack of standardisation in collection and recording methods historical | |
biodiversity data collections provide a challenge when it comes to | |
ascertaining fitness for use or determining the quality of data. The | |
quality of historical locality information recorded in biodiversity data | |
collections faces the scrutiny of fitness for use as these information | |
is critical in performing assessments. The lack of descriptive locality | |
information, or ambiguous locality information deems most historical | |
biodiversity records unfit for use. Georeferencing should essentially | |
improve the quality of biodiversity data, but how do you measure the | |
fitness for use of georeferenced data? | |
Through the use of the Darwin Core coordinateUncertaintyinMeters, | |
georeferenced data can be queried to investigate and determine the | |
quality of the georeferenced data produced. My presentation will cover | |
the scope of ascertaining georeferenced data quality through the use of | |
the DarwinCore term coordinateUncertatintyInMeters, the impacts of using | |
a controlled vocabulary in representing the | |
coordinateUncertaintyInMeters, and will highlight how SANBI's | |
georeferencing efforts have contributed to data quality within the | |
management of biodiversity information. | |
Abstract | |
As part of the Biodiversity Information System on Nature and Landscapes | |
(SINP), the French National Natural History Museum has been appointed to | |
develop biodiversity data exchanges by the French ministry in charge of | |
ecology. Given there are, quite literally, thousands of different | |
sources, such a development brings into question the underlying quality | |
of data. To add complexity, there can be several layers of quality: one | |
being appraised by the producer himself, one by a regional node, and one | |
by the national node. | |
The approach to quality issues was addressed by a dedicated working | |
group, representative of biodiversity stakeholders in France. The | |
resulting documents focus on core methodology elements that characterize | |
a data quality process for taxon occurrences only in the first instance | |
(It may be extended to habitats, geology, etc. in the near future). | |
Three processes are covered, how to ensure: | |
data conformity by checking for the presence of compulsory elements or | |
that a given attribute is of the right type, | |
data consistency by checking information versus other information (for | |
example, an end date has to be later than a start date), | |
and scientific validation, through either manual (use of expertise) or | |
automated (comparison with knowledge databases) means, or even a | |
combined approach that provides users with a quality appraisal of said | |
data. | |
Within the SINP, only data that has passed conformity and consistency | |
tests can be exchanged with any and all types of validation levels. For | |
example, should there be no expert existing on a specific taxon group, | |
unvalidated data can be shared. | |
For scientific validation, two processes are used, one automatic that | |
uses several criteria such as comparison with a national taxonomic | |
reference database (TAXREF), and with species reference maps. The | |
combination of all these elements can be used to automatically flag data | |
for a second, deeper, manual process that allows for further scrutiny in | |
order to reach a conclusive evaluation. This allows experts to work only | |
on "doubtful" data, thus saving time. | |
In the future, other criteria that are currenlty used with the manual | |
approach, such as for example congruity, data scarcity on a given | |
species, determination difficulty, existence of associated proof | |
(specimen, picture...), knowledge of the ability of the observer, | |
databases on most frequent determination errors etc., could be added to | |
the automatic process. | |
Some elements must be included in the data to allow for comprehensive | |
testing, and have been included in a national data standard so that the | |
result of the validation process can be shared with users, allowing them | |
to judge how the data is fit for their use. | |
The presentation will deal with how such a work was undertaken and how | |
conformity, consistency and scientific validation have been treated and | |
issues solved by the workgroup. For example, there could be a 40 million | |
data record backlog. The presentation will also show how the required | |
elements could be integrated into the French national standard. | |
Abstract | |
The success of Darwin Core and ABCD Schema as flexible standards for | |
sharing specimen data and species occurrence records has enabled GBIF to | |
aggregate around one billion data records. At the same time, other | |
thematic, national or regional aggregators have developed a wide range | |
of other data indexes and portals, many of which enrich the data by | |
interpreting and normalising elements not currently handled by GBIF or | |
by linking other data from geospatial layers, trait databases, etc. | |
Unfortunately, although each of these aggregators has specific strengths | |
and supports particular audiences, this diversification produces many | |
weaknesses and deficiencies for data publishers and for data users, | |
including: incomplete and inconsistent inclusion of relevant datasets; | |
proliferation of record identifiers; inconsistent and bespoke workflows | |
to interpret and standardise data; absence of any shared basis for | |
linked open data and annotations; divergent data formats and APIs; lack | |
of clarity around provenance and impact; etc. | |
The time is ripe for the global community to review these processes. | |
From a technical standpoint, it would be feasible to develop a shared, | |
integrated pipeline which harvested, validated and normalised all | |
relevant biodiversity data records on behalf of all stakeholders. Such a | |
system could build on TDWG expertise to standardise data checks and all | |
stages in data transformation. It could incorporate a modular structure | |
that allowed thematic, national or regional networks to generate | |
additional data elements appropriate to the needs of their users, but | |
for all of these elements to remain part of a single record with a | |
single identifier, facilitating a much more rigorous approach to linked | |
open data. Most of the other issues we currently face around | |
fitness-for-use, predictability and repeatability, transparency and | |
provenance could be supported much more readily under such a model. | |
The key challenges that would need to be overcome would be around social | |
factors, particularly to deliver a flexible and appropriate governance | |
model and to allow research networks, national agencies, etc. to embed | |
modular components within a shared workflow. Given the urgent need to | |
improve data management to support Essential Biodiversity Variables and | |
to deliver an effective global virtual natural history collection, we | |
should review these challenges and seek to establish a data management | |
and aggregation architecture that will support us for the coming | |
decades. | |
Abstract | |
Digitized natural history data are enabling a broad range of innovative | |
studies of biodiversity. Large-scale data aggregators such as Global | |
Biodiversity Information facility (GBIF) and Integrated Digitized | |
Biocollections (iDigBio) provide easy, global access to millions of | |
specimen records contributed by thousands of collections. A developing | |
community of eager users of specimen data -- whether locality, image, | |
trait, etc. -- is perhaps unaware of the effort and resources required | |
to curate specimens, digitize information, capture images, mobilize | |
records, serve the data, and maintain the infrastructure (human and | |
cyber) to support all of these activities. Tracking of specimen | |
information throughout the research process is needed to provide | |
appropriate attribution to the institutions and staff that have supplied | |
and served the records. Such tracking may also allow for annotation and | |
comment on particular records or collections by the global community. | |
Detailed data tracking is also required for open, reproducible science. | |
Despite growing recognition of the value and need for thorough data | |
tracking, both technical and sociological challenges continue to impede | |
progress. In this talk, I will present a brief vision of how application | |
of a DOI to each iteration of a data set in a typical research project | |
could provide attribution to the provider, opportunity for comment and | |
annotation of records, and the foundation for reproducible science based | |
on natural history specimen records. Sociological change -- such as | |
journal requirements for data deposition of all iterations of a data set | |
-- can be accomplished using community meetings and workshops, along | |
with editorial efforts, as were applied to DNA sequence data two decades | |
ago. | |
Abstract | |
DiSSCo (The Distributed System of Scientific Collections) is a Research | |
Infrastructure (RI) aiming at providing unified physical | |
(transnational), remote (loans) and virtual (digital) access to the | |
approximately 1.5 billion biological and geological specimens in | |
collections across Europe. DiSSCo represents the largest ever formal | |
agreement between natural science museums (114 organisations across 21 | |
European countries). With political and financial support across 14 | |
European governments and a robust governance model DiSSCo will deliver, | |
by 2025, a series of innovative end-user discovery, access, | |
interpretation and analysis services for natural science collections | |
data. | |
As part of DiSSCo\'s developing data model, we evaluate the application | |
of Digital Objects (DOs), which can act as the centrepiece of its | |
architecture. DOs have bit-sequences representing some content, are | |
identified by globally unique persistent identifiers (PIDs) and are | |
associated with different types of metadata. The PIDs can be used to | |
refer to different types of information such as locations, checksums, | |
types and other metadata to enable immediate operations. In the world of | |
natural science collections, currently fragmented data classes (inter | |
alia genes, traits, occurrences) that have derived from the study of | |
physical specimens, can be re-united as parts in a virtual container | |
(i.e., as components of a Digital Object). These typed DOs, when | |
combined with software agents that scan the data offered by | |
repositories, can act as complete digital surrogates of the physical | |
specimens. | |
In this paper we: | |
investigate the architectural and technological applicability of DOs for | |
large scale data RIs for bio- and geo-diversity, | |
identify benefits and challenges of a DO approach for the DiSSCo RI and | |
describe key specifications (incl. metadata profiles) for a | |
specimen-based new DO type. | |
Abstract | |
Collections, aggregators, data re-packagers, publishers, researchers, | |
and external user groups form a complex web of data connections and | |
pipelines. This forms the natural history infrastructure essential for | |
collections use by an ever increasing and diverse external user | |
community. We have made great strides in developing the individual | |
actors within this system and we are now well poised to utilize these | |
capabilities to address big picture questions. We need to continue work | |
on the individual aspects, but the focus now needs to be on integration | |
of the functionality provided by the actors involved in the pipeline to | |
facilitate the transfer of data between them with as few human | |
interventions as possible. In order for the system to function | |
efficiently and to the benefit of all parties, information, data, and | |
resources need not only to be integrated efficiently but flow in the | |
reverse direction (attribution) to facilitate collections advocacy and | |
sustainability. There are unrealized benefits to collections from | |
inclusion into aggregators and subsequent use by researchers and | |
publishers. A recent National Science Foundation (NSF) funded Research | |
Coordination Network (RCN) Biodiversity Collections Network (BCoN) needs | |
assessment workshop identified a possible solution to the integration | |
and attribution of collections data and specimen information using a | |
suite of unique, persistent identifiers for specimen records | |
(Universally Unique Identifiers or UUIDs), datasets (Digital Object | |
Identifiers or DOIs) and institutions/collections (Cool Uniform Resource | |
Identifiers or Cool URIs). This talk will highlight this potential | |
workflow and the work needed to achieve this solution while soliciting | |
participation from actors in the pipeline and the community at large. | |
Abstract | |
Increasing the number of occurrence records available for biodiversity | |
research requires developing efficient pipelines from collectors and | |
observers to data aggregators and then marketing those pipelines to | |
biodiversity researchers. To be effective, these pipelines must | |
recognize that in many countries, internet access is slow, intermittent, | |
or expensive; cell phone internet access may be more common but many | |
people cannot afford the costs associated with using a cell phone for | |
databasing. The pipelines must also make it easy for users to provide | |
high quality data that conforms to international biodiversity data | |
standards. Marketing of these pipelines should include building | |
understanding of these standards and enable data providers to benefit | |
almost immediately from their contributions. Symbiota has succeeded in | |
making over 32 million specimen records available but most come from the | |
United States, a country with fast and reliable internet access in most | |
regions. We have established two Symbiota-based websites, OpenHerbarium | |
and OpenZooMuseum, to enable collectors and collections in Old World | |
countries that lack a national network, to become contributors to and | |
participants in the global biodiversity data sharing community. Talking | |
with biodiversity researchers in such countries has clarified the many | |
impediments to data sharing faced by their collectors and collections. | |
In this presentation, we shall describe the steps we have taken, and are | |
proposing to take, to improve the pipeline for collectors and | |
collections in countries with poor internet access. | |
Abstract | |
VertNet (vertnet.org) is a collaborative project that makes biodiversity | |
data free and available on the web. VertNet is also a tool designed to | |
help people discover, improve, and publish biodiversity data. It is also | |
the core of a collaboration between hundreds of biocollections that | |
contribute biodiversity data and work together to improve it. VertNet | |
has its genesis in the late 1990s and the very beginnings of vertebrate | |
collections data sharing, and is nearing its 20th birthday. The small | |
team that coordinates VertNet efforts long recognized the value of | |
archival versions of VertNet data separate from individual published | |
Darwin Core Archives. Here we describe why we produce what we call | |
"snapshots" of the VertNet index. To understand the snapshots, it is | |
important to also know how the VertNet indexing process works, which | |
includes efforts at better flagging record types and special content of | |
particular value to data consumers. We provide a brief explanation of | |
the process we developed for creating these snapshots, focusing on how | |
to assure their citation and licensing, and how to decide the scope of | |
different snapshots. We also discuss the collaborative process of | |
deciding infrastructure for archiving those snapshots, and our thinking | |
about timing of new snapshots. In particular, we cover the use of Google | |
BigQuery to produce snapshots and CyVerse as infrastructure for archival | |
storage. | |
Abstract | |
The South African Institute for Aquatic Biodiversity (SAIAB) operates | |
several research platforms, which may be used by the broader South | |
African research community (e.g. a marine research vessel and a remotely | |
operated underwater vehicle). SAIAB's Enterprise-grade data centre, | |
along with expertise in systems administration and biodiversity | |
information management, allow the institute to offer a Biodiversity | |
Information Management Platform. | |
Data hosted by SAIAB is replicated across three data centres, with each | |
centre being at least 250m apart and operating independently. | |
Infrastructure at two data centres replicates in real time, forming a | |
high availability cluster. The third datacentre is dedicated to storing | |
backups. High-capacity tape backup will be added in the near future. As | |
an additional measure, cloud storage is used to store daily extracts of | |
Specify databases, which are retained for one year. | |
In the first instance, the Platform aims to provide SAIAB researchers | |
and associates with biodiversity data curation services. This begins | |
with support for the SAIAB Collections Division, to ensure that voucher | |
specimens, tissue samples and associated media are accurately catalogued | |
and can be easily retrieved. Biodiversity data curation is broader than | |
this. It also means that any biodiversity data/metadata (records of | |
species, events, occurrences/observations and traits) can potentially be | |
curated using Specify Software, and standardised and published (subject | |
to relevant policies) to the GBIF Data Portal using the GBIF Integrated | |
Publishing Toolkit. The use of Specify Software to curate biodiveristy | |
data that do not represent voucher specimens (e.g. underwater images and | |
video) is a new research project within SAIAB, which has the potential | |
to be extended beyond SAIAB. | |
A new national initiative, the Natural Science Collections Facility | |
(NSCF), was launched in 2017 to reinvigorate natural science museums | |
across the country, to halt deterioration of specimens and improve | |
capacity for specimen and data curation. | |
In support of the NSCF, the SAIAB platform is offered to natural science | |
museums in South Africa (excluding herbaria, which are all part of or | |
affiliated with SANBI, and therefore accommodated by a different | |
system). Each museum will be provided with a webserver, Specify 7 | |
database, Specify web portal and IPT server. | |
In offering this platform to the broader South African Biodiversity | |
Science community, SAIAB is primarily motivated by the potential for | |
collaborative research in capacity development for biodiversity data | |
curation / information management, using Specify Software. The first | |
research project will examine participating museums' capacity to use the | |
Specify Workbench sustainably, to import new voucher/occurrence records | |
generated by fieldwork. The requisite training to enhance this potential | |
will be provided. | |
The Natural Science Collections Facility (NSCF) is an important | |
collaborator in the context of enhancing the general state of South | |
Africa's specimen collections, and the Specify Collections Consortium is | |
an important collaborator, specifically for support. | |
Abstract | |
Managing digital data for long-term archival and disaster recovery is a | |
key component of our collective responsibility in managing digital data | |
and metadata. As more and more data are collected digitally and as the | |
metadata for traditional museum collections becomes both digitized and | |
more comprehensive, the need to ensure that these data are safe and | |
accessible in the long term becomes essential. Unfortunately, disasters | |
do occur and many irreplaceable datasets on biodiversity have been | |
permanently lost. Maintaining a long-term archive and putting in place | |
reliable disaster recovery processes can be prohibitively expensive, | |
both in the cost of hardware and software as well as the costs of | |
personnel to manage and maintain an archival system. Traditionally, | |
storing digital data for the long term and ensuring the data are | |
loss-less, safe and completely recoverable when a disaster occurs has | |
been managed on-premises with a combination of on-site and off-site | |
storage. This requires complex data workflows to ensure that all data | |
are securely and redundantly stored in multiple highly dispersed | |
locations to minimize the threat of data loss due to local or regional | |
disasters. Files are often moved multiple times across operating systems | |
and media types on their way to and from a deep archive, increasing the | |
risk of file integrity issues. With the recent advent of an array of | |
Cloud Services from organizations such as Amazon, Microsoft and Google | |
to more focused offerings from Iron Mountain, Atempo and others, we have | |
a number of options for long term archival of digital data. Deep archive | |
solutions, storage where retrieval expected only in the case of a | |
disaster, are offered by many of these organizations at a rate | |
substantially less than their normal data storage fees. | |
The most basic requirement for an archival system is storing multiple | |
replicates of the data in geographically isolated locations with a | |
mechanism for guaranteeing file integrity, usually using a checksum | |
algorithm. Additional components that are integral to a robust archive | |
include a simple metadata search and reliable retrieval. | |
In this presentation, we'll discuss the need for long term archive and | |
disaster recovery capabilities, detail the current best practices of | |
data archival systems and review a variety of archival options that have | |
become available with Cloud Services. | |
Abstract | |
The Cornell Lab of Ornithology gathers, utilizes and archives a wide | |
variety of digital assets ranging from details of a bird observation to | |
photos, video and sound recordings. Some of these datasets are fairly | |
small, while others are hundreds of terabytes. In this presentation we | |
will describe how the Lab archives these datasets to ensure the data are | |
both loss-less and recoverable in the case of a widespread disaster, how | |
the archival strategy has evolved over the years and explore in detail | |
the current hybrid cloud storage management system. | |
The Lab runs eBird and several other citizen science programs focused on | |
birds where individuals from around the globe enter their sightings into | |
a centralized database. The eBird project alone stores over 500,000,000 | |
observations and the underlying database is over a terabyte in size. | |
Birds of North America, Neotropical Birds and All About Birds are online | |
species accounts comprising a wide range of authoritative live history | |
articles maintained in a relatively small database. Macaulay Library is | |
the world's largest image, sound and video archive with over 6,000,000 | |
cuts totaling nearly 100 TB of data. The Bioacoustics Research Program | |
utilizes automated recording units (SWIFTs) in the forests of the US, | |
jungles of Africa and in all seven oceans to record the environment. | |
These units record 24 hours a day and gather a tremendous about of raw | |
data, over 200 TB to date with an expected rate of an additional 100TB | |
per year. Lastly, BirdCams run by the lab add a steady stream of media | |
detailing the reproductive cycles of a number of species. The lab is | |
committed to making these archives of the natural world available for | |
research and conservation today. More importantly, ensuring these data | |
exist and are accessible in 100 years is a critical component of the Lab | |
data strategy. | |
The data management system for these digital assets has been completely | |
overhauled to handle the rapidly increasing volume and to utilize | |
on-premises systems and cloud services in a hybrid cloud storage system | |
to ensure data are archived in a manner that is redundant, loss-less and | |
insulated from disasters yet still accessible for research. With | |
multimedia being the largest and most rapidly growing block of data, | |
cost rapidly becomes a constraining factor of archiving these data in | |
redundant, geographically isolated facilities. Datasets with a smaller | |
footprint, eBIrd and species accounts allow for a wider variety of | |
solutions as cost is less of a factor. Using different methods to take | |
advantage of differing technologies and balancing cost vs recovery | |
speed, the Lab has implemented several strategies based on data | |
stability (eBird data are constantly changing), retrieval frequency | |
required for research and overall size of the dataset. We utilize Amazon | |
S3 and Glacier as our media archive, we tag each media in Glacier with a | |
set of basic DarwinCore metatdata fields that key back to a master | |
metadata database and numerous project specific databases. Because these | |
metadata databases are much smaller in size, yet critical in searching | |
and retrieval of a required media file, they are archived differently | |
with up to the minute replication to prevent any data loss due to an | |
unexpected disaster. The media files are tagged with a standard set of | |
basic metadata and in the case where the metadata databases were | |
unavailable, retrieval of specific media and basic metadata can still | |
occur. | |
This system has allowed the lab to place into long term archive hundreds | |
of terabytes of data, store them in redundant, geographically isolated | |
locations and provide for complete disaster recovery of the data and | |
metadata. | |
Abstract | |
Validation using schemas and tools like the Darwin Core Archive | |
Validator from GBIF are mainly seen as methods of checking data quality | |
and fitness for use, but are also important for long-term preservation. | |
We may like to think that our present (meta)data standards and formats | |
are made for eternity, but in reality we know that standards evolve, | |
formats change (some even become obsolete with time), and so do our | |
needs for storage, searching and future dissemination for re-use. So we | |
might eventually come to a point where transformation of our archival | |
records and migration to other formats will be necessary. This could | |
also mean that even if the AIPs, the Archival Information Packages stay | |
the same in storage, the DIPs, the Dissemination Information Packages | |
that we want to extract from the archive are subject to change of | |
format. Further, in order for archival information packages to be | |
self-sustainable as required in the OAIS model, it is important to take | |
interdependencies between individual files in the information packages | |
into account, already by the time of ingest and validation of the SIPs, | |
the Submission Information Packages, and along the line at different | |
points of necessary transformation / migration (from SIP to AIP, from | |
AIP to DIP etc.) to counter obsolecense. Validation schemas and | |
transformation code should also be archived together with the AIPs. By | |
ensuring compliance with standards these tools are essential in | |
controlling uniformity of records in a collection, for future needs of | |
transformation and migration to new, sustainable formats. An example is | |
given of the problems encountered in transforming only a small, | |
relatively well defined collection of about 1000 archival items but with | |
substantial variations between them, due to a lack of effective input | |
constraints and validation at ingest. | |
A further assessment is made of validation errors encountered in some | |
Darwin Core Archives comprising thousands of records from some hundred | |
published datasets, and how these errors might affect a future potential | |
transformation / migration effort. Migration efforts must necessarily be | |
general in scope, while errors in datasets from non-compliance with | |
standards risk being reinforced or aggravated in the transformation | |
process, making the information contained in the resulting records more | |
difficult to interpret. The conclusion is that efforts should be made, | |
e.g. by means of embedded validation measures into upload forms and | |
other methods of information transfer (e.g. ftp, oai-pmh) to ensure as | |
close compliance as possible to standards, already at the time of | |
ingest. | |
Abstract | |
Biodiversity Information Serving our Nation - BISON (bison.usgs.gov) is | |
the U.S. node to the Global Biodiversity Information Facility | |
(gbif.org), containing more than 375 million documented locations for | |
all species in the U.S. It is hosted by the United States Geological | |
Survey (USGS) and includes a web site and application programming | |
interface for apps and other websites to use for free. With this massive | |
database one can see not only the 15 million records for nearly 10 | |
thousand non-native species in the U.S. and its territories, but also | |
their relationship to all of the other species in the country as well as | |
their full national range. Leveraging this huge resource and its | |
enterprise level cyberinfrastructure, USGS BISON staff have created a | |
value-added feature by labeling non-native species records, even where | |
contributing datasets have not provided such labels. | |
Based on our ongoing four-year compilation of non-native species | |
scientific names from the literature, specific examples will be shared | |
about the ambiguity and evolution of terms that have been discovered, as | |
they relate to invasiveness, impact, dispersal, and management. The idea | |
of incorporating these terms into an invasive species extension to | |
Darwin Core has been discussed by Biodiversity Information Standards | |
(TDWG) working group participants since at least 2005. One roadblock to | |
the implementation of this standard\'s extension has been the diverse | |
terminology used to describe the characteristics of biological | |
invasions, terminology which has evolved significantly over the past | |
decade. | |
Abstract | |
Reducing the damage caused by invasive species requires a community | |
approach informed by rapidly mobilized data. Even if local stakeholders | |
work together, invasive species do not respect borders, and national, | |
continental and global policies are required. Yet, in general, data on | |
invasive species are slow to be mobilized, often of insufficient quality | |
for their intended application and distributed among many stakeholders | |
and their organizations, including scientists, land managers, and | |
citizen scientists. The Belgian situation is typical. We struggle with | |
the fragmentation of data sources and restrictions to data mobility. | |
Nevertheless, there is a common view that the issue of invasive alien | |
species needs to be addressed. In 2017 we launched the Tracking Invasive | |
Alien Species (TrIAS) project, which envisages a future where alien | |
species data are rapidly mobilized, the spread of exotic species is | |
regularly monitored, and potential impacts and risks are rapidly | |
evaluated in support of policy decisions (Vanderhoeven et al. 2017). | |
TrIAS is building a seamless, data-driven workflow, from raw data to | |
policy support documentation. TrIAS brings together 21 different | |
stakeholder organizations that covering all organisms in the | |
terrestrial, freshwater and marine environments. These organizations | |
also include those involved in citizen science, research and wildlife | |
management. | |
TrIAS is an Open Science project and all the software, data and | |
documentation are being shared openly (Groom et al. 2018). This means | |
that the workflow can be reused as a whole or in part, either after the | |
project or in different countries. We hope to prove that rapid data | |
workflows are not only an indispensable tool in the control of invasive | |
species, but also for integrating and motivating the citizens and | |
organizations involved. | |
Abstract | |
The Global Register of Introduced and Invasive Species (GRIIS) presents | |
annotated country checklists of introduced and invasive species. | |
Annotations include higher taxonomy of the species, synonyms, | |
environment/system in which the species occurs, and its biological | |
status in that country. Invasiveness is classified by evidenced impact | |
in that country. Draft country checklists are subjected to a process of | |
validation and verification by networks of country experts. Challenges | |
encountered across the world include confusion with alien/invasive | |
species terminology, classification of the 'invasive' status of an alien | |
species and issues with taxonomic synonyms. | |
Abstract | |
North America's Great Lakes contain 21% of the planet's fresh water, and | |
their protection is a matter of national security to both the USA & | |
Canada. One of the greatest threats to the health of this unparalleled | |
natural resource is invasion by non-indigenous species, several of which | |
already have had catastrophic impacts on property values, the fisheries, | |
shipping, and tourism industries, and continue to threaten the survival | |
of native species and wetland ecosystems. | |
The Great Lakes Invasives Network is a consortium (20 institutions) of | |
herbaria and zoology museums from among the Great Lakes states of | |
Minnesota, Wisconsin, Illinois, Indiana, Michigan, Ohio, and New York | |
created to better document the occurrence of selected non-indigenous | |
species and their congeners in space and time by imaging and providing | |
online access to the information on the specimens of the critical | |
organisms. The list of non-indigenous species (1 alga, 42 vascular | |
plants, 22 fish, and 13 mollusks) to be digitized was generated by | |
conducting a query of all fish, plants, algae, and mollusks present in | |
the database of GLANSIS -- the Great Lakes Aquatic Nonindigenous Species | |
Information System -- maintained by the National Oceanic and Atmospheric | |
Administration (NOAA). The network consists of collections at 20 | |
institutions, including 4 of the 10 largest herbaria in North America, | |
each of which curates 1-7 million specimens (NY, F, MICH, and WIS). | |
Eight of the nation's largest zoology museums are also represented, | |
several of which (e.g., Ohio State and U of Minnesota) are | |
internationally recognized for their fish and mollusk collections. | |
Each genus includes at least one species that is considered a Great | |
Lakes non-indigenous taxon -- several have many, whereas others have | |
congeners on "watchlists", meaning that they have not arrived in the | |
Great Lakes Basin yet, but have the potential to do so, especially in | |
light of human activity and climate change. Because the introduction and | |
spread of these species, their close relatives, and hybrids into the | |
region is known to have occurred almost entirely from areas in North | |
America outside of the Basin, our effort will include non-indigenous | |
specimens collected from throughout North America. | |
Digitized specimens of Great Lakes non-indigenous species and their | |
congeners will allow for more accurate identification of invasive | |
species and hybrids from their non-invasive relatives by a wider | |
audience of end users. The metadata derived from digitized specimens of | |
Great Lakes non-indigenous species and their congeners will help | |
biologists to track, monitor, and predict the spread of invasive species | |
through space and time, especially in the face of a more rapidly | |
changing climate in the upper Midwest. All together consortium members | |
will digitize \>2 million individual specimens from \>860,000 | |
sheets/lots of non-indigenous species and their congeneric taxa. Data | |
and metadata are uploaded to the Great Lakes Invasives Network, a | |
Symbiota portal (GreatLakesInvasvies.org), and ingested by the National | |
Resource for Advancing Digitization of Biodiversity Collections (ADBC) | |
(iDigBio.org) national resource. | |
Several initiatives are already in place to alert citizens to the | |
dangers of spreading aquatic invasive species among our nation\'s | |
waterways, but this project is developing complementary scientific and | |
educational tools for scientists, students, wildlife officers, teachers, | |
and the public who have had little access to images or data derived | |
directly from preserved specimens of invasive species collected over the | |
past three centuries. | |
Abstract | |
Agriculture and Agri-Food Canada (AAFC) is home to numerous specimen and | |
environmental collections generating highly relational data sets that | |
are analyzed using molecular methods (Sanger and NGS). The need to have | |
a system to properly manage these data sets and to capture accurate, | |
standardized metadata over entire laboratory workflows has been a | |
long-term strategic vision of the Biodiversity group at AAFC. Without | |
robust tracking, many difficulties arise when trying to publish or | |
submit data to external repositories. To even know what work has been | |
carried out on individual collection records over a researchers career | |
becomes a demanding task if the information is retrievable at all. SeqDB | |
was built to resolve these issues by centralizing, standardizing and | |
improving the availability and data quality of source specimen | |
collection data that is being studied using molecular methods. SeqDB | |
also facilitates integration with tools and external repositories in | |
order to take the burden off researchers and technicians having to | |
create adequate systems to track and mobilize their data sets, allowing | |
them to focus on research and collection management. | |
The development of SeqDB aligns with agile development methodologies and | |
attempts to fulfill rapidly emerging needs from genetics and genomics | |
research, which can evolve and fade quickly at times or be without clear | |
requirements. The success of SeqDB as an application supporting DNA | |
sequencing workflows has put it in the same space as other monolithic | |
architectures before it. As the feature set to support the application | |
continues to increase, the number of software developers vs operations | |
and maintenance staff is difficult to rebalance in our organisation. In | |
an effort to manage the scope for the project and ensure we are able to | |
continue to deliver on our mandate, the sequence tracking workflows of | |
the application will become part of the DINA ecosystem ("DIgital | |
information system for NAtural history data", https://dina-project.net). | |
Other functions of SeqDB such as collections management and taxonomy | |
tree curation, will be replaced with the DINA modules implementing these | |
functions. | |
In order to allow SeqDB to become a module of DINA, it has been decided | |
to refactor the application to base it on a Service Oriented | |
Architecture. By doing so, all molecular data of SeqDB will be exposed | |
as JSON API Web Services (JavaScript object notation application | |
programming interface) allowing other modules, user interfaces and the | |
current SeqDB application to communicate in a standardised way. The new | |
architecture will also bring an important technology upgrade for SeqDB | |
where the front end will eventually become a project in itself. | |
Abstract | |
As the biodiversity community increasingly adopts Semantic Web (SW) | |
standards to represent taxonomic registers, trait banks or museum | |
collections, some questions come up relentlessly: How to model the data? | |
For what goals? Can the same model fulfill different goals? | |
So far, the community has mostly considered the SW standards through | |
their most salient manifestation: the Web of Linked Data (Heath and | |
Bizer 2011). Indeed, the 5-star Linked Data principles are geared | |
towards the building of a large, distributed knowledge graph that may | |
successfully fulfill biodiversity's need for interoperability and data | |
integration. However, the SW addresses a much broader set of problems | |
involving automatic reasoning. For instance, reasoners can exploit | |
ontological knowledge to improve query answering, leverage class | |
definitions to infer class subsumption relationships, or classify | |
individuals i.e. compute instance relationships between individuals and | |
classes by applying reasoning techniques on class definitions and | |
instance descriptions (Shearer et al. 2008). | |
Whether a \"thing\" should be modelled as a class or a class instance | |
has been debated at length in the SW community, and the answer is often | |
a matter of perspective. In the context of taxonomic registers for | |
example, the NCBI Organismal Classification (Federhen 2012) and | |
Vertebrate Taxonomy Ontology (Midford et al. 2013) represent taxa as | |
classes in the Ontology Web Language (OWL). By contrast, other | |
initiatives represent taxa as instances of various classes, e.g. the | |
SKOS Concept class (skos:Concept) in the AGROVOC thesaurus (Caracciolo | |
et al. 2013) (we speak of the instances as SKOS concepts), the Darwin | |
Core taxon class (dwc:Taxon) in Encyclopedia of Life (Parr et al. 2016), | |
or classes depicting taxonomic ranks in GeoSpecies, DBpedia and the BBC | |
Wildlife Ontology. Such modelling discrepancies impede linking congruent | |
taxa throughout taxonomic registers. Indeed, one can state the | |
equivalence between two classes (with owl:equivalentClass) or two class | |
instances (with owl:sameAs, skos:exactMatch, etc.), but good practices | |
discourage the alignment of classes with class instances (Baader et al. | |
2003). | |
Recently, Darwin Core\'s popularity has fostered the modeling of taxa as | |
instances of class dwc:Taxon (Senderov et al. 2018, Parr et al. 2016). | |
In this context, pragmatism may incline a Linked Data provider to comply | |
with this majority trend to ensure maximum interlinking. Although | |
technically and conceptually valid, this choice entails certain | |
drawbacks. First, considering a taxon only as a an instance misses the | |
fact that it is a set of biological individuals with common | |
characteristics. An OWL class exactly captures this semantics through | |
the set of necessary and sufficient conditions that an individual must | |
meet to be a class member. In turn, an OWL reasoner can leverage this | |
knowledge to perform query answering, compute subsumption or instance | |
relationships. By contrast, taxa depicted by class instances are not | |
defined but described by stating their properties. Hence the second | |
drawback: unless we develop bespoke reasoners, there is not much a | |
standard OWL reasoner can deduce from instances. | |
Yet, some works have demonstrated the effectiveness of logic | |
representation and reasoning capabilities, e.g. computing the alignments | |
of two primate classifications (Franz et al. 2016) using generic | |
reasoners that nevertheless require proprietary input formats. OWL | |
reasoners are typically designed to solve such classification problems. | |
They may leverage taxonomic ontologies to compute alignments with other | |
ontologies or apply reasoning to individuals\' properties to infer their | |
species. Hence, pragmatically following the instance-based approach may | |
indeed maximize interlinking in the short term, but bears the risk of | |
denying ourselves potentially desirable use cases in the longer term. We | |
believe that developing class-based ontologies for biodiversity should | |
help leverage the SW's extensive theoretical and practical works to | |
tackle a variety of use cases that so far have been addressed with | |
bespoke solutions. | |
Abstract | |
The DINA Consortium ("DIgital information system for NAtural history | |
data", https://dina-project.net,Fig. 1 was formed in order to provide a | |
framework for like-minded large natural history collection-holding | |
institutions to collaborate through a distributed Open Source | |
development model to produce a flexible and sustainable collection | |
management system. Target collections include zoological, botanical, | |
mycological, geological and paleontological collections, living | |
collections, biodiversity inventories, observation records, and | |
molecular data. | |
The DINA system is architected as a loosely-coupled set of several | |
web-based modules. The conceptual basis for this modular ecosystem is a | |
compilation of comprehensive guidelines for Web application programming | |
interfaces (APIs) to guarantee the interoperability of its components. | |
Thus, all DINA components can be modified or even replaced by other | |
components without crashing the rest of the system as long as they are | |
DINA compliant. Furthermore, the modularity enables the institutions to | |
host only the components they need. DINA focuses on an Open Source | |
software philosophy and on community-driven open development, so the | |
contributors share their development resources and expertise outside of | |
their own institutions. | |
One of the overarching reasons to develop a new collection management | |
system is the need to better model complex relationships between | |
collection objects (typically specimens) involving their derivatives, | |
preparations and storage. We will discuss enhancements made in the DINA | |
data model to better represent these relationships and the influence it | |
has on the management of these objects, and on the sharing of | |
information. Technical detail of various components of the DINA system | |
will be shown in other talks in this symposium followed by a discussion | |
session. | |
Abstract | |
The DINA Symposium ("DIgital information system for NAtural history | |
data", https://dina-project.net) ends with a plenary session involving | |
the audience to discuss the interplay of collection management and | |
software tools. The discussion will touch different areas and issues | |
such as: | |
\(1) Collection management using modern technology: | |
How should and could collections be managed using current technology -- | |
What is the ultimate objective of using a new collection management | |
system? | |
How should traditional management processes be changed? | |
\(2) Development and community | |
Why are there so many collection management systems? | |
Why is it so difficult to create one system that fits everyone's | |
requirements? | |
How could the community of developers and collection staff be built | |
around DINA project in the future? | |
\(3) Features and tools | |
How to identify needs that are common to all collections? | |
What are the new tools and technologies that could facilitate collection | |
management? | |
How could those tools be implemented as DINA compliant services? | |
\(4) Data | |
What data must be captured about collections and specimens? | |
What criteria need to be applied in order to distinguish essential and | |
"nice-to-have" information? | |
How should established data standards (e.g. Darwin Core & ABCD (Access | |
to Biological Collection Data)) be used to share data from rich and | |
diverse data models? | |
In addition to the plenary discussion around these questions, we will | |
agree on a streamlined format for continuing the discussion in order to | |
write a white paper on these questions. The results and outcome of the | |
session will constitute the basis of the paper and will be subsequently | |
refined. | |
Abstract | |
In order to ensure long-term commitment to the DINA project ("DIgital | |
information system for NAtural history data", https://dina-project.net), | |
it is essential to continuously deliver features of high value to the | |
user community. This is also what agile software development methods try | |
to achieve by emphasizing early delivery, rapid response to changes and | |
close collaboration with users (see for example the Manifesto for Agile | |
Software Development at http://agilemanifesto.org). We will give a brief | |
overview on how current development of the DINA collection management | |
system core is guided by agile principles. The mammal collection at the | |
Swedish Museum of Natural History will be used as an example. | |
Developing a cross-disciplinary collection management system is a | |
complex task that poses many challenges: Which features should we focus | |
on? What kinds of data should the system ultimately support? How can the | |
system be flexible but still easy to use? Since we cannot do everything | |
at once, we work towards a minimum viable product (MVP) that contains | |
just enough features at a time to bring value for selected target users. | |
In the mammal collection case, the MVP is the simplest product that is | |
able to replace the functions of the current system used for managing | |
the collection. As we begin to work with other collections, new MVPs are | |
defined and used to guide further development. Thus, the set of features | |
available will increase with each MVP, benefiting both new and current | |
users. | |
Another big challenge is migration of legacy data, which is labor | |
intensive and involves standardizing data that are not compatible with | |
the new system. To address these issues, we aim to build a flexible data | |
model that allows less structured data to coexist with more complex, | |
highly structured data. Migration should thus not require extensive data | |
standardization, transformation and cleaning. The plan is to instead | |
offer tools for transforming and cleaning the data after they have been | |
imported. With the data in place, it will be easier for the user to | |
provide feedback and suggest new features. | |
Abstract | |
The DINA system ("DIgital information system for NAtural history data", | |
https://dina-project.net) consists of several web-based services that | |
fulfill specific tasks. Most of the existing services are covering | |
single core features in the collection management system and can be used | |
either as integrated components in the DINA environment, or as | |
stand-alone services. | |
In this presentation single services will be highlighted as they | |
represent technically interesting approaches and practical solutions for | |
daily challenges in collection management, data curation and migration | |
workflows. The focus will be on the following topics: (1) a generic | |
reporting and label printing service, (2) practical decisions on | |
taxonomic references in collection data and (3) the generic management | |
and referencing of related research data and metadata: | |
Reporting as presented in this context is defined as an extraction and | |
subsequent compilation of information from the collection management | |
system rather than just summarizing statistics. With this quite broad | |
understanding of the term the DINA Reports & Labels Service (Museum für | |
Naturkunde Berlin 2018) can assist in several different collection | |
workflows such as generating labels, barcodes, specimen lists, vouchers, | |
paper loan forms etc. As it is based on customizable HTML templates, it | |
can be even used for creating customized web forms for any kind of | |
interaction (e.g. annotations). | |
Many collection management systems try to cope with taxonomic issues, | |
because in practice taxonomy is used not only for determinations, but | |
also for organizing the collections and categorizing storage units (e.g. | |
"Coleoptera hall"). Addressing taxonomic challenges in a collection | |
management system can slow down development and add complexity for the | |
users. The DINA system uncouples these issues in a simple taxonomic | |
service for the sole assignment of names to specimens, for example | |
determinations. This draws a clear line between collection management | |
and taxonomic research, of which the latter can be supported in a | |
separate service. | |
While the digitization of collection data and workflows proceeds, | |
linking related data is essential for data management and enrichment. In | |
many institutions research data is disconnected from the collection | |
specimen data because the type and structure cannot be easily included | |
in the collection management databases. With the DINA Generic Data | |
Module (Museum für Naturkunde Berlin 2017) a service exists that allows | |
for attaching any relational data structures to the DINA system. It can | |
also be used as a standalone service that accommodates structured data | |
within a DINA compliant interface for data management. | |
Abstract | |
The large efforts to document and map aboveground biodiversity have | |
helped to elucidate ecological and evolutionary mechanisms and | |
processes, predict responses to global change, and identify potential | |
management options in response to those changes. Yet these concepts have | |
mostly been applied to aboveground plant and animal communities, while | |
microbial diversity remains difficult to incorporate. The ability to | |
integrate microbial sequence data into an accessible global | |
infrastructure has previously been limited by a few key factors: First, | |
most of microbial diversity remains undescribed and unknown; there is | |
just an enormous amount of biodiversity. Second, there is a lack of | |
congruence between the many disparate microbial datasets (e.g. taxonomy, | |
phylogeny, and methodological biases), which limits the ability to | |
monitor and quantify global patterns of the terrestrial microbiome. | |
Finally, there is a lack of coordination and networking between | |
scientists studying microbes. In this presentation I will discuss two | |
case studies that highlight how we can begin to link microbial data to | |
the already well-established macro-knowledge and other environmental | |
databases (like global carbon maps) | |
Study 1 -- a megameta analysis: The emergence of high-throughput DNA | |
sequencing methods provides unprecedented opportunities to further | |
unravel microbial ecology and its worldwide role from human health to | |
ecosystem functioning. However, in spite of the abundance of sequencing | |
studies, combining data from multiple individual studies to address | |
macroecological questions of bacterial diversity remains methodically | |
challenging and plagued with biases. While previous meta-analysis | |
efforts have focused on diversity measures or abundances of major taxa, | |
in a recent study^(1)^ we show that disparate amplicon sequence data can | |
be combined at the taxonomy-based level to assess bacterial community | |
structure. Using a machine learning approach, we found that rarer taxa | |
are more important for structuring soil communities than abundant taxa. | |
We concluded that combining data from independent studies can be used to | |
explore novel patterns in bacterial communities, identify potential | |
'indicator' taxa with an important role in structuring communities, and | |
propose new hypotheses on the factors that shape microbial biogeography | |
previously overlooked. | |
Study 2 -- a global soil biodiversity database: Greater access to | |
microbial data is an important next step for biodiversity research and | |
conservation, and for understanding the ecology and evolution of | |
microbial communities. In collaboration with the Global Soil | |
Biodiversity Initiative and the German Biodiversity Synthesis Centre | |
(sDIV) we outlined steps that must be taken to ensure microbial sequence | |
data can be included in global measures and maps of biodiversity^(2)^. | |
Here I will discuss how the plant associated microbiome is an optimal | |
starting point to synthesize microbial sequence data on an open and | |
global platform. The plant-microbiome is an optimal model system that | |
goes across scales and time, can act as a bridge between microorganisms | |
and macroorganisms, and as an opportunity to more thoroughly explore the | |
synthesis of global microbial sequence data (for a global soil | |
biodiversity database). Beyond expanding primary research, the patterns | |
discovered in a synthesis of plant-microbiome can be used to explore and | |
guide ecosystem restoration and sustainability. Overall, a better | |
understanding of microbial biodiversity will help to predict | |
consequences of (human-induced) global changes and facilitate | |
conservation and adaptation responses. | |
\(1) Ramirez, K.S., C.G. Knight et al. and F.T. de Vries (2017). | |
Detecting macroecological patterns in bacterial communities across | |
independent studies of global soils. Nature Microbiology. | |
\(2) Ramirez, K.S., M. Döring, N. Eisenhauer, C. Gardi, J. Ladau, J.W. | |
Leff, G. Lentendu, Z. Lindo, M.C. Rillig, D. Russell, S. Scheu, M.G. St. | |
John, F.T. de Vries, T. Wubet, W.H. van der Putten, D.H. Wall, (2015). | |
Towards a global platform for linking soil biodiversity data. Frontiers | |
in Ecology and Evolutionary Biology 3(91). doi: 10.3389/fevo.2015.00091 | |
Abstract | |
Traditionally, taxonomic characterisation of organisms has relied on | |
their morphology; however, molecular methods are increasingly used to | |
monitor and assess biodiversity and ecosystem health. Approaches such as | |
DNA amplicon diversity assessments are a particularly useful tool when | |
morphology-based taxonomy is difficult or taxa are morphologically | |
ambiguous, for example for freshwater bacteria and fungi as well as many | |
freshwater invertebrate species. DNA metabarcoding provides the ability | |
to distinguish cryptic taxa (which can differ markedly in their | |
ecological requirements and tolerances) and in addition it can provide | |
valuable insights into the genetic and ecological diversity of taxa and | |
ecosystems. While DNA metabarcoding has been used mostly on tissue of | |
sampled specimens, recent years have seen an increased use of | |
metabarcoding on environmental DNA samples: DNA extracted not from | |
sampled specimens, but from the surrounding soil or water. However, the | |
ability of metabarcoding of specimens and metabarcoding of environmental | |
DNA (eDNA) to assess biodiversity and the impact of anthropogenic | |
stressors on freshwater ecosystems is largely understudied. In this | |
talk, several studies that document the advantages and still open | |
challenges of (e)DNA metabarcoding for assessing impacts of | |
environmental stressors on aquatic ecosystems will be presented. These | |
studies, performed in Europe and New Zealand, integrate impacts across | |
different biotic groups, i.e. look at stressor effects on bacterial, | |
protist, fungal and macroinvertebrate communities. Specifically, we use | |
various case studies from freshwater ecosystems to address the following | |
questions: | |
whether eDNA samples, which can be relatively quickly obtained from the | |
water, can act as reliable proxies for catchment-level stressor impacts | |
by comparing these to DNA obtained from local bulk samples, and | |
whether DNA metabarcoding data can also provide quantitative information | |
rather than only presence-absence data. | |
In view of the case studies presented, a perspective on the urgent next | |
steps that need to be taken in order to include genetic tools in routine | |
biomonitoring will be derived and linked to the vision of the | |
international network DNAqua-Net. | |
Abstract | |
Adventitious roots in canopy soils associated with silver beech | |
(Lophozonia menziesii (Hook.f.) Heenan & Smissen (Nothofagaceae)) form | |
ectomycorrhizal associations. We used amplicon sequencing of the | |
internal transcribed spacer 2 region to compare diversity of | |
ectomycorrhizal fungal species in canopy and terrestrial sites. The | |
study data are archived as an NCBI BioProject (accession PRJNA421209), | |
with the raw DNA sequence reads available from the NCBI Sequence Read | |
Archive SRA637723 Community composition of canopy ectomycorrhizal fungi | |
was significantly different to the terrestrial community composition, | |
with several abundant ectomycorrhizal species significantly more | |
represented in the terrestrial soil than the canopy soil. Additionally, | |
we found evidence that an introduced ectomycorrhizal species was present | |
in these native forest soils. We identified OTUs in two ways: (i) by | |
manually curated BLAST searching of the NCBI nr database, and (ii) by | |
comparison with Species Hypotheses on UNITE v.7.2. We desired to make | |
species identifications where we could be reasonably confident they were | |
robust, but had to avoid making identifications when an incorrect name | |
could have implications for biosecurity or our understanding of | |
biodiversity and biogeography. We found some UNITE Species Hypotheses | |
included sequences of more than one taxon, which we were able to | |
separate and distinguish by phylogenetic analysis. Consequently we | |
exercised caution in reporting names based on the Species Hypotheses. | |
Using data from this case study, we will illustrate the achievements and | |
challenges faced in identifying species of ectomycorrhizal fungi from | |
DNA barcodes. Most DNA sequences of ectomycorrhizal fungi matched | |
closely New Zealand voucher specimens stored in either the New Zealand | |
Fungal Herbarium (PDD) or the Otago Regional Herbarium (OTA), which | |
facilitated the validation of identifications. In the case of PDD | |
specimens, collection and DNA data were linked via the Systematics | |
Collections Data database (https://scd.landcareresearch.co.nz). We are | |
working towards a similar database for OTA specimens, using the Specify | |
6 database platform. | |
Abstract | |
Several national and international environmental laws require countries | |
to meet clearly defined targets with respect to the ecological status of | |
aquatic ecosystems. In Europe, the EU-Water Framework Directive (WFD; | |
2000/60/EC) represents such a detailed piece of legislation. The WFD | |
that requires the European member countries to achieve an at least | |
'good' ecological status of all surface waters at latest by the year | |
2027. In order to assess the ecological status of a given water body | |
under the WFD, data on its aquatic biodiversity are obtained and | |
compared to reference status. The mismatch between these two metrics | |
then is used to derive the respective ecological status class. While the | |
workflow to carry out the assessment is well established, it relies only | |
on few biological groups (typically fish, macroinvertebrates and a few | |
algal taxa such as diatoms), is time consuming and remains at a lower | |
taxonomic resolution, so that the identifications can be done routinely | |
by non-experts with an acceptable learning curve. Here, novel genetic | |
and genomic tools provide new solutions to speed up the process and | |
allow to include a much greater proportion of biodiversity in the | |
assessment process. further, results are easily comparable through the | |
genetic 'barcodes' used to identify organisms. | |
The aim of the large international COST Action DNAqua-Net | |
(http://dnaqua.net/) is to develop strategies on how to include novel | |
genetic tools in bioassessment of aquatic ecosystems in Europe and | |
beyond and how to standardize these among the participating countries. | |
It is the ambition of the network to have these new genetic tools | |
accepted in future legal frameworks such as the EU-Water Framework | |
Directive (WFD; 2000/60/EC) and the Marine Strategy Framework Directive | |
(2008/56/EC). However, a prerequisite is that various aspects that start | |
from the validation and completion of DNA Barcode reference databases, | |
to the lab and field protocols, to the analysis processes as well as the | |
subsequently derived biotic indices and metrics are dealt with and | |
commonly agreed upon. Furthermore, many pragmatic questions such as | |
adequate short and long-term storage of samples or specimens for further | |
processing or to serve as an accessible reference need also be | |
addressed. In Europe the conformity and backward compatibility of the | |
new methods with the existing legislation and workflows are further of | |
high importance. Without rigorous harmonization and inter-calibration | |
concepts, the implementation of the powerful new genetic tools will be | |
substantially delayed in real-world legal framework applications. | |
After a short introduction on the structure and vision of DNAqua-Net, we | |
discuss how the DNAqua-Net community considers possibilities to include | |
novel DNA-based approaches into current bioassessment and how formal | |
standardization e.g. through the framework of CEN (The European | |
Committee for Standardization) may aid in that process (Hering et al. | |
2018, Leese et al. 2016, Leese et al. 2018. Further we explore how TDWG | |
data standards can further facilitate swift adoption of the genetic | |
methods in routine use. We further present potential impacts of the | |
legislative requirements of the Nagoya Protocol on the exchange of | |
genetic resources and their implications for biomonitoring. Last but not | |
least, we will touch upon the rather unexpected influence that the new | |
General Data Protection Regulation (GDPR) may have on the bioassessment | |
work in practice. | |
Abstract | |
Although they are hyperdiverse and intensively studied, parasites | |
present major challenges when it comes to phylogenetics, taxonomy, and | |
biodiversity informatics. The collection of any parasitic organism | |
entails the linking of at least two specimens - the parasite and the | |
host. If the parasite has a complex life cycle, then this becomes | |
further complicated by requiring the linking of three or more hosts, | |
such as the parasite, its intermediate host (vector) and its definitive | |
host(s). Parasites are sometimes collected as byproduct of another | |
collection event and are not studied immediately - which has the | |
potential to disconnect them further in terms of information content and | |
continuity- and the converse if also common - parasites can be collected | |
by parasitologists, who do not necessarily take host vouchers or | |
incorporate host taxonomy, let alone other metadata for these events. | |
Using the specific example of the malaria parasites (Order Haemosporida) | |
I will present examples of the specific challenges that have accompanied | |
the study of these parasites including issues of delimiting species, | |
phylogenetic study, including genetic oddities that are unique to these | |
organisms, and taxonomic quandries that we now find ourselves in, along | |
with other problems with maintaining continuity of information in a | |
group that is both diverse biologically and important medically. | |
Abstract | |
Madagascar is one of the world's hottest biodiversity hotspots and a | |
natural laboratory for evolutionary research. Tenrecs (Tenrecidae; 32 | |
currently recognized species) -- small placental mammals endemic to | |
Madagascar -- colonized the island \>35 million years ago and have | |
evolved a stunning range of behaviors and morphologies, including | |
heterothermic species; species with hedgehog-like spines; and fossorial, | |
aquatic, and scansorial ecotypes. In 2016, we produced the first | |
taxonomically complete phylogeny of tenrecs, which has served as a | |
framework for studying morphological evolution, phylogeography, and | |
species limits. Most recently, we have built on this phylogeny to | |
incorporate an enormous database of genetic, morphometric, and | |
geographic data from \>800 vouchered tenrec specimens. These data have | |
revealed interesting and unexpected aspects of their evolutionary | |
history, including decoupled diversification of the cranium and | |
postcranium. Using a machine learning approach, we have also uncovered | |
numerous new, cryptic species in the family Tenrecidae. As phylogenetic | |
and phenotypic data become more readily available through online | |
repositories, we expect that the same approaches can be applied to other | |
taxonomic groups, providing unprecented resolution of the tree of life. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Loop over all abstracts and write the output to abstracts.txt | |
for abstract in $(cat TDWG_abstracts.txt); do | |
# Strip off just abstract number | |
anum=$(echo $abstract | sed 's/\/article\///g;s/\/download\/xml\///g;') | |
# Download XML representation of Abstracts | |
wget "https://biss.pensoft.net${i}" -O $anum.xml ; done | |
# Extract just Abstract text from XML using XPATH | |
xmllint --xpath "/article/front/article-meta/abstract" $anum.xml | pandoc --from html --to markdown >> abstracts.txt | |
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/article/27339/download/xml/ | |
/article/26369/download/xml/ | |
/article/25437/download/xml/ | |
/article/26922/download/xml/ | |
/article/26860/download/xml/ | |
/article/26516/download/xml/ | |
/article/26323/download/xml/ | |
/article/26304/download/xml/ | |
/article/26262/download/xml/ | |
/article/26235/download/xml/ | |
/article/26177/download/xml/ | |
/article/26080/download/xml/ | |
/article/26075/download/xml/ | |
/article/25842/download/xml/ | |
/article/25738/download/xml/ | |
/article/25661/download/xml/ | |
/article/25577/download/xml/ | |
/article/25223/download/xml/ | |
/article/26168/download/xml/ | |
/article/27244/download/xml/ | |
/article/26490/download/xml/ | |
/article/26367/download/xml/ | |
/article/26286/download/xml/ | |
/article/26104/download/xml/ | |
/article/26102/download/xml/ | |
/article/25960/download/xml/ | |
/article/25864/download/xml/ | |
/article/25828/download/xml/ | |
/article/25890/download/xml/ | |
/article/25885/download/xml/ | |
/article/25724/download/xml/ | |
/article/25723/download/xml/ | |
/article/25881/download/xml/ | |
/article/25836/download/xml/ | |
/article/25876/download/xml/ | |
/article/25564/download/xml/ | |
/article/25560/download/xml/ | |
/article/25535/download/xml/ | |
/article/25481/download/xml/ | |
/article/26122/download/xml/ | |
/article/25852/download/xml/ | |
/article/26731/download/xml/ | |
/article/25869/download/xml/ | |
/article/25693/download/xml/ | |
/article/25658/download/xml/ | |
/article/25165/download/xml/ | |
/article/25641/download/xml/ | |
/article/25586/download/xml/ | |
/article/25700/download/xml/ | |
/article/25298/download/xml/ | |
/article/26749/download/xml/ | |
/article/25651/download/xml/ | |
/article/25289/download/xml/ | |
/article/25525/download/xml/ | |
/article/25282/download/xml/ | |
/article/25748/download/xml/ | |
/article/25694/download/xml/ | |
/article/25653/download/xml/ | |
/article/25585/download/xml/ | |
/article/26665/download/xml/ | |
/article/25838/download/xml/ | |
/article/25450/download/xml/ | |
/article/25439/download/xml/ | |
/article/25394/download/xml/ | |
/article/25268/download/xml/ | |
/article/25148/download/xml/ | |
/article/25582/download/xml/ | |
/article/25657/download/xml/ | |
/article/25608/download/xml/ | |
/article/25438/download/xml/ | |
/article/25395/download/xml/ | |
/article/25351/download/xml/ | |
/article/25324/download/xml/ | |
/article/25317/download/xml/ | |
/article/25310/download/xml/ | |
/article/25176/download/xml/ | |
/article/26808/download/xml/ | |
/article/26060/download/xml/ | |
/article/25474/download/xml/ | |
/article/25456/download/xml/ | |
/article/26836/download/xml/ | |
/article/25840/download/xml/ | |
/article/25812/download/xml/ | |
/article/25811/download/xml/ | |
/article/25805/download/xml/ | |
/article/25642/download/xml/ | |
/article/24749/download/xml/ | |
/article/25306/download/xml/ | |
/article/24930/download/xml/ | |
/article/25647/download/xml/ | |
/article/25646/download/xml/ | |
/article/25635/download/xml/ | |
/article/25580/download/xml/ | |
/article/25579/download/xml/ | |
/article/26009/download/xml/ | |
/article/25983/download/xml/ | |
/article/25982/download/xml/ | |
/article/25953/download/xml/ | |
/article/25604/download/xml/ | |
/article/25936/download/xml/ | |
/article/25776/download/xml/ | |
/article/25739/download/xml/ | |
/article/25727/download/xml/ | |
/article/25698/download/xml/ | |
/article/25589/download/xml/ | |
/article/25614/download/xml/ | |
/article/25478/download/xml/ | |
/article/25409/download/xml/ | |
/article/25345/download/xml/ | |
/article/25343/download/xml/ | |
/article/26514/download/xml/ | |
/article/25969/download/xml/ | |
/article/25415/download/xml/ | |
/article/25410/download/xml/ | |
/article/25990/download/xml/ | |
/article/25488/download/xml/ | |
/article/25487/download/xml/ | |
/article/25486/download/xml/ | |
/article/25121/download/xml/ | |
/article/24991/download/xml/ | |
/article/27087/download/xml/ | |
/article/26658/download/xml/ | |
/article/26615/download/xml/ | |
/article/26471/download/xml/ | |
/article/25728/download/xml/ | |
/article/25914/download/xml/ | |
/article/25664/download/xml/ | |
/article/26561/download/xml/ | |
/article/25699/download/xml/ | |
/article/27251/download/xml/ | |
/article/25762/download/xml/ | |
/article/25833/download/xml/ | |
/article/25749/download/xml/ | |
/article/25637/download/xml/ | |
/article/25261/download/xml/ | |
/article/25260/download/xml/ | |
/article/29123/download/xml/ | |
/article/28479/download/xml/ | |
/article/28364/download/xml/ | |
/article/28131/download/xml/ | |
/article/28158/download/xml/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment