jduckles · August 21, 2018 07:57
diff --git a/abstracts.txt b/abstracts.txt
 Abstract

 This presentation will outline the untapped potential of Information and
 Library Science (ILS) programs as an integral space for the long-term
 training and support of biodiversity informatics work. It will also
 outline the specific proposed steps taken at Indiana University,
 Bloomington (IU), to provide long-term, systematized training of
 students focused on information work within this broad domain.

 As a discipline, ILS has long been preoccupied with the organization,
 description, curation, and access to a wide variety of information and
 data sources. ILS curriculum necessarily emphasizes a broad range of
 information topics given that many different kinds of institutions
 require these particular skillsets. Typical ILS curriculums focus on
 topics such as, knowledge organization, metadata, ontologies, database
 design, scholarly communication, intellectual property, information
 ethics, interface design, data analytics, online publishing, museum
 studies, data curation, and collection management/administration. Given
 this broad range of training, students graduating from ILS programs are
 perfectly situated to support biodiversity informatics broadly
 conceived, especially as it relates to the standardization and
 normalization of data sources across geographically and temporally
 distributed locations and sources within specific institutional
 environments.

 Yet, despite the overlaps between ILS departments, biodiversity
 informatics, and museum environments, no ILS program has officially
 taken steps to support this intersectional space. Using concrete
 examples, this talk will show how the ILS program at IU is building on
 top of already-existing capacities to more robustly support biodiversity
 work. The proposed way forward is a tightly integrated approach to
 biodiversity informatics that integrates theoretical experience and
 technical training with hands-on internships in museum and biodiversity
 environments. Through close partnerships with on-campus institutes, such
 as the Indiana Geological & Water Survey and the Center for Biological
 Research Collections, as well as larger, external institutions such as
 the Smithsonian National Museum of Natural History, students will be
 provided intense fieldwork experience in data management and
 standards-driven work specific to the museum and biodiversity world. A
 tiered approach to this training will be suggested, as this kind of
 training should proceed at both the professional level (for example,
 master's level work), as well as more advanced levels focused on more
 research-driven activity (such as postdoctoral work).

 Part of this new approach to biodiversity informatics training requires
 the rearticulation of ILS courses, as well as the addition of new
 courses that can provide domain-specific knowledge. This presentation,
 then, will outline a proposed curriculum to support this kind of
 collaborative training and work. A distributed training structure will
 be suggested, utilizing expertise from across the globe. In addition, it
 will show how a more project- and field work-centric approach to ILS
 education can more quickly and deeply train students to enter the
 quickly changing field.

 Part of the difficulty with training biodiversity informatics
 specialists is that building such programs from the ground up is often
 costly and requires the building of new workflows and practices. An
 integrated approach, such as that proposed in this presentation,
 however, will leverage the respective strengths of ILS program and
 museum environments in ways that are sustainable and resilient for the
 long term. The goal here is for institutions to support each other in
 ways that strengthen their core missions, as well as push the discipline
 forward in systematic and unique ways.
 Abstract

 As rapid advances in sequencing technology result in more branches of
 the tree of life being illuminated, there has actually been a decrease
 in the percentage of sequence records that are backed by voucher
 specimens Trizna 2018b. The good news is that there are tools Trizna
 (2017), NCBI (2005), Biocode LLC (2014) to enable well-databased museum
 vouchers to automatically validate and format specimen and collection
 metadata for high quality sequence records. Another problem is that
 there are millions of existing sequence records that are known to
 contain either incorrect or incomplete specimen data. I will show an
 end-to-end example of sequencing specimens from a museum, depositing
 their sequence records in NCBI\'s (National Center for Biotechnology
 Information) GenBank database, and then providing updates to GenBank as
 the museum database revises identifications. I will also talk about
 linking records from specimen databases as well. Over one million
 records in the Global Biodiversity Information Facility (GBIF) Trizna
 (2018a) contain a value in the Darwin Core term \"associatedSequences\",
 and I will examine what is currently contained in these entries, and how
 best to format them to ensure that a tight connection is made to
 sequence records.
 Abstract

 SOCCOMAS is a ready-to-use Semantic Ontology-Controlled Content
 Management System (http://escience.biowikifarm.net/wiki/SOCCOMAS). Each
 web content management system (WCMS) run by SOCCOMAS is controlled by a
 set of ontologies and an accompanying Java-based middleware with the
 data housed in a Jena tuple store. The ontologies describe the behavior
 of the WCMS, including all of its input forms, input controls, data
 schemes and workflow processes (Fig. 1).

 Data is organized into different types of data entries, which represent
 collections of data referring to a particular material entity, for
 instance an individual specimen. SOCCOMAS implements a suite of general
 processes, which can be used to manage and organize all data entry
 types. One category of processes manages the life-cycle of a data entry,
 including all required for changing between the following possible entry
 states:

 current draft version;

 backup draft version;

 recycle bin draft version;

 deleted draft version;

 current published version;

 previously published version.

 The processes also allow a user to create a revised draft based on the
 current published version. Another category of processes automatically
 tracks the overall provenance (i.e. creator, authors, creation and
 publication date, contributers, relation between different versions,
 etc.) for each particular data entry. Additionally, on a significantly
 finer level of granularity, SOCCOMAS also tracks in a detailed
 change-history log all changes made to a particular data record at the
 level of individual input fields. All information (data, provenance
 metadata, change-history metadata) is stored based on Resource
 Description Framework (RDF) compliant data schemes into different named
 graphs (i.e. a URI under which triple statements are stored in the tuple
 store). All recorded information can be accessed through a SPARQL
 endpoint. All data entries are Linked Open Data and thus provide access
 to an HTML representation of the data for visualization in a web-browser
 or as a machine-readable RDF file. The ontology-controlled design of
 SOCCOMAS allows administrators to easily customize already existing
 templates for input forms of data entries, define new templates for new
 types of data entries, and define underlying RDF-compliant data schemes
 and apply them to each relevant input field. SOCCOMAS provides an engine
 for running and developing semantic WCMSs, where only ontology editing,
 but no middleware and front end programming, are required for adapting
 the WCMS to one\'s own specific requirements.
 Abstract

 Taxonomic names are ambiguous as identifiers of biodiversity data, as
 they refer to a particular concept of a taxon in an expert's mind
 (Kennedy et al. 2005). This ambiguity is particularly problematic when
 attempting to reconcile taxonomic names from disparate sources with
 clades on a phylogeny. Currently, such reconciliation requires expert
 interpretation, which is necessarily subjective, difficult to reproduce,
 and refractory to scaling. In contrast, phylogenetic clade definitions
 are a well-developed method for unambiguously defining the semantics of
 a clade concept in terms of shared evolutionary ancestry (Queiroz and
 Gauthier 1990, Queiroz and Gauthier 1994), and these semantics allow
 locating clades on any phylogeny. Although a few software tools have
 been created for resolving clade definitions, including for definitions
 expressed in the Mathematical Markup Language (e.g. Names on Nodes in
 Keesey 2007) and as lists of GenBank accession numbers (e.g. mor in
 Hibbett et al. 2005), these are application-specific representations
 that do not provide formal definitions with well-defined semantics for
 every component of a clade definition. Being able to create such
 machine-interpretable definitions would allow computers to store,
 compare, distribute and resolve semantically-rich clade definitions.

 To this end, the Phyloreferencing project (http://phyloref.org,
 Cellinese and Lapp 2015) is working on a specification for encoding
 phylogenetic clade definitions as ontologies using the Web Ontology
 Language (OWL in W3C OWL Working Group 2012). Our specification allows
 the semantics of these definitions, which we call phyloreferences, to be
 described in terms of shared ancestor and excluded lineage properties.
 The aim of this effort is to allow any OWL-DL reasoner to resolve
 phyloreferences on a phylogeny that has itself been translated into a
 compatible OWL representation. We have developed a workflow that allows
 us to curate phyloreferences from phylogenetic clade definitions
 published in natural language, and to resolve the curated phyloreference
 against the phylogeny upon which the definition was originally created,
 allowing us to validate that the phyloreference reflects the authors'
 original intent. We have started work on curating dozens of
 phyloreferences from publications and the clade definition database
 RegNum (http://phyloregnum.org), which will provide an online catalog of
 all clade definitions that are part of the Phylonym Volume, to be
 published together with the PhyloCode (https://www.ohio.edu/phylocode/).
 We will comprehensively curate these definitions into a reusable and
 fully computable ontology of phyloreferences.

 In our presentation, we will provide an overview of phyloreferencing and
 will describe the model and workflow we use to encode clade definitions
 in OWL, based on concepts and terms taken from the Comparative Data
 Analysis Ontology (Prosdocimi et al. 2009), Darwin-SW (Baskauf and Webb
 2016) and Darwin Core (Wieczorek et al. 2012). We will demonstrate how
 phyloreferences can be visualized, resolved and tested on the phylogeny
 that they were originally described on, and how they resolve on one of
 the largest synthetic phylogenies available, the Open Tree of Life
 (Hinchliff et al. 2015). We will conclude with a discussion of the
 problems we faced in referring to taxonomic units in phylogenies, which
 is one of the key challenges in enabling better integration of
 phylogenetic information into biodiversity analyses.
 Abstract

 Parasitism can be defined as an interaction between species in which one
 of the interaction partners, the parasite, lives in or on the other, the
 host. The parasite draws food from its host and harms it in the process.
 According to some estimates, over 40% of all eukaryotes are parasites.
 Nevertheless, it is difficult to obtain information about a particular
 taxon is a parasite computationally making it difficult to query large
 sets of taxa.

 Here we test to what extend it is possible to use the Open Tree of Life
 (OTL), a synthesis of phylogenetic trees on a backbone taxonomy
 (resulting in unresolved nodes), to expand available information via
 phylogenetic trait prediction. We use the Global Biotic Interactions
 (GloBI) database to categorise 25,992 and 34,879 species as parasites
 and free-living, respectively, and predict states for over \~2.3 million
 (97.34%) leaf nodes without state information.

 We estimate the accuracy of our maximum parsimony based predictions
 using cross-validation and simulation at roughly 60-80% overall, but
 strongly varying between clades. The cross-validation resulted in an
 accuracy of 98.17% which is explained by the fact that the data are not
 uniformly distributed. We describe this variation across taxa as
 associated with available state and topology information. We compare our
 results with several smaller scale studies, which used manual expert
 curation and conclude that computationally inferred state changes
 largely agree in number and placement with those. In clades in which
 available state information is biased (mostly towards parasites, e.g. in
 Nematodes) phylogenetic prediction is bound to provide results
 contradicting conventional wisdom.

 This represents, to our knowledge, the first comprehensive computational
 reconstruction of the emergence of parasitism in eukaryotes. We argue
 that such an approach is necessary to allow further incorporation of
 parasitism as an important trait in species interaction databases and in
 individual studies on eukaryotes, e.g. in the microbiome.
 Abstract

 The Open Tree of Life project is a collaborative effort to synthesize,
 share and update a comprehensive tree of life Fig. 1. We have completed
 a draft synthesis of a tree summarizing digitally available taxonomic
 and phylogenetic knowledge for all 2.6 million named species, available
 at tree.opentreeoflife.org Hinchliff et al. 2015. . . This tree provides
 ready access to phylogenetic information which can link together
 biodiversity data on the basis of what we know about relevant
 evolutionary history. Both the unified reference taxonomy Rees and
 Cranston 2017 and the published phylogenetic statements underlying the
 tree McTavish et al. 2015 are available and accessible online. Taxa in
 the phylogenies are mapped to the the reference taxonomy, which aligns
 Open Tree taxon identifiers to those from NCBI and GBIF, among several
 other taxonomy resources. The synthesis tree is revised as new data
 become available, and captures conflict and consensus across different
 published phylogenetic estimates. This undertaking requires both
 development of novel infrastructure and analysis tools, as well as
 community engagement with the Open Tree of Life project. I will discuss
 the challenges in and the progress towards achieving these goals.
 Abstract

 Connecting biodiversity data across databases is not as easy as one
 might think. Different databases use different identifiers and
 taxonomies and connecting these data often results in loss of
 information and precision. Here we present some of the challenges we
 faced with integrating multiple biodiversity data sets, including
 specimen data from the scientific collections, during a hackathon hosted
 by the Phenoscape project in December of 2017. The hackathon brought
 together a diverse group of participants, including biologists and
 software developers, to explore ways of using the computable phenotype
 data in the Phenoscape Knowledgebase (KB) (Edmunds et al. 2015). The KB
 contains ontology-annotated data that links evolutionary phenotypes from
 the comparative literature to model organism phenotypes enabling, e.g.,
 the retrieval of candidate genes for evolutionary phenotypes and the
 generation of synthetic supermatrices of presence/absence characters.
 During this hackathon, our team explored how to link phenotype data in
 the KB to museum specimen data in iDigBio (Matsunaga et al. 2013) with
 the hope of creating visualizations including world maps showing species
 distributions with different character states and their phylogenetic
 relationships. We visualized lineage relationships by querying the Open
 Tree of Life (OT) (Hinchliff et al. 2015) website using data integrated
 by another group at the hackathon that linked KB and OT taxonomic
 identifiers.

 Phenoscape uses terms from anatomy, quality, and taxonomy ontologies to
 annotate characters and taxonomic information from the phylogenetic
 literature along with specimen information. When populating the KB,
 specimen identifiers such as occurrence identifiers, collector's number,
 and catalog numbers were preserved if present in the literature. We
 found that these identifiers, although standard in the biodiversity
 domain, were mostly insufficient to uniquely identify the source
 specimen in iDigBio. As an alternative, we instead mapped all the
 occurrences of taxa using string matches of the genus and species from
 Vertebrate Taxonomy Ontology identifiers. Without specimen identifiers
 that are consistent across databases, we lost the ability to explore
 spatial and temporal variation of characters within genera and were only
 able to explore phenotypes and geographic distributions among genera. We
 look forward to discussing these issues with the collections community
 represented at this meeting by the Society for the Preservation of
 Natural History Collections (SPNHC).

 We developed an R Shiny application that integrates characters and taxa
 from Phenoscape with specimen records from iDigBio and phylogenies from
 OT, to visualize phenotypic characters and taxon distributions in three
 interactive panels. The app allows a user to visualize OT phylogenies
 and place presence/absence character data on the tree. Specifically,
 users can: select taxa or specific characters to visualize their
 geographic distributions, navigate a phylogeny browser which displays
 character and specimen data available for taxa under consideration, and
 view a heatmap of characters available for character and taxon
 combinations. Because of our challenges joining data, our distribution
 map leaves users with the impression that all individuals in a genus
 exhibit a character whereas the KB was populated with data describing
 individuals. We hope that with improved data standards and their use by
 more people, constructing applications like ours will become easier.
 Abstract

 There is a large amount of publicly available biodiversity data from
 many different data sources. When doing research, one ideally interacts
 with biodiversity data programmatically so their work is reproducible.
 The entry point to biodiversity data records is largely through
 taxonomic names, or common names in some cases (e.g., birds). However,
 many researchers have a phylogeny focused project, meaning taxonomic
 names are not the ideal interface to biodiversity data. Ideally, it
 would be simple to programmatically go from a phylogeny to biodiversity
 records through a phylogeny based query.

 I\'ll discuss a new project \'phylodiv\'
 (https://github.com/ropensci/phylodiv/) that attempts to facilitate
 phylogeny based biodiversity data collection (see Fig. 1). The project
 takes the form of an R software package. The idea is to make the user
 interface take essentially two inputs: a phylogeny and a phylogeny based
 question. Behind the scenes we\'ll do many things, including gathering
 taxonomic names and hierarchies for the taxa in the phylogeny, send
 queries to GBIF (or other data sources), and map the results. The user
 will of course have control over the behind the scenes parts, but I
 imagine the majority use case will be to input a phylogeny and a
 question and expect an answer back.

 We already have R tools to do nearly all parts of the work-flow shown
 above: there\'s a large number of phylogeny tools,
 \'taxize\'/\'taxizedb\' can handle taxonomic name collection, while
 \'rgbif\' can handle interaction with GBIF, and there\'s many mapping
 options in R. There are a few areas that need work still however.

 First, there\'s not yet a clear way to do a phylogeny based query.
 Ideally a user will be able to express a simple query like \"taxon A vs.
 its sister group\". That\'s simple to imagine, but to implement that in
 software is another thing.

 Second, users ideally would like answers back - in this case a map of
 occurrences - relatively quickly to be able to iterate on their research
 work-flow. The most likely solution to this will be to use GBIF\'s map
 tile service to visualize binned occurrence data, but we\'ll need to
 explore this in detail to make sure it works.
 Abstract

 Xper3 (Vignes Lebbe et al. 2016) is a collaborative knowledge base
 publishing platform that, since its launch in november 2013, has been
 adopted by over 2 thousand users (Pinel et al. 2017). This is mainly due
 to its user friendly interface and the simplicity of its data model. The
 data are stored in MySQL Relational DBs, but the exchange format uses
 the TDWG standard format SDD (Structured Descriptive Data Hagedorn et
 al. 2005). However, each Xper3 knowledge base is a closed world that the
 author(s) may or may not share with the scientific community or the
 public via publishing content and/or identification key (Kopfstein
 2016). The explicit taxonomic, geographic and phenotypic limits of a
 knowledge base are not always well defined in the metadata fields.

 Conversely terminology vocabularies, such as Phenotype and Trait
 Ontology PATO and the Plant Ontology PO, and software to edit them, such
 as Protégé and Phenoscape, are essential in the semantic web, but
 difficult to handle for biologist without computer skills. These
 ontologies constitute open worlds, and are expressed themselves by RDF
 triples (Resource Description Framework). Protégé offers vizualisation
 and reasoning capabilities for these ontologies (Gennari et al. 2003,
 Musen 2015).

 Our challenge is to combine the user friendliness of Xper3 with the
 expressive power of OWL (Web Ontology Language), the W3C standard for
 building ontologies. We therefore focused on analyzing the
 representation of the same taxonomic contents under Xper3 and under
 different models in OWL. After this critical analysis, we chose a
 description model that allows automatic export of SDD to OWL and can be
 easily enriched. We will present the results obtained and their
 validation on two knowledge bases, one on parasitic crustaceans
 (Sacculina) and the second on current ferns and fossils (Corvez and
 Grand 2014). The evolution of the Xper3 platform and the perspectives
 offered by this link with semantic web standards will be discussed.
 Abstract

 Anthropogenic-induced climate change has already altered the conditions
 to which species have adapted locally, and consequently, shifts of
 occurrence areas have been previously reported (Chen et al. 2011).
 Anticipating the results of climate change is urgent, and using these
 results efficiently to guide decision-making can help to build
 strategies to protect species from those changes. Therefore, our
 objective is to propose the use of climate change impact assessments,
 obtained through species distribution models (SDMs), to guide decision
 making. The emphasis will be on data that could help determine the
 potentially vulnerable species and the priority areas, which could act
 as climate refuges, as well as wildlife corridors. SDMs are based on
 species occurrence points, available mainly from biological collections
 and observations (Franklin 2010). When combined with geospatially
 explicit layers of abiotic or biotic data (e. g. temperature,
 precipitation, land use), which defines the ecological requirements of
 species under study, it can generate species distribution models. These
 models are projected in the form of maps indicating areas where the
 species can find the most suitable habitats and, therefore, where one
 can most likely find them. To support public policies decision, the
 generation of robust and reliable model is an important factor. A
 minimum number of six occurrence points is a mandatory requirement, with
 non-overlapping area as a filter criteria. Unfortunatelly, in Brasil, as
 well as in Latin America in general, this type of data is scarce.

 Thus, with SDMs, four types of decision making information data
 regarding priority species and areas could be obtained (Fig. 1).

 Size of potential occurrence areas: species that have a small area of
 occurrence are potentially vulnerable, since they present endemism,
 usually living in restricted environmental conditions. In this case, any
 small change in environmental conditions can result in the extinction of
 the impacted species. Thus, this region needs to be protected.

 Difference between current and future area: species presenting the most
 significant reduction in potential areas should be prioritized by
 decision-makers. This measurement could be used as an indication of
 vulnerability.

 Even species that have no predicted area reduction or an increase could
 be prioritized in management programs due to its role in the complex
 interaction networks of ecosystem services, such as pollinators, seed
 dispersers or disease control. These species could be more resilient to
 network interaction changes due climate, and possibly are better able to
 provide their services in the extreme unfavorable climate scenarios.

 Areas that maintain higher species diversity in future scenarios: their
 protection could be prioritized in restoration and conservation
 programs. Especially in cases involving multiple species, those areas
 could be considered as climate refuges by decision-makers. Additionally,
 for the reconstruction and use of SDM published in peer-reviewed
 journals, it is necessary that all pieces of information about models,
 its generation, ensemble methods, data cleaning and data quality
 criteria applied should be available.

 The availability of the four above mentioned types of information can
 help on decision-making strategies aiming the protection of priority
 species and areas. In conclusion, SDMs present essential information
 about the present and future impacts of projected climate change and
 their derived data could be preserved using a standard controlled
 vocabulary.
 Abstract

 Can Essential Biodiversity Variables (EBVs) be developed to monitor
 changes in species interactions? That was the difficult question asked
 at the GLOBIS-B workshop in February, 2017 in which \>50 experts
 participated. EBVs can be defined as harmonized measurements that allow
 us to inform policy about essential changes in biodiversity. They can be
 seen as biological state variables from which more refined indicators
 may be derived. They have been presented as a means to monitor global
 biodiversity change and as a concept to drive the gathering, sharing,
 and standardisation of data on our biota (Geijzendorffer et al. 2015,
 Kissling et al. 2017, Pereira et al. 2013).

 There are different classes of EBVs that characterize, for example, the
 state of species populations, species traits and ecosystem structure and
 function. It has also been proposed that there should be EBVs related to
 species interactions. However, until now there has been little progress
 formulating what these should be, even though species interactions are
 central to ecology. Species interactions cover a wide range of important
 processes, from mutualisms, such as pollination, to different forms of
 heterotrophic nutrition, such as the predator-prey relationship. Indeed,
 ecological interactions are critical to understand why an ecosystem is
 more than the sum of its parts. Nevertheless, direct observation of
 species interactions is often difficult and time consuming work, which
 makes it difficult to monitor them in the long-term. For this reason the
 workshop focused on those species interactions that are feasible to
 study and are most relevant to policy. To bring focus to our discussions
 we concentrated on pollination, predation and microbial interactions.

 Taking pollination as an example, there was recognition of the
 importance of ecological networks and that network metrics may be a
 sensitive indicator of change. Potential EBVs might be the number of
 pairwise interactions between species or the modularity and interaction
 diversity of the whole network. This requires standardised data
 collection and reporting (e.g. standardization of measures of
 interaction strength or minimum data specifications for ecological
 networks) and sufficient data across time to regularly calculate these
 metrics. Other simpler surrogates for pollination might also prove
 useful, such as flower visitation rates or the proportion of fruit set.

 Finally, there was a recognition that we do not yet have enough tools to
 monitor some important interactions. Many interactions, particular among
 microbes, can currently only be inferred from the co-occurrence of taxa.
 However, technology is rapidly developing and it is possible to foresee
 a future where even these interactions can be monitored efficiently.

 Species interactions are essential to understanding ecology, but they
 are also difficult to monitor. Yet, delegates at the workshop left with
 a positive outlook that it is valuable to develop standardisation and
 harmonization of species interaction data to make them suitable for EBV
 production.
 Abstract

 Understanding the role that species play in their environment is a
 fundamental goal of biodiversity research, bringing knowledge on
 ecosystem maintenance and in provision of ecosystem services. Different
 types of interaction that different species establish with their
 partners regulate the functioning of ecosystems (McCann 2007).
 Interactions between plants and pollinators (Potts et al. 2016) and
 between plants and seed dispersers (Wang and Smith 2002) are examples of
 mutualism, crucial to the maintenance of the floristic composition and
 overall biodiversity in different biomes. They also illustrate well the
 nature\'s contribution to people, supporting ecosystem services with key
 economic consequences, such as pollination of agricultural crops (Klein
 et al. 2007) and seed dispersal of natural or assisted restoration of
 degraded areas (Wunderle 1997).

 Interactions are mediated by different functional traits (morphological
 and/or behavioral characteristics of organisms that influence their
 performance) (Ball et al. 2015). As the zoochorous transfer of pollen
 grains and seeds usually involves contact, the success of pollination
 and seed dispersal depends to a large extend on the relationship of size
 and morphology between flower/fruit and their respective pollinator/seed
 disperser. Selected over a long history of shared evolutionary history,
 it is feasible to rely on the predictive potential these traits may have
 to determine if a certain animal is able to transfer pollen grains
 and/or seeds of specific plants in the landscape (Howe 2016).

 Biodiversity is facing constant negative impacts, especially related to
 climate and habitat changes. They are threatening the provision of
 ecosystem services, jeopardizing the basic premise of sustainable
 development, which is to guarantee resources for future generations. The
 novel landscapes that result from these impacts will certainly be
 dependent of these ecosystem services, but will they persist in face of
 extinctions and invasive competitors? Ultimately, will these services be
 predicable by functional traits, in landscapes where shared evolutionary
 history is reduced? Strategies that help our understanding of the
 interactions and their role in the provision of services are urgent
 (Corlett 2011). Given this context, our objective here is to present the
 type of data that, if made available, could assist in determining the
 role of species in terms of the interactions they make and the provision
 of ecosystem services. Moreover, we aimed to elucidate how this role can
 be associated with functional traits.

 The current work focuses on the following groups: plants, birds, bats
 and bees (Fig. 1). Of particular interest are interactions involving:

 pollination, which is carried out predominantly by bees, but also by
 nectarivorous birds and bats; and

 seed dispersal, mainly carried out by frugivorous birds and bats.

 These interactions are mediated by key traits. In plants, common flower
 traits are the aperture, color, odor strength and type, shape
 orientation, size and symmetry, nectar guide and sexual organ, and
 reward. Fruit or seed traits, such as fleshy nutrient, chemical
 attractant and clinging structures, are also relevant for seed
 dispersal. In animals the most common traits are the body size (for
 bees, the intertegular distance; for bats, forearm length; and for
 birds, the weight), gape-width for birds and the feeding habit
 (nectarivorous, frugivorous, omnivorous) for bats and birds. Providing
 standardized data on traits involving interactions between fauna and
 flora is important to fill knowledge gaps, which could help in the
 decision making processes aiming conservation, restoration and
 management programs for protecting ecosystem services based on
 biodiversity.
 Abstract

 The Brazilian Plant-Pollinator Interactions Network\*1 (REBIPP) aims to
 develop scientific and teaching activities in plant-pollinator
 interaction. The main goals of the network are to:

 generate a diagnosis of plant-pollinator interactions in Brazil;

 integrate knowledge in pollination of natural, agricultural, urban and
 restored areas;

 identify knowledge gaps;

 support public policy guidelines aimed at the conservation of
 biodiversity and ecosystem services for pollination and food production;

 and encourage collaborative studies among REBIPP participants.

 To achieve these goals the group has resumed and built on previous works
 in data standard definition done under the auspices of the IABIN-PTN
 (Etienne Américo et al. 2007) and FAO (Saraiva et al. 2010) projects
 (Saraiva et al. 2017). The ultimate goal is to standardize the ways data
 on plant-pollinator interactions are digitized, to facilitate data
 sharing and aggregation. A database will be built with standardized data
 from Brazilian researchers members of the network to be used by the
 national community, and to allow sharing data with data aggregators.

 To achieve those goals three task groups of specialists with similar
 interests and background (e.g botanists, zoologists, pollination
 biologists) have been created. Each group is working on the definition
 of the terms to describe plants, pollinators and their interactions. The
 glossary created explains their meaning, trying to map the suggested
 terms into Darwin Core (DwC) terms, and following the TDWG Standards
 Documentation Standard\*2 in definition.

 Reaching a consensus on terms and their meaning among members of each
 group is challenging, since researchers have different views and
 concerns about which data are important to be included into a standard.
 That reflects the variety of research questions that underlie different
 projects and the data they collect. Thus, we ended up having a long list
 of terms, many of them useful only in very specialized research
 protocols and experiments, sometimes rarely collected or measured.

 Nevertheless we opted to maintain a very comprehensive set of terms, so
 that a large number of researchers feel that the standard meets their
 needs and that the databases based on it are a suitable place to store
 their data, thus encouraging the adoption of the data standard.

 An update of the work will soon be available at REBIPP website and will
 be open for comments and contributions. This proposal of a data standard
 is also being discussed within the TDWG Biological Interaction Data
 Interest Group\*3 in order to propose an international standard for
 species interaction data.

 The importance of interaction data for guiding conservation practices
 and ecosystem services provision management has led to the proposal of
 defining Essential Biodiversity Variables (EBVs) related to biological
 interactions. Essential Biodiversity Variables (Pereira et al. 2013)
 were developed to identify key measurements that are required to
 monitoring biodiversity change. EBVs act as intermediate abstract layer
 between primary observations (raw data) and indicators (Niemeijer 2002).
 Five EBV classes have been defined in an initial stage: genetic
 composition, species populations, species traits, community composition,
 ecosystem function and ecosystem structure. Each EBV class defines a
 list of candidate EBVs for biodiversity change monitoring (Fig. 1).
 Consequently, digitalization of such data and making them available
 online are essential. Differences in sampling protocols may affect data
 scalability across space and time, hence imposing barriers to the full
 use of primary data and EBVs calculation (Henry et al. 2008). Thus,
 common protocols and methods should be adopted as the most
 straightforward approach to promote integration of collected data and to
 allow calculation of EBVs (Jürgens et al. 2011). Recently a Workshop was
 held by GLOBIS B\*4 (GLOBal Infrastructures for Supporting Biodiversity
 research) to discuss Species Interactions EBVs (February, 26-28, Bari,
 Italy). Plant-pollinator interactions deserved a lot of attention and
 REBIPP\'s work was presented there. As an outcome we expect to define
 specific EBVs for interactions, and use plant-pollinators as an example,
 considering pairwise interactions as well as interaction network related
 variables.

 The terms in the plant-pollinator data standard under discussion at
 REBIPP will provide information not only on EBV related with
 interactions, but also on other four EBV classes: species populations,
 species traits, community composition, ecosystem function and ecosystem
 structure. As we said, some EBVs for specific ecosystem functions (e.g.
 pollination) lay beyond interactions network structures. The EBV
 \'Species interactions\' (EBV class \'Community composition\') should
 incorporate other aspects such as frequency (Vázquez et al. 2005),
 duration and empirical estimates of interaction strengths (Berlow et al.
 2004).

 Overall, we think the proposed plant-pollinator interaction data
 standard which is currently being developed by REBIPP will contribute to
 data aggregation, filling many data gaps and can also provide indicators
 for long-term monitoring, being an essential source of data for EBVs.
 Abstract

 The cTAKES package (using the ClearTK Natural Language Processing
 toolkit Bethard et al. 2014, http://cleartk.github.io/cleartk/) has been
 successfully used to automatically read clinical notes in the medical
 field (Albright et al. 2013, Styler et al. 2014). It is used on a daily
 basis to automatically process clinical notes and extract relevant
 information by dozens of medical institutions. ClearEarth is a
 collaborative project that brings together computational linguistics and
 domain scientists to port Natural Language Processing (NLP) modules
 trained on the same types of linguistic annotation to the fields of
 geology, cryology, and ecology. The goal for ClearEarth in the ecology
 domain is the extraction of ecologically-relevant terms, including
 eco-phenotypic traits from text and the assignment of those traits to
 taxa. Four annotators used Anafora (an annotation software;
 https://github.com/weitechen/anafora) to mark seven entity types
 (biotic, aggregate, abiotic, locality, quality, unit, value) and six
 reciprocal property types (synonym of/has synonym, part of/has part,
 subtype/supertype) in 133 documents from primarily Encyclopedia of Life
 (EOL) and Wikipedia according to project guidelines
 (https://github.com/ClearEarthProject/AnnotationGuidelines).
 Inter-annotator agreement ranged from 43% to 90%. Performance of
 ClearEarth on identifying named entities in biology text overall was
 good (precision: 85.56%; recall: 71.57%). The named entities with the
 best performance were organisms and their parts/products (biotic
 entities - precision: 72.09%; recall: 54.17%) and systems and
 environments (aggregate entities - precision: 79.23%; recall: 75.34%).
 Terms and their relationships extracted by ClearEarth can be embedded in
 the new ecocore ontology after vetting
 (http://www.obofoundry.org/ontology/ecocore.html). This project enables
 use of advanced industry and research software within natural sciences
 for downstream operations such as data discovery, assessment, and
 analysis. In addition, ClearEarth uses the NLP results to generate
 domain-specific ontologies and other semantic resources.
 Abstract

 There are many ways to capture data from herbarium specimen labels. Here
 we compare the results of in-house verses out-sourced data transcription
 with the aim of evaluating the pros and cons of each approach and
 guiding future projects that want to do the same.

 In 2014 Meise Botanic Garden (BR) embarked on a mass digitization
 project. We digitally imaged of some 1.2 million herbarium specimens
 from our African and Belgian Herbaria. The minimal data for a third of
 these images was transcribed in-house, while the remainder was
 out-sourced to a commercial company. The minimal data comprised the
 fields: specimen's herbarium location, barcode, filing name, family,
 collector, collector number, country code and phytoregion (for the
 Democratic Republic of Congo, Rwanda & Burundi). The out-sourced data
 capture consisted of three types:

 additional label information for central African specimens having
 minimal data;

 complete data for the remaining African specimens; and,

 species filing name information for African and Belgian specimens
 without minimal data. As part of the preparation for out-sourcing, a
 strict protocol had to be established as to the criteria for acceptable
 data quality levels.

 Also, the creation of several lookup tables for data entry was necessary
 to improve data quality. During the start-up phase all the data were
 checked, feedback given, compromises made and the protocol amended.
 After this phase, an agreed upon subsample was quality controlled. If
 the error score exceeded the agreed level, the batch was returned for
 retyping. The data had three quality control checks during the process,
 by the data capturers, the contractor's project managers and ourselves.

 Data quality was analysed and compared in-house versus out-sourced modes
 of data capture. The error rate by our staff versus the external company
 was comparable. The types of error that occurred were often linked to
 the specific field in question. These errors include problems of
 interpretation, legibility, foreign languages, typographic errors, etc.
 A significant amount of data cleaning and post-capture processing was
 required prior to import into our database, despite the data being of
 good quality according to protocol (error \< 1%). By improving the
 workflow and field definitions a notable improvement could be made in
 the "data cleaning" phase.

 The initial motivation for capturing some data in-house was financial.
 However, after analysis, this may not have been the most cost effective
 approach. Many lessons have been learned from this first mass
 digitisation project that will implemented in similar projects in the
 future.
 Abstract

 Recent developments in digitisation technologies and equipment have
 enabled advances in the rate of natural history specimen digitisation.
 However Europe's Natural History Collection Institutions are home to
 over one billion specimens and currently only a small fraction of these
 have been digitally catalogued with fewer imaged. It is clear that
 institutions still face huge challenges when digitising the vast number
 of specimens in their collections.

 I will present the results of two surveys that aimed to discover the
 main successes and challenges facing institutions in their digitisation
 programmes. The first survey was undertaken in 2014 within the SYNTHESYS
 3 project and gathered information from project partners on their
 current digitisation facilities, equipment and workflows providing some
 key recommendations based on these findings. The second survey was
 completed more recently in 2017, through the Consortium of European
 Taxonomic Facilities (CETAF) Digitisation Working Group. This survey
 aimed to discover the successful protocols and implementation of
 digitisation, and to identify the shortfalls in resources and protocols.
 Results from both surveys will be fed into the future programme of the
 CETAF Digitisation Working Group as well as forthcoming and proposed EU
 projects, including Innovation and Consolidation for large-scale
 Digitisation of natural heritage (ICEDIG).
 Abstract

 On herbarium sheets, data elements such as plant name, collection site,
 collector, barcode and accession number are found mostly on labels glued
 to the sheet. The data are thus visible on specimen images. With
 continuously improving technologies for collection mass-digitisation it
 has become easier and easier to produce high quality images of herbarium
 sheets and in the last few years herbarium collections worldwide have
 started to digitize specimens on an industrial scale (Tegelberg et al.
 2014). To use the label data contained in these massive numbers of
 images, they have to be captured and databased. Currently, manual data
 entry prevails and forms the principal cost and time limitation in the
 digitization process. The StanDAP-Herb Project has developed a standard
 process for (semi-) automatic detection of data on herbarium sheets.
 This is a formal extensible workflow integrating a wide range of
 automated specimen image analysis services, used to replace
 time-consuming manual data input as far as possible. We have created
 web-services for OCR (Optical Character Recognition); for identifying
 regions of interest in specimen images and for the context-sensitive
 extraction of information from text recognized by OCR. We implemented
 the workflow as an extension of the OpenRefine platform (Verborgh and De
 Wilde 2013).
 Abstract

 Globally there are a number of citizen science portals to support
 digitisation of biodiversity collections. Digitisation not only involves
 imaging of the specimen itself, but also includes the digital
 transcription of label and ledger data, georeferencing and linking to
 other digital resources. Making use of the skills and enthusiasm of
 volunteers is potentially a good way to reduce the great backlog of
 specimens to be digitised.

 These citizen science portals engage the public and are liberating data
 that would otherwise remain on paper. There is also considerable scope
 for expansion into other countries and languages. Therefore, should we
 continue to expand? Volunteers give their time for free, but the
 creation and maintenance of the platform is not without costs. Given a
 finite budget, what can you get for your money? How does the quality
 compare with other methods? Is crowdsourcing of label transcription
 faster, better and cheaper than other forms of transcription system?

 We will summarize the use of volunteer transcription from our own
 experience and the reports of other projects. We will make our
 evaluation based on the costs, speed and quality of the systems and
 reach conclusions on why you should or should not use this method.
 Abstract

 The Atlas of Living Costa Rica (http://www.crbio.cr/) is a biodiversity
 data portal, based on the Atlas of Living Australia (ALA), which
 provides integrated, free, and open access to data and information about
 Costa Rican biodiversity in order to support science, education, and
 conservation. It is managed by the Biodiversity Informatics Research
 Center (CRBio) and the National Biodiversity Institute (INBio).

 Currently, the Atlas of Living Costa Rica includes nearly 8 million
 georeferenced species occurrence records, mediated by the Global
 Biodiversity Information Facility (GBIF), which come from more than 900
 databases and have been published by research centers in 36 countries.
 Half of those records are published by Costa Rican institutions. In
 addition, CRBio is making a special effort to enrich and share more than
 5000 species pages, developed by INBio, about Costa Rican vertebrates,
 arthropods, molluscs, nematodes, plants and fungi. These pages contain
 information elements pertaining to, for instance, morphological
 descriptions, distribution, habitat, conservation status, management,
 nomenclature and multimedia. This effort is aligned with collaboration
 established by Costa Rica with other countries such as Spain, Mexico,
 Colombia and Brazil to standarize this type of information through
 Plinian Core (https://github.com/PlinianCore), a set of vocabulary terms
 that can be used to describe different aspects of biological species.

 The Biodiversity Information Explorer (BIE) is one of the modules made
 available by ALA which indexes taxonomic and species content and
 provides a search interface for it. We will present how CRBio is
 implementing BIE as part of the Atlas of Living Costa Rica in order to
 share all the information elements contained in the Costa Rican species
 pages.
 Abstract

 Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global
 Biodiversity Information Facility (GBIF) node of Australia. They
 developed an open and free platform for sharing and exploring
 biodiversity data. All the modules are publicly available for reuse and
 customization on their GitHub account
 (https://github.com/AtlasOfLivingAustralia).

 GBIF Benin, hosted at the University of Abomey-Calavi, has published
 more than 338 000 occurrence records from 87 datasets and 2 checklists.
 Through the GBIF Capacity Enhancement Support Programme
 (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme),
 GBIF Benin, with the help of GBIF France, is in the process of deploying
 the Beninese data portal using the GBIF France back-end architecture.
 GBIF Benin is the first African country to implement this module of the
 ALA infrastructure.

 In this presentation, we will show you an overview of the registry and
 the occurrence search engine using the Beninese data portal. We will
 begin with the administration interface and how to manage metadata, then
 we will continue with the user interface of the registry and how you can
 find Beninese occurrences through the hub.
 Abstract

 Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global
 Biodiversity Information Facility (GBIF) node of Australia. In 2010,
 they launched an open and free platform for sharing and exploring
 biodiversity data. Thanks to this new infrastructure, they have been
 able to drastically increase the number of occurrences published through
 the GBIF.org . In order to help other GBIF nodes or institutions, they
 made all of their modules publicly available for reuse and customization
 through GitHub (https://github.com/AtlasOfLivingAustralia).

 Since 2013, the community created by developers interested by ALA tools,
 organized, with the help of GBIF, 8 technical workshops around the
 world. These workshops helped the launch of at least 13 data portals.
 The last training session, funded through the GBIF Capacity Enhancement
 Support Programme
 (https://www.gbif.org/programme/82219/capacity-enhancement-support-programme),
 was been attended by 23 participants from 19 countries on 6 continents.
 Moreover, on the new GBIF website, a section has been dedicated to this
 programme (https://www.gbif.org/programme/82953/living-atlases), the
 Living Atlases community official website has been launched in 2017
 (https://living-atlases.gbif.org) and the technical documentation has
 been improved and translated in several languages. All of these
 achievements would not have been possible without a huge effort from the
 ALA developer community.

 After a brief introduction of the Living Atlases community, we will
 present you the work done by ALA to simplify the process of getting a
 living atlas up and running. We will also show you how ALA developers
 managed to help the community members to create their own version by
 performing simple HTML/CSS customizations.
 Abstract

 Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global
 Biodiversity Information Facility (GBIF) node of Australia. Since 2010,
 they have developed and improved a platform for sharing and exploring
 biodiversity information. All the modules are publicly available for
 reuse and customization on their GitHub account
 (https://github.com/AtlasOfLivingAustralia).

 The National Biodiversity Network, a registered charity, is the UK GBIF
 node and has been sharing biodiversity data since 2000. They published
 more than 79 million occurrences from 818 datasets. In 2016, they
 launched the NBN Atlas Scotland (https://scotland.nbnatlas.org/) based
 on the Atlas of Living Australia infrastructure. Since then, they
 released the NBN Atlas (https://nbnatlas.org/), the NBN Atlas Wales
 (https://wales.nbnatlas.org/) and soon the NBN Atlas Isle of Man. In
 addition to the occurrence/species search engine and the metadata
 registry, they put in place several tools that help users to work with
 data published in the network: the spatial portal and \"explore your
 region\" module. Both elements are based on Atlas of Living Australia
 developments.

 Because the Atlas of Living Australia platform is really powerful an
 reusable, we want to show you these two applications used to make
 geographical analyses. In order to perform this, we will present you the
 specificities of each component by giving examples of some
 functionalities.
 Abstract

 During the last few years, a large number of countries have deployed
 national customized versions of The Atlas of Living Australia (ALA)
 (https://www.ala.org.au/), which is a collaboratively developed, open
 infrastructure for collecting and presenting biodiversity data
 nationally and for sharing it globally through GBIF (https://gbif.org).

 The increasing number of national nodes deploying this free and open
 source software platform has built a worldwide community involving more
 than 17 countries, that collaborate openly in a decentralized way
 (https://living-atlases.gbif.org/), helping each other out by organizing
 technical workshops and by developing and sharing new software modules
 using GitHub.

 One of these modules in the Living Atlases infrastructure is an R
 package called ALA4R originally created by Ben Raymond
 (https://github.com/AtlasOfLivingAustralia/ALA4R). It provides the
 research community with programmatic data access to many of the Living
 Atlases data services using R.

 This presentation will show how ALA4R can be used to access data from
 different national Living Atlases nodes and how this R package can
 enable research studies that utilize methods and practices for
 reproducible workflows that are being increasingly established within
 the research community
 (https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf).
 Abstract

 Many, if not most, countries have several official or widely used
 languages. And most, if not all, of these countries have herbaria.
 Furthermore, specimens have been exchanged between herbaria from many
 countries, so herbaria are often polylingual collections. It is
 therefore useful to have label transcription systems that can attract
 users proficient in a wide variety of languages. Belgium is a typical
 polylingual country at the boundary between the Romance and Franconian
 languages (French, Dutch & German). Yet, currently there are few
 non-English transcription platforms for citizen science. This is why in
 Belgium we built DoeDat, from the Digivol system of the Atlas of Living
 Australia.

 We will be demonstrating DoeDat and its multilingual features. We will
 explain how we enter translations, both for the user interface and for
 the dynamic parts of the website. We will share our experiences of
 running a multilingual site and the challenges it brings. Translating
 and running such a website requires skilled personnel and patience.
 However, our experience has been positive and the number and quality of
 our volunteer transcriptions has been rewarding. We look forward to the
 further use of DoeDat to transcribe data in many other languages. There
 are no reasons anymore to exclude willing volunteers in any language.
 Abstract

 MapBio is a project initiated by the Chinese Academy of Sciences, which
 aims at integrating species distribution data from different sources and
 mapping the biodiversity of China to support biodiversity research and
 biodiversity conservation decisions. Species distribution data may be
 found in journal articles, books and different databases in various
 formats, and most species distributions are described in free text.
 MapBio is trying to build up a workflow for collecting this free text,
 parsing it into standardized data and projecting distributions onto a
 map for each species in China. A map module of MapBio is designed and
 implemented based on Web GIS to visualize species distributions on a map
 at different levels, e.g., occurrence points, county, province,
 distribution range, protected area, waterbody, biogeographic realm.
 Since the completeness of distribution data is very important for
 assessing biodiversity, we developed a tool in MapBio for analysis of
 the gaps in distribution data. Based on the species distribution data,
 especially the occurrence data, MapBio provides an integrated modeling
 tool for helping users to build species niche models. MapBio is an open
 access project. Users can get data and services from it easily for
 biodiversity research and conservation, and also can contribute their
 own biodiversity data to MapBio.
 Abstract

 For more than a decade, the biodiversity informatics community has
 recognised the importance of stable resolvable identifiers to enable
 unambiguous references to data objects and the associated concepts and
 entities, including museum/herbarium specimens and, more broadly, all
 records serving as evidence of species occurrence in time and space.
 Early efforts built on the Darwin Core institutionCode, collectionCode
 and catalogueNumber terms, treated as a triple and expected to uniquely
 to identify a specimen. Following review of current technologies for
 globally unique identifiers, TDWG adopted Life Science Identifiers
 (LSIDs) (Pereira et al. 2009). Unfortunately, the key stakeholders in
 the LSID consortium soon withdrew support for the technology, leaving
 TDWG committed to a moribund technology. Subsequently, publishers of
 biodiversity data have adopted a range of technologies to provide unique
 identifiers, including (among others) HTTP Universal Resource
 Identifiers (URIs), Universal Unique Identifiers (UUIDs), Archival
 Resource Keys (ARKs), and Handles. Each of these technologies has merit
 but they do not provide consistent guarantees of persistence or
 resolvability. More importantly, the heterogeneity of these solutions
 hampers delivery of services that can treat all of these data objects as
 part of a consistent linked-open-data domain.

 The geoscience community has established the System for Earth Sample
 Registration (SESAR) that enables collections to publish standard
 metadata records for their samples and for each of these to be
 associated with an International Geo Sample Number (IGSN
 http://www.geosamples.org/igsnabout). IGSNs follow a standard format,
 distribute responsibility for uniqueness between SESAR and the
 publishing collections, and support resolution via HTTP URI or Handles.
 Each IGSN resolves to a standard metadata page, roughly equivalent in
 detail to a Darwin Core specimen record. The standardisation of
 identifiers has allowed the community to secure support from some
 journal publishers for promotion and use of IGSNs within articles.

 The biodiversity informatics community encompasses a much larger number
 of publishers and greater pre-existing variation in identifier formats.
 Nevertheless, it would be possible to deliver a shared global identifier
 scheme with the same features as IGSNs by building off the aggregation
 services offered by the Global Biodiversity Information Facility (GBIF).
 The GBIF data index includes normalised Darwin Core metadata for all
 data records from registered data sources and could serve as a platform
 for resolution of HTTP URIs and/or Handles for all specimens and for all
 occurrence records. The most significant trade-off requiring
 consideration would be between autonomy for collections and other
 publishers in how they format identifiers within their own data and the
 benefits that may arise from greater consistency and predictability in
 the form of resolvable identifiers.
 Abstract

 A simple, permanent and reliable specimen identifier system is needed to
 take the informatics of collections into a new era of interoperability.
 A system of identifiers based on HTTP URI (Uniform Resource
 Identifiers), endorsed by the Consortium of European Taxonomic
 Facilities (CETAF), has now been rolled out to 14 member organisations
 (Güntsch et al. 2017).

 CETAF-Identifiers have a Linked Open Data redirection mechanism for both
 human- and machine-readable access and, if fully implemented, provide
 Resource Description Framework (RDF) -encoded specimen data following
 best practices continuously improved by members of the initiative. To
 date, more than 20 million physical collection objects have been
 equipped with CETAF Identifiers (Groom et al. 2017).

 To facilitate the implementation of stable identifiers, simple
 redirection scripts and guidelines for deciding on the local identifier
 syntax have been compiled
 (http://cetafidentifiers.biowikifarm.net/wiki/Main\_Page). Furthermore,
 a capable \"CETAF Specimen URI Tester\" (http://herbal.rbge.info/)
 provides an easy-to-use service for testing whether the existing
 identifiers are operational.

 For the usability and potential of any identifier system associated with
 evolving data objects, active links to the source information are
 critically important. This is particularly true for natural history
 collections facing the next wave of industrialised mass digitisation,
 where specimens come online with only basic, but rapidly evolving label
 data. Specimen identifier systems must therefore have components for
 monitoring the availability and correct implementation of individual
 data objects. Our next implementation steps will involve the development
 of a \"Semantic Specimen Catalogue\", which has a list of all existing
 specimen identifiers together with the latest RDF metadata snapshot. The
 catalogue will be used for semantic inference across collections as well
 as the basis for periodic testing of identifiers.
 Abstract

 Life sciences research, and even more specifically biodiversity sciences
 research, has yet to coalesece on a single system of identifiers for
 specimens (physical samples collected for research) or even a single set
 of standards for identifiers. Diverse identifier systems lead to
 duplication and ambiguity, which in turn lead to challenges in finding
 specimens, tracking and citing their usage, and linking them to data.
 Other research disciplines provide experience that biodiversity sciences
 could use to overcome these challenges. Earth sciences/geology may be
 the most advanced discipline in this regard, thanks to the use of the
 International GeoSample Number (IGSN) system, which was established to
 provide globally unique identifiers for geological samples. The original
 motivation of IGSN was to overcome duplication of sample numbers
 reported in the scientific literature and to support the correlation of
 observations on the same samples carried out by different laboratories
 and reported in different publications. The IGSN system is managed
 through a small set of \'allocating agents\' who act on behalf of a
 national agency or community, under the overall coordination of the IGSN
 Organization - a volunteer group representing a mixture of research
 institutions and agencies. Similar to widely-recognized Digital Object
 Identifiers (DOIs), the primary requirement of an allocating agent is to
 maintain the mapping from an IGSN to a web \'landing page\'
 corresponding to each sample. A standard (minimal) schema for describing
 samples registered with IGSN has been developed, but individual IGSN
 allocating agents will often supplement the base metadata with
 additional information. Other efforts are working on cross-disciplinary
 sample metadata schemas, but no single core standard has been agreed
 upon yet. An important part of the development of the IGSN system has
 been an engagement with scholarly publishers, with a goal of making each
 mention of an IGSN within a report or paper be a hyperlink, and also for
 links to other observations relating to the same sample to be
 automatically highlighted by the publisher.
 Abstract

 Zooarchaeological specimens are the remains of animals, including
 vertebrate and invertebrate taxa, recovered from, or in association
 with, archaeological contexts of deposition or surrounding landscapes.
 The physical scope of zooarchaeological specimens is diverse and
 includes macro- and micro-zooarchaeological specimens composed of
 archaeologically preserved bone, shell, exoskeletons, teeth, hair or
 fur, scales, horns or antlers, as well as geochemical (e.g., isotopes)
 and biochemical (e.g., ancient DNA) signatures derived from faunal
 remains. Artifacts and objects created from animal remains, such as bone
 pins, shell beads, preserved animal hides, are also zooarchaeological
 specimens. Here we present recent work to utilize identifiers for
 archaeological samples in new data publishing routines, focusing on key
 challenges. One critical challenge is that archaeological samples are
 often composited into different units depending on managers of
 collections and analysts. Thus, in some cases, when migrating datasets
 for publication, identifiers can refer to different sets of units, even
 within the same dataset. Another key challenge is assuring that
 different repositories can share sample identifiers. We show how Open
 Context, a site-based archaeology-focused repository that also manages
 objects such as zooarchaeological material, and VertNet, a
 specimen-oriented biodiversity repository, have collaborated to share
 sample identifiers.

 While this illustrates a success story of linking data across
 repositories, we discuss the complexity of how "occurrence identifiers,"
 but not true sample identifiers, in VertNet are propagated to another
 system where the identifiers point to a similar record called "Animal
 Bone" in Open Context.
 Abstract

 The Ocean Biogeographic Information System (OBIS) began in 2000 as the
 repository for data from the Census of Marine Life. Since that time,
 OBIS has expanded its goals beyond simply hosting data to supporting
 more aspects of marine conservation (Pooter et al. 2017). In order to
 accomplish those goals, the OBIS secretariat in partnership with its
 European node (EurOBIS) hosted at the Flanders Marine Institute (VLIZ,
 Belgium), and the Intergovernmental Oceanographic Commission (IOC)
 Committee on International Oceanographic Data and Information Exchange
 (IODE, 23rd session, March 2015, Brugge) established a 2-year pilot
 project to address a particularly problematic issue that environmental
 data collected as part of marine biological research were being
 disassociated from the biological data. OBIS-Event-Data is the solution
 that was developed from that pilot project, which devised a method for
 keeping environmental data together with the biological data (Pooter et
 al. 2017).

 OBIS is seeking early adopters of the new data standard OBIS-Event-Data
 from among the marine biodiversity monitoring communities, to further
 validate the data standard, and develop data products and scientific
 applications to support the enhancement of Biological and Ecosystem
 Essential Ocean Variables (EOVs) in the framework of the Global Ocean
 Observing System (GOOS) and the Marine Biodiversity Observation Network
 of the Group on Earth Observations (GEO BON MBON).

 After the successful 2-year IODE pilot project OBIS-ENV-DATA, the IOC
 established a new 2-year IODE pilot project OBIS-Event-Data for
 Scientific Applications (2017-2019). The OBIS-Event-Data data standard,
 building on Darwin Core, provides a technical solution for combined
 biological and environmental data, and incorporates details about
 sampling methods and effort, including event hierarchy. It also
 implements standardization of parameters involved in biological,
 environmental, and sampling details using an international standard
 controlled vocabulary (British Oceanographic Data Centre Natural
 Environment Research Council).

 A workshop organized by IODE/OBIS in April brought together major animal
 tagging and tracking networks such as the Ocean Tracking Network (OTN),
 the Animal Telemetry Network (ATN), the Integrated Marine Observing
 System (IMOS), the European Tracking Network (ETN) and the Acoustic
 Tracking Array Platform (ATAP) to test the OBIS-Event-Data standard
 through the development of some data products and science applications.
 Additionally, this workshop contributes to the further maturation of the
 GOOS EOV on fish as well as the EOV on birds, mammals and turtles.

 We will present the outcomes as well as any lessons learned from this
 workshop on problems, solutions, and applications of using Darwin
 Core/OBIS-Event-Data for bio-logging data.
 Abstract

 In recent years, bio-logging data, automatically gathered by sensors
 deployed on animals, has become one of the fastest growing sources of
 biodiversity data. This is largely due to the steadily declining mass,
 size and costs of sensors, continuously opening new opportunities to
 monitor new species. While previously 'tracking data'---data from
 spatially enabled sensors such as GPS sensors---was most prominent,
 currently almost 70% of all bio-logging data is comprised of non-spatial
 data as e.g., physiological data. In contrast to the biodiversity data
 community, where standards to mobilize and exchange data are relatively
 well established, the bio-logging community is still lacking standards
 to transport data from sensors into repositories, or to mobilize data in
 a standardized format from different repositories to enable cooperation
 between users, shared software tools, data aggregation for
 meta-analysis, or a consistent format for long-term archiving.

 To set the stage for a discussion about standards for bio-logging data
 to be developed or adapted, we present a mind map describing the
 different pathways of bio-logging data during its life cycle, and the
 opportunities for standardization within this cycle. As an example we
 present the use of the Open Geospatial Consortium (OGC) 'SensorML' and
 'Observations & Measurements' standards to transfer bio-logging data
 from a sensor to a repository and ultimately to a user for subsequent
 analysis. These standards provide machine-readable methods for
 describing bio-logging sensors and the measurements they collect,
 offering a standardized structure that can be customized by the
 bio-logging community (e.g. with standardized vocabularies) to achieve
 interoperability.
 Abstract

 To usefully describe sensor deployments on animals is a major challenge
 for advocates of data standards. Bio-logging studies also need to be
 documented in a standard manner to facilitate discovery and determine
 relevance? For systems aggregating biodiversity occurrence records, the
 use of the Darwin Core standard (Wieczorek et al. 2012) to express
 species occurrences is near ubiquitous. Bio-logging studies are
 universally multiple instances of species occurrences that output high
 quality spatial and temporal data recorded by specialists.

 There are a lot of benefits to summarising these studies by means of a
 single, flat file record. Simple Darwin Core offers the ability to do
 this by representing the multiple occurrences as a date range in
 dwc:eventDate and a footprint polygon using dwc:footprintWKT for the
 area covered by the track. By also uniformly describing the species, the
 dwc:basisOfRecord as Machine Observation, and a controlled vocabulary to
 describe the type of bio-logging data, systems could offer an effective
 means of querying tracking data. It's important to look to other data
 standards initiatives relevant to bio-logging to ensure common usage of
 Darwin Core terms.

 The Atlas of Living Australia is using an implementation of Simple
 Darwin Core to represent data from the bio-logging platform ZoaTrack as
 occurrence data to make it discoverable via location or species-based
 searches. Other initiatives, for example Swedish LifeWatch follow a
 similar approach to represent data from the Wireless Remote Animal
 Monitoring (WRAM) Scandinavian bio-logging infrastructure. With
 endorsement from the community, the implementation could be useful as a
 type of metadata catalogue record, opening it for usage in application
 programmer interface (API) development and thus enabling machine
 interoperability between systems and users. In short, bio-logging
 systems and practitioners would be able to easily discover relevant
 studies by searching by location and/or species.
 Abstract

 With the continuous development of imaging technology, the amount of
 insect 3D data is increasing, but research on data management is still
 virtually non-existent. This paper will discuss the specifications and
 standards relevant to the process of insect 3D data acquisition,
 processing and analysis.

 The collection of 3D data of insects includes specimen collection,
 sample preparation, image scanning specifications and 3D model
 specification. The specimen collection information uses existing
 biodiversity information standards such as Darwin Core. However, the 3D
 scanning process contains unique specifications for specimen
 preparation, depending on the scanning equipment, to achieve the best
 imaging results.

 Data processing of 3D images includes 3D reconstruction, tagging
 morphological structures (such as muscle and skeleton), and 3D model
 building. There are different algorithms in the 3D reconstruction
 process, but the processing results generally follow DICOM (Digital
 Imaging and Communications in Medicine) standards. There is no available
 standard for marking morphological structures, because this process is
 currently executed by individual researchers who create operational
 specifications according to their own needs. 3D models have specific
 file specifications, such as object files
 (https://en.wikipedia.org/wiki/Wavefront\_.obj\_file) and 3D max format
 (https://en.wikipedia.org/wiki/.3ds), which are widely used at present.

 There are only some simple tools for analysis of three-dimensional data
 and there are no specific standards or specifications in Audubon Core
 (https://terms.tdwg.org/wiki/Audubon\_Core), the TDWG standard for
 biodiversity-related multi-media.

 There are very few 3D databases of animals at this time. Most of insect
 3D data are created by individual entomologists and are not even stored
 in databases. Specifications for the management of insect 3D data need
 to be established step-by-step. Based on our attempt to construct a
 database of 3D insect data, we preliminarily discuss the necessary
 specifications.
 Abstract

 iDigBio Matsunaga et al. 2013 currently references over 22 million media
 files, and stores approximately 120 terabytes worth of those media files
 co-located with our compute infrastructure. Using these images for
 scientific research is a logistical and technical challenge.
 Transferring large numbers of images requires programming skill,
 bandwidth, and storage space. While simple image transformations such as
 resizing and generating histograms are approachable on desktops and
 laptops, the neural networks commonly used for learning from images
 require server-based graphical processing units (GPUs) to run
 effectively.

 Using the GUODA (Global Unified Open Data Access) infrastructure, we
 have built a model pipeline for applying user-defined processing to any
 subset of the images stored in iDigBio. This pipeline is run on servers
 located in the Advanced Computing and Information Systems lab (ACIS)
 alongside the iDigBio storage system. We use Apache Spark, the Hadoop
 File System (HDFS), and Mesos to perform the processing. We have placed
 a Jupyter notebook server in front of this architecture which provides
 an easy environment with deep learning libraries for Python already
 loaded for end users to write their own models. Users can access the
 stored data and images and manipulate them according to their
 requirements and make their work publicly available on GitHub.

 As an example of how this pipeline can be used in research, we applied a
 neural network developed at the Smithsonian Institution to identify
 herbarium sheets that were prepared with hazardous mercury containing
 solutions Schuettpelz et al. 2017. The model was trained with
 Smithsonian resources on their images and transferred to the GUODA
 infrastructure hosted at ACIS which also houses iDigBio. We then applied
 this model to additional images in iDigBio to classify them to
 illustrate the application of these techniques to broad image corpora
 potentially to notify other data publishers of contamination. We present
 the results of this classification not as a verified research result,
 but as an example of the collaborative and scalable workflows this
 pipeline and infrastructure enable.
 Abstract

 Earth's ecosystems are threatened by anthropogenic change, yet
 relatively little is known about biodiversity across broad spatial (i.e.
 continent) and temporal (i.e. year-round) scales. There is a significant
 gap at these scales in our understanding of species distribution and
 abundance, which is the precursor to conservation (Hochachka et al.
 2012). The cost and availability of experts to collect data does not
 scale to broad spatial or temporal surveys. With recent advances in
 artificial intelligence (AI) it is becoming possible to automate some of
 this data collection and analysis (Joppa 2017). The Cornell Lab of
 Ornithology is working to apply AI in three ways:

 incorporating AI into the analysis of radar data to assess densities of
 migratory birds at a continent-wide scale and across years;

 utilizing new techniques in convolution neural networks (CNNs) to
 improving our ability to classify natural sounds by limiting background
 noise;

 applying our ability to train models to classify birds in images to
 build a system that can analyze video streams.

 Our approach to accomplishing this is through partnerships between our
 non-profit organization, computer science faculty, and industry leaders.
 By leveraging deep learning technologies and including an array of
 stakeholders, we are able to process data that would take years to
 analyze using traditional methods.

 Methods.

 We use 28 years of Next-Generation Radar (NEXRAD) imagery, which
 contains birds aloft during nocturnal migration. Using CNNs we can
 assess the density of birds captured on radar images to count the number
 of individuals crossing the continental U.S. each spring and fall. For
 acoustical analysis of birds vocalizing during nocturnal migration, we
 are using recorders to monitor the calling activity of birds aloft and
 CNN's to detect and classify bird vocalizations in noisy landscapes. We
 gathered more than 6 million images from the eBird community, archived
 them in the Macaulay Library at the Cornell Lab of Ornithology, and
 crowdsourced millions of annotations to train models to classify more
 than 5,000 species of birds in images. Now we are applying this approach
 to video. These projects have used both supervised and unsupervised
 learning techniques. With supervised learning and the use of elaborate
 training datasets, we made tremendous headway in bird photo
 identification. Unsupervised learning was used to eliminate rain in
 NEXRAD images successfully, with little training data incorporated. We
 expect advances in unsupervised learning will open new possibilities in
 the future.

 Conclusions.

 The Cornell Lab pioneered the concept of autonomous recording units for
 monitoring biodiversity two decades ago, but without AI to process the
 data, discoveries were limited by human processing time. Today, we can
 combine our findings using radar with acoustic monitoring and sightings
 from citizen scientists for a more complete understanding of bird
 populations. We now expect AI processes to be able to identify birds
 with high confidence in the near future for images, audio recordings and
 videos. Furthermore, while conventional approaches require using
 separate neural nets that are combined in a separate process, we now
 combine multi-model sensor integration into a single CNN. There is no
 longer a need for pre-processing of data for AI pattern recognition. Our
 vision is to continue to apply these techniques to create a 'real-time
 global bird monitoring network', with a combination of humans and
 automated sensors. This network of sensors (or robots) will have
 comparable ability as a human to detect, identify, and count birds,
 gathering information systematically and in places where humans cannot
 reach.
 Abstract

 Widespread technology usage has resulted in a deluge of data that is not
 limited to scientific domains. For example, technology companies
 accumulate vast amounts of data on their users to support their
 applications and platforms. The participation of many domains in big
 data collection, data analysis and visualization, and the need for fast
 data exploration has provided a stellar market opportunity for high
 quality data visualization software to emerge. In this talk, leading
 industry visualization software (Tableau) will be used to explore a
 biodiversity dataset (Carex spp. distribution and morphology). The
 advantages and disadvantages of using Tableau for scientific exploration
 will be discussed, as well as how to integrate data visualization tools
 early into the data pipeline. Lastly, the potential for developing a
 data visualization \"stack\" (i.e., a combination of software products
 and programming languages) using available tools will be discussed, as
 well as what the future might look like for scientists looking to
 capitalize on the growth of industry tools.
 Abstract

 Phytoplankton form the basis of the marine food web and are an indicator
 for the overall status of the marine ecosystem. Changes in this
 community may impact a wide range of species (Capuzzo et al. 2018)
 ranging from zooplankton and fish to seabirds and marine mammals.
 Efficient monitoring of the phytoplankton community is therefore
 essential (Edwards et al. 2002). Traditional monitoring techniques are
 highly time intensive and involve taxonomists identifying and counting
 numerous specimens under the light microscope. With the recent
 development of automated sampling devices, image analysis technologies
 and learning algorithms, the rate of counting and identification of
 phytoplankton can be increased significantly (Thyssen et al. 2015). The
 FlowCAM (Álvarez et al. 2013) is an imaging particle analysis system for
 the identification and classification of phytoplankton. Within the
 Belgian Lifewatch observatory, monthly phytoplankton samples are taken
 at nine stations in the Belgian part of the North Sea. These samples are
 run through the FlowCAM and each particle is photographed. Next, the
 particles are identified based on their morphology (and fluorescence)
 using state-of-the-art Convolutional Neural Networks (CNNs) for computer
 vision. This procedure requires learning sets of expert validated
 images. The CNNs are specifically designed to take advantage of the two
 dimensional structure of these images by finding local patterns, being
 easier to train and having many fewer parameters than a fully connected
 network with the same number of hidden units.

 In this work we present our approach to the use of CNNs for the
 identification and classification of phytoplankton, testing it on
 several benchmarks and comparing with previous classification
 techniques. The network architecture used is ResNet50 (He et al. 2016).
 The framework is fully written in Python using the TensorFlow (Abadi, M.
 et al. 2016) module for Deep Learning.

 Deployment and exploitation of the current framework is supported by the
 recently started European Union Horizon 2020 programme funded project
 DEEP-Hybrid-Datacloud (Grant Agreement number 777435), which supports
 the expensive training of the system needed to develop the application
 and provides the necessary computational resources to the users.
 Abstract

 Over the next 5 years major advances in the development and application
 of numerous technologies related to computing, mobile phones, artificial
 intelligence (AI), and augmented reality (AR) will have a dramatic
 impact in biodiversity monitoring and conservation. Over a 2-week period
 several of us had the opportunity to meet with multiple technology
 experts in the Silicon Valley, California, USA to discuss trends in
 technology innovation, and how they could be applied to conservation
 science and ecology research. Here we briefly highlight some of the key
 points of these meetings with respect to AI and Deep Learning.

 Computing: Investment and rapid growth in AI and Deep Learning
 technologies are transforming how machines can perceive the environment.
 Much of this change is due to increased processing speeds of Graphics
 Processing Units (GPUs), which is now a billion-dollar industry. Machine
 learning applications, such as convolutional neural networks (CNNs) run
 more efficiently on GPUs and are being applied to analyze visual imagery
 and sounds in real time. Rapid advances in CNNs that use both supervised
 and unsupervised learning to train the models is improving accuracy. By
 taking a Deep Learning approach where the base layers of the model are
 built upon datasets of known images and sounds (supervised learning) and
 later layers relying on unclassified images or sounds (unsupervised
 learning), dramatically improve the flexibility of CNNs in perceiving
 novel stimuli. The potential to have autonomous sensors gathering
 biodiversity data in the same way personal weather stations gather
 atmospheric information is close at hand.

 Mobile Phones: The phone is the most widely used information appliance
 in the world. No device is on the near horizon to challenge this
 platform, for several key reasons. First, network access is ubiquitous
 in many parts of the world. Second, batteries are improving by about 20%
 annually, allowing for more functionality. Third, app development is a
 growing industry with significant investment in specializing apps for
 machine-learning. While GPUs are already running on phones for video
 streaming, there is much optimism that reduced or approximate Deep
 Learning models will operate on phones. These models are already working
 in the lab, with the biggest hurdle being power consumption and
 developing energy efficient applications and algorithms to run
 complicated AI processes will be important. It is just a matter of time
 before industry will have AI functionality on phones.

 These rapid improvements in computing and mobile phone technologies have
 huge implications for biodiversity monitoring, conservation science, and
 understanding ecological systems. Computing: AI processing of video
 imagery or acoustic streams create the potential to deploy autonomous
 sensors in the environment that will be able to detect and classify
 organisms to species. Further, AI processing of Earth spectral imagery
 has the potential to provide finer grade classification of habitats,
 which is essential in developing fine scale models of species
 distributions over broad spatial and temporal extents. Mobile Phones:
 increased computing functionality and more efficient batteries will
 allow applications to be developed that will improve an individual's
 perception of the world. Already AI functionality of Merlin improves a
 birder's ability to accurately identify a bird. Linking this
 functionality to sensor devices like specialized glasses, binoculars, or
 listening devises will help an individual detect and classify objects in
 the environment.

 In conclusion, computing technology is advancing at a rapid rate and
 soon autonomous sensors placed strategically in the environment will
 augment the species occurrence data gathered by humans. The mobile phone
 in everyone's pocket should be thought of strategically, in how to
 connect people to the environment and improve their ability to gather
 meaningful biodiversity information.
 Abstract

 Reliable plant species identification from seeds is intrinsically
 difficult due to the scarcity of features and because it requires
 specialized expertise that is becoming increasingly rarer, as the number
 of field plant taxonomists is diminishing (Bacher 2012, Haas and Häuser
 2005). On the other hand, seed identification is relevant in some
 science domains such as plant community ecology, archaeology,
 paleoclimatology. Besides, economic activities such as agriculture,
 require seed identification to assess weed species contained in the
 \"soil seed banks\" (Colbach 2014) to enable targeted treatments before
 they become a problem.

 In this work, we explore and evaluate several approaches by using
 different training image sets with various requisites and assessing
 their performance with test datasets of different sources.

 The core training dataset is provided by the Anthos project (Castroviejo
 et al. 2017) as a subset of its image collection. It consists of nearly
 a 1000 images of seeds identified by experts.

 As identification algorithm, we will use state-of-the-art convolutional
 neural networks for image classification (He et al. 2016). The framework
 is fully written in Python using the TensorFlow (Abadi et al. 2016)
 module for deep learning.
 Abstract

 Automated identification of plants and animals has improved considerably
 in the last few years, in particular thanks to the recent advances in
 deep learning. In order to evaluate the performance of automated plant
 identification technologies in a sustainable and repeatable way, a
 dedicated system-oriented benchmark was setup in 2011 in the context of
 ImageCLEF (Goëau et al. 2011). Each year, since that time, several
 research groups participated in this large collaborative evaluation by
 benchmarking their image-based plant identification systems. In 2014,
 the LifeCLEF research platform (Joly et al. 2014) was created in the
 continuity of this effort so as to enlarge the evaluated challenges by
 considering birds and fishes in addition to plants, and audio and video
 contents in addition to images.

 The 2017-th edition of the LifeCLEF plant identification challenge (Joly
 et al. 2017) is an important milestone towards automated plant
 identification systems working at the scale of continental floras with
 10.000 plant species living mainly in Europe and North America
 illustrated by a total of 1.1M images. Nowadays, such ambitious systems
 are enabled thanks to the conjunction of the dazzling recent progress in
 image classification with deep learning and several outstanding
 international initiatives, aggregating the visual knowledge on plant
 species coming from the main national botanical institutes. The
 PlantCLEF plant challenge that we propose to present at this workshop
 aimed at evaluating to what extent a large noisy training dataset
 collected through the web (then containing a lot of labelling errors)
 can compete with a smaller but trusted training dataset checked by
 experts. To fairly compare both training strategies, the test dataset
 was created from a third data source, the Pl\@ntNet (Joly et al. 2015)
 mobile application that collects millions of plant image queries all
 over the world.

 Due to the good results obtained at the 2017-th edition of the LifeCLEF
 plant identification challenge, the next big question is how far such
 automated systems are from the human expertise. Indeed, even the best
 experts are sometimes confused and/or disagree with each other when
 validating images of living organism. A multimedia data actually
 contains only partial information that is usually not sufficient to
 determine the right species with certainty. Quantifying this uncertainty
 and comparing it to the performance of automated systems is of high
 interest for both computer scientists and expert naturalists. This work
 reports an experimental study following this idea in the plant domain.
 In total, 9 deep-learning systems implemented by 3 different research
 teams were evaluated with regard to 9 expert botanists of the French
 flora. The main outcome of this work is that the performance of
 state-of-the-art deep learning models is now close to the most advanced
 human expertise. This shows that automated plant identification systems
 are now mature enough for several routine tasks, and can offer very
 promising tools for autonomous ecological surveillance systems.
 Abstract

 The fast and accurate identification of forest species is critical to
 support their sustainable management, to combat illegal logging, and
 ultimately to conserve them. Traditionally, the anatomical
 identification of forest species is a manual process that requires a
 human expert with a high level of knowledge to observe and differentiate
 certain anatomical structures present in a wood sample (Wiedenhoeft
 (2011)).

 In recent years, deep learning techniques have drastically improved the
 state of the art in many areas such as speech recognition, visual object
 recognition, and image and music information retrieval, among others
 (LeCun et al. (2015)). In the context of the automatic identification of
 plants, these techniques have recently been applied with great success
 (Carranza-Rojas et al. (2017)) and even mobile apps such as Pl\@ntNet
 have been developed to identify a species from images captured
 on-the-fly (Joly et al. (2014)). In contrast to conventional machine
 learning techniques, deep learning techniques extract and learn by
 themselves the relevant features from large datasets.

 One of the main limitations for the application of deep learning
 techniques to forest species identification is the lack of comprehensive
 datasets for the training and testing of convolutional neural network
 (CNN) models. For this work, we used a dataset developed at the Federal
 University of Parana (UFPR) in Curitiba, Brazil, that comprises 2939
 images in JPG format without compression and a resolution of 3.264 x
 2.448 pixels. It includes 41 different forest species of the Brazilian
 flora that were cataloged by the Laboratory of Wood Anatomy at UFPR
 (Paula Filho et al. (2014)). Due to the lack of comprehensive datasets
 world wide, this has become a benchmark dataset in previous research
 (Paula Filho et al. (2014), Hafemann et al. (2014)).

 In this work, we propose and demonstrate the power of deep CNNs to
 identify forest species based on macroscopic images. We use a
 pre-trained model which is built from the resnet50 model and uses
 weights pre-trained on ImageNet. We apply fine-tuning by first
 truncating the top layer (softmax layer) of the pre-trained network and
 replacing it with a new softmax layer. Then we train again the model
 with the dataset of macroscopic images of species of the Brazilian flora
 used in (Hafemann et al. (2014), Paula Filho et al. (2014)).

 Using the proposed model we achieve a top-1 98% accuracy which is better
 than the 95.77% reported in (Hafemann et al. (2014)) using the same data
 set. In addition, our result is slightly better than the reported in
 (Paula Filho et al. (2014)) of 97.77% which was obtained by combining
 several conventional techniques of computer vision.
 Abstract

 Costa Rica is one of the countries with highest species biodiversity
 density in the world. More than 2,000 tree species have already been
 identified, many of which are used in the building, furniture, and
 packaging industries (Grayum et al. 2003). This rich diversity makes the
 correct identification of tree species very difficult. As a result, it
 is common to see in the national market that species are commercialized
 with mistaken identifications, which makes quality control particularly
 challenging. In addition, because 90 timber tree species have been
 classified as "threatened" in Costa Rica, correct identifications are
 indispensable for law-enforcement.

 The traditional system for tree species identification is based on macro
 and microscopic evaluations of the anatomy of the wood. It entails
 assesing anatomical features such as patterns of vessels, parenchymas,
 and fibers. Typically, 7.7 x 10 cm pieces of wood cuts are used to
 identify the tree species (Pan and Kudo 2011, Yusof et al. 2013).
 However, assessing these features is extremely difficult for taxonomists
 because properties of the wood can vary considerably due to
 environmental conditions and intra-specific genetic variability.

 Deep learning techniques have recently been used to identify plant
 species (Carranza-Rojas et al. 2017a, Carranza-Rojas et al. 2017b) and
 are potentially useful to detect subtle differences in patterns of
 vessels, parenchyma, and other anatomical features of wood. However, it
 is necessary to have a large collection of macroscopic photographs of
 individuals from various parts of the country (Pan and Kudo 2011). As a
 first step in the application of deep learning techniques, we have
 defined a formal, standard protocol for collecting wood samples,
 physically processing them, taking pictures, performing data
 augmentation, and using metadata to provide the primary data necessary
 for deep learning applications. Unlike traditional xylotheque sampling
 methods that destroy trees or use wood from fallen trees, we propose a
 method that extracts small size samples with sufficient quality for
 anatomical characterization but does not affect the growth and survival
 of the individual.

 This study has been developed in three forest permanent plots in Costa
 Rica, all of which are sites with historical growth data over the last
 20 years. We have so far evaluated 40 species (10 individuals per
 species) with diameters greater than 20 cm. From each individual, a
 cylindrical sample of 12 mm diameter and 7.5 cm in length was extracted
 with a cordless drill. Each sample is then cut into five of 8 x 8 x 8 mm
 cubes and further processed to result in curated xylotheque samples, a
 dataset with all relevant metadata and original images, and a dataset
 with images obtained by performing data augmentation on the original
 images.
 Abstract

 As a child, I loved exhibits at the museum. As an adult conservation
 biologist, entering the back rooms of the museum to view the collections
 is even more remarkable. I have begun to realise the scope of what might
 be held in museum collections, and to consider what these specimens,
 artefacts, taonga (treasure) might tell us. Using examples from my work
 on insects, birds and kahukurii (dogskin cloaks), and analyses from
 morphometrics to isotopes, I will show how sampling from museum
 collections can add layers of richness and complexity to research, with
 the added dimensions of space, time, and connection to communities.
 Finally, I'll discuss some of the ethics and understandings that guide
 my work with museum collections, and what it means to be part of
 collaborative partnerships of discovery with museum curators and
 communities.
 Abstract

 Natural history collections are essential for understanding the world's
 biodiversity and drive research in taxonomy, systematics, ecology and
 biosecurity. One of the biggest challenges faced is the decline of new
 taxonomists and public interest in collections-based research, which is
 alarming considering that an estimated 70% of the world's species are
 yet to be formally described.

 Science communication combines public relations with the dissemination
 of scientific knowledge and offers many benefits to promoting natural
 history collections to a wide audience. For example, social media has
 revolutionised the way collections and their staff communicate with the
 public in real time, and can attract more visitors to collection
 exhibits and new students interested in natural history. Although not
 everyone is born a natural science communicator, institutions can
 encourage and provide training for their staff to become engaging
 spokespeople skilled in social media and public speaking, including
 television, radio and/or print media. By embracing science
 communication, natural history collections can influence their target
 audiences in a positive and meaningful way, raise the profile of their
 institution, encourage respect for biodiversity, promote their events
 and research outputs, seek philanthropic donations, connect with other
 researchers or industry leaders, and most importantly, inspire the next
 generation of natural historians.
 Abstract

 Since 2010, the Canterbury region on the eastern coast of New Zealand's
 South Island has experienced more than 14,000 earthquakes. This
 presentation begins by considering the immediate impact of these seismic
 events on Canterbury Museum; how were its buildings, its collections,
 its team and its communities affected? Within the first weeks and
 months, what processes were put in place to manage the collections and
 to what extent was the Museum's team able to undertake work to ensure
 the institution remained relevant during a national disaster? With a
 distance of almost eight years since the first major earthquake, this
 presentation reflects on some of the lessons learnt about the realities
 of planning for, and responding to, disaster and the impact of a
 continuing series of earthquakes on the concept of 'business as usual'.
 Abstract

 Taxonomic work is slow and time consuming. Alarm bells have rung for
 years about the need to go faster, the need to attract and train new
 taxonomic workers, and the need to convince other branches of science
 that taxonomic work is vital. Morphological taxonomy is either being
 overrun or augmented -- depending on your perspective -- by genomics,
 artificial intelligence, new imaging methods and species-related data
 from other branches of science.

 Ecology is one such branch of science, where defining, documenting and
 managing information about species traits has emerged as one of the most
 significant problems in the discipline. Traits have been recorded for
 aeons, but the resulting data has largely been insulated within cliques.
 How do we integrate these data and make them available in a form that
 will help to address significant issues about our environment? The
 'speed bumps' on the route to a useful solution may be more social than
 technical.

 Cross-disciplinary collaboration is required to address the big
 questions in biodiversity research today, and it will need to extend
 beyond taxonomy and ecology to other disciplines, such as pharmacology
 and material science. As Harry Truman said, and John LaSalle often
 quoted, "It is amazing what you can accomplish if you do not care who
 gets the credit".

 We are challenged to understand and answer the key questions about the
 world on which we all depend. What are the challenges and the
 opportunities to accelerate biodiversity discovery and documentation?
 Abstract

 Standards set up by Biodiversity Information Standards-Taxonomic
 Databases Working Group (TDWG), initially developed as a way to share
 taxonomical data, greatly facilitated the establishment of the Global
 Biodiversity Information Facility (GBIF) as the largest index to
 digitally-accessible primary biodiversity information records (PBR) held
 by many institutions around the world. The level of detail and coverage
 of the body of standards that later became the Darwin Core terms enabled
 increasingly precise retrieval of relevant records useful for increased
 digitally-accessible knowledge (DAK) which, in turn, may have helped to
 solve ecologically-relevant questions.

 After more than a decade of data accrual and release, an increasing
 number of papers and reports are citing GBIF either as a source of data
 or as a pointer to the original datasets. GBIF has curated a list of
 over 5,000 citations that were examined for contents, and to which tags
 were applied describing such contents as additional keywords. The list
 now provides a window on what users want to accomplish using such DAK.

 We performed a preliminary word frequency analysis of this literature,
 starting at titles, which refers to GBIF as a resource. Through a
 standardization and mapping of terms, we examined how the
 facility-enabled data seem to have been used by scientists and other
 practitioners through time: what concepts/issues are pervasive, which
 taxon groups are mostly addressed, and whether data concentrate around
 specific geographical or biogeographical regions. We hoped to cast light
 on which types of ecological problems the community believes are
 amenable to study through the judicious use of this data commons and
 found that, indeed, a few themes were distinctly more frequently
 mentioned than others. Among those, generally-perceived issues such as
 climate change and its effect on biodiversity at global and regional
 scales seemed prevalent. The taxonomic groups were also unevenly
 mentioned, with birds and plants being the most frequently named.

 However, the entire list of potential subjects that might have used
 GBIF-enabled data is now quite wide, showing that the availability of
 well-structured data has spawned a widening spectrum of possible use
 cases. Among them, some enjoy early and continuous presence (e.g.
 species, biodiversity, climate) while others have started to show up
 only later, once a critical mass of data seemed to have been attained
 (e.g. ecosystems, suitability, endemism). Biodiversity information in
 the form of standards-compliant DAK may thus already have become a
 commodity enabling insight into an increasingly more complex and diverse
 body of science. Paraphrasing Tennyson, more things were wrought by data
 than TDWG dreamt of.
 Abstract

 Agile, interconnected and diverse communities of practice can serve as a
 hedge on an uncertain world. We currently live in an era of populist
 politics and diminishing government funding, challenging our collective
 optimism for the future. However, the communities we build and
 contribute to can be prepared and strengthened to address the challenges
 ahead. How we choose to operate in this world of less funding is tied to
 the collective impacts we all believe we can achieve by working
 together. How we choose to work together and structure our communities
 matters.
 Abstract

 Taxidermy made for display is often considered less significant in
 museum research collections. This is because historical taxidermy
 material often becomes disassociated with key data and through the
 rigours of public display, end up in poor physical condition.

 However by tracing a specimen\'s biography as a living animal and
 following its transition into a museum afterlife, much can be revealed
 about the development of natural history collections and changing
 attitudes towards animals.

 This presentation will investigate several pieces of taxidermy in the
 zoology collection of the Tasmanian Museum and Art Gallery (TMAG)
 (http://www.tmag.tas.gov.au/collections\_and\_research/zoology/collections),
 where research has uncovered surprising stories and helped reassess the
 significance and cultural value of this material.

 An unregistered lion head, identified as animal celebrity John Burns,
 tells the story of the golden age of Australian and New Zealand
 circuses, changing attitudes around animal ethics in the circus and the
 negotiations between scientific institutions in acquiring exotics
 species in the late nineteenth century.

 A collection of taxidermied domestic chickens from the 1940s is found to
 mark the modernisation of the TMAG public displays in communicating
 current research and the development of a dedicated museum education
 unit.

 The colourful afterlife of these specimens in the museum collection
 highlights struggles with storage issues, changes in collecting
 priorities and evolution of public display and education at TMAG.
 Abstract

 In France, a national information system on water withdrawals called
 Banque Nationale des Prélèvements en Eau (BNPE) has been set up to
 comply with the Water Framework Directive (WFD) and national Law on
 Water and Aquatic Environments. The aims are to centralize information
 on the volume of water withdrawals and to share it on the website
 www.bnpe.eaufrance.fr, where data can both be viewed and exported
 without restriction. BNPE shares data in a form that can be used for
 water management studies, scientific research, or to assess impacts on
 aquatic habitats.

 THE BNPE PROJECT SCOPE

 The BNPE is a part of the French Water Information System (SIE), set up
 to share public data on water and aquatic environments\*1. The BNPE
 project is managed by the French Biodiversity Agency (AFB) and the
 Adour-Garonne Water Agency, and is supervised by the French Ministry in
 charge of Environment. Database and related tools were developed with
 the French Geological Survey (BRGM).

 To achieve its goals, the project mainly reuses information from Water
 Agencies, based on taxes collected using the \'taker-payer\' principle:
 persons who take water from the natural environment have to pay. Data on
 water withdrawals disseminated by BNPE can now be reused by land
 managers, decision-makers and researchers due to the single access of
 these data for all of France (metropolitan and overseas). These data
 are:

 Detailed data of water withdrawn: volume of water withdrawn (m^3^),
 geographic coordinates of the water pump, water uses (e.g. energy,
 irrigation, drinking water supply, industries), type of water
 (groundwater, surface water: river, lake or estuary),

 Aggregated data: synthesis is available by year, geography, use or type
 of water.

 In 2018, BNPE shared data from 2008 to 2016.

 CHALLENGES OF CENTRALIZATION AND REUSE OF DATA : FEEDBACK FROM THE
 PROJECT

 The BNPE project faced the challenges of centralization and reuse of
 data at a national level by making the data available to everyone. The
 reuse of data derived from taxes due to environmental issues is not
 easy, even in an open data context. We identified two main issues:

 The data standardization issue

 The stakeholders of the project set up a dictionary to define \*2 common
 repositories and a data exchange format. This work was done with the
 collaboration of the Sandre\*3, the French National Service for Water
 Data and Common Repositories Management. However, the definition of the
 standard is too broad and producers encounter issues in standardizing
 their data. This project shows us the need to define a limited core of
 data concepts to share, which are very well defined and cannot be
 misinterpreted. BNPE also focuses on the importance of using concepts
 that already exist in the producer's information system. Centralization
 and enrichment of datasets are two additional steps that need to be
 differentiated for a project to succeed.

 The challenge of reusing data

 The project is confronting issues related to assembling a relevant
 dataset of water withdrawals. Data from taxes paid by water takers lack
 key environmental information that limits its use for environmental
 studies. For example, only 50% of water withdrawn is linked to a
 specific river, lake or groundwater source. Moreover, because current
 water use datasets are derived from taxes on withdrawals greater than
 7000 m^3^ per year, the data are missing for some withdrawals. AFB is
 studying additional data sources to complete the dataset (e.g., local
 authorities, crowdsourcing, spatial joining).
 Abstract

 The European Search Catalogue for Plant Genetic Resources, EURISCO,
 provides information about more than 1.9 million accessions of crop
 plants and their wild relatives, preserved ex situ by almost 400
 institutes in Europe and beyond (Weise et al. 2017). EURISCO, which is
 being maintained on behalf of the European Cooperative Programme for
 Plant Genetic Resources, is based on a network of National Inventories
 of 43 member countries. It represents an important effort for the
 preservation of the world's agrobiological diversity by providing
 information about the large genetic diversity kept by the collaborating
 institutions.

 Besides the classical passport data, in 2016, EURISCO started to
 additionally collect phenotypic data about the documented germplasm
 accessions. The selection of genebank material for both research and
 breeding purposes is increasingly carried out through the selection of
 specific phenotypic values, e.g. flowering time or plant height. Thus,
 these data are of high importance to users of plant genetic resources
 (PGR) since they determine the value of the respective germplasm.
 However, because there are no commonly agreed standards existing within
 the genebank community, this kind of data is very difficult to handle.
 In this context, the challenges range from synonymous/homonymous
 descriptor names over different rating scales to different/insufficient
 amounts of meta information, thus hampering both integration and
 cross-experiment comparison of data.

 The presentation will illustrate the approach followed within EURISCO,
 together with the challenges resulting therefrom. Using this as a solid
 basis for a discussion about the utilization of this kind of data, the
 presentation shall be regarded as a call for cooperation.
 Abstract

 Trait data in biology can be extracted from text and structured for
 reuse within and across taxa. For example, body length is one trait
 applicable to many species and \"body length is about 170 cm\" is one
 trait data point for the human species. Trait data can be used in more
 detailed analyses to describe species evolution and development
 processes, so it has begun to be valued by more than taxonomists. The
 EOL (Encyclopedia of Life) TraitBank provides an example of a trait
 database.

 Current trait databases are in their infancy. Most are based on
 morphological data such as shape, color, structural and sexual
 characteristics. In fact, some data such as behavioral and biological
 characteristics may be similarly included in trait databases.

 To build a trait database we constructed a list of controlled vocabulary
 to record the states of various terms. These terms may exhibit common
 characteristics:

 They can be grouped as conceptual (subject) and descriptive (delimiter)
 terms. For example, in "the shoulder height is 65--70 cm", \"shoulder
 height\" is the conceptual term and \"65--70 cm\" is the descriptive
 term.

 Conceptual terms may be part of an interdependent hierarchical
 structure. Examples in morphology, physiology and conservation or
 protection status, demonstrate how parts or systems may be broken into
 smaller measurable (quantifiable) or enumerable pieces.

 Descriptive terms will modify or delimit parameters of conceptual terms.
 These may be numerical with distinguishing units, counts, or other
 adjectives or enumerable with special nouns.

 Although controlled vocabularies about animals are complex, they can be
 normalized using RDF (Resource Description Framework) and OWL (web
 ontology language) standards.

 Next, we extract traits from two main types of existing descriptions.

 tabular data, which is more easily digested by machine, and

 descriptive text, which is complex.

 Pure text often needs to be extracted manually or by NLP (computerized
 natural language processing). Sometimes machine learning methods can be
 used. Moreover, different human languages may demand different
 extraction methods.

 Because the number of recordable traits exceeds current collection
 records, the database structure should be optimized for retrieval speed.
 For this reason, key-value databases are more suitable for storage of
 traits data than relational databases. EOL used the database Virtuoso
 for Traitbank, which is a non-relational database.

 Using existing mature tools and standards of ontology, we can construct
 a preliminary work-flow for animal trait data, but some tools and
 specifications for data analysis and use need to await additional data
 accumulation.
 Abstract

 The South African National Biodiversity Institute (SANBI) has initiated
 the development of the National Biodiversity Information System to
 provide access to integrated South African biodiversity information. The
 aim of the project is to centrally manage all biodiversity information
 to support researchers, conservationists, policy and decision-makers in
 achieving their goals, support planners in making sensible decisions,
 and help SANBI understand the anthropogenic impact on biodiversity. The
 project is set to deliver a centralised web-based infrastructure to
 capture, aggregate, manage, discover, analyse and visualise biodiversity
 data and associated information through a suite of tools and spatial
 layers. The infrastructure is a Microsoft technology stack with
 microservices component architecture
 (http://microservices.io/patterns/microservices.html), which is vital to
 building an application out of small collaborating services, stemming
 from integrating the enterprise system.

 SANBI conducted a review of the data holdings of the individual herbaria
 and museums in South Africa. The intention is to have a federated
 approach to data management, exposing what is available as a collection
 but ensuring that each individual natural science collection has full
 ownership and management control over their data within a defined
 framework and governed by internationally accepted data policies and
 standards. The presentation highlights the opportunities and unexpected
 difficulties with developing a national botanical and zoological
 collections data management service in South Africa.
 Abstract

 The long-term lifecycle management of natural history data requires
 careful planning. Elements that have a significant impact on this
 planning include data quality, domain-specific requirements, and data
 interoperability. Standards like Darwin Core Wieczorek et al. 2012 are
 built to be flexible, allowing institutions to share data quickly
 without extensive modification of internal information management
 processes. However, there is often limited consensus on the exact
 meanings and use of key terms by various domains. If we want to increase
 the quality, interoperability, and long-term health of collections data,
 we must reassess how we record specimen data, paying special attention
 to the terms we use and how we use them.

 Here we share results from efforts to evaluate current data sharing
 practices for data from paleontology collections. By analysing the use
 of terms in Darwin Core, we are constructing a framework for how
 paleontological data is shared, how terms are used across many
 institutions, and where there are inconsistencies or lack of terms to
 support a fully robust record. We have also used data quality assessment
 and validation tools developed by organizations like the Global
 Biodiversity Information Facility (GBIF) to provide insight and testing
 for term-specific requirements addressing quality on a more global scale
 than might be the focus of any more locally driven data quality
 assessment.

 These assessments can guide the development of a new framework for
 sharing paleontological data, enabling the community to collaborate and
 find solutions to increase quality and interoperability. Additionally,
 individual institutions can utilize the framework to enhance long-term
 care of digital assets with global participation in mind.
 Abstract

 Since the Nagoya Protocol on Access to genetic resources and Benefit
 Sharing (ABS) came into force in 2014, the conservation and assurance of
 national biodiversity has been internationally stressed. The Government
 of South Korea is exercising significant efforts to integrate and manage
 the information pertaining to biological resources in line with this
 global trend. However, connecting and sharing biodiversity data has
 certain challenges because the existing databases and information
 systems are being operated using different standards.

 In the present study, we established an integrated management system for
 freshwater biodiversity information, the Freshwater Biodiversity
 Platform (FBP), to support the conservation and sustainable use of
 biodiversity. This platform allows the management of various types of
 biodiversity data, such as occurrences, habitats and genetics, for
 freshwater species inhabiting South Korea. The data fields are based on
 a global biodiversity data standard, Darwin Core, and national
 biodiversity standards of South Korea in order to share our data more
 efficiently, both nationally and internationally. It is important to
 note that the platform deals with information related to the utilization
 of biological resources as well as information representing the national
 biodiversity. We have collected bibliographical data, such as papers and
 patents, from databases, including information on the use of biological
 resources. The data have been refined by applying a national species
 list of South Korea and ontology terms in (MeSH) to compile valuable
 information for biological industries. Furthermore, our platform is open
 source and is compatible with multiple language packs to facilitate the
 availability of biodiversity data for other countries and institutions.

 Currently, the Freshwater Biodiversity Platform is being used to collect
 and standardize various types of existing freshwater biodiversity data
 to build foundations for data management. Based on these data, we will
 improve the platform by adding new systems that can analyze and release
 data for public access. This platform will provide integrated
 information on freshwater species from the Korean Peninsula to the world
 and contribute to the conservation and sustainable use of biological
 resources.
 Abstract

 Freshwater biodiversity is critically understudied in Rwanda, and to
 date there has not been an efficient mechanism to integrate freshwater
 biodiversity information or make it accessible to decision-makers,
 researchers, private sector or communities, where it is needed for
 planning, management and the implementation of the National Biodiversity
 Strategy and Action Plan (NBSAP). A framework to capture and distribute
 freshwater biodiversity data is crucial to understanding how economic
 transformation and environmental change is affecting freshwater
 biodiversity and resulting ecosystem services. To optimize conservation
 efforts for freshwater ecosystems, detailed information is needed
 regarding current and historical species distributions and abundances
 across the landscape. From these data, specific conservation concerns
 can be identified, analyzed and prioritized.

 The purpose of this project is to establish and implement a long-term
 strategy for freshwater biodiversity data mobilization, sharing,
 processing and reporting in Rwanda. The expected outcome of the project
 is to support the mandates of the Rwanda Environment Management
 Authority (REMA), the national agency in charge of environmental
 monitoring and the implementation of Rwanda's NBSAP, and the Center of
 Excellence in Biodiversity and Natural Resources Management (CoEB). The
 project also aligns with the mission of the Albertine Rift Conservation
 Society (ARCOS) to enhance sustainable management of natural resources
 in the Albertine rift region. Specifically, organizational structure,
 technology platforms, and workflows for the biodiversity data capture
 and mobilization are enhanced to promote data availability and
 accessibility to improve Rwanda's NBSAP and support other
 decision-making processes. The project is enhancing the capacity of
 technical staff from relevant government and non-government institutions
 in biodiversity informatics, strengthening the capacity of CoEB to
 achieve its mission as the Rwandan national biodiversity knowledge
 management center. Twelve institutions have been identified as data
 holders and the digitization of these data using Darwin Core standards
 is in progress, as well as data cleaning for the data publication
 through the ARCOS Biodiversity Information System
 (http://arbmis.arcosnetwork.org/). The release of the first national
 State of Freshwater Biodiversity Report is the next step. CoEB is a
 registered publisher to the Global Biodiversity Information Facility
 (GBIF) and holds an Integrated Publishing Toolkit (IPT) account on the
 ARCOS portal. This project was developed for the African Biodiversity
 Challenge, a competition coordinated by the South African National
 Biodiversity Institute (SANBI) and funded by the JRS Biodiversity
 Foundation which supports on-going efforts to enhance the biodiversity
 information management activities of the GBIF Africa network. This
 project also aligns with SANBI's Regional Engagement Strategy, and
 endeavors to strengthen both emerging biodiversity informatics networks
 and data management capacity on the continent in support of sustainable
 development.
 Abstract

 As a national center for managing biological data, the Korean
 Bioinformation Center (KOBIC) provides capabilities and resources to
 manage and standardize the explosively growing amount of biological data
 from national Research and Development grants by developing a systematic
 and integrative approach. The biological data includes biological
 material resource, genome, and biodiversity data, such as observation,
 collection, taxonomy, character, and genome information of living
 organisms. The Korean government has enacted legislature for the
 collection, management and utilization of biological data in 2009 and,
 as a follow-up, KOBIC has undertaken the mission to collect and
 integrate the scattered biological data in Korea. We first made a
 biological data format for exchanging data between government agencies.
 After that, the Korean Bio-resource Information System (KOBIS) has been
 developed. KOBIS is an integrated information system for efficient
 acquisition and systematic management of biological data. KOBIS contains
 more than 109,000 species and 12.1 million occurrence records from 107
 collaborating institutions from four ministries. KOBIS is a system that
 establishes a catalog of scientific names by linking species information
 by ministries. The main function is integrated information search. The
 results of integrated information search show character information,
 bibliographic information, electronic book, DNA classification, gene
 information, photo image, and research achievement. We will continue to
 focus our efforts on the management of KOBIS for facilitation of
 information sharing, distribution, and service towards mining biological
 data.

 KOBIS is available at http://www.kobis.re.kr.
 Abstract

 Primary biodiversity data, or occurrence data, are being produced at an
 increasing rate and are used in numerous studies (Hampton et al. 2013,
 La Salle et al. 2016). This data avalanche is a remarkable opportunity
 but it comes with hurdles. First, available software solutions are rare
 for very large datasets and those solutions often require significant
 computer skills (Gaiji et al. 2013), while most biologists are not
 formally trained in bioinformatics (List et al. 2017). Second, large
 datasets are heterogeneous because they come from different producers
 and they can contain erroneous data (Gaiji et al. 2013). Hence, they
 need to be curated. In this context, we developed a biodiversity
 occurrence curator designed to quickly handle large amounts of data
 through a simple interface: the Darwin Core Spatial Processor (DwCSP).
 DwCSP does not require the installation or use of third-party software
 and has a simple graphical user interface that requires no computer
 knowledge. DwCSP allows for the data enrichment of biodiversity
 occurrences and also ensures data quality through outlier detection. For
 example, the software can enrich a tabulated occurrence file (Darwin
 Core for instance) with spatial data from polygon files (e.g., Esri
 shapefile) or a Rasters file (geotiff). The speed of the enriching
 procedures is ensured through multithreading and optimized spatial
 access methods (R-Tree indexes). DwCSP can also detect and tag outliers
 based on their geographic coordinates or environmental variables. The
 first type of outlier detection uses a computed distance between the
 occurrence and its nearest neighbors, whereas the second type uses a
 Mahalanobis distance (Mahalanobis 1936). One hundred thousand
 occurrences can be processed by DwCSP in less than 20 minutes and
 another test on forty million occurrences was completed in a few days on
 a recent personal computer. DwCSP has an English interface including
 documentation and will be available as a stand-alone Java Archive (JAR)
 executable that works on all computers having a Java environment
 (version 1.8 and onward).
 Abstract

 Museum-preserved samples are attracting attention as a rich resource for
 DNA studies. Museomics aims to link DNA sequence data back to the museum
 collection. Molecular biologists are interested in morphological
 information including body size, pattern, and colors, and sequence data
 have also become essential for biodiversity research as evidence for
 species identification and phylogenetic analysis.

 For more than 30 years, molecular data, such as DNA and protein
 sequences, have been captured by the DNA Data Bank of Japan (DDBJ), the
 European Bioinformatics Institute (EBI, UK), and the National Center for
 Biotechnology Information (NCBI, US) under the International Nucleotide
 Sequence Database Collaboration (INSDC). INSDC provides collected
 molecular data to researchers as public databases including GenBank for
 DNA sequences and Gene Expression Omnibus (GEO) for gene expression.
 These three institutes synchronize archived data and publish all data on
 an FTP (File Transfer Protocol) site so that it is available for big
 data analysis.

 In recent years, high-throughput sequencing technology, also called
 next-generation sequencing (NGS) technology, has been widely utilized
 for molecular biology including genomics, transcriptomics, and
 metagenomics. Biodiversity researchers also focus on NGS data for DNA
 barcoding and phylogenetic analysis as well as molecular biology.
 Additionally, a portable NGS platform, MinION (Oxford Nanopore
 Technologies), has been launched, enabling biodiversity researchers to
 perform DNA sequencing in the field. Along with GenBank and GEO data,
 INSDC accepts NGS data and provides a public primary database, called
 the Sequence Read Archive (SRA). As of March 2018, 6.4 Peta Bases of NGS
 data is freely available under more than 130,000 projects in SRA. The
 Database Center for Life Science (DBCLS) provides a search engine for
 public NGS data, called DBCLS SRA (http://sra.dbcls.jp/) in
 collaboration with DDBJ. SRA contains not only raw sequence reads or
 processed data mapped to genome, but also information on the
 experimental design, including project types, sequencing platforms, and
 sample species. Researchers can use this data to refine their search
 results. We also linked publications referring to NGS data to the
 corresponding SRA entries.

 The mission of DBCLS is to accelerate the accessibility of life science
 data. Collected data used to be described in the Excel-readable tabular
 format, but these formats are difficult to merge with other databases
 because of the ambiguity of labels. To overcome this difficulty, we
 recently integrated life science data with Semantic Web technology. We
 held annual meetings to integrate life science data, called
 BioHackathons, in which researchers from all over the world
 participated. UniProt and Ensembl databases currently provide an RDF
 (Resource Description Framework) version of curated genome and protein
 data, respectively. In the biodiversity domain, there are many databases
 such as GBIF (The Global Biodiversity Information Facility) for species
 occurrence records, EoL (The Encyclopedia of Life) as a knowledge base
 of all species, and BoL (The Barcode of Life) for DNA barcoding data.
 RDF is utilized to describe Darwin Core based data so that
 bioinformatics and biodiversity informatics researchers can technically
 merge both types of data. Currently, specimen data and DNA sequence data
 are not linked. Museomics starts with cross-referencing specimen and
 sequence IDs and by making data sources comply with an existing
 standard.
 Abstract

 The Eastern Highlands of Zimbabwe is a biodiversity hotspot that forms
 part of the Eastern Afromontane region, which has seen an increase in
 human activities such as agriculture, illegal mining, and introduction
 of invasive species. These anthropogenic activities have had negative
 environmental consequences including land degradation and water
 pollution, which have negatively impacted on the quality of aquatic
 habitats and biodiversity in the region. The region harbours several
 freshwater species of conservation interest whose numbers and
 distribution are little known. We also do not know the impacts of the
 ongoing human activities and threats on the local wetland biodiversity
 and the integrity of the ecosystem in the region. The relevant data on
 the wetland biodiversity from previous studies and surveys is also not
 readiliy available to guide poliies and conservation efforts in this
 region.

 With the aid of the Biodiversity Information for Development (BID)
 program sponsored by the Global Biodiversity Information Facility (GBIF)
 and the European Union (EU), a project titled \'Freshwater Biodiversity
 of the Eastern Highlands of Zimbabwe: Assessing Conservation Priorities
 Using Primary Species-Occurrence Data\' has mobilized and digitized over
 2,000 occurrence records on freshwater biodiversity, with a focus on
 fish, invertebrates, amphibians and bird species in the region, since
 October 2017. The project also makes use of biodiversity informatics
 tools such as ecological niche modelling, to identify the important
 sites for conservation of the freshwater biodiversity in this region.
 The outputs will help to show policy makers, wildlife managers,
 researchers and conservationists where to target resources and
 conservation efforts. This will also help protect the biodiversity that
 still existsin the unprotected wetlands of the Eastern Highlands of
 Zimbabwe and that could be lost to human activities such as clearing for
 agriculture.
 Abstract

 Recognizing the abundance and the accumulation of information and data
 on biodiversity that are still poorly exploited and even unfunded, the
 REBIOMA project (Madagascar Biodiversity Networking), in collaboration
 with partners, has developed an online dataportal in order to provide
 easy access to information and critical data, to support conservation
 planning and the expansion of scientific and professional activities in
 Madagascar biodiversity.

 The mission of the REBIOMA data portal is to serve quality-labeled,
 up-to-date species occurrence data and environmental niche models for
 Madagascar's flora and fauna, both marine and terrestrial. REBIOMA is a
 project of the Wildlife Conservation Society Madagascar and the
 University of California, Berkeley.

 REBIOMA serves species occurrence data for marine and terrestrial
 regions of Madagascar. Following upload, data is automatically validated
 against a geographic mask and a taxonomic authority. Data providers can
 decide whether their data will be public, private, or shared only with
 selected collaborators. Data reviewers can add quality labels to
 individual records, allowing selection of data for modeling and
 conservation assessments according to quality. Portal users can query
 data in numerous ways.

 One of the key features of the REBIOMA web portal is its support for
 species distribution models, created from taxonomically valid and
 quality-reviewed occurrence data. Species distribution models are
 produced for species for which there are at least eight, reliably
 reviewed, non-duplicate (per grid cell) records. Maximum Entropy
 Modeling (MaxEnt for short) is used to produce continuous distribution
 models from these occurrence records and environmental data for
 different eras: past (1950), current (2000), and future (2080). The
 result is generally interpreted as a prediction of habitat suitability.
 Results for each model are available on the portal and ready for
 download as ASCII and HTML files.

 The REBIOMA Data Portal address is http://data.rebioma.net, or visit
 http://www.rebioma.net for more general information about the entire
 REBIOMA project.
 Abstract

 Herbaria in Taiwan face critical data challenges:

 Different taxonomic views prevent data exchange;

 There is a lack of development practices to keep up with standard and
 technological advances;

 Data is disconnected from researchers' perspective, thus it is difficult
 to demonstrate the value of taxonomists' activities, even though a few
 herbaria have their specimen catalogue partially exposed in Darwin Core.

 In consultation with the Herbarium of the Taiwan Forestry Research
 Institute (TAIF), the Herbarium of the National Taiwan University (TAI)
 and the Herbarium of the Biodiversity Research Center, Academia Sinica
 (HAST), which together host most important collections of the vegetation
 on the island, we have planned the following activities to address data
 challenges:

 Investigate a new data model for scientific names that will accommodate
 different taxonomic views and create a web service for access to
 taxonomic data;

 Refactor existing herbarium systems to utilize the aforementioned
 service so the three herbaria can share and maintain a standardized name
 database;

 Create a layer of Application Programming Interface (API) to allow
 multiple types of accessing devices;

 Conduct behavioral research regarding various personas engaged in the
 curatorial workflow;

 Create a unified front-end that supports data management, data
 discovery, and data analysis activities with user experience
 improvements.

 To manage these developments at various levels, while maximizing the
 contribution of participating parties, it is crucial to use a proven
 methodological framework. As the creative industry has been leading in
 the area of solution development, the concept of design thinking and
 design thinking process (Brown and Katz 2009) has come to our radar.
 Design thinking is a systematic approach to handling problems and
 generating new opportunities (Pal 2016). From requirement capture to
 actual implementation, it helps consolidate ideas and identify agreed-on
 key priorities by constantly iterating through a series of interactive
 divergence and convergence steps, namely the following:

 Empathize: A divergent step. We learn about our audience, which in this
 case includes curators and visitors of the herbarium systems, about what
 they do and how they interact with the system, and collate our findings.

 Define: A convergent step. We construct a point of view based on
 audience needs.

 Ideate: A divergent step. We brainstorm and come up with creative
 solutions, which might be novel or based on existing practice.

 Prototype: A convergent step. We build representations of the chosen
 idea from the previous step.

 Test: Use the prototype to test whether the idea works. Then refine from
 step 3 if problems were with the prototyping, or even step 1, if the
 point of view needs to be revisited.

 The benefits by adapting to this process are:

 Instead of "design for you", we "design together", which strengthens the
 sense of community and helps the communication of what the revision and
 refactoring will achieve;

 When put in context, increased awareness and understanding of
 biodiversity data standards, such as Darwin Core (DwC) and Access to
 Biological Collections Data (ABCD);

 As we lend the responsibility of process control to an external
 facilitator, we are able to focus during each step as a participant.

 We illustrate how the planned activities are conducted by the five
 iterative steps.
 Abstract

 GBIF Benin, hosted at the University of Abomey-Calavi, has published
 more than 338,000 occurrence records in 87 datasets and checklists. It
 has been a Global Biodiversity Information Facility (GBIF) node since
 2004 and is a leader in several projects from the Biodiversity
 Information for Development (BID) programme.

 GBIF facilitates collaboration between nodes at different levels through
 its Capacity Enhancement Support Programme (CESP)
 \[https://www.gbif.org/programme/82219/capacity-enhancement-support-programme\].
 One of the actions included in the CESP guidelines is called 'Mentoring
 activities'. Its main goal is the transfer of knowledge between partners
 such as information, technologies, experience, and best practices.

 Sharing architecture and development is the key solution to solve some
 technical challenges or impediments (hosting, staff turnover, etc.) that
 GBIF nodes could face. The Atlas of Living Australia (ALA) team
 developed a functionality called 'data hub'. It gives the possibility to
 create a standalone website with a dedicated occurrence search engine
 that seeks among a range of data (e.g. specific genus, geographic area).

 In 2017, GBIF Benin and GBIF France wanted to strengthen their
 partnership and started a CESP project. One of the core objectives of
 this project is the creation of the Atlas of Living Benin using ALA
 modules. GBIF France developers, with the help of the GBIF Benin team,
 are in the process of configuring a data hub that will give access to
 Beninese data only, while at the same time Atlas of Living France will
 give access to French data only. Both data portals will use the same
 back end, therefore the same databases. Benin is the first African GBIF
 node to implement this kind of infrastructure.

 On this poster, we will present the Atlas of Living Benin specific
 architecture and how we have managed to distinguish data coming from
 Benin and coming from France.
 Abstract

 The existing web representation of the Flora of North America (FNA)
 project needs improvement. Despite being electronically available, it
 has little more functionality than its printed counterpart. Over the
 past few years, our team has been working diligently to build a new more
 effective online presence for the FNA. The main objective is to
 capitalize on modern Natural Language Processing (NLP) tools built for
 biodiversity data (Explorer of Taxon Concepts or ETC; Cui et al. 2016),
 and present the FNA online in both machine and human readable formats.
 With machine-comprehensible data, the mobilization and usability of
 flora treatments is enhanced and capabilities for data linkage to a
 Biodiversity Knowledge Graph (Page 2016) are enabled. For example,
 usability of treatments increases when morphological statements are
 parsed into finely grained pieces of data using ETC, because these data
 can be easily traversed across taxonomic groups to reveal trends.
 Additionally, the development of new features in our online FNA is
 facilitated by FNA data parsing and processing in ETC, including a
 feature to enable users to explore all treatments and illustrations
 generated by an author of interest. The current status of the ongoing
 project to develop a Semantic MediaWiki (SMW) platform for the FNA is
 presented here. New features recently implemented are introduced,
 challenges in assembling the Semantic MediaWiki are discussed, and
 future opportunities, which include the integration of additional floras
 and data sources, are explored. Furthermore, implications of
 standardization of taxonomic treatments, which work such as this
 entails, will be discussed.
 Abstract

 In 2015, the global biodiversity information initiatives Biodiversity
 Heritage Library (BHL), Barcode of Life Data systems (BoLD), Catalogue
 of Life (CoL), Encyclopedia of Life (EOL), and the Global Biodiversity
 Information Facility (GBIF) took the first step to work on the idea for
 building a single shared authoritative nomenclature and taxonomic
 foundation that could be used as a backbone to order and connect
 biodiversity data across various domains. At present, the Catalogue of
 Life is being used by BHL, BoLD, EOL, and GBIF, but each extend the CoL
 with additional data to meet the specific backbone services required.

 The goal of the CoL+ project is to innovate the CoL systems by
 developing a new information technology infrastructure that includes
 both the current Catalogue of Life and a provisional Catalogue of Life
 (replacing the current GBIF backbone taxonomy), separates scientific
 names and taxonomic concepts with associated unique identifiers, and
 provides some (infrastructural) support for taxonomic and nomenclatural
 content authorities to finish their work. The project's specific
 objectives are to

 establish a clearinghouse covering scientific names across all life;
 provide a single taxonomic view grounded in the consensus classification
 of the Catalogue of Life along with candidate taxonomic sources, show
 differences between sources, and provide an avenue for feedback to
 content authorities while allowing the broader community to contribute,
 and

 establish a partnership and governance, allowing a continuing commitment
 after the project's end for a clearinghouse infrastructure and its
 associated components, including a roadmap for future developments of
 the infrastructure.

 As result of the project we expect to have a shared information space
 for names and taxonomy between the Catalogue of Life, nomenclator
 content authorities (e.g. IPNI, ZooBank) and several global biodiversity
 information initiatives.
 Abstract

 The 3i World Auchenorrhyncha database (http://dmitriev.speciesfile.org)
 is being migrated into TaxonWorks (http://taxonworks.org) and comprises
 nomenclatural data for all known Auchenorrhyncha taxa (leafhoppers,
 planthoppers, treehoppers, cicadas, spittle bugs). Of all those
 scientific names, 8,700 are unique genus-group names (which include
 valid genera and subgenera as well as their synonyms). According to the
 Rules of Zoological Nomenclature, a properly formed species-group name
 when combined with a genus-group name must agree with the latter in
 gender if the species-group name is or ends with a Latin or Latinized
 adjective or participle. This provides a double challenge for
 researchers describing new or citing existing taxa. For each species,
 the knowledge about the part of speech is essential information (nouns
 do not change their form when associated with different generic names).
 For the genus, the knowledge of the gender is essential information.
 Every time the species is transferred from one genus to another, its
 ending may need to be transformed to make a proper new scientific name
 (a binominal name). In modern day practice, it is important, when
 establishing a new name, to provide information about etymology of this
 name and the ways it should be used in the future publications: the
 grammatical gender for a genus, and the part of speech for a species.
 The older names often do not provide enough information about their
 etymology to make proper construction of scientific names. That is why
 in the literature, we can find numerous cases where a scientific name is
 not formed in conformity to the Rules of Nomenclature. An attempt was
 made to resolve the etymology of the generic names in Auchenorrhyncha to
 unify and clarify nomenclatural issues in this group of insects. In
 TaxonWorks, the rules of nomenclature are defined using the NOMEN
 onthology (https://github.com/SpeciesFileGroup/nomen).
 Abstract

 Compilation and retrieval of reliable data on biological interactions is
 one of the critical bottlenecks affecting efficiency and statistical
 power in testing ecological theories. TaxonWorks, a web-based workbench,
 can facilitate such research by enabling the digitization of complex
 biological interactions involving multiple species, individuals, and
 trophic levels. These data can be further organized into spatial and
 temporal axes, and annotated at the level of individual or grouped
 interactions (e.g. singularly citing the combined elements of a
 tritrophic interaction). The simple, customizable nature of tools
 ultimately reduces the time-consuming steps of data gathering, cleaning,
 and formatting of datasets for subsequent exploration and analysis while
 also improving the asserted semantics.

 An example use case is provided with a dataset of associations among
 plants, pathogens and insect vectors. The curated data are accessed
 through the JSON serving TaxonWorks API (Application Programming
 Interface) by an R package. Analysis and visualization of the network
 graphs persisted in TaxonWorks is demonstrated using core R
 functionality and the igraph package (Csardi and Nepusz 2006).

 TaxonWorks is open-source, collaboratively built software available at
 http://taxonworks.org.
 Abstract

 As part of the Biodiversity Information System on Nature and Landscapes
 (Système d\'Informations Nature et Paysages or SINP), the French
 National Natural History Museum has been appointed by the French
 ministry in charge of ecology to develop mechanisms for biodiversity
 data exchange, especially taxon occurrences (there are also elements on
 habitat occurrences, geo-heritage, etc.). Given that there are thousands
 of different sources for datasets, containing over 42 million records,
 such a development brings into question the underlying quality of data.
 To add complexity, there can be several layers of quality assurance: one
 by the producer of the data, one by a regional node, and another one by
 the national node.

 The approach to quality issues was addressed by a dedicated working
 group, representative of biodiversity stakeholders in France. The
 resulting documents focus on core methodology elements that characterize
 a data quality process for, in the first instance, taxon occurrences
 only. It may be extended to habitats, geology, etc. in the near future.

 For scientific validation, two processes are used:

 One automated process that uses expertise upstream (automated validation
 based on previous databases created through the use of said expertise),
 with several criteria such as comparison with a national taxonomic
 reference database (TAXREF), and with species reference distributions.
 The outcomes of this process will indicate error potential and can be
 used to automatically flag data above a certain threshold for the
 following process.

 A second, manual process, that allows for further scrutiny in order to
 reach a conclusive evaluation.

 The combination of both processes allows experts to focus on data that
 has a higher likelihood of being erroneous, thus saving time and
 resources.

 One objective of the INPN (Inventaire National du Patrimoine Naturel, or
 National Inventory of Natural Heritage), after one or both approaches,
 is to have each record assigned a confidence level.

 The poster will be about National scientific validation of data in the
 SINP. It will show for whom and why it is done, whether the expertise
 lies upstream or downstream (manual validation through expert networks),
 what documents may exist, and what attributes have been considered to be
 added to the national standards so as to convey the information derived
 from these processes.
 Abstract

 Web portals are commonly used to expose and share scientific data. They
 enable end users to find, organize and obtain data relevant to their
 interests. With the continuous growth of data across all science
 domains, researchers commonly find themselves overwhelmed as finding,
 retrieving and making sense of data becomes increasingly difficult.
 Search engines can help find relevant websites, but the short
 summarizations they provide in results lists are often little
 informative on how relevant a website is with respect to research
 interests.

 To yield better results, a strategy adopted by Google, Yahoo, Yandex and
 Bing involves consuming structured content that they extract from
 websites. Towards this end, the schema.org collaborative community
 defines vocabularies covering common entities and relationships (e.g.,
 events, organizations, creative works) (Guha et al. 2016). Websites can
 leverage these vocabularies to embed semantic annotations within web
 pages, in the form of markup using standard formats. Search engines, in
 turn, exploit semantic markup to enhance the ranking of most relevant
 resources while providing more informative and accurate summarization.
 Additionally, adding such rich metadata is a step forward to make data
 FAIR, i.e. Findable, Accessible, Interoperable and Reusable.

 Although schema.org encompasses terms related to data repositories,
 datasets, citations, events, etc., it lacks specialized terms for
 modeling research entities. The Bioschemas community (Garcia et al.
 2017) aims to extend schema.org to support markup for Life Sciences
 websites. A major pillar lies in reusing types from schema.org as well
 as well-adopted domain ontologies, while only proposing a limited set of
 new types. The goal is to enable semantic cross-linking between
 knowledge graphs extracted from marked-up websites. An overview of the
 main types is presented in Fig. 1. Bioschemas also provides profiles
 that specify how to describe an entity of some type. For instance, the
 protein profile requires a unique identifier, recommends to list
 transcribed genes and associated diseases, and points to recommended
 terms from the Protein Ontology and Semantic Science Integrated
 Ontology.

 The success of schema.org lies in its simplicity and the support by
 major search engines. By extending schema.org, Bioschemas enables life
 sciences research communities to benefit from a lightweight semantic
 layer on websites and thus facilitates discoverability and
 interoperability across them. From an initial pilot including just a few
 bio-types such as proteins and samples, the Bioschemas community has
 grown and is now opening up towards other disciplines. The biodiversity
 domain is a promising candidate for such further extensions. We can
 think of additional profiles to account for biodiversity-related
 information. For instance, since taxonomic registers are the backbone of
 many web portals and databases, new profiles could describe taxa and
 scientific names while reusing well-adopted vocabularies such as Darwin
 Core terms (Baskauf et al. 2016) or TDWG ontologies (TDWG Vocabulary
 Management Task Group 2013). Fostering the use of such markup by web
 portals reporting traits, observations or museum collections could not
 only improve information discovery using search engines, but could also
 be a key to spur large-scale biodiversity data integration scenarios.
 Abstract

 BIOfid is a specialized information service currently being developed to
 mobilize biodiversity data dormant in printed historical and modern
 literature and to offer a platform for open access journals on the
 science of biodiversity. Our team of librarians, computer scientists and
 biologists produce high-quality text digitizations, develop new
 text-mining tools and generate detailed ontologies enabling semantic
 text analysis and semantic search by means of user-specific queries. In
 a pilot project we focus on German publications on the distribution and
 ecology of vascular plants, birds, moths and butterflies extending back
 to the Linnaeus period about 250 years ago. The three organism groups
 have been selected according to current demands of the relevant research
 community in Germany. The text corpus defined for this purpose comprises
 over 400 volumes with more than 100,000 pages to be digitized and will
 be complemented by journals from other digitization projects,
 copyright-free and project-related literature. With TextImager (Natural
 Language Processing & Text Visualization) and TextAnnotator (Discourse
 Semantic Annotation) we have already extended and launched tools that
 focus on the text-analytical section of our project. Furthermore,
 taxonomic and anatomical ontologies elaborated by us for the taxa
 prioritized by the project's target group - German institutions and
 scientists active in biodiversity research - are constantly improved and
 expanded to maximize scientific data output. Our poster describes the
 general workflow of our project ranging from literature acquisition via
 software development, to data availability on the BIOfid web portal
 (http://biofid.de/), and the implementation into existing platforms
 which serve to promote global accessibility of biodiversity data.
 Abstract

 A new R package for biodiversity data cleaning, \'bdclean\', was
 initiated in the Google Summer of Code (GSoC) 2017 and is available on
 github. Several R packages have great data validation and cleaning
 functions, but \'bdclean\' provides features to manage a complete
 pipeline for biodiversity data cleaning; from data quality explorations,
 to cleaning procedures and reporting. Users are able go through the
 quality control process in a very structured, intuitive, and effective
 way. A modular approach to data cleaning functionality should make this
 package extensible for many biodiversity data cleaning needs. Under GSoC
 2018, \'bdclean\' will go through a comprehensive upgrade. New features
 will be highlighted in the demonstration.
 Abstract

 TaxonWorks (http://taxonworks.org) is an integrated workbench for
 taxonomists and biodiversity scientists. It is designed to capture,
 organize, and enrich data, share and refine it with collaborators, and
 package it for analysis and publication. It is based on PostgreSQL
 (database) and the Ruby-on-Rails programming language and framework for
 developing web applications
 (https://github.com/SpeciesFileGroup/taxonworks). The TaxonWorks
 community is built around an open software ecosystem that facilitates
 participation at many levels. TaxonWorks is designed to serve both
 researchers who create and curate the data, as well as technical users,
 such as programmers and informatics specialists, who act as data
 consumers. TaxonWorks provides researchers with robust, user friendly
 interfaces based on well thought out customized workflows for efficient
 and validated data entry. It provides technical users database access
 through an application programming interface (API) that serves data in
 JSON format. The data model includes coverage for nearly all classes of
 data recorded in modern taxonomic treatments primary studies of
 biodiversity, including nomenclature, bibliography, specimens and
 collecting events, phylogenetic matrices and species descriptions, etc.
 The nomenclatural classes are based on the NOMEN ontology
 (https://github.com/SpeciesFileGroup/nomen).
 Abstract

 Providing data in a semantically structured format has become the gold
 standard in data science. However, a significant amount of data is still
 provided as unstructured text - either because it is legacy data or
 because adequate tools for storing and disseminating data in a
 semantically structured format are still missing. We have developed a
 description module for Morph∙D∙Base, a semantic knowledge base for
 taxonomic and morphologic data, that enables users to generate highly
 standardized and formalized descriptions of anatomical entities using
 free text and ontology-based descriptions. The main organizational
 backbone of a description in Morph∙D∙Base is a partonomy, to which the
 user adds all the anatomical entities of the specimen that they want to
 describe. Each element of this partonomy is an instance of an ontology
 class and can be further described in two different ways:

 as semantically enriched free-text description that is annotated with
 terms from ontologies, and

 semantically through defined input forms with a wide range of
 ontology-terms to choose from.

 To facilitate the integration of the free text into a semantic context,
 text can be automatically annotated using jAnnotator, a javascript
 library that uses about 700 ontologies with more than 8.5 million
 classes of the National Center for Biomedical Ontology (NCBO) bioportal.
 Users get to choose from suggested class definitions and link them to
 terms in the text, resulting in a semantic markup of the text. This
 markup may also include labels of elements that the user already added
 to the partonomy. Anatomical entities marked in the text can be added to
 the partonomy as new elements that can subsequently be described
 semantically using the input forms. Each free text together with its
 semantic annotations is stored following the W3C Web Annotation Data
 Model standard (https://www.w3.org/TR/annotation-model). The whole
 description with the annotated free text and the formalized semantic
 descriptions for each element of the partonomy are saved in the
 tuplestore of Morph∙D∙Base.

 The demonstration is targeted at developers and users of data portals
 and will give an insight to the semantic Morph∙D∙Base knowledge base
 (https://proto.morphdbase.de) and jAnnotator
 (http://git.morphdbase.de/christian/jAnnotator).
 Abstract

 Web APIs (Application Programming Interface) are a common means for Web
 portals and data producers to enable HTTP-based, machine-processable
 access to their data. They are a prominent source of information\*1
 pertaining to topics as diverse as scientific information, social
 networks, entertainment or finance. The methods of Linked Data (Heath
 and Bizer 2011) similarly aim to publish machine-readable data on the
 Web, while connecting related resources within and between datasets,
 thereby creating a large distributed knowledge graph. Today, the
 biodiversity community is increasingly adopting the Linked Data
 principles to publish data such as trait banks, museum collections and
 taxonomic registers (Parr et al. 2016, Baskauf et al. 2016). However,
 standard approaches are still missing to combine disparate
 representations coming from both Linked Data interfaces and the manifold
 Web APIs that were developed during the last two decades to expose
 legacy biodiversity databases on the Web.

 The SPARQL Micro-Service architecture (Michel et al. 2018) tackles the
 goal of reconciling Linked Data interfaces and Web APIs. It proposes a
 lightweight method to query a Web API using SPARQL (Harris and Seaborne
 2013), the Semantic Web standard to query knowledge graphs expressed in
 the Resource Description Framework (RDF). A SPARQL micro-service
 provides access to a small RDF graph, typically resource-centric, that
 it builds at run-time by transforming a fraction of the whole dataset
 served by the Web API into RDF triples. Furthermore, Web APIs
 traditionally rely on internal, proprietary resource identifiers that
 are unsuited for use as Uniform Resource Identifiers (URIs). To address
 this concern, a SPARQL micro-service can assign a URI to a Web API
 resource, allowing an application to look up this URI and get a
 description of the resource in return (this process is referred to as
 dereferencing).

 In this demo, we wish to showcase the value of SPARQL micro-services in
 the biodiversity domain. We first query TAXREF-LD, a Linked Data
 representation of the French taxonomic register of living beings (Michel
 et al. 2017), to retrieve information about a given taxon. Then, we
 demonstrate how we can enrich our knowledge about this taxon with
 various types of data retrieved on-the-fly from multiple Web APIs:

 trait data from the Encyclopedia of Life trait bank (Parr et al. 2016),

 articles or books from the Biodiversity Heritage Library,

 audio recordings from the Macaulay scientific media archive,

 photos from the Flickr photography social network, and

 music tunes from MusicBrainz.

 Different visualizations are demonstrated, ranging from raw RDF triples
 to Web pages generated dynamically and integrating heterogeneous data,
 as suggested in Fig. 1. Depending on the audience's interests, we shall
 touch upon the alignment of Web APIs' proprietary vocabularies with
 well-adopted thesauri or ontologies, or more technical concerns e.g.
 related to the effort required to deploy a new SPARQL micro-service.
 Abstract

 In recent years, the natural history collections community has made
 great progress in accelerating the pace of collection digitization and
 global data-sharing. However, a common workflow bottleneck often occurs
 in that period immediately following image capture but preceding image
 submission to portals, a critical phase involving quality control, file
 management, image processing, metadata capture, data backup, and
 monitoring performance and progress.

 While larger institutions have likely developed reliable, automated
 workflows over time, small and medium institutions may not have the
 expertise or resources to design and implement workflows that take full
 advantage of automation opportunities. Without automation, these
 institutions must invest many hours of manual effort to meet quality and
 performance goals.

 To address its own needs, BRIT developed a number of workflow automation
 components, which coalesced over time into a suite of tools that operate
 on both an image capture station as a client application and on a server
 that provides file storage and image processing features. Together,
 these tools were created to meet the following goals:

 Simplify file management and data preservation through automation

 Quickly identify quality issues

 Quickly capture skeletal metadata to facilitate later databasing

 Significantly reduce time between image capture and online availability

 Provide performance and quality monitoring and reporting

 Easy configuration and maintenance of client and server

 The client and server components together can be considered a
 "digitization appliance": software integrated with the specific goal of
 providing a comprehensive suite of digitization tools that can be
 quickly and easily deployed on simple consumer hardware. We have made
 this software available to the natural history collections community
 under an open-source license at
 https://github.com/BRITorg/digitization\_appliance.
 Abstract

 The Specify Software Project (www.specifysoftware.org) has been funded
 by the University of Kansas and with grants from the U.S. National
 Science Foundation for 20 years. In 2018, the effort is pivoting from a
 grant-funded project to a community-supported effort through the
 establishment of a consortium of biological collection institutions.
 Specify Collection Consortium software products will remain open source
 and free to download and use. Consortium membership benefits will
 include access to technical support services and seats on the Board of
 Directors and advisory committees, groups that will determine priorities
 for future products, platform capabilities, and technical support
 services. In 2017 and 2018, we have been engaged in organizational
 planning and development--modeling the Specify Collections Consortium on
 examples of viable open source and open access consortia in other
 research communities. Founding members of the Consortium in the U.S.
 include the University of Michigan, University of Florida, and
 University of Kansas. The Consortium\'s mission will be to support
 collections institutions in mobilizing data from their holdings to
 broader biological and computational initiatives to advance
 collections-based research, while facilitating efficient data curation
 and collection management. We will provide an update on our progress
 with the Consortium\'s development and highlight new capabilities and
 integration features of the Specify 6 & 7 software platforms.
 Abstract

 To improve access to biodiversity knowledge for diverse audiences, the
 Encyclopedia of Life (EOL) aggregates materials from hundreds of content
 providers. In addition to text, media, references, taxon names and
 hierarchies, traits and other structured data are an increasingly
 important component of EOL (TraitBank). Content priorities for TraitBank
 include information about body size, geographic distribution, habitat,
 trophic ecology, and biotic interactions in general. Our goal is to
 summarize available data at the level of species and supraspecific taxa
 and to achieve broad taxonomic coverage for high priority topics.
 Integration of information from heterogeneous sources relies on a
 variety of community standards (e.g., Dublin Core, Darwin Core, Audubon
 Core) as well as post-hoc semantic annotations that standardize
 terminology for traits and metadata and provide links to domain
 ontologies and controlled vocabularies (e.g., Ontology of Biological
 Attributes, Phenotypic Quality Ontology, Environment Ontology, Uber
 Anatomy Ontology). Taxon names are mapped to a reference hierarchy that
 leverages taxonomic information from many different resources (e.g.,
 Catalogue of Life, World Register of Marine Species, Paleobiology
 Database, National Center for Biotechnology Information). Names
 reconciliation takes into account canonical name strings, authorities,
 and synonym relationships as well as information about ranks and
 hierarchies (parent/child taxa). In EOL version 3 this infrastructure
 supports complex queries across EOL data sets, autogenerated natural
 language descriptions of taxa, and knowledge-based recommender systems
 for the exploration of content along multiple axes, including phylogeny,
 ecology, life history, relevance to humans and other characteristics
 derived from structured data. Most TraitBank data currently come from
 published data compilations and databases of specialist projects, but
 there are still significant gaps in coverage for many lesser known
 groups. Recent advances in natural language processing, image analysis,
 and machine learning technologies, facilitate the automated extraction
 and processing of data from unstructured text and images. This will soon
 make it possible to recruit vast amounts of information from millions of
 pages of taxonomic, ecological, and natural history literature available
 in open access repositories like Biodiversity Heritage Library (BHL) and
 Plazi. Natural history collections are another promising source of new
 taxon information. Millions of museum specimens indexed by organizations
 like the Global Biodiversity Information Facility (GBIF) and Integrated
 Digitized Biocollections (iDigBio) already contribute significantly to
 our understanding of species occurrences in space and time. But
 specimens and associated labels and field notes can also provide
 information about morphology, phenology, habitats, and biotic
 interactions. Data mined from literature corpora or specimen collections
 will generally lack detailed descriptions of what exactly was measured,
 metadata about the data capture process, measurement accuracy, and other
 important parameters. The integration of this information with data sets
 from the primary literature therefore poses challenges that go beyond
 the standardization of taxonomy and terminology. Leverage of data from a
 wide variety of sources is however necessary to achieve a comprehensive,
 interconnected biodiversity knowledge base that supports the exploration
 of trait diversity across the tree of life.
 Abstract

 The World Flora Online (WFO) is primarily a data management project
 initiated in 2012 in response to Target 1 of the Global Strategy for
 Plant Conservation -- \"To create an online flora of all known plants by
 2020\". A WFO Consortium has been formed of now 42 international
 partners with a governing Council and three Working Groups. The World
 Flora Online Public Portal (www.worldfloraonline.org) was launched at
 the International Botanical Congress in Shenzhen, China in July, 2017.
 The baseline Public Portal was primarily populated with a taxonomic
 backbone of information gathered from The Plant List augmented by newer
 taxonomic sources like Solanaceae Source. To support all known plant
 names in the WFO. including both vascular and non-vascular plants, new
 WFO identifiers (WFOIDs) were created, which were also cross-referenced
 to the International Plant Names Index (IPNI) identifiers for plant
 names included there. The next phase of the World Flora Online involves
 additional enhancement of the taxonomic backbone by engagement of new
 plant Taxonomic Expert Networks (TENs) and acceleration of ingestion of
 descriptive data from digital floras and monographs, and other sources
 like International Union for Conservation of Nature (IUCN) threat
 assessments and the Botanic Gardens Conservation International (BGCI)
 Global Tree Assessment. Descriptive data can be text descriptions,
 images, geographic distributions, identification keys, phylogenetic
 trees, as well as atomized trait data like threat status, lifeform or
 habitat. Initial digital descriptive datasets have been received by WFO
 from Flora of Brazil, Flora of South Africa, Flora of China, Flora of
 North Africa, Solanaceae Source and several others. The hard work is
 underway to match the names associated with the submitted descriptions
 to the names and WFOIDs in the World Flora Online taxonomic backbone and
 then merging the descriptive data elements into the WFO database.
 Numerous data tools have been adopted and created to accomplish the data
 cleaning, standardization and transformation required before descriptive
 data can be integrated. The WFO project has discovered many variations
 between just the few datasets received so far, which highlights the need
 for better standardization and controlled vocabularies for flora and
 monographic descriptive data. This presentation will review some of the
 issues identified by the project when merging descriptive data and some
 potential gaps in the TDWG standards specifically for flora descriptive
 data. Some opportunities for consideration by the TDWG Species
 Information Interest Group will be presented.
 Abstract

 Species level information, as an important component of the biodiversity
 information landscape, is an area where some TDWG standards and
 activities, coincide. Plinian Core (Plinian Core Task Group 2018) is a
 generalistic specification that covers aspects such species descriptions
 and nomenclature, as well as many others (legal, conservation,
 management, etc.). While the Plinian Core non-biological terms have no
 counterpart in the TDWG developments, some of its biological ones have,
 and that is the focus of this work. First, it must be noticed that
 Plinian Core relies on some TDWG standards for specific facets of
 species information:

 Standard: Darwin Core (Darwin Core maintenance group, Biodiversity
 Information Standards (TDWG) 2014)

 Elements: taxonConceptID, Hierarchy, MeasurementOrFact,
 ResourceRelationShip.

 Standard:Ecological Metadata Language (EML project members 2011)

 Elements: associatedParty, keywordSet, coverage, dataset

 Standard:Encyclopedia of Life Schema (EOL Team 2012)

 Elements: AncillaryData: DataObjectBase

 Standard:Global Invasive Species Network (GISIN 2008)

 Elements: origin, presence, persistence, distribution, harmful,
 modified, startValidDate, endValidDate, countryCode, stateProvince,
 county, localityName, county, language, citation, abundance\...

 Standard:Taxon Concept Schema. TCS (Taxonomic Names and Concepts
 interest group 2006)

 Elements: scientificName

 Given the direct dependency of Plinian Core for these terms, they do not
 pose any compatibility or interoperability problem. However, biological
 descriptions \--especially structured ones\-- are the object of DELTA
 (Dallwitz 2006) and the Structured Descriptive Data (SDD) (Hagedorn et
 al. 2005), and also covered by Plinian Core. This convergence presents
 overlaps, mismatches and nuances, which discussion is the core of this
 work.

 Using some species descriptions as a test case, and transforming them
 between these standards (Plinian Core, DELTA, and SDD), the strengths
 and compatibility issues of these specifications are evaluated and
 discussed.

 Some operational aspects of Plinian Core in relation to GBIF\'s IPT
 (GBIF Secretariat 2016) and the INSPIRE directive (European Commission
 2007) are also reviewed.
 Abstract

 Taxonomic monographs are a series of publications covering a higher
 taxonomic group with each monograph focusing on an individual species.
 They are a compendium of the current state of research and knowledge
 detailing many aspects of the species and are extensively used by
 researchers, ornithologists and conservationists to learn what is
 'currently' known about a species. Birds, being one of the more easily
 seen and studied taxa, have a number of specialized taxonomic monographs
 where data from a wide variety of disciplines are combined into a single
 place and utilized for research and conservation management. Many of the
 existing avian monographs have regional or subdomain focus such as
 "Birds of the Western Palearctic" or "Catalan Breeding Bird Atlas
 1999-2002" and monographs are sometimes focused on different user
 communities, ranging from those with casual interest to professional
 ornithologists and researchers.

 The Lab of Ornithology maintains several monograph series. Merlin and
 All About Birds include simplified information that is of interest to
 the casual observer and Birds of North America and Neotropical Birds
 Online are monographs with complete, detailed life histories, prepared
 for ornithologists and active researchers. These monograph projects were
 originally supported using different Content Management Systems which
 became very difficult to maintain, difficult to keep content current and
 provided no capacity for organizing and sharing of content across
 monograph projects. Bird taxonomies change annually and the previous
 systems had no capacity to intelligently manage taxonomic changes. To
 solve these issues, we created a new Content Management System with
 Taxonomic Concepts at its core. Reviewing a number of existing monograph
 projects led us to create an underlying content structure that is very
 analogous to Plinian Core. The initial requirement to support multiple
 monograph series, some focused on the professional community and others
 focused on budding amateurs, presented challenges to creating a 'one
 size fits all' model for structuring content that includes authoritative
 articles covering most aspects of a species life history, traditional
 range maps, dynamic observation maps, relative abundance models, photos,
 images, video and a bibliography. In this talk I'll present in detail
 the Content Management System and the underlying models we have
 developed. Four of these five models are tied to the underlying
 taxonomic concept while the fifth is tied to the taxonomic names.
 Articles, multimedia (including traditional range maps), taxonomic
 description and bibliography have long existed in print monographs and
 having these authored and displayed via the web makes it much simpler to
 incorporate new information and, keep the information current and
 publish the information to an existing standard. The incorporation of
 dynamic content has only been possible with the advent of the web and
 standards for the underlying Taxonomic Concepts. With four monographs
 currently in production and several more in development, we've
 encountered both advantages and disadvantages in using these models for
 managing and serving monograph series. I will discuss these in detail
 and compare the models with Plinian Core to highlight both fundamental
 differences as well as common ground.
 Abstract

 Aiming at promoting interaction among researchers and the integration of
 data from their pollen collections, herbaria and bee collections, RCPol
 was created in 2013. In order to structure RCPol work, researchers and
 collaborators have organized information on Palynology and trophic
 interactions between bees and plants. During the project development,
 different computing tools were developed and provided on RCPol website
 (http://rcpol.org.br), including: interactive keys with multiple inputs
 for species identification (http://chaves.rcpol.org.br); a glossary of
 palinology related terms
 (http://chaves.rcpol.org.br/profile/glossary/eco); a plant-bee
 interactions database (http://chaves.rcpol.org.br/interactions); and a
 data quality tool (http://chaves.rcpol.org.br/admin/data-quality). Those
 tools were developed in partnership with researchers and collaborators
 from Escola Politécnica (USP) and other Brazilian and foreign
 institutions that act on palynology, floral biology, pollination, plant
 taxonomy, ecology, and trophic interactions. The interactive keys are
 organized in four branches: palynoecology, paleopalynology,
 palynotaxonomy and spores. These information are collaboratively
 digitized and managed using standardized Google Spreadsheets. All the
 information are assessed by a data quality assurance tool (based on the
 conceptual framework of TDWG Biodiversity Data Quality Interest Group
 Veiga et al. 2017) and curated by palynology experts. In total, it has
 published 1,774 specimens records, 1,488 species records (automatically
 generated by merging specimens records with the same scientific name),
 656 interactions records, 370 glossary terms records and 15 institutions
 records, all of them translated from the original language (usually
 Portuguese or English) to Portuguese, English and Spanish. During the
 projectʼs first three years, 106 partners, among researchers and
 collaborators from 28 institutions from Brazil and abroad, actively
 participated on the project. An important part of the project\'s
 activities involved training researchers and students on palynology,
 data digitization and on the use of the system. Until now six training
 courses have reached 192 people.
 Abstract

 The Australian Department of the Environment and Energy (DoEE) is
 working with the Atlas of Living Australia (ALA), Biodiversity Climate
 Change Virtual Laboratory (BCCVL) together with 2 state environment
 departments (New South Wales and Queensland) to develop a standard
 framework for modelling threatened species distributions for use in
 policy / environmental decision making.

 In addition, DoEE is working with 7 state and territory environment
 departments to implement a common assessment method (CAM) for the
 assessment and listing of nationally threatened species. The method is
 based on the IUCN Red List criteria. Each Australian jurisdiction has
 traditionally used different assessment method, including categories,
 criteria, thresholds, definitions and scales of assessment to list
 threatened species within their jurisdiction. The CAM is a standardised
 method for species assessed for listing at the national level. Through
 cross-jurisdictional collaboration, this will improve the efficiency of
 the assessment process and facilitate consistency across jurisdictional
 lists.

 The BCCVL includes linkages to species observations on the ALA and users
 are able to add their own data including contextual and species data.
 The project aims to create a secure environment where
 cross-jurisdictional collaboration can occur both on the standardisation
 of methodologies for creating species distributions and the integration
 of data. The project also aims to provide a secure platform for
 jurisdictions to contribute sensitive observations not available through
 the ALA and take into consideration expert feedback on the distribution
 of species.

 The project will provide a public-facing platform whereby SDM's can be
 published. This will be searchable by area, species or contributor. All
 outputs will be scientifically robust, repeatable, maintainable, open
 and transparent. The increased validity and robustness of models lead to
 better informed decisions relating to impacts of development and
 conservation of species.
 Abstract

 How do you successfully engage volunteers in citizen science projects?

 In recent years, citizen science has grown considerably in popularity,
 resulting in rapid increases in the number of citizen science and
 crowdsourcing projects and providing cost-effective means for scientists
 to gather more data over broader spatial ranges to tackle research
 questions in a wide variety of scientific, conservation, and
 environmental fields Bonney et al. 2016, Aceves-Bueno et al. 2017. While
 the proliferation of such projects has produced a growing abundance of
 citizen scientist-generated data and published research informed by
 citizen science methods Follett and Strezov 2015, this also means that
 volunteers have a greater number of projects competing for their time.

 When faced with an increasingly-crowded landscape, how can you generate
 interest in a citizen science or crowdsourcing project and maintain
 contributions over the project's lifetime?

 The Biodiversity Heritage Library (BHL) supports a variety of citizen
 science and crowdsourcing projects, from transcribing field notes to
 tagging scientific illustrations with taxonomic names on Flickr and
 enhancing data for 19^th^ century periodicals through its
 Zooniverse-based Science Gossip project. Through a variety of outreach
 strategies including collaborative social media campaigns, partnerships
 with citizen science communities, and interactive incentives, BHL has
 successfully engaged volunteers with diverse projects to enrich the
 library's data and increase discoverability of its collections.

 This presentation will discuss outreach strategies for citizen science
 projects that BHL has undertaken to further support research initiatives
 with our content. In addition, the presentation will share
 lessons-learned and offer suggestions that attendees can apply to their
 own citizen science engagement efforts.
 Abstract

 Biodiversity literature and archival collections are not only
 indispensable in taxonomic research, they provide crucial information
 for understanding of museums' natural history collections. Literature
 and archives document collecting events resulting in specimen
 collections, contain original descriptions based on those specimens, and
 provide a wealth of other contextual information for the study of life
 on earth. The Biodiversity Heritage Library is committed to improving
 research efficiency by providing open access to a growing body of
 biodiversity literature and archives. While descriptive metadata is
 widely available for both specimen collections (i.e., DarwinCore) and
 literature (i.e., MARCXML), connections between the two collection types
 cannot generally be found at these descriptive levels thus hindering
 efficient discovery of relevant materials. The integration of name
 finding services, powered by Global Names Architecture, provides a
 significant value-add through page-level access to mentions of a given
 taxon name. Yet how might one search based on a museum code, a common
 name, or a place name? This presentation will share how BHL's top
 technical priorities for 2018 will help facilitate more efficient
 searching and discovery of information in the pages of the BHL corpus.
 Specifically, updates on BHL's top two priorities -- implementation of
 full text search and incorporation of available crowdsourced
 transcriptions---will be covered.
 Abstract

 The classification of living things depends upon the literature. Access
 to this literature is essential to taxonomic research and to our
 understanding of biodiversity. There have been tremendous efforts to
 digitise the world's biodiversity literature; the Biodiveristy Heritage
 Library (BHL) alone has uploaded over 54 million pages, all of which is
 freely accessible online. Our scientific literature is far more
 accessible than it has ever been, but that does not mean it is easily
 discoverable. Much of the taxonomic literature online remains outside
 the linked network of scholarly research. But that is rapidly changing.

 Taxonomic aggregators are an invaluable source of authoritative
 information on species names and their hierarchical classification. It
 is critical that this information includes citations for taxonomic
 descriptions, that these citations link to the published literature
 online and that (wherever possible) the citations include DOIs (Digital
 Object Identifiers). The DOI is an essential part of a publication's
 bibliographic metadata and should be included (as a live link) in any
 reference to that content.

 However, the definitive (DOI'd) versions of recent publications are
 frequently behind paywalls. And, while much of the historic literature
 available online is open access, commercial publishers are uploading
 out-of-copyright publications onto their own websites, assigning DOIs to
 "their" definitive versions (the versions that must be cited in other
 publications, as per DOI requirements) and then locking the defintiive
 versions behind paywalls. This is perfectly within their rights. DOIs
 may be assigned to legacy publications retrospectively, providing that:
 a) the party assigning them owns the rights for the content, or has
 permission from the rights holder to assign a DOI, and b) the
 publication does not already have a DOI. If there are no rights attached
 to a piece of content, anyone can assign a DOI to it.

 This means that citation traffic from the bibliographies of current
 publications is increasingly directed towards commercial publishers'
 websites, rather than towards open access versions, such as those freely
 available on the Biodiversity Heritage Library (BHL). However, taxonomic
 aggregators are not bound by the same obligations as publishers and may
 therefore choose to link to any online version of a publication
 (although the DOI should still be included in the citation).

 Many taxonomic aggregators link to the literature available on BHL. The
 taxonomic name profiles in EOL (Encyclopedia of Life), GBIF (Global
 Biodiversity Information Facility) and ALA (Atlas of Living Australia)
 each contain a BHL bibliography: a list of links to the pages in BHL
 that contain an identified mention of that taxon name. However, the
 lists of returned results can be long, and they may or may not include
 the citations for accepted names, synonyms and taxon concepts. Some
 biodiversity aggregators feature these key citations on the names pages
 (or tabs) of taxon profiles. However, where these do exist, they are
 usually plain text rather than links.

 BHL is now registering DOIs for the content it hosts and is creating
 landing pages for articles, containing the full bibliographic metadata,
 including (where applicable) the DOI. Articles are now discoverable by
 article title, keywords within titles (scientific names, locations,
 traits, etc.), author names and DOIs, and can be easily linked to (via
 their landing pages) by other parties.

 This paper will examine the issues, benefits and complexities associated
 with linking to definitive versions, the difference between easy and
 open access, the ethics of putting out-of-copyright content behind
 paywalls, and the future of creating order amongst the massively
 expanding resource of literature online.
 Abstract

 The Biodiversity Heritage Library (BHL) provides open access to over 54
 million pages of biodiversity literature. Much of this literature is
 either in the public domain or is licensed for reuse under the Creative
 Commons framework. Anyone can therefore freely reuse much of the
 information and data provided by BHL. This presentation will outline how
 the work of a citizen scientist using BHL content might benefit research
 scientists. It will discuss how a citizen scientist can reuse and link
 BHL literature and data in Wikipedia and Wikidata. It will explain the
 research efficiencies that can be obtained through this reuse and
 linking, for example through the consolidation of database identifiers.
 The presentation will outline the subsequent reuse of the BHL data added
 to Wikipedia and Wikidata by the internet search engine Google. It will
 discuss an example of the linking of this information in the citizen
 science observation platform iNaturalist. The presentation will explain
 how BHL, as a result of its open reuse licensing of information and
 data, helps in the creation of more accurate citizen science generated
 biodiversity data and assists with the wider and more effective
 dissemination of biodiversity information.
 Abstract

 A program to integrate species diversity information systems was
 launched by the Chinese Academy of Sciences (CAS) in January 2018, with
 funding from the CAS Earth project, a Strategic Priority Research
 Program of CAS. The program will create a series of data products, such
 as China flora online, species catalogues, distribution maps, software
 tools for data mining and knowledge discovery based on big data and
 artificial intelligence technology, and a service platform and portal
 highlighting species diversity information in China. The products and
 platform will provide the robust data to support decision making on
 biodiversity conservation, fundamental research on biodiversity
 evolution and spatial patterns, and species identification for citizen
 science. China flora online will include 35,000 species of higher plants
 in China and an online editing environment for botanists to maintain the
 floral records. The trait database will include structured data of
 animals, plants and fungi, such as weight, height, length, color and
 shape of organisms. This species catalogue will be the annually updated
 version of the Catalogue of Life, China. The distribution maps will show
 the spatial pattern for each species of vertebrate animal and higher
 plant. Cell phone apps will help users to easily and quickly identify
 plants in the field. The mechanism and workflow for data collection,
 integration, public sharing and quality control will be built up in the
 next few years.
 Abstract

 Due to the recent establishment of the Global Genome Biodiversity
 Network (GGBN) data portal, we have extended Specify collections
 management software (http://www.sustain.specifysoftware.org/) to more
 effectively manage, publish, and integrate tissue and DNA extract data
 by adding support for the GGBN data schema. Specify's database design
 now includes a number of data fields and tables proscribed in GGBN
 standard vocabularies. We also realigned some of the underlying table
 relationships to address the needs of specimen curation and collection
 transactions for extract and tissue samples. Specify now also supports
 "Next Generation" sequencing metadata with fields to record NCBI SRA ID
 numbers for web-linking tissue and extract metadata to entries in the
 NCBI SRA databases.

 With the ongoing evolution of the TDWG Darwin Core (DwC) standard for
 specimen data exchange, we generalized Specify 7's data publishing
 capabilities to export collections data to any DwC or other
 standards-based, exchange schema. This generic, external schema mapping
 capability enables Specify collections to design and map data packages
 to integrate their data with any community aggregator or collaborative
 project database based on Darwin Core or other community standard-based
 format. The development of these versatile new integration capabilities
 was in collaboration with, and through financial support from GGBN. This
 talk will highlight these changes in the context of delivery of museum
 tissue and extract data records to the GGBN data portal for aggregation.
 Abstract

 The Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/)
 is an open access repository for geographic and ecological metadata
 associated with biosamples and genetic data. It contributes to the
 informatics stack -- Biocode Commons -- of the Genomic Observatories
 Network
 (https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-2).
 The GeOMe project interface enables administrators to plan and execute
 field based sample collection efforts. GeOMe projects specify a core set
 of sample metadata fields based on community standard vocabularies and
 also includes plugins for associating samples with photos, subsamples,
 NextGen sequence metadata, and permits. Users can upload their own
 expedition-specific metadata, which contributes to the overall project
 dataset while providing the user a convenient method for updating and
 refining their contributed data. GeOMe provides connection points to the
 Global Biodiversity Information Facility and archived genetic data
 stored in the National Center for Biotechnology Information\'s (NCBI\'s)
 Sequence Read Archive (SRA), linking specimens and seqeuences via unique
 persistent identifiers.
 Abstract

 Genomic research depends upon access to DNA or tissue collected and
 preserved according to high-quality standards. At present, the
 collections in most natural history museums do not sufficiently address
 these standards. In response to these challenges, natural history
 museums, culture collections, herbaria, botanical gardens and others
 have started to build high-quality biodiversity biobanks. Unfortunately,
 information about these collections remains fragmented, scattered and
 largely inaccessible. Without a central registry of relevant
 institutions, it is difficult and time-consuming to locate the needed
 samples.

 The Global Genome Biodiversity Network (GGBN) was created to fill this
 gap by establishing a central access point for locating samples meeting
 quality standards for genome-scale applications, while complying with
 national and international legislations and conventions (e.g. the Nagoya
 Protocol). The GGBN is rapidly growing and currently has 70 members and
 works closely together with GBIF, SPNHC, CETAF, INSDC, BOLD, ESBB,
 ISBER, GSC and others to reach its goals.

 Knowledge of biodiversity biobank content is urgently needed to enable
 concerted efforts and strategies in collecting and sampling new material
 and making ABS a reality. GGBN provides an infrastructure for making
 genomic samples discoverable and accessible.

 While respecting national law, GGBN requires that its members comply
 with the provisions of the Nagoya-protocol. Thus researchers,
 collection-holding institutions, and networks should adopt a common Best
 Practice approach to manage ABS, as has been developed by GGBN. A Code
 of Conduct; recommendations for implementing the Code of Conduct (the
 Best Practices), and implementation tools, such as standard Material
 Transfer Agreements (MTA) and mandatory and recommended data fields in
 collection databases, are tools which will aid compliance. This talk
 provides an overview of GGBN and comprises updates on GGBN's best
 practices on ABS and the Nagoya Protocol, with examples of their use and
 applicability.
 Abstract

 Arctos (https://arctosdb.org), an online collection management
 information system, was developed in 1999 to manage museum specimen data
 and to make those data publicly available. The portal
 (arctos.database.museum) now serves data on over 3.5 million cataloged
 specimens from more than 130 collections throughout North America in an
 instance at the Texas Advanced Computing Center. Arctos also is a
 community of museum professionals that collaborates on museum best
 practices and works together to improve Arctos data richness and
 functionality for on-line museum data streaming. In 2017, three large
 Arctos genomics collections at the Museum of Southwestern Biology (MSB),
 Museum of Vertebrate Zoology, Berkeley (MVZ), and University of Alaska
 Museum of the North (UAM), received support from GGBN to create a
 pipeline for publishing data from Arctos to the GGBN portal.
 Modifications to Arctos included standardization of controlled
 vocabulary for tissues; changes to the data structure and code tables
 with regard to permit information, container history, part attributes,
 and sample quality; implementation of interfaces and protocols for
 parent-child relationships between tissues, tissue subsamples, and DNA
 extracts; and coordination with the DWC community to ensure that all
 GGBN data standards and formatting are included in the standard DWC
 export in order to finalize the pipeline to GGBN. The addition of these
 three primary Arctos biorepositories to the GGBN network will add over
 750,000 tissue and DNA records representing over 11,000 species and 667
 families. These voucher-based archives represent primarily vertebrate
 taxa, with growing collections of arthropods, endoparasites, and
 incipient collections of microbiome and environmental samples associated
 with online media and linked to GenBank and other external databases.
 The high-quality data in Arctos complement and significantly extend
 existing GGBN holdings, and the establishment of an Arctos-GGBN pipeline
 also will facilitate future collaboration between more Arctos
 collections and GGBN.
 Abstract

 The GGBN Data Standard
 (https://terms.tdwg.org/wiki/GGBN\_Data\_Standard) provides a platform
 based on a documented agreement to promote the efficient sharing and
 usage of genomic sample material and associated specimen information in
 a consistent way. It builds upon existing standards commonly used within
 the community extending them with the capability to exchange data on
 tissue, environmental and DNA samples as well as sequences. The standard
 has been recently extended to support environmental DNA and High
 Throughput Sequencing (HTS) library samples. Both, eDNA and HTS library
 sample use cases have been published in the GGBN Sandbox
 (http://sandbox.ggbn.org) and will be presented here. The use case
 collection is documented in the GGBN wiki
 (http://wiki.ggbn.org/ggbn/Use\_Case\_Collection).

 In addition a general overview of the GGBN Data Portal
 (http://www.ggbn.org) will be given. Based on ABCD, DwC and the GGBN
 Data Standard the GGBN Data Portal is the gateway to standardized access
 of DNA, tissue and environmental samples and their associated specimens.

 The third core piece of GGBN is the GGBN Document Library
 (https://library.ggbn.org), today containing more than 300 documents
 about research, management and legal aspects of biodiversity biobanks.
 We will provide an overview of covered topics and gaps that the
 community can help to fill.

 Finally an outlook of goals and priority tasks for the next two years
 will be given.
 Abstract

 The Open Biodiversity Knowledge Management System (OBKMS) is an
 end-to-end, eXtensible Markup Language (XML)- and Linked Open Data
 (LOD)-based ecosystem of tools and services that encompasses the entire
 process of authoring, submission, review, publication, dissemination,
 and archiving of biodiversity literature, as well as the text mining of
 published biodiversity literature (Fig. 1). These capabilities lead to
 the creation of interoperable, computable, and reusable biodiversity
 data with provenance linking facts to publications.

 OBKMS is the result of a joint endeavour by Plazi and Pensoft lasting
 many years. The system was developed with the support of several
 biodiversity informatics projects - initially (Virtual Biodiversity
 Research and Access Network for Taxonomy) ViBRANT, and then followed by
 pro-iBiosphere, European Biodiversity Observation Network (EU BON), and
 Biosystematics, informatics and genomics of the big 4 insect groups
 (BIG4). The system includes the following key components:

 ARPHA Journal Publishing Platform: a journal publishing platform based
 on the TaxPub XML extension for National Library of Medicine (NLM)'s
 Journal Publishing Document Type Definition (DTD) (Version 3.0). Its
 advanced ARPHA-BioDiv component deals with integrated biodiversity data
 and narrative publishing (Penev et al. 2017).

 GoldenGATE Imagine: an environment for marking up, enhancing, and
 extracting text and data from PDF files, supporting the TaxonX XML
 schema. It has specific enhancements for articles containing
 descriptions of taxa (\"taxonomic treatments\") in the field of
 biological systematics, but its core features may be used for general
 purposes as well.

 Biodiversity Literature repository (BLR): a public repository hosted at
 Zenodo (CERN) for published articles (PDF and XML) and images extracted
 from articles.

 Ocellus/Zenodeo: a search interface for the images stored at BLR.

 TreatmentBank: an XML-based repository for taxonomic treatments and data
 therein extracted from literature.

 The OpenBiodiv knowledge graph: a biodiversity knowledge graph built
 according to the Linked Open Data (LOD) principles. Uses the RDF data
 model, the SPARQL Protocol and RDF Query Language (SPARQL) query
 language, is open to the public, and is powered by the OpenBiodiv-O
 ontology (Senderov et al. 2018).

 OpenBiodiv portal:

 Semantic search and browser for the biodiversity knowledge graph.

 Multiple semantic apps packaging specific views of the biodiviersity
 knowledge graph.

 Supporting tools:

 Pensoft Markup Tool (PMT)

 ARPHA Writing Tool (AWT)

 ReFindit

 R libraries for working with RDF and for converting XML to RDF
 (ropenbio, RDF4R).

 Plazi RDF converter, web services and APIs.

 As part of OBKMS, Plazi and Pensoft offer the following services beyond
 supplying the software toolkit:

 Digitization through imaging and text capture of paper-based or
 digitally born (PDF) legacy literature.

 XML markup of both legacy and newly published literature (journals and
 books).

 Data extraction and markup of taxonomic names, literature references,
 taxonomic treatments and organism occurrence records.

 Export and storage of text, images, and structured data in data
 repositories.

 Linking and semantic enhancement of text and data, bibliographic
 references, taxonomic treatments, illustrations, organism occurrences
 and organism traits.

 Re-packaging of extracted information into new, user-demanded outputs
 via semantic apps at the OpenBiodiv portal.

 Re-publishing of legacy literature (e.g., Flora, Fauna, and Mycota
 series, important biodiversity monographs, etc.).

 Semantic open access publishing (including data publishing) of journal
 and books.

 Integration of biodiversity information from legacy and newly published
 literature into interoperable biodiversity repositories and platforms
 (Global Biodiversity Information Facility (GBIF), Encyclopedia of Life
 (EOL), Species-ID, Plazi, Wikidata, and others).

 In this presentation we make the case for why OpenBiodiv is an essential
 tool for advancing biodiversity science. Our argument is that through
 OpenBiodiv, biodiversity science makes a step towards the ideals of open
 science (Senderov and Penev 2016). Furthermore, by linking data from
 various silos, OpenBiodiv allows for the discovery of hidden facts.

 A particular example of how OpenBiodiv can advance biodiversity science
 is demonstrated by the OpenBiodiv\'s solution to \"taxonomic anarchy\"
 (Garnett and Christidis 2017). \"Taxonomic anarchy\" is a term coined by
 Garnett and Christidis to denote the instability of taxonomic names as
 symbols for taxonomic meaning. They propose an \"authoritarian\"
 top-down approach to stablize the naming of species. OpenBiodiv, on the
 other hand, relies on taxonomic concepts as integrative units and
 therefore integration can occur through alignment of taxonomic concepts
 via Region Connection Calculus (RCC-5) (Franz and Peet 2009). The
 alignment is \"democratically\" created by the users of system but no
 consensus is forced and \"anarchy\" is avoided by using unambiguous
 taxonomic concept labels (Franz et al. 2016) in addition to Linnean
 names.
 Abstract

 The temporality of specimens is an often overlooked but quintessential
 part of using aggregated biodiversity occurrences for research,
 especially when millions of these occurrences exist in deep time.
 Presently in Darwin Core, there are terms for describing the geological
 context of specimens, which is needed for paleontological specimens.
 However, information about the contextual absolute date associated with
 a specimen, and how that date was generated is not supported in Darwin
 Core, but would strongly enhance usability for research. Providers do
 occasionally try provisioning this information, but it is currently
 hidden in a few different Darwin Core fields, making it hard to discover
 and nearly impossible to search for in biodiversity portals. Here we
 provide an overview of where absolute date content for paleontological
 and archaeological specimens are currently found in published specimens
 records. We will then introduce a working Darwin Core extension that
 focuses on chronometric content, and demonstrate the use of this
 extension with published datasets from the zooarchaeological and
 paleontological communities. This new advancement will allow providers
 to make these crucial data available, researchers to easily find the
 temporal range associated with an occurrence, evaluate how this range
 was determined, and compile occurrences based on their shared ages to
 help streamline the research process.
 Abstract

 Important initiatives, such as the Convention on Biological Diversity\'s
 (CBD) Aichi targets, the United Nations\' 2030 Agenda for Sustainable
 Development (and its Sustainable Development Goals) highlight the urgent
 need to stop the continuous and increasing loss of biodiversity. That
 requires an increase in the knowledge that will allow for sustainable
 use of natural resources. To accomplish that, detailed studies are
 needed to evaluate multiple species and regions. These studies demand
 great effort from professionals, searching for species and/or observing
 their behavior. In this case, the use of new monitoring devices could be
 beneficial in data collection and identification, optimizing the
 specialist effort to detect and observe species in-situ. With the
 advance of technology platforms for developing connected devices and
 sensors, associated with the evolution of the Internet of Things (IoT)
 concepts, and the advances of unmanned aerial vehicles (UAVs) and
 Wireless sensor networks (WSN), new scenarios in biodiversity studies
 are possible. The technology available now could allow studies applying
 relatively cheaper sensors with long-range (approx. 15 km), low power,
 low bit rate communication and up to 10-year battery life, using a Low
 Power Wide Area Network (LPWAN) and with capacity to run bio-acoustic or
 image processing detection. Platforms like Raspberry Pi or any other
 with signal processing capabilities can be applied (Hodgkinson and Young
 2016). Sensor technologies protocols applied in IoT networks are usually
 simple and flexible. Common semantics and metadata definitions are
 necessary to extract information and representations to construct
 complex networks. Some of these metadata definitions can be adopted from
 the current Darwin Core schema. However, Darwin Core evolved based on
 enterprise technologies (i.e. XML) and relational database definitions,
 that usually need machines with significant bandwidth to transmit data.
 Today the technology scenario is taking another route, going from
 centralized to distributed architectures, occasionally applying
 non-relational and distributed databases, ready to deal with
 synchronization and eventual consistency problems. These distributed
 databases are usually employed to construct complex networks, where
 relation restrictions are not mandatory or, sometimes, even desired
 (Baggio et al. 2016). With these new techniques becoming a reality in
 biodiversity conservation studies, new metadata definitions are
 necessary. Those new metadata need to standardize and create a shared
 vocabulary that includes requirements for devices information exchange,
 data analytics, and model generation. Also, these new definitions could
 aggregate the Essential Biodiversity Variables (EBVs) concepts, that aim
 to identify the minimum of variables that can be used to inform
 scientists, managers and decision makers (Haase et al. 2018). For this
 reason, we propose the insertion of EBV definitions in the construction
 of sensor integration metadata and models characterization inside the
 Darwin Core metadata definitions (Fig. 1).
 Abstract

 The Specialized Information Service Biodiversity Research (BIOfid;
 http://biofid.de/) has recently been launched to mobilize valuable
 biodiversity data hidden in German print sources of the past 250 years.
 The partners involved in this project started digitisation of the
 literature corpus envisaged for the pilot stage and provided novel
 applications for natural language processing and visualization. In order
 to foster development of new text mining tools, the Senckenberg
 Biodiversity Informatics team focuses on the design of ontologies for
 taxa and their anatomy. We present our progress for the taxa prioritized
 by the target group for the pilot stage, i.e. for vascular plants, moths
 and butterflies, as well as birds. With regard to our text corpus a key
 aspect of our taxonomic ontologies is the inclusion of German vernacular
 names. For this purpose we assembled a taxonomy ontology for vascular
 plants by synchronizing taxon lists from the Global Biodiversity
 Information Facility (GBIF) and the Integrated Taxonomic Information
 System (ITIS) with K.P. Buttler's Florenliste von Deutschland
 (http://www.kp-buttler.de/florenliste/). Hierarchical classification of
 the taxonomic names and class relationships focus on rank and status
 (validity vs. synonymy). All classes are additionally annotated with
 details on scientific name, taxonomic authorship, and source. Taxonomic
 names for birds are mainly compiled from ITIS and the International
 Ornithological Congress (IOC) World Bird List, for moths and butterflies
 mainly from GBIF, both lists being classified and annotated accordingly.
 We intend to cross-link our taxonomy ontologies with the Environment
 Ontology (ENVO) and anatomy ontologies such as the Flora Phenotype
 Ontology (FLOPO). For moths and butterflies we started to design the
 Lepidoptera Anatomy Ontology (LepAO) on the basis of the already
 available Hymenoptera Anatomy Ontology (HAO). LepAO is planned to be
 interoperable with other ontologies in the framework of the OBO foundry.
 A main modification of HAO is the inclusion of German anatomical terms
 from published glossaries that we add as scientific and vernacular
 synonyms to make use of already available identifiers (URIs) for
 corresponding English terms. International collaboration with the
 founders of HAO and teams focusing on other insect orders such as
 beetles (ColAO) aims at development of a unified Insect Anatomy
 Ontology. With a restriction on terms applicable on all insects the
 unified Insect Anatomy Ontology is intended to establish a basis for
 accelerating the design of more specific anatomy ontologies for any
 particular insect order. The advancement of such ontologies aligns with
 current needs to make knowledge accumulated in descriptive studies on
 the systematics of organisms accessible to other domains. In the context
 of BIOfid our ontologies provide exemplars on how semantic queries of
 yet untapped data relevant for biodiversity studies can be achieved for
 literature in non-English languages. Furthermore, BIOfid will serve as
 an open access platform for professional international journals
 facilitating non-commercial publishing of biodiversity and
 biodiversity-related data.
 Abstract

 Field data collection by Citizen Scientists has been hugely assisted by
 the rapid development and spread of smart phones as well as apps that
 make use of the integrated technologies contained in these devices. We
 can improve the quality of the data by increasing utilisation of the
 device in-built sensors and improving the software user-interface.
 Improvements to data timeliness can be made by integrating directly with
 national and international biodiversity repositories, such as the Atlas
 of Living Australia (ALA).

 I will present two Citizen Science apps that we developed for the
 conservation of two of Australia's iconic species -- the koala and the
 echidna. First is the Koala Counter app used in the Great Koala Count 2
 -- a two-day Blitz-style population census. The aim was to improve both
 the recording of citizen science effort as well as to improve the
 recording of "absence" data which would improve population modelling.
 Our solution was to increase the transparent use of the phone sensors as
 well as providing an easy-to-use user interface. Second is the
 EchidnaCSI app -- an observational tool for collecting sightings and
 samples of echidna.

 From a software developer's perspective, I will provide details on
 multi-platform app development as well as collaboration and integration
 with the Australian national biodiversity repository -- the Atlas of
 Living Australia. Preliminary analysis regarding data quality will be
 presented along with lessons learned and paths for future research. I
 also seek feedback and further ideas on possible enhancements or
 modifications that might usefully be made to improve these techniques.
 Abstract

 Scratchpads are an online Virtual Research Environment (VRE) for
 biodiversity scientists, allowing anyone to share their data and create
 their own research networks (http://scratchpads.eu/). In operation since
 2007, the platform has supported more than 1,000 communities in their
 efforts to share, manage and aggregate information on the natural world.
 Funded through a series of European Commission and United Kingdom
 research council grants, the platform reached a height of popularity in
 2014 with more than 14,500 users, but high levels of usage, coupled with
 the difficulty of sustaining external funding, led to a significant
 decline in the quality of service provision and support available to the
 project. Consequently, the Scratchpads service was closed to new
 communities in October 2016 and was managed on an essential care and
 maintenance basis until new permanent funding became available in
 December 2017. Despite these challenges, the Scratchpad system continues
 to be used by a loyal community of taxonomists and systematists. As part
 of our efforts to stabilise the platform and develop a sustainable
 future for its users, we present our findings from an in-depth analysis
 of Scratchpad usage metrics and user behaviour. We investigate the
 growth of the Scratchpads since their inception; how global taxonomic
 concepts have been generated, used and adapted; the geographical and
 taxonomic coverage of Scratchpads; the functionality most popular with
 users, and those features that failed to gain traction with the
 community; and finally how aggregated data was used and modified by
 select user communities. Our presentation examines the challenges of
 maintaining a complex digital project once funding expires and the
 initial project team disperses. We conclude with a summary of the
 Scratchpad software development roadmap based on this quantitative
 analysis of user behaviour. This is informing the future of the
 Scratchpads system and identifies how VREs for the biodiversity data
 community might be developed to provide a more integrated and
 sustainable solution to the problem of community management for
 biodiversity data.
 Abstract

 The quality of data produced by citizen science (CS) programs has been
 called into question by academic scientists, governments, and
 corporations. Their doubts arise because they perceive CS groups as
 intruding on the rightful opportunities of standard science and industry
 organizations, because of a normal skepticism of novel approaches, and
 because of a lack of understanding of how CS produces data.

 I propose a three-pronged strategy to overcome these objections and
 improve trust in CS data.

 Develop methods for CS programs to advertise their efforts in data
 quality control and quality assurance (QCQA). As a first step the PPSR
 core could incorporate a field that would allow programs to point to
 webpages that document the QAQC practices of each program. It is my
 experience that many programs think carefully about data quality, but
 the CS community currently lacks an established protocol to share this
 information.

 Define and implement best practices for generating biodiversity data
 using different methods. Wiggins et al. 2011 published a list of
 approaches that can be used for QCQA in CS projects but how these
 approaches should be implemented has not been systematically
 investigated.

 Measure and report data quality. If one takes the point of view that
 citizen science is akin to a new category of scientific instruments,
 then the ideas of instrument measurement and calibration can be applied
 CS. Scientists are well aware that any instrument needs to be calibrated
 before its efficacy can be established. However, because CS is new
 approach, the specific procedures needed for different kinds of programs
 are just now being worked out for the first time.

 The strategy outlined above faces some specific challenges. Citizen
 science biodiversity programs must address two important problems that
 standard scientific entities encounter when sampling and monitoring
 biodiversity. The first is correctly identifying species. For citizens
 this can be a problem because they often do not have the training and
 background of scientist teams. Likewise, it may be difficult for CS
 projects to manage updating and maintaining the taxonomies of the
 species being investigated. A second set of challenges is the diverse
 kinds of biodiversity data collected by CS programs. For instances,
 Notes from Nature decodes that labels of museum specimens, Snapshot
 Serengeti identifies species of large mammals from camera trap
 photographs, iNaturalist collections images of species and then has a
 crowdsource identification processs, while eBird collects observations
 of birds that are immediately filtered with computer algorithms for
 review by the observer and if, subsequently flagged, reviewed by a local
 expert. Each of these programs likely requires a different set of best
 practices and methods to measure data quality.
 Abstract

 Pl\@ntNet is an international initiative which was the first one
 attempting to combine the force of citizens networks with automated
 identification tools based on machine learning technologies (Joly et al.
 2014). Launched in 2009 by a consortium involving research institutes in
 computer sciences, ecology and agriculture, it was the starting point of
 several scientific and technological productions (Goëau et al. 2012)
 which finally led to the first release of the Pl\@ntNet app (iOS in
 February 2013 (Goëau et al. 2013) and Android (Goëau et al. 2014) the
 following year). Initially based on 800 plant species, the app was
 progressively enlarged to thousands of species of the European, North
 American and tropical regions. Nowadays, the app covers more than 15 000
 species and is adapted to 22 regional and thematic contexts, such as the
 Andean plant species, the wild salads of southern Europe, the indigenous
 trees species of South Africa, the flora of the Indian Ocean Islands,
 the New Caledonian Flora, etc. The app is translated in 11 languages and
 is being used by more than 3 millions of end-users all over the world,
 mostly in Europe and the US.

 The analysis of the data collected by Pl\@ntnet users, which represent
 more than 24 millions of observations up to now, has a high potential
 for different ecological and management questions. A recent work
 (Botella et al. 2018), in particular, did show that the stream of
 Pl\@ntNet observations could allow a fine-grained and regular monitoring
 of some species of interest such as invasive ones. However, this
 requires cautious considerations about the contexts in which the
 application is used. In this talk, we will synthesize the results of
 this study and present another one related to phenology. Indeed, as the
 phenological stage of the observed plants is also recorded, these data
 offer a rich and unique material for phenological studies at large
 geographical or taxonomical scale. We will share preliminary results
 obtained on some important pantropical species (such as the Melia
 azedarach L., and the Lantana camara L.), for which we have detected
 significant intercontinental phenological patterns, among the project
 data.
 Abstract

 Many organisations running citizen science projects don't have access to
 or the knowledge or means to develop databases and apps for their
 projects. Some are also concerned about long-term data management and
 also how to make the data that they collect accessible and impactful in
 terms of scientific research, policy and management outcomes. To solve
 these issues, the Atlas of Living Australia (ALA) has developed
 BioCollect. BioCollect is a sophisticated, yet simple to use tool which
 has been built in collaboration with hundreds of real users who are
 actively involved in field data capture. It has been developed to
 support the needs of scientists, ecologists, citizen scientists and
 natural resource managers in the field-collection and management of
 biodiversity, ecological and natural resource management (NRM) data.
 BioCollect is a cloud-based facility hosted by the ALA and also includes
 associated mobile apps for offline data collection in the field.
 BioCollect provides form-based structured data collection for:

 Ad-hoc survey-based records;

 Method-based systematic structured surveys; and

 Activity-based projects such as natural resource management intervention
 projects (eg. revegetation, site restoration, seed collection, weed and
 pest management, etc.).

 This session will cover how BioCollect is being used for citizen science
 in Australia and some of the features of the tool.
 Abstract

 eBird is a global citizen science project that gathers observations of
 birds. The project has been making a considerable contribution to the
 collection and sharing of bird observations, even in the data-poorest
 countries, and is accelerating the accumulation of bird records
 globally. On 22 March 2018 eBird surpassed ½ billion bird observations.

 A primary component of ensuring the best quality data is the network of
 more than 1300 volunteer reviewers who scour incoming data for accuracy.
 Reviewers provide active feedback to participants on everything from
 bird identification to best practices for data collection. Since eBird's
 inception in 2002, almost 23 million observations have been reviewed,
 requiring more than 190,000 hours of effort by reviewers. In this
 presentation we review how eBird recruits expert reviewers, describe
 their responsibilities, and offer some insight in new developments to
 improve the reviewing process.

 How are reviewers recruited. There are three primary methods that used
 to identify new reviewers. First, if we don't have any active
 participants in a region (e.g., Kamchatka Russia) eBird staff search
 birding listserves to find an individual who is reporting a lot of
 high-quality observations from the area. We then contact those
 individuals and offer them the opportunity to review records for the
 region. This option has the lowest likelihood of success. Second, if an
 individual is submitting a lot of records to eBird from a region that
 needs a reviewer we contact them and request their participation. Third,
 in much of the world eBird has partner groups. These partner
 organizations (e.g., Taiwan, Spain, India, Portugal, Australia, and all
 of the Western Hemisphere) recruit their own reviewers. The third method
 is the most effective way to gain expert participation.

 What does a reviewer do? eBird reviewers work to improve eBird data in
 three primary areas. First, they develop and manage the eBird checklist
 filters for a region. These filters generate a checklist of birds for a
 particular time and location, and determine what records get flagged for
 further review. Second, if an eBird participant tries to report a
 species that is not on the checklist, or if the number of individuals of
 a species exceeds the filter limit, then these records get flagged for
 review. Reviewers contact the observer and request further
 documentation. Currently, 57% of all records that are evaluated by
 reviewers are validated. Finally, eBird reviewers validate whether the
 participant is eBirding correctly. That is, are they correctly filling
 out the information on when, where, and how they went birding. It has
 been our experience that different types of reviewers are required to
 effectively review eBird submissions: those who are good at reviewing
 bird records and those who are good at educating observers on how to
 participate.

 What are future plans? eBird will move towards more effective reviewer
 teams, where the volume of observations can be split amongst a number of
 individuals with different strengths, allowing identification experts to
 focus on observation-level ID issues; and strong communicators to focus
 on working with contributors on checklist-level best practices.
 Currently, a single eBird review platform handles a broad array of
 different reviewing functions. It is our intent to split some of these
 functions into multiple platforms. For example, right now all review
 happens at the database level of the 'observation': a record of a taxon
 at a date and location. Plans are underway to develop tools that will
 allow reviewers to work at the entire checklist level (i.e., to more
 easily review the accuracy of how all the observations during a
 checklist event were submitted), which will enable much more effective
 review of checklist-level data quality concerns.
 Abstract

 Volunteers, researchers and citizen scientists are important
 contributors to observation and monitoring databases. Their
 contributions thus become part of a global digital data pool, that forms
 the basis for important and powerful tools for conservation, research,
 education and policy. With the data contributed by citizen scientists
 also come concerns about data completeness and quality. For data
 generated by citizen scientists taxonomic bias effects, where certain
 species (groups) are underrepresented in observations, are even stronger
 than for professionally collected data. Identification tools that help
 citizen scientists to access more difficult, underrepresented groups,
 can help to close this gap.

 We are exploring the possibilities of using artificial intelligence for
 automatic species identification as a tool to support the registration
 of field observations. Our aim is to offer nature enthusiasts the
 possibility of automatically identifying species, based on photos they
 have taken as part of an observation. Furthermore, by allowing them to
 register these identifications as part of the observation, we aim to
 enhance the completeness and quality of the observation database. We
 will demonstrate the use of automatic species recognition as part of the
 process of observation registration, using a recognition model that is
 based on deep learning techniques.

 We investigated the automatic species recognition using deep learning
 models trained with observation data of the popular website
 Observation.org (https://observation.org/). At Observation.org data
 quality is ensured by a review process of all observations by experts.
 Using the pictures and corresponding validated metadata from their
 database, models were developed covering several species groups. These
 techniques were based on earlier work that culminated in ObsIdentify, an
 free offline mobile app for identifying species based on pictures taken
 in the field. The models are also made available as an API web service,
 which allows for identification by offering a photo through common
 HTTP-communication - essentially like uploading it through a webpage.
 This web service was implemented in the observation entry workflows of
 Observation.org. By providing an automatically generated taxonomic
 identification with each image, we expect to stimulate existing citizen
 scientists to generate a larger quantity of and more biodiverse
 observations. Additionally we hope to motivate new citizen scientists to
 start contributing.

 Additionally, we investigated the use of image recognition for the
 identification of additional species in the photo other than the primary
 subject, for example the identification of the host plant in photos of
 insects. The Observation.org database contains many of such photos which
 are associated with a single species observation, while additional,
 other species are also present in the photo, but are unidentified.
 Combining object detection to detect individual species with species
 recognition models opens up the possibility of automatically identifying
 and counting these species, enhancing the quality of the observations.
 In the presentation we will present the initial results of this
 application of deep learning technology, and discuss the possibilities
 and challenges.
 Abstract

 Specimen labels are written in numerous languages and accurate
 interpretation requires local knowledge of place names, vernacular names
 and people's names. In many countries more than one language is in
 common usage. Belgium, for example, has three official languages.
 Crowdsourcing has helped many collections digitize their labels and
 generates useful data for science. Furthermore, direct engagement of the
 public with a herbarium increases the collection's visibility and
 potentially reinforces a sense of common ownership. For these reasons we
 built DoeDat, a multilingual crowdsourcing platform forked from Digivol
 of the Australian Museum (Figs 1, 2). Some of the useful features we
 inherited from Digivol include a georeferencing tool, configurable
 templates, simple project management and individual institutional
 branding.

 Running a multilingual website does increase the work needed to setup
 and manage projects, but we hope to gain from the broader engagement we
 can attract. Currently, we are focusing our work on Belgian collections
 were Dutch and French are the primary languages, but in the future we
 may expand our languages when we work on our international collections.
 We also hope that we can eventually merge our code with that of Digivol,
 so that we can both benefit from each other\'s developments.
 Abstract

 The implementation of Citizen Science in biodiversity studies has led
 the general public to engage in environmental actions and to contribute
 to the conservation of natural resources (Chandler et al. 2017).
 Smartphones have become part of the daily lives of millions of people,
 allowing the general public to collect data and conduct automatic
 measurements at a very low cost. Indeed, a series of Citizen Science
 mobile applications have allowed citizens to rapidly record specimen
 observations and contribute for the development of large biodiversity
 databases around the World. Citizen Science applications have a
 multitude of purposes, as well as target a variety of taxa, biological
 questions and geographical regions.

 Brazil is a megadiverse country that includes many threatened species
 and Biomes. Conversation efforts are urgent and the engagement of the
 civil society is critical. Brazilian dry and wet forests are dominated
 by members of the plant family Bignoniaceae, all of which are
 characterized by beautiful trumpet-shaped flowers and a big-bang
 flowering strategy. Species of the Neotropical Bignoniaceae trees are
 popularly known in Brazil as "Ipê" and are broadly cultivated throughout
 the country due to the showy flowers and strong wood. Different species
 have different flower colors, making its identification relatively easy.
 The showy and colorful flowers are extremely admired by the local
 population and the media. Flowering of "Ipês" is triggered by dry
 climate, lower temperatures and increasing day-light, making this group
 an excellent model for phenological and climatic studies involving
 Citizen Science.

 Here, we developed a multi-platform mobile application focused on the
 plant family Bignoniaceae that allows users to contribute phenological
 data for species from this plant family. More specifically, through this
 application the user is able to provide data about specimen locations,
 phenology and date, all of which can be validated by a photograph. This
 platform is based on React Native, a hybrid app framework that helps the
 developers to reuse the code across multiple mobile platforms, a
 development much more efficient and with efforts focused on the user
 experience. This technology uses Javascript as programming language and
 Facebook React as a basis for development. The system is similar to
 other CS apps such as iNaturalist. Namely, the overall observations
 improve the quality of the ranking through positive feedback from the
 community, strengthening the network of interactions between users and
 encouraging active participation. On the other hand, the application
 allows users to access all previously stored observations, which, in
 turn, can suggest improvements to that particular observation.
 Furthermore, observations without a correct ID can be stored until
 others can suggest a correct identification, maximizing the value of
 individual observations and data gathered.

 An important aspect of this mobile application is the participation of a
 network of experts on this plant family, allowing a rapid and accurate
 verification of individual observations. This team of Bignoniaceae
 experts is also able to make full use of the data gathered by
 correlating climate and phenological patterns. Results from these
 analyses are provided to the citizens gathering the data which will, in
 turn, stimulate the collection of new data, especially in poorly sampled
 locations. This is a very dynamic mobile application, that aims to
 engage the civil society with true scientific research, stimulating the
 management of natural resources and conservation efforts. Through this
 mobile app, we hope to engage the general public into biodiversity
 studies by improving their knowledge on an iconic group of Brazilian
 plants, while contributing data for scientific studies. The system is
 expected to be released in May and will be available at
 ipesdobrasil.org.br.
 Abstract

 The Online Pollen Catalogs Network (RCPol) (http://rcpol.org.br) was
 conceived to promote interaction among researchers and the integration
 of data from pollen collections, herbaria and bee collections. In order
 to structure RCPol work, researchers and collaborators have organized
 information on Palynology in four branches: palynoecology,
 paleopalynology, palynotaxonomy and spores. This information is
 collaboratively digitized and managed using standardized Google
 Spreadsheets. These datasets are assessed by the RCPol palynology
 experts and when a dataset is compliant with the RCPol data quality
 policy, it is published to http://chaves.rcpol.org.br.

 Data quality assessment used to be performed manually by the experts and
 was time-consuming and inconsistent in detecting data quality problemas
 such as incomplete and inconsistent information. In order to support
 data quality assessment in a more automated and effective way, we are
 developing a data quality tool which implements a series of mechanisms
 to measure, validate and improve completeness, consistency, conformity,
 accessibility and uniqueness of data, prior to a manual expert
 assessment. The system was designed according to the conceptual
 framework proposed by Task Group 1 of the Biodiversity Data Quality
 Interest Group Veiga et al. 2017. For each sheet in the Google
 Spreadsheet, the system generates a set of assertions of measures,
 validations and amendments for the records (rows) and datasets (sheets),
 according to a profile defined for RCPol. The profile follows the
 policies of data quality measurement, validation and enhancement. The
 data quality measurement policy encompassess the dimensions of
 completeness, consistency, conformity, accessibility and uniqueness.
 RCPol uses a quality assurance approach: only data that are compliant
 with all the quality requirements are published in the system.
 Therefore, its data quality validation policy only considers datasets
 with 100% completeness, consistency, conformity, accessibility and
 uniqueness. In order to improve the quality in each relevant dimension,
 a set of enhancements was defined in the data quality enhancement
 policy. Based on this RCPol profile, the system is able to generate
 reports that contain measures, validations and amendments assertions
 with the method and tool used to generate the assertion. This web-based
 system can be tested at http://chaves.rcpol.org.br/admin/data-quality
 with the dataset
 https://docs.google.com/spreadsheets/u/1/d/1gH0aa2qqnAgfAixGom3Gnx6Qp
 91ZvWhUHPb\_QeoIreQ. This system is able to assure that only data
 compliant with the data quality profile defined by RCPol are fit for use
 and can be published.

 This system contributes significantly to decreasing the workload of the
 experts. Some data may still contain values that cannot be easily
 automatically assessed, e.g. validate if the content of an image matches
 the respective scientific name, so expert manual assessment remains
 necessary. After the system reports that data are compliant with the
 profile, a manual assessment must be performed by the experts, using the
 data quality report as support, and only after that will the data be
 published. The next steps include archival of the data quality reports
 in a database, improving the web interface to enable searching and
 sorting of assertions, and to provide a machine readable interface for
 the data quality reports.
 Abstract

 Task Group 2 of the TDWG Data Quality Interest Group aims to provide a
 standard suite of tests and resulting assertions that can assist with
 filtering occurrence records for as many applications as possible.
 Currently 'data aggregators' such as the Global Biodiversity Information
 Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run
 their own suite of tests over records received and report the results of
 these tests (the assertions): there is, however, no standard reporting
 mechanisms. We reasoned that the availability of an internationally
 agreed set of tests would encourage implementations by the aggregators,
 and at the data sources (museums, herbaria and others) so that issues
 could be detected and corrected early in the process.

 All the tests are limited to Darwin Core terms. The \~95 tests refined
 from over 250 in use around the world, were classified into four output
 types: validations, notifications, amendments and measures. Validations
 test one of more Darwin Core terms, for example, that
 dwc:decimalLatitude is in a valid range (i.e. between -90 and +90
 inclusive). Notifications report a status that a user of the record
 should know about, for example, if there is a user-annotation associated
 with the record. Amendments are made to one or more Darwin Core terms
 when the information across the record can be improved, for example, if
 there is no value for dwc:scientificName, it can be filled in from a
 valid dwc:taxonID. Measures report values that may be useful for
 assessing the overall quality of a record, for example, the number of
 validation tests passed.

 Evaluation of the tests was complex and time-consuming, but the
 important parameters of each test have been consistently documented.
 Each test has a globally unique identifier, a label, an output type, a
 resource type, the Darwin Core terms used, a description, a dimension
 (from the Framework on Data Quality from TG1), an example, references,
 implementations (if any), test-prerequisites and notes. For each test,
 generic code is being written that should be easy for institutions to
 implement -- be they aggregators or data custodians.

 A valuable product of the work of TG2 has been a set of general
 principles. One example is "Darwin Core terms are either:

 literal verbatim (e.g., dwc:verbatimLocality) and cannot be assumed
 capable of validation,

 open-ended (e.g., dwc:behavior) and cannot be assumed capable of
 validation, or

 bounded by an agreed vocabulary or extents, and therefore capable of
 validation (e.g., dwc:countryCode)".

 Another is "criteria for including tests is that they are informative,
 relatively simple to implement, mandatory for amendments and have power
 in that they will not likely result in 0% or 100% of all record hits." A
 third: "Do not ascribe precision where it is unknown."

 GBIF, the ALA and iDigBio have committed to implementing the tests once
 they have been finalized. We are confident that many museums and
 herbaria will also implement the tests over time. We anticipate that
 demonstration code and a test dataset that will validate the code will
 be available on project completion.
 Abstract

 In the process of sharing information, it is of highest importance that
 we utilize common codes and signifiers, so that communication is
 effective. This process presents a series of complexities that are
 related to capturing and transmitting the meaning of the information
 despite homonymy, polysemy and synonymy. Biodiversity data sharing is
 not exempt from these challenges and understanding the meaning often
 requires expert knowledge. For communication to be effective, and
 therefore for data to be of maximal re-use, we need common vocabularies
 that unequivocally refer us to the same concepts.

 The community has agreed upon some vocabularies to structure shared
 information, i.e., biodiversity data standards such as the Darwin Core
 standard (Wieczorek et al. 2012). The bterms in Darwin Core can be
 thought of as the names of the columns in a spreadsheet. For example,
 there are terms such as genus, stateProvince, sex, etc. This allows us
 to capture and share information which we agree belongs under one of
 those terms. However, we have not yet reached an agreement on how to
 express the permitted values under all those terms, that is,
 vocabularies of values. As a simple example, we agree that if we have a
 record of an organism that is a female, we will share the fact that it
 is a female under the "sex" term, but we could represent female with the
 values "female", "fem.", "f.", and other possible abbreviation and
 language variants. Other more complex examples, bound to expert
 knowledge, include biological taxonomies and how we name distinct
 species and species concepts.

 While many vocabularies exist in the community, we currently do not
 possess a full suite of vocabularies of values that apply uniformly
 across the biodiversity data community and there is no single repository
 to explore the available resources. While some of the available
 vocabularies are discipline-specific, many that could be applied more
 broadly remain independent and scattered. Additionally, similar lists of
 terms that refer to the same concepts can be found in different
 languages, but disconnected from one another.

 The lack of or non-adherence to vocabularies of values constitutes a
 data quality issue, as the heterogeneity in the data renders data less
 discoverable and difficult to use. Capturing information in myriad ways
 risks being incomplete and inaccurate in our transmission of
 information. If we cannot be certain that a particular value
 unambiguously refers to a particular concept, we cannot assert that a
 record containing that value could reliably be used for a particular
 purpose. In this context, the construction and use of vocabularies of
 values, including the explicit declaration of usage, is a data quality
 issue.

 From the TDWG Data Quality Interest Group we have begun to tackle this
 problem, with the aim of creating a suitable environment for thought and
 development of vocabularies of values. Accordingly, a new task group has
 been constituted, whose main goals are to:

 prepare a scoping document in which we will determine the types of
 vocabularies needed (including multi-lingual approaches) and the
 strategy for organizing the construction and/or management of
 new/existing vocabularies;

 develop a common repository to store vocabularies and/or link to
 existing ones;

 develop best practices for building TDWG vocabularies; and

 develop an exemplary vocabulary following the standard format.

 This will provide the community with a framework to work on and build
 upon vocabularies of values in a way that would allow better
 understanding and maximal interoperability.
 Abstract

 As the world strives towards achieving Sustainable Development Goals,
 development planners both at national and local levels have now come to
 understand the importance of informed decision-making. Natural resources
 management is one of the areas where careful planning is required to
 ensure sustainable use of and maximum benefit from the services we get
 from ecosystems.

 In developing countries, the scarcity of resources (both in terms of
 funding and skills) constitutes the main hindrance to the generation of
 accurate and timely data and information that would guide planning and
 implementation of development strategies. As a result, decisions are
 taken on an ad-hoc basis and without possibility of appreciating the
 long-term effect of these decisions.

 In that regard, Albertine Rift Conservation Society (ARCOS) has
 developed a participatory and cost-effective framework to monitor the
 status and trends of biodiversity and ecosystem services at the
 landscape level and to assess the socio-economic conditions that affect
 them.

 The approach termed "Integrated Landscape Assessment and Monitoring --
 ILAM" uses the Driver-Pressure-State-Impact-Response model and applies a
 simple indicators framework that allows teams to collect needed data in
 a rapid and cost-effective way. Burkhard and Müller (2008)

 This approach is flexible enough to be adaptable to the available time
 and funding resources and is therefore very suitable to be applied in
 the context of the developing world including east-African countries.
 This flexibility ranges from the use GIS and remote sensing techniques
 combined with thorough biodiversity field surveys to simple rapid
 assessment of key indicators using smaller teams and for short periods
 of time in the field.

 Since 2013, ARCOS has been biennially conducting ILAM studies in its
 five focal landscapes in Rwanda, Uganda and Burundi and the results have
 influenced major decisions such as the designation of at least two
 wetlands as Ramsar sites and the upgrade of one forest as a national
 park.

 In addition to this, other planning processes have been informed by the
 results of these studies, such as the process to develop the new Rwandan
 National Strategy for Transformation for 2017--2024 and the development
 of the districts' strategic plans for 2018--2024.

 Currently the biodiversity data generated through these studies is being
 published by Global Biodiversity Information Facility (GBIF) for wider
 access by researchers and educators in the region and a portal, the
 ARCOS Biodiversity Information Management System (ARBIMS), has been
 established to facilitate sharing of data and information to guide
 planning and decision-making in the region.
 Abstract

 Species-level observational data comprise the largest and
 fastest-growing part of the Global Biodiversity Information Facility
 (GBIF). The largest single contributor of species observations is eBird,
 which so far has contributed more than 361 million records to GBIF.
 eBird engages a vast network of human observers (citizen-scientists) to
 report bird observations, with the goal of estimating the range,
 abundance, habitat preferences, and trends of bird species at high
 spatial and temporal resolutions across each species' entire life-cycle.
 Since its inception, eBird has focused on improving the data quality of
 its observations, primarily focused in two areas:

 ensuring that participants describe how they gathered their observations
 and,

 all observations are reviewed for accuracy.

 In this presentation I will review how this is done in eBird.

 Standardized Data Collection. eBird gathers bird observations based on
 how bird watchers typically observe birds with units of data collection
 being "checklists" of zero or more species including a count of
 individuals for each species observed. Participants choose the location
 where they made their observations and submit their checklists via
 Mobile Apps (50% of all submissions) or the website (50% of all
 submissions). All checklists are submitted in a standard format
 identifying where, how, and with whom they made their observations.
 Mobile apps precisely record locations, the track taken, and the
 distance they traveled while making the observations. The start time and
 duration of surveys are also recorded. All observers must report whether
 they reported all the birds they detected and identified, which allows
 analysts to infer absence of birds if they were not reported. All data
 are stored within an Oracle data management framework.

 Data Accuracy. The most significant data quality challenge for species
 observations is detecting and correctly identifying organisms to
 species. The issue involves how to handle both false positives --- the
 misidentification of an observed organism, and false negatives---failing
 to report a species that was present. The most egregious false positives
 can be identified as anomalies that fall outside the norm of occurrence
 for a species at a particular time or space. However, false positives
 can also be misidentifications of common species. These challenges are
 addressed by:

 Data-driven filters. eBird's existing data can identify and flag
 potentially erroneous records at increasingly fine spatial, temporal,
 and user-specific scales. These filters can identify outliers and likely
 errors, which are the foundation of the eBird review process. By using
 the vetted data to identify outliers, data quality checks run against
 expected occurrence probabilities at very fine scales and identify
 anomalies during data submission (including on mobile devices).

 Incorporate observer expertise scores. Observer differences are the
 largest source of variability in eBird data. Assessment of observer
 metrics, and the inclusion of these data in species distribution models,
 improves analysis output and model performance.

 Expert reviewer network. More than 2000 volunteers review records
 identified by the data-driven filters and contact data submitters to
 confirm their observations. The existing data quality process functions
 globally. Currently the approach is focused on misidentified birds, but
 in the future will also involve collection event issues (e.g., issues
 with protocol, location, or methodology), sensitive species, exotic
 species, and better handle widely-observed individual rarities.
 Additional tools are also to be developed to help editors improve
 efficiency and better prioritize review.

 In 2017, 4,107,757 observations representing 4.6% of all eBird records
 submitted were flagged for review by the data driven filters. Of these
 records 57.4% were validated and 42.6% were invalidated.
 Abstract

 From 81 study sites across the United States, the US National Ecological
 Observatory Network (NEON), generates \>75,000 samples per year. Samples
 range from soil and dust deposition material, tissue samples (e.g.,
 small mammals and fish), DNA extracts, and whole organisms (e.g., ground
 beetles and ticks). Samples are collected, processed, and documented
 according to protocols that are standardized across study sites and
 according to the needs of the ecological research community for future
 studies. NEON has faced numerous challenges with managing data related
 to these many diverse physical samples, particularly when data are
 gathered at numerous steps throughout processing. Here, we share these
 challenges as well as solutions, including innovative semantically
 driven software tools and processing pipelines that manage data from
 each sample\'s point of collection to its ultimate fate (consumption,
 archive facility, or partnering data repository) while maintaining links
 across sample hierarchies.
 Abstract

 What is a provider (or consumer) of biodiversity data to think when one
 quality assessment tool asserts that a particular problem exists in
 their data, while a different tool asserts that this problem is not
 present? Is there a problem with their data? Is there a problem with one
 of the tools? The Biodiversity Data Quality Task Group 2 is developing a
 suite of standardized descriptions of tests (validations, measures,
 amendments) of biodiversity data, implementations of which would be
 expected to provide consistent assertions about a particular data set so
 that input of identical data sets into two different test suite
 implementations will produce the same results (for some meaning of "the
 same").

 Development of standard test definitions is a big step in the direction
 of consistency. More is needed. Clear and detailed specifications for
 each test will help. For example, data might have suitable quality for
 global change analysis if collecting dates have a temporal resolution of
 one year or less. One implementer\'s test may check if the event date
 has a duration of 365 days or less, another might account for leap days,
 another might test if the data can be unambiguously binned into single
 years. For some data, each implementation will produce different
 assertions about the record. If the standard test specification states
 which of these meanings apply, then correct implementations should make
 identical assertions. To tell, however, if two implementations of a
 suite of tests will produce the same result for identical inputs we need
 two things, one is a set of tests (of the tests), the other is an
 understanding of what it means for results to be the same. It is
 expected that there will be changes in the results of tests of
 scientific names over time, and that different authorities will have
 different opinions about that set of scientific names. One element of
 "the same" is an expectation that results will be the same when test
 implementations are run at the same time and with the same
 configuration, but not necessarily otherwise.

 Consider tests at three levels: First, tests of the internals of a test,
 separate from the fitness for use framework (Veiga et al. 2017) or
 serialization of test results. At this first level, unit tests are very
 appropriate, but these are tightly coupled to the language of
 implementation and the unit testing framework, and to the internal
 details of the implementation. Unit tests are very effective for
 software quality control, but not particularly portable. Second,
 consider tests of the output of a suite of tests. At this level (of
 integration tests), we are tightly coupled to both the fitness for use
 framework and the serialization, and the meaning of "the same" is
 important. Different software implementations may be expected to have
 different orders of output for the same input, and human readable
 comments would be expected to vary (e.g. with internationalization).
 Identity of machine readable assertions but in varying orders should be
 tolerable, but this is not easily accomplished. Implementation at this
 level is difficult. Third, consider tests of the framework output of a
 particular test. Order becomes unimportant, only machine readable
 framework assertions can be considered, and this is probably the level
 to target for testing. Input data for tests could be synthetic, real, or
 modified real data. Real data has the advantage of being realistic, but
 it is difficult to find real data which contains single issues. Clean
 real data into which synthetic error conditions have been introduced is
 enticing for test purposes, but risks confusion with real data, so I
 propose some standard values for certain Darwin Core terms for
 identifying synthetic data.
 Abstract

 The ability to communicate and assess the quality and fitness for use of
 data is crucial to ensure maximum utility and re-use. Data consumers
 have certain requirements for the data they seek and need to be able to
 check if a data set conforms with these requirements. Data publishers
 aim to provide data with the highest possible quality and need to be
 able to identify potential errors that can be addressed with the
 available information at hand. The development and adoption of data
 publication guidelines is one approach to define and meet those
 requirements. However, the use of a guideline, the mapping decisions,
 and the requirements a dataset is expected to meet, are generally not
 communicated with the provided data. Moreover, these guidelines are
 typically intended for humans only.

 In this talk, we will present \'whip\': a proposed syntax for data
 specifications. With whip, one can define column-based constraints for
 tabular (tidy) data using a number of rules, e.g. how data is structured
 following Darwin Core, how a term uses controlled vocabulary values, or
 what the expected minimum and maximum values are. These rules are human-
 and machine-readable, which communicates the specifications, and allows
 to automatically validate those in pipelines for data publication and
 quality assessment, such as Kurator. Whip can be formatted as a (yaml)
 text file that can be provided with the published data, communicating
 the specifications a dataset is expected to meet. The scope of these
 specifications can be specific to a dataset, but can also be used to
 express expected data quality and fitness for use of a publisher,
 consumer or community, allowing bottom-up and top-down adoption. As
 such, these specifications are complementary to the core set of data
 quality tests as currently under development by the TDWG Biodiversity
 Data Quality Task 2 Group 2. Whip rules are currently generic, but more
 specific ones can be defined to address requirements for biodiversity
 information.
 Abstract

 Georeferencing helps to fill in biodiversity information gaps, allowing
 biodiversity data to be represented spatially to allow for valuable
 assessments to be conducted. The South African National Biodiversity
 Institute has embarked on a number of projects that have required the
 georeferencing of biodiversity data to assist in assessments for
 redlisting of species and measuring the protection levels of species.

 Data quality in biodiversity information is an important aspect. Due to
 a lack of standardisation in collection and recording methods historical
 biodiversity data collections provide a challenge when it comes to
 ascertaining fitness for use or determining the quality of data. The
 quality of historical locality information recorded in biodiversity data
 collections faces the scrutiny of fitness for use as these information
 is critical in performing assessments. The lack of descriptive locality
 information, or ambiguous locality information deems most historical
 biodiversity records unfit for use. Georeferencing should essentially
 improve the quality of biodiversity data, but how do you measure the
 fitness for use of georeferenced data?

 Through the use of the Darwin Core coordinateUncertaintyinMeters,
 georeferenced data can be queried to investigate and determine the
 quality of the georeferenced data produced. My presentation will cover
 the scope of ascertaining georeferenced data quality through the use of
 the DarwinCore term coordinateUncertatintyInMeters, the impacts of using
 a controlled vocabulary in representing the
 coordinateUncertaintyInMeters, and will highlight how SANBI's
 georeferencing efforts have contributed to data quality within the
 management of biodiversity information.
 Abstract

 As part of the Biodiversity Information System on Nature and Landscapes
 (SINP), the French National Natural History Museum has been appointed to
 develop biodiversity data exchanges by the French ministry in charge of
 ecology. Given there are, quite literally, thousands of different
 sources, such a development brings into question the underlying quality
 of data. To add complexity, there can be several layers of quality: one
 being appraised by the producer himself, one by a regional node, and one
 by the national node.

 The approach to quality issues was addressed by a dedicated working
 group, representative of biodiversity stakeholders in France. The
 resulting documents focus on core methodology elements that characterize
 a data quality process for taxon occurrences only in the first instance
 (It may be extended to habitats, geology, etc. in the near future).

 Three processes are covered, how to ensure:

 data conformity by checking for the presence of compulsory elements or
 that a given attribute is of the right type,

 data consistency by checking information versus other information (for
 example, an end date has to be later than a start date),

 and scientific validation, through either manual (use of expertise) or
 automated (comparison with knowledge databases) means, or even a
 combined approach that provides users with a quality appraisal of said
 data.

 Within the SINP, only data that has passed conformity and consistency
 tests can be exchanged with any and all types of validation levels. For
 example, should there be no expert existing on a specific taxon group,
 unvalidated data can be shared.

 For scientific validation, two processes are used, one automatic that
 uses several criteria such as comparison with a national taxonomic
 reference database (TAXREF), and with species reference maps. The
 combination of all these elements can be used to automatically flag data
 for a second, deeper, manual process that allows for further scrutiny in
 order to reach a conclusive evaluation. This allows experts to work only
 on "doubtful" data, thus saving time.

 In the future, other criteria that are currenlty used with the manual
 approach, such as for example congruity, data scarcity on a given
 species, determination difficulty, existence of associated proof
 (specimen, picture...), knowledge of the ability of the observer,
 databases on most frequent determination errors etc., could be added to
 the automatic process.

 Some elements must be included in the data to allow for comprehensive
 testing, and have been included in a national data standard so that the
 result of the validation process can be shared with users, allowing them
 to judge how the data is fit for their use.

 The presentation will deal with how such a work was undertaken and how
 conformity, consistency and scientific validation have been treated and
 issues solved by the workgroup. For example, there could be a 40 million
 data record backlog. The presentation will also show how the required
 elements could be integrated into the French national standard.
 Abstract

 The success of Darwin Core and ABCD Schema as flexible standards for
 sharing specimen data and species occurrence records has enabled GBIF to
 aggregate around one billion data records. At the same time, other
 thematic, national or regional aggregators have developed a wide range
 of other data indexes and portals, many of which enrich the data by
 interpreting and normalising elements not currently handled by GBIF or
 by linking other data from geospatial layers, trait databases, etc.
 Unfortunately, although each of these aggregators has specific strengths
 and supports particular audiences, this diversification produces many
 weaknesses and deficiencies for data publishers and for data users,
 including: incomplete and inconsistent inclusion of relevant datasets;
 proliferation of record identifiers; inconsistent and bespoke workflows
 to interpret and standardise data; absence of any shared basis for
 linked open data and annotations; divergent data formats and APIs; lack
 of clarity around provenance and impact; etc.

 The time is ripe for the global community to review these processes.
 From a technical standpoint, it would be feasible to develop a shared,
 integrated pipeline which harvested, validated and normalised all
 relevant biodiversity data records on behalf of all stakeholders. Such a
 system could build on TDWG expertise to standardise data checks and all
 stages in data transformation. It could incorporate a modular structure
 that allowed thematic, national or regional networks to generate
 additional data elements appropriate to the needs of their users, but
 for all of these elements to remain part of a single record with a
 single identifier, facilitating a much more rigorous approach to linked
 open data. Most of the other issues we currently face around
 fitness-for-use, predictability and repeatability, transparency and
 provenance could be supported much more readily under such a model.

 The key challenges that would need to be overcome would be around social
 factors, particularly to deliver a flexible and appropriate governance
 model and to allow research networks, national agencies, etc. to embed
 modular components within a shared workflow. Given the urgent need to
 improve data management to support Essential Biodiversity Variables and
 to deliver an effective global virtual natural history collection, we
 should review these challenges and seek to establish a data management
 and aggregation architecture that will support us for the coming
 decades.
 Abstract

 Digitized natural history data are enabling a broad range of innovative
 studies of biodiversity. Large-scale data aggregators such as Global
 Biodiversity Information facility (GBIF) and Integrated Digitized
 Biocollections (iDigBio) provide easy, global access to millions of
 specimen records contributed by thousands of collections. A developing
 community of eager users of specimen data -- whether locality, image,
 trait, etc. -- is perhaps unaware of the effort and resources required
 to curate specimens, digitize information, capture images, mobilize
 records, serve the data, and maintain the infrastructure (human and
 cyber) to support all of these activities. Tracking of specimen
 information throughout the research process is needed to provide
 appropriate attribution to the institutions and staff that have supplied
 and served the records. Such tracking may also allow for annotation and
 comment on particular records or collections by the global community.
 Detailed data tracking is also required for open, reproducible science.
 Despite growing recognition of the value and need for thorough data
 tracking, both technical and sociological challenges continue to impede
 progress. In this talk, I will present a brief vision of how application
 of a DOI to each iteration of a data set in a typical research project
 could provide attribution to the provider, opportunity for comment and
 annotation of records, and the foundation for reproducible science based
 on natural history specimen records. Sociological change -- such as
 journal requirements for data deposition of all iterations of a data set
 -- can be accomplished using community meetings and workshops, along
 with editorial efforts, as were applied to DNA sequence data two decades
 ago.
 Abstract

 DiSSCo (The Distributed System of Scientific Collections) is a Research
 Infrastructure (RI) aiming at providing unified physical
 (transnational), remote (loans) and virtual (digital) access to the
 approximately 1.5 billion biological and geological specimens in
 collections across Europe. DiSSCo represents the largest ever formal
 agreement between natural science museums (114 organisations across 21
 European countries). With political and financial support across 14
 European governments and a robust governance model DiSSCo will deliver,
 by 2025, a series of innovative end-user discovery, access,
 interpretation and analysis services for natural science collections
 data.

 As part of DiSSCo\'s developing data model, we evaluate the application
 of Digital Objects (DOs), which can act as the centrepiece of its
 architecture. DOs have bit-sequences representing some content, are
 identified by globally unique persistent identifiers (PIDs) and are
 associated with different types of metadata. The PIDs can be used to
 refer to different types of information such as locations, checksums,
 types and other metadata to enable immediate operations. In the world of
 natural science collections, currently fragmented data classes (inter
 alia genes, traits, occurrences) that have derived from the study of
 physical specimens, can be re-united as parts in a virtual container
 (i.e., as components of a Digital Object). These typed DOs, when
 combined with software agents that scan the data offered by
 repositories, can act as complete digital surrogates of the physical
 specimens.

 In this paper we:

 investigate the architectural and technological applicability of DOs for
 large scale data RIs for bio- and geo-diversity,

 identify benefits and challenges of a DO approach for the DiSSCo RI and

 describe key specifications (incl. metadata profiles) for a
 specimen-based new DO type.
 Abstract

 Collections, aggregators, data re-packagers, publishers, researchers,
 and external user groups form a complex web of data connections and
 pipelines. This forms the natural history infrastructure essential for
 collections use by an ever increasing and diverse external user
 community. We have made great strides in developing the individual
 actors within this system and we are now well poised to utilize these
 capabilities to address big picture questions. We need to continue work
 on the individual aspects, but the focus now needs to be on integration
 of the functionality provided by the actors involved in the pipeline to
 facilitate the transfer of data between them with as few human
 interventions as possible. In order for the system to function
 efficiently and to the benefit of all parties, information, data, and
 resources need not only to be integrated efficiently but flow in the
 reverse direction (attribution) to facilitate collections advocacy and
 sustainability. There are unrealized benefits to collections from
 inclusion into aggregators and subsequent use by researchers and
 publishers. A recent National Science Foundation (NSF) funded Research
 Coordination Network (RCN) Biodiversity Collections Network (BCoN) needs
 assessment workshop identified a possible solution to the integration
 and attribution of collections data and specimen information using a
 suite of unique, persistent identifiers for specimen records
 (Universally Unique Identifiers or UUIDs), datasets (Digital Object
 Identifiers or DOIs) and institutions/collections (Cool Uniform Resource
 Identifiers or Cool URIs). This talk will highlight this potential
 workflow and the work needed to achieve this solution while soliciting
 participation from actors in the pipeline and the community at large.
 Abstract

 Increasing the number of occurrence records available for biodiversity
 research requires developing efficient pipelines from collectors and
 observers to data aggregators and then marketing those pipelines to
 biodiversity researchers. To be effective, these pipelines must
 recognize that in many countries, internet access is slow, intermittent,
 or expensive; cell phone internet access may be more common but many
 people cannot afford the costs associated with using a cell phone for
 databasing. The pipelines must also make it easy for users to provide
 high quality data that conforms to international biodiversity data
 standards. Marketing of these pipelines should include building
 understanding of these standards and enable data providers to benefit
 almost immediately from their contributions. Symbiota has succeeded in
 making over 32 million specimen records available but most come from the
 United States, a country with fast and reliable internet access in most
 regions. We have established two Symbiota-based websites, OpenHerbarium
 and OpenZooMuseum, to enable collectors and collections in Old World
 countries that lack a national network, to become contributors to and
 participants in the global biodiversity data sharing community. Talking
 with biodiversity researchers in such countries has clarified the many
 impediments to data sharing faced by their collectors and collections.
 In this presentation, we shall describe the steps we have taken, and are
 proposing to take, to improve the pipeline for collectors and
 collections in countries with poor internet access.
 Abstract

 VertNet (vertnet.org) is a collaborative project that makes biodiversity
 data free and available on the web. VertNet is also a tool designed to
 help people discover, improve, and publish biodiversity data. It is also
 the core of a collaboration between hundreds of biocollections that
 contribute biodiversity data and work together to improve it. VertNet
 has its genesis in the late 1990s and the very beginnings of vertebrate
 collections data sharing, and is nearing its 20th birthday. The small
 team that coordinates VertNet efforts long recognized the value of
 archival versions of VertNet data separate from individual published
 Darwin Core Archives. Here we describe why we produce what we call
 "snapshots" of the VertNet index. To understand the snapshots, it is
 important to also know how the VertNet indexing process works, which
 includes efforts at better flagging record types and special content of
 particular value to data consumers. We provide a brief explanation of
 the process we developed for creating these snapshots, focusing on how
 to assure their citation and licensing, and how to decide the scope of
 different snapshots. We also discuss the collaborative process of
 deciding infrastructure for archiving those snapshots, and our thinking
 about timing of new snapshots. In particular, we cover the use of Google
 BigQuery to produce snapshots and CyVerse as infrastructure for archival
 storage.
 Abstract

 The South African Institute for Aquatic Biodiversity (SAIAB) operates
 several research platforms, which may be used by the broader South
 African research community (e.g. a marine research vessel and a remotely
 operated underwater vehicle). SAIAB's Enterprise-grade data centre,
 along with expertise in systems administration and biodiversity
 information management, allow the institute to offer a Biodiversity
 Information Management Platform.

 Data hosted by SAIAB is replicated across three data centres, with each
 centre being at least 250m apart and operating independently.
 Infrastructure at two data centres replicates in real time, forming a
 high availability cluster. The third datacentre is dedicated to storing
 backups. High-capacity tape backup will be added in the near future. As
 an additional measure, cloud storage is used to store daily extracts of
 Specify databases, which are retained for one year.

 In the first instance, the Platform aims to provide SAIAB researchers
 and associates with biodiversity data curation services. This begins
 with support for the SAIAB Collections Division, to ensure that voucher
 specimens, tissue samples and associated media are accurately catalogued
 and can be easily retrieved. Biodiversity data curation is broader than
 this. It also means that any biodiversity data/metadata (records of
 species, events, occurrences/observations and traits) can potentially be
 curated using Specify Software, and standardised and published (subject
 to relevant policies) to the GBIF Data Portal using the GBIF Integrated
 Publishing Toolkit. The use of Specify Software to curate biodiveristy
 data that do not represent voucher specimens (e.g. underwater images and
 video) is a new research project within SAIAB, which has the potential
 to be extended beyond SAIAB.

 A new national initiative, the Natural Science Collections Facility
 (NSCF), was launched in 2017 to reinvigorate natural science museums
 across the country, to halt deterioration of specimens and improve
 capacity for specimen and data curation.

 In support of the NSCF, the SAIAB platform is offered to natural science
 museums in South Africa (excluding herbaria, which are all part of or
 affiliated with SANBI, and therefore accommodated by a different
 system). Each museum will be provided with a webserver, Specify 7
 database, Specify web portal and IPT server.

 In offering this platform to the broader South African Biodiversity
 Science community, SAIAB is primarily motivated by the potential for
 collaborative research in capacity development for biodiversity data
 curation / information management, using Specify Software. The first
 research project will examine participating museums' capacity to use the
 Specify Workbench sustainably, to import new voucher/occurrence records
 generated by fieldwork. The requisite training to enhance this potential
 will be provided.

 The Natural Science Collections Facility (NSCF) is an important
 collaborator in the context of enhancing the general state of South
 Africa's specimen collections, and the Specify Collections Consortium is
 an important collaborator, specifically for support.
 Abstract

 Managing digital data for long-term archival and disaster recovery is a
 key component of our collective responsibility in managing digital data
 and metadata. As more and more data are collected digitally and as the
 metadata for traditional museum collections becomes both digitized and
 more comprehensive, the need to ensure that these data are safe and
 accessible in the long term becomes essential. Unfortunately, disasters
 do occur and many irreplaceable datasets on biodiversity have been
 permanently lost. Maintaining a long-term archive and putting in place
 reliable disaster recovery processes can be prohibitively expensive,
 both in the cost of hardware and software as well as the costs of
 personnel to manage and maintain an archival system. Traditionally,
 storing digital data for the long term and ensuring the data are
 loss-less, safe and completely recoverable when a disaster occurs has
 been managed on-premises with a combination of on-site and off-site
 storage. This requires complex data workflows to ensure that all data
 are securely and redundantly stored in multiple highly dispersed
 locations to minimize the threat of data loss due to local or regional
 disasters. Files are often moved multiple times across operating systems
 and media types on their way to and from a deep archive, increasing the
 risk of file integrity issues. With the recent advent of an array of
 Cloud Services from organizations such as Amazon, Microsoft and Google
 to more focused offerings from Iron Mountain, Atempo and others, we have
 a number of options for long term archival of digital data. Deep archive
 solutions, storage where retrieval expected only in the case of a
 disaster, are offered by many of these organizations at a rate
 substantially less than their normal data storage fees.

 The most basic requirement for an archival system is storing multiple
 replicates of the data in geographically isolated locations with a
 mechanism for guaranteeing file integrity, usually using a checksum
 algorithm. Additional components that are integral to a robust archive
 include a simple metadata search and reliable retrieval.

 In this presentation, we'll discuss the need for long term archive and
 disaster recovery capabilities, detail the current best practices of
 data archival systems and review a variety of archival options that have
 become available with Cloud Services.
 Abstract

 The Cornell Lab of Ornithology gathers, utilizes and archives a wide
 variety of digital assets ranging from details of a bird observation to
 photos, video and sound recordings. Some of these datasets are fairly
 small, while others are hundreds of terabytes. In this presentation we
 will describe how the Lab archives these datasets to ensure the data are
 both loss-less and recoverable in the case of a widespread disaster, how
 the archival strategy has evolved over the years and explore in detail
 the current hybrid cloud storage management system.

 The Lab runs eBird and several other citizen science programs focused on
 birds where individuals from around the globe enter their sightings into
 a centralized database. The eBird project alone stores over 500,000,000
 observations and the underlying database is over a terabyte in size.
 Birds of North America, Neotropical Birds and All About Birds are online
 species accounts comprising a wide range of authoritative live history
 articles maintained in a relatively small database. Macaulay Library is
 the world's largest image, sound and video archive with over 6,000,000
 cuts totaling nearly 100 TB of data. The Bioacoustics Research Program
 utilizes automated recording units (SWIFTs) in the forests of the US,
 jungles of Africa and in all seven oceans to record the environment.
 These units record 24 hours a day and gather a tremendous about of raw
 data, over 200 TB to date with an expected rate of an additional 100TB
 per year. Lastly, BirdCams run by the lab add a steady stream of media
 detailing the reproductive cycles of a number of species. The lab is
 committed to making these archives of the natural world available for
 research and conservation today. More importantly, ensuring these data
 exist and are accessible in 100 years is a critical component of the Lab
 data strategy.

 The data management system for these digital assets has been completely
 overhauled to handle the rapidly increasing volume and to utilize
 on-premises systems and cloud services in a hybrid cloud storage system
 to ensure data are archived in a manner that is redundant, loss-less and
 insulated from disasters yet still accessible for research. With
 multimedia being the largest and most rapidly growing block of data,
 cost rapidly becomes a constraining factor of archiving these data in
 redundant, geographically isolated facilities. Datasets with a smaller
 footprint, eBIrd and species accounts allow for a wider variety of
 solutions as cost is less of a factor. Using different methods to take
 advantage of differing technologies and balancing cost vs recovery
 speed, the Lab has implemented several strategies based on data
 stability (eBird data are constantly changing), retrieval frequency
 required for research and overall size of the dataset. We utilize Amazon
 S3 and Glacier as our media archive, we tag each media in Glacier with a
 set of basic DarwinCore metatdata fields that key back to a master
 metadata database and numerous project specific databases. Because these
 metadata databases are much smaller in size, yet critical in searching
 and retrieval of a required media file, they are archived differently
 with up to the minute replication to prevent any data loss due to an
 unexpected disaster. The media files are tagged with a standard set of
 basic metadata and in the case where the metadata databases were
 unavailable, retrieval of specific media and basic metadata can still
 occur.

 This system has allowed the lab to place into long term archive hundreds
 of terabytes of data, store them in redundant, geographically isolated
 locations and provide for complete disaster recovery of the data and
 metadata.
 Abstract

 Validation using schemas and tools like the Darwin Core Archive
 Validator from GBIF are mainly seen as methods of checking data quality
 and fitness for use, but are also important for long-term preservation.
 We may like to think that our present (meta)data standards and formats
 are made for eternity, but in reality we know that standards evolve,
 formats change (some even become obsolete with time), and so do our
 needs for storage, searching and future dissemination for re-use. So we
 might eventually come to a point where transformation of our archival
 records and migration to other formats will be necessary. This could
 also mean that even if the AIPs, the Archival Information Packages stay
 the same in storage, the DIPs, the Dissemination Information Packages
 that we want to extract from the archive are subject to change of
 format. Further, in order for archival information packages to be
 self-sustainable as required in the OAIS model, it is important to take
 interdependencies between individual files in the information packages
 into account, already by the time of ingest and validation of the SIPs,
 the Submission Information Packages, and along the line at different
 points of necessary transformation / migration (from SIP to AIP, from
 AIP to DIP etc.) to counter obsolecense. Validation schemas and
 transformation code should also be archived together with the AIPs. By
 ensuring compliance with standards these tools are essential in
 controlling uniformity of records in a collection, for future needs of
 transformation and migration to new, sustainable formats. An example is
 given of the problems encountered in transforming only a small,
 relatively well defined collection of about 1000 archival items but with
 substantial variations between them, due to a lack of effective input
 constraints and validation at ingest.

 A further assessment is made of validation errors encountered in some
 Darwin Core Archives comprising thousands of records from some hundred
 published datasets, and how these errors might affect a future potential
 transformation / migration effort. Migration efforts must necessarily be
 general in scope, while errors in datasets from non-compliance with
 standards risk being reinforced or aggravated in the transformation
 process, making the information contained in the resulting records more
 difficult to interpret. The conclusion is that efforts should be made,
 e.g. by means of embedded validation measures into upload forms and
 other methods of information transfer (e.g. ftp, oai-pmh) to ensure as
 close compliance as possible to standards, already at the time of
 ingest.
 Abstract

 Biodiversity Information Serving our Nation - BISON (bison.usgs.gov) is
 the U.S. node to the Global Biodiversity Information Facility
 (gbif.org), containing more than 375 million documented locations for
 all species in the U.S. It is hosted by the United States Geological
 Survey (USGS) and includes a web site and application programming
 interface for apps and other websites to use for free. With this massive
 database one can see not only the 15 million records for nearly 10
 thousand non-native species in the U.S. and its territories, but also
 their relationship to all of the other species in the country as well as
 their full national range. Leveraging this huge resource and its
 enterprise level cyberinfrastructure, USGS BISON staff have created a
 value-added feature by labeling non-native species records, even where
 contributing datasets have not provided such labels.

 Based on our ongoing four-year compilation of non-native species
 scientific names from the literature, specific examples will be shared
 about the ambiguity and evolution of terms that have been discovered, as
 they relate to invasiveness, impact, dispersal, and management. The idea
 of incorporating these terms into an invasive species extension to
 Darwin Core has been discussed by Biodiversity Information Standards
 (TDWG) working group participants since at least 2005. One roadblock to
 the implementation of this standard\'s extension has been the diverse
 terminology used to describe the characteristics of biological
 invasions, terminology which has evolved significantly over the past
 decade.
 Abstract

 Reducing the damage caused by invasive species requires a community
 approach informed by rapidly mobilized data. Even if local stakeholders
 work together, invasive species do not respect borders, and national,
 continental and global policies are required. Yet, in general, data on
 invasive species are slow to be mobilized, often of insufficient quality
 for their intended application and distributed among many stakeholders
 and their organizations, including scientists, land managers, and
 citizen scientists. The Belgian situation is typical. We struggle with
 the fragmentation of data sources and restrictions to data mobility.
 Nevertheless, there is a common view that the issue of invasive alien
 species needs to be addressed. In 2017 we launched the Tracking Invasive
 Alien Species (TrIAS) project, which envisages a future where alien
 species data are rapidly mobilized, the spread of exotic species is
 regularly monitored, and potential impacts and risks are rapidly
 evaluated in support of policy decisions (Vanderhoeven et al. 2017).
 TrIAS is building a seamless, data-driven workflow, from raw data to
 policy support documentation. TrIAS brings together 21 different
 stakeholder organizations that covering all organisms in the
 terrestrial, freshwater and marine environments. These organizations
 also include those involved in citizen science, research and wildlife
 management.

 TrIAS is an Open Science project and all the software, data and
 documentation are being shared openly (Groom et al. 2018). This means
 that the workflow can be reused as a whole or in part, either after the
 project or in different countries. We hope to prove that rapid data
 workflows are not only an indispensable tool in the control of invasive
 species, but also for integrating and motivating the citizens and
 organizations involved.
 Abstract

 The Global Register of Introduced and Invasive Species (GRIIS) presents
 annotated country checklists of introduced and invasive species.
 Annotations include higher taxonomy of the species, synonyms,
 environment/system in which the species occurs, and its biological
 status in that country. Invasiveness is classified by evidenced impact
 in that country. Draft country checklists are subjected to a process of
 validation and verification by networks of country experts. Challenges
 encountered across the world include confusion with alien/invasive
 species terminology, classification of the 'invasive' status of an alien
 species and issues with taxonomic synonyms.
 Abstract

 North America's Great Lakes contain 21% of the planet's fresh water, and
 their protection is a matter of national security to both the USA &
 Canada. One of the greatest threats to the health of this unparalleled
 natural resource is invasion by non-indigenous species, several of which
 already have had catastrophic impacts on property values, the fisheries,
 shipping, and tourism industries, and continue to threaten the survival
 of native species and wetland ecosystems.

 The Great Lakes Invasives Network is a consortium (20 institutions) of
 herbaria and zoology museums from among the Great Lakes states of
 Minnesota, Wisconsin, Illinois, Indiana, Michigan, Ohio, and New York
 created to better document the occurrence of selected non-indigenous
 species and their congeners in space and time by imaging and providing
 online access to the information on the specimens of the critical
 organisms. The list of non-indigenous species (1 alga, 42 vascular
 plants, 22 fish, and 13 mollusks) to be digitized was generated by
 conducting a query of all fish, plants, algae, and mollusks present in
 the database of GLANSIS -- the Great Lakes Aquatic Nonindigenous Species
 Information System -- maintained by the National Oceanic and Atmospheric
 Administration (NOAA). The network consists of collections at 20
 institutions, including 4 of the 10 largest herbaria in North America,
 each of which curates 1-7 million specimens (NY, F, MICH, and WIS).
 Eight of the nation's largest zoology museums are also represented,
 several of which (e.g., Ohio State and U of Minnesota) are
 internationally recognized for their fish and mollusk collections.

 Each genus includes at least one species that is considered a Great
 Lakes non-indigenous taxon -- several have many, whereas others have
 congeners on "watchlists", meaning that they have not arrived in the
 Great Lakes Basin yet, but have the potential to do so, especially in
 light of human activity and climate change. Because the introduction and
 spread of these species, their close relatives, and hybrids into the
 region is known to have occurred almost entirely from areas in North
 America outside of the Basin, our effort will include non-indigenous
 specimens collected from throughout North America.

 Digitized specimens of Great Lakes non-indigenous species and their
 congeners will allow for more accurate identification of invasive
 species and hybrids from their non-invasive relatives by a wider
 audience of end users. The metadata derived from digitized specimens of
 Great Lakes non-indigenous species and their congeners will help
 biologists to track, monitor, and predict the spread of invasive species
 through space and time, especially in the face of a more rapidly
 changing climate in the upper Midwest. All together consortium members
 will digitize \>2 million individual specimens from \>860,000
 sheets/lots of non-indigenous species and their congeneric taxa. Data
 and metadata are uploaded to the Great Lakes Invasives Network, a
 Symbiota portal (GreatLakesInvasvies.org), and ingested by the National
 Resource for Advancing Digitization of Biodiversity Collections (ADBC)
 (iDigBio.org) national resource.

 Several initiatives are already in place to alert citizens to the
 dangers of spreading aquatic invasive species among our nation\'s
 waterways, but this project is developing complementary scientific and
 educational tools for scientists, students, wildlife officers, teachers,
 and the public who have had little access to images or data derived
 directly from preserved specimens of invasive species collected over the
 past three centuries.
 Abstract

 Agriculture and Agri-Food Canada (AAFC) is home to numerous specimen and
 environmental collections generating highly relational data sets that
 are analyzed using molecular methods (Sanger and NGS). The need to have
 a system to properly manage these data sets and to capture accurate,
 standardized metadata over entire laboratory workflows has been a
 long-term strategic vision of the Biodiversity group at AAFC. Without
 robust tracking, many difficulties arise when trying to publish or
 submit data to external repositories. To even know what work has been
 carried out on individual collection records over a researchers career
 becomes a demanding task if the information is retrievable at all. SeqDB
 was built to resolve these issues by centralizing, standardizing and
 improving the availability and data quality of source specimen
 collection data that is being studied using molecular methods. SeqDB
 also facilitates integration with tools and external repositories in
 order to take the burden off researchers and technicians having to
 create adequate systems to track and mobilize their data sets, allowing
 them to focus on research and collection management.

 The development of SeqDB aligns with agile development methodologies and
 attempts to fulfill rapidly emerging needs from genetics and genomics
 research, which can evolve and fade quickly at times or be without clear
 requirements. The success of SeqDB as an application supporting DNA
 sequencing workflows has put it in the same space as other monolithic
 architectures before it. As the feature set to support the application
 continues to increase, the number of software developers vs operations
 and maintenance staff is difficult to rebalance in our organisation. In
 an effort to manage the scope for the project and ensure we are able to
 continue to deliver on our mandate, the sequence tracking workflows of
 the application will become part of the DINA ecosystem ("DIgital
 information system for NAtural history data", https://dina-project.net).
 Other functions of SeqDB such as collections management and taxonomy
 tree curation, will be replaced with the DINA modules implementing these
 functions.

 In order to allow SeqDB to become a module of DINA, it has been decided
 to refactor the application to base it on a Service Oriented
 Architecture. By doing so, all molecular data of SeqDB will be exposed
 as JSON API Web Services (JavaScript object notation application
 programming interface) allowing other modules, user interfaces and the
 current SeqDB application to communicate in a standardised way. The new
 architecture will also bring an important technology upgrade for SeqDB
 where the front end will eventually become a project in itself.
 Abstract

 As the biodiversity community increasingly adopts Semantic Web (SW)
 standards to represent taxonomic registers, trait banks or museum
 collections, some questions come up relentlessly: How to model the data?
 For what goals? Can the same model fulfill different goals?

 So far, the community has mostly considered the SW standards through
 their most salient manifestation: the Web of Linked Data (Heath and
 Bizer 2011). Indeed, the 5-star Linked Data principles are geared
 towards the building of a large, distributed knowledge graph that may
 successfully fulfill biodiversity's need for interoperability and data
 integration. However, the SW addresses a much broader set of problems
 involving automatic reasoning. For instance, reasoners can exploit
 ontological knowledge to improve query answering, leverage class
 definitions to infer class subsumption relationships, or classify
 individuals i.e. compute instance relationships between individuals and
 classes by applying reasoning techniques on class definitions and
 instance descriptions (Shearer et al. 2008).

 Whether a \"thing\" should be modelled as a class or a class instance
 has been debated at length in the SW community, and the answer is often
 a matter of perspective. In the context of taxonomic registers for
 example, the NCBI Organismal Classification (Federhen 2012) and
 Vertebrate Taxonomy Ontology (Midford et al. 2013) represent taxa as
 classes in the Ontology Web Language (OWL). By contrast, other
 initiatives represent taxa as instances of various classes, e.g. the
 SKOS Concept class (skos:Concept) in the AGROVOC thesaurus (Caracciolo
 et al. 2013) (we speak of the instances as SKOS concepts), the Darwin
 Core taxon class (dwc:Taxon) in Encyclopedia of Life (Parr et al. 2016),
 or classes depicting taxonomic ranks in GeoSpecies, DBpedia and the BBC
 Wildlife Ontology. Such modelling discrepancies impede linking congruent
 taxa throughout taxonomic registers. Indeed, one can state the
 equivalence between two classes (with owl:equivalentClass) or two class
 instances (with owl:sameAs, skos:exactMatch, etc.), but good practices
 discourage the alignment of classes with class instances (Baader et al.
 2003).

 Recently, Darwin Core\'s popularity has fostered the modeling of taxa as
 instances of class dwc:Taxon (Senderov et al. 2018, Parr et al. 2016).
 In this context, pragmatism may incline a Linked Data provider to comply
 with this majority trend to ensure maximum interlinking. Although
 technically and conceptually valid, this choice entails certain
 drawbacks. First, considering a taxon only as a an instance misses the
 fact that it is a set of biological individuals with common
 characteristics. An OWL class exactly captures this semantics through
 the set of necessary and sufficient conditions that an individual must
 meet to be a class member. In turn, an OWL reasoner can leverage this
 knowledge to perform query answering, compute subsumption or instance
 relationships. By contrast, taxa depicted by class instances are not
 defined but described by stating their properties. Hence the second
 drawback: unless we develop bespoke reasoners, there is not much a
 standard OWL reasoner can deduce from instances.

 Yet, some works have demonstrated the effectiveness of logic
 representation and reasoning capabilities, e.g. computing the alignments
 of two primate classifications (Franz et al. 2016) using generic
 reasoners that nevertheless require proprietary input formats. OWL
 reasoners are typically designed to solve such classification problems.
 They may leverage taxonomic ontologies to compute alignments with other
 ontologies or apply reasoning to individuals\' properties to infer their
 species. Hence, pragmatically following the instance-based approach may
 indeed maximize interlinking in the short term, but bears the risk of
 denying ourselves potentially desirable use cases in the longer term. We
 believe that developing class-based ontologies for biodiversity should
 help leverage the SW's extensive theoretical and practical works to
 tackle a variety of use cases that so far have been addressed with
 bespoke solutions.
 Abstract

 The DINA Consortium ("DIgital information system for NAtural history
 data", https://dina-project.net,Fig. 1 was formed in order to provide a
 framework for like-minded large natural history collection-holding
 institutions to collaborate through a distributed Open Source
 development model to produce a flexible and sustainable collection
 management system. Target collections include zoological, botanical,
 mycological, geological and paleontological collections, living
 collections, biodiversity inventories, observation records, and
 molecular data.

 The DINA system is architected as a loosely-coupled set of several
 web-based modules. The conceptual basis for this modular ecosystem is a
 compilation of comprehensive guidelines for Web application programming
 interfaces (APIs) to guarantee the interoperability of its components.
 Thus, all DINA components can be modified or even replaced by other
 components without crashing the rest of the system as long as they are
 DINA compliant. Furthermore, the modularity enables the institutions to
 host only the components they need. DINA focuses on an Open Source
 software philosophy and on community-driven open development, so the
 contributors share their development resources and expertise outside of
 their own institutions.

 One of the overarching reasons to develop a new collection management
 system is the need to better model complex relationships between
 collection objects (typically specimens) involving their derivatives,
 preparations and storage. We will discuss enhancements made in the DINA
 data model to better represent these relationships and the influence it
 has on the management of these objects, and on the sharing of
 information. Technical detail of various components of the DINA system
 will be shown in other talks in this symposium followed by a discussion
 session.
 Abstract

 The DINA Symposium ("DIgital information system for NAtural history
 data", https://dina-project.net) ends with a plenary session involving
 the audience to discuss the interplay of collection management and
 software tools. The discussion will touch different areas and issues
 such as:

 \(1) Collection management using modern technology:

 How should and could collections be managed using current technology --
 What is the ultimate objective of using a new collection management
 system?

 How should traditional management processes be changed?

 \(2) Development and community

 Why are there so many collection management systems?

 Why is it so difficult to create one system that fits everyone's
 requirements?

 How could the community of developers and collection staff be built
 around DINA project in the future?

 \(3) Features and tools

 How to identify needs that are common to all collections?

 What are the new tools and technologies that could facilitate collection
 management?

 How could those tools be implemented as DINA compliant services?

 \(4) Data

 What data must be captured about collections and specimens?

 What criteria need to be applied in order to distinguish essential and
 "nice-to-have" information?

 How should established data standards (e.g. Darwin Core & ABCD (Access
 to Biological Collection Data)) be used to share data from rich and
 diverse data models?

 In addition to the plenary discussion around these questions, we will
 agree on a streamlined format for continuing the discussion in order to
 write a white paper on these questions. The results and outcome of the
 session will constitute the basis of the paper and will be subsequently
 refined.
 Abstract

 In order to ensure long-term commitment to the DINA project ("DIgital
 information system for NAtural history data", https://dina-project.net),
 it is essential to continuously deliver features of high value to the
 user community. This is also what agile software development methods try
 to achieve by emphasizing early delivery, rapid response to changes and
 close collaboration with users (see for example the Manifesto for Agile
 Software Development at http://agilemanifesto.org). We will give a brief
 overview on how current development of the DINA collection management
 system core is guided by agile principles. The mammal collection at the
 Swedish Museum of Natural History will be used as an example.

 Developing a cross-disciplinary collection management system is a
 complex task that poses many challenges: Which features should we focus
 on? What kinds of data should the system ultimately support? How can the
 system be flexible but still easy to use? Since we cannot do everything
 at once, we work towards a minimum viable product (MVP) that contains
 just enough features at a time to bring value for selected target users.
 In the mammal collection case, the MVP is the simplest product that is
 able to replace the functions of the current system used for managing
 the collection. As we begin to work with other collections, new MVPs are
 defined and used to guide further development. Thus, the set of features
 available will increase with each MVP, benefiting both new and current
 users.

 Another big challenge is migration of legacy data, which is labor
 intensive and involves standardizing data that are not compatible with
 the new system. To address these issues, we aim to build a flexible data
 model that allows less structured data to coexist with more complex,
 highly structured data. Migration should thus not require extensive data
 standardization, transformation and cleaning. The plan is to instead
 offer tools for transforming and cleaning the data after they have been
 imported. With the data in place, it will be easier for the user to
 provide feedback and suggest new features.
 Abstract

 The DINA system ("DIgital information system for NAtural history data",
 https://dina-project.net) consists of several web-based services that
 fulfill specific tasks. Most of the existing services are covering
 single core features in the collection management system and can be used
 either as integrated components in the DINA environment, or as
 stand-alone services.

 In this presentation single services will be highlighted as they
 represent technically interesting approaches and practical solutions for
 daily challenges in collection management, data curation and migration
 workflows. The focus will be on the following topics: (1) a generic
 reporting and label printing service, (2) practical decisions on
 taxonomic references in collection data and (3) the generic management
 and referencing of related research data and metadata:

 Reporting as presented in this context is defined as an extraction and
 subsequent compilation of information from the collection management
 system rather than just summarizing statistics. With this quite broad
 understanding of the term the DINA Reports & Labels Service (Museum für
 Naturkunde Berlin 2018) can assist in several different collection
 workflows such as generating labels, barcodes, specimen lists, vouchers,
 paper loan forms etc. As it is based on customizable HTML templates, it
 can be even used for creating customized web forms for any kind of
 interaction (e.g. annotations).

 Many collection management systems try to cope with taxonomic issues,
 because in practice taxonomy is used not only for determinations, but
 also for organizing the collections and categorizing storage units (e.g.
 "Coleoptera hall"). Addressing taxonomic challenges in a collection
 management system can slow down development and add complexity for the
 users. The DINA system uncouples these issues in a simple taxonomic
 service for the sole assignment of names to specimens, for example
 determinations. This draws a clear line between collection management
 and taxonomic research, of which the latter can be supported in a
 separate service.

 While the digitization of collection data and workflows proceeds,
 linking related data is essential for data management and enrichment. In
 many institutions research data is disconnected from the collection
 specimen data because the type and structure cannot be easily included
 in the collection management databases. With the DINA Generic Data
 Module (Museum für Naturkunde Berlin 2017) a service exists that allows
 for attaching any relational data structures to the DINA system. It can
 also be used as a standalone service that accommodates structured data
 within a DINA compliant interface for data management.
 Abstract

 The large efforts to document and map aboveground biodiversity have
 helped to elucidate ecological and evolutionary mechanisms and
 processes, predict responses to global change, and identify potential
 management options in response to those changes. Yet these concepts have
 mostly been applied to aboveground plant and animal communities, while
 microbial diversity remains difficult to incorporate. The ability to
 integrate microbial sequence data into an accessible global
 infrastructure has previously been limited by a few key factors: First,
 most of microbial diversity remains undescribed and unknown; there is
 just an enormous amount of biodiversity. Second, there is a lack of
 congruence between the many disparate microbial datasets (e.g. taxonomy,
 phylogeny, and methodological biases), which limits the ability to
 monitor and quantify global patterns of the terrestrial microbiome.
 Finally, there is a lack of coordination and networking between
 scientists studying microbes. In this presentation I will discuss two
 case studies that highlight how we can begin to link microbial data to
 the already well-established macro-knowledge and other environmental
 databases (like global carbon maps)

 Study 1 -- a megameta analysis: The emergence of high-throughput DNA
 sequencing methods provides unprecedented opportunities to further
 unravel microbial ecology and its worldwide role from human health to
 ecosystem functioning. However, in spite of the abundance of sequencing
 studies, combining data from multiple individual studies to address
 macroecological questions of bacterial diversity remains methodically
 challenging and plagued with biases. While previous meta-analysis
 efforts have focused on diversity measures or abundances of major taxa,
 in a recent study^(1)^ we show that disparate amplicon sequence data can
 be combined at the taxonomy-based level to assess bacterial community
 structure. Using a machine learning approach, we found that rarer taxa
 are more important for structuring soil communities than abundant taxa.
 We concluded that combining data from independent studies can be used to
 explore novel patterns in bacterial communities, identify potential
 'indicator' taxa with an important role in structuring communities, and
 propose new hypotheses on the factors that shape microbial biogeography
 previously overlooked.

 Study 2 -- a global soil biodiversity database: Greater access to
 microbial data is an important next step for biodiversity research and
 conservation, and for understanding the ecology and evolution of
 microbial communities. In collaboration with the Global Soil
 Biodiversity Initiative and the German Biodiversity Synthesis Centre
 (sDIV) we outlined steps that must be taken to ensure microbial sequence
 data can be included in global measures and maps of biodiversity^(2)^.
 Here I will discuss how the plant associated microbiome is an optimal
 starting point to synthesize microbial sequence data on an open and
 global platform. The plant-microbiome is an optimal model system that
 goes across scales and time, can act as a bridge between microorganisms
 and macroorganisms, and as an opportunity to more thoroughly explore the
 synthesis of global microbial sequence data (for a global soil
 biodiversity database). Beyond expanding primary research, the patterns
 discovered in a synthesis of plant-microbiome can be used to explore and
 guide ecosystem restoration and sustainability. Overall, a better
 understanding of microbial biodiversity will help to predict
 consequences of (human-induced) global changes and facilitate
 conservation and adaptation responses.

 \(1) Ramirez, K.S., C.G. Knight et al. and F.T. de Vries (2017).
 Detecting macroecological patterns in bacterial communities across
 independent studies of global soils. Nature Microbiology.

 \(2) Ramirez, K.S., M. Döring, N. Eisenhauer, C. Gardi, J. Ladau, J.W.
 Leff, G. Lentendu, Z. Lindo, M.C. Rillig, D. Russell, S. Scheu, M.G. St.
 John, F.T. de Vries, T. Wubet, W.H. van der Putten, D.H. Wall, (2015).
 Towards a global platform for linking soil biodiversity data. Frontiers
 in Ecology and Evolutionary Biology 3(91). doi: 10.3389/fevo.2015.00091
 Abstract

 Traditionally, taxonomic characterisation of organisms has relied on
 their morphology; however, molecular methods are increasingly used to
 monitor and assess biodiversity and ecosystem health. Approaches such as
 DNA amplicon diversity assessments are a particularly useful tool when
 morphology-based taxonomy is difficult or taxa are morphologically
 ambiguous, for example for freshwater bacteria and fungi as well as many
 freshwater invertebrate species. DNA metabarcoding provides the ability
 to distinguish cryptic taxa (which can differ markedly in their
 ecological requirements and tolerances) and in addition it can provide
 valuable insights into the genetic and ecological diversity of taxa and
 ecosystems. While DNA metabarcoding has been used mostly on tissue of
 sampled specimens, recent years have seen an increased use of
 metabarcoding on environmental DNA samples: DNA extracted not from
 sampled specimens, but from the surrounding soil or water. However, the
 ability of metabarcoding of specimens and metabarcoding of environmental
 DNA (eDNA) to assess biodiversity and the impact of anthropogenic
 stressors on freshwater ecosystems is largely understudied. In this
 talk, several studies that document the advantages and still open
 challenges of (e)DNA metabarcoding for assessing impacts of
 environmental stressors on aquatic ecosystems will be presented. These
 studies, performed in Europe and New Zealand, integrate impacts across
 different biotic groups, i.e. look at stressor effects on bacterial,
 protist, fungal and macroinvertebrate communities. Specifically, we use
 various case studies from freshwater ecosystems to address the following
 questions:

 whether eDNA samples, which can be relatively quickly obtained from the
 water, can act as reliable proxies for catchment-level stressor impacts
 by comparing these to DNA obtained from local bulk samples, and

 whether DNA metabarcoding data can also provide quantitative information
 rather than only presence-absence data.

 In view of the case studies presented, a perspective on the urgent next
 steps that need to be taken in order to include genetic tools in routine
 biomonitoring will be derived and linked to the vision of the
 international network DNAqua-Net.
 Abstract

 Adventitious roots in canopy soils associated with silver beech
 (Lophozonia menziesii (Hook.f.) Heenan & Smissen (Nothofagaceae)) form
 ectomycorrhizal associations. We used amplicon sequencing of the
 internal transcribed spacer 2 region to compare diversity of
 ectomycorrhizal fungal species in canopy and terrestrial sites. The
 study data are archived as an NCBI BioProject (accession PRJNA421209),
 with the raw DNA sequence reads available from the NCBI Sequence Read
 Archive SRA637723 Community composition of canopy ectomycorrhizal fungi
 was significantly different to the terrestrial community composition,
 with several abundant ectomycorrhizal species significantly more
 represented in the terrestrial soil than the canopy soil. Additionally,
 we found evidence that an introduced ectomycorrhizal species was present
 in these native forest soils. We identified OTUs in two ways: (i) by
 manually curated BLAST searching of the NCBI nr database, and (ii) by
 comparison with Species Hypotheses on UNITE v.7.2. We desired to make
 species identifications where we could be reasonably confident they were
 robust, but had to avoid making identifications when an incorrect name
 could have implications for biosecurity or our understanding of
 biodiversity and biogeography. We found some UNITE Species Hypotheses
 included sequences of more than one taxon, which we were able to
 separate and distinguish by phylogenetic analysis. Consequently we
 exercised caution in reporting names based on the Species Hypotheses.

 Using data from this case study, we will illustrate the achievements and
 challenges faced in identifying species of ectomycorrhizal fungi from
 DNA barcodes. Most DNA sequences of ectomycorrhizal fungi matched
 closely New Zealand voucher specimens stored in either the New Zealand
 Fungal Herbarium (PDD) or the Otago Regional Herbarium (OTA), which
 facilitated the validation of identifications. In the case of PDD
 specimens, collection and DNA data were linked via the Systematics
 Collections Data database (https://scd.landcareresearch.co.nz). We are
 working towards a similar database for OTA specimens, using the Specify
 6 database platform.
 Abstract

 Several national and international environmental laws require countries
 to meet clearly defined targets with respect to the ecological status of
 aquatic ecosystems. In Europe, the EU-Water Framework Directive (WFD;
 2000/60/EC) represents such a detailed piece of legislation. The WFD
 that requires the European member countries to achieve an at least
 'good' ecological status of all surface waters at latest by the year
 2027. In order to assess the ecological status of a given water body
 under the WFD, data on its aquatic biodiversity are obtained and
 compared to reference status. The mismatch between these two metrics
 then is used to derive the respective ecological status class. While the
 workflow to carry out the assessment is well established, it relies only
 on few biological groups (typically fish, macroinvertebrates and a few
 algal taxa such as diatoms), is time consuming and remains at a lower
 taxonomic resolution, so that the identifications can be done routinely
 by non-experts with an acceptable learning curve. Here, novel genetic
 and genomic tools provide new solutions to speed up the process and
 allow to include a much greater proportion of biodiversity in the
 assessment process. further, results are easily comparable through the
 genetic 'barcodes' used to identify organisms.

 The aim of the large international COST Action DNAqua-Net
 (http://dnaqua.net/) is to develop strategies on how to include novel
 genetic tools in bioassessment of aquatic ecosystems in Europe and
 beyond and how to standardize these among the participating countries.
 It is the ambition of the network to have these new genetic tools
 accepted in future legal frameworks such as the EU-Water Framework
 Directive (WFD; 2000/60/EC) and the Marine Strategy Framework Directive
 (2008/56/EC). However, a prerequisite is that various aspects that start
 from the validation and completion of DNA Barcode reference databases,
 to the lab and field protocols, to the analysis processes as well as the
 subsequently derived biotic indices and metrics are dealt with and
 commonly agreed upon. Furthermore, many pragmatic questions such as
 adequate short and long-term storage of samples or specimens for further
 processing or to serve as an accessible reference need also be
 addressed. In Europe the conformity and backward compatibility of the
 new methods with the existing legislation and workflows are further of
 high importance. Without rigorous harmonization and inter-calibration
 concepts, the implementation of the powerful new genetic tools will be
 substantially delayed in real-world legal framework applications.

 After a short introduction on the structure and vision of DNAqua-Net, we
 discuss how the DNAqua-Net community considers possibilities to include
 novel DNA-based approaches into current bioassessment and how formal
 standardization e.g. through the framework of CEN (The European
 Committee for Standardization) may aid in that process (Hering et al.
 2018, Leese et al. 2016, Leese et al. 2018. Further we explore how TDWG
 data standards can further facilitate swift adoption of the genetic
 methods in routine use. We further present potential impacts of the
 legislative requirements of the Nagoya Protocol on the exchange of
 genetic resources and their implications for biomonitoring. Last but not
 least, we will touch upon the rather unexpected influence that the new
 General Data Protection Regulation (GDPR) may have on the bioassessment
 work in practice.
 Abstract

 Although they are hyperdiverse and intensively studied, parasites
 present major challenges when it comes to phylogenetics, taxonomy, and
 biodiversity informatics. The collection of any parasitic organism
 entails the linking of at least two specimens - the parasite and the
 host. If the parasite has a complex life cycle, then this becomes
 further complicated by requiring the linking of three or more hosts,
 such as the parasite, its intermediate host (vector) and its definitive
 host(s). Parasites are sometimes collected as byproduct of another
 collection event and are not studied immediately - which has the
 potential to disconnect them further in terms of information content and
 continuity- and the converse if also common - parasites can be collected
 by parasitologists, who do not necessarily take host vouchers or
 incorporate host taxonomy, let alone other metadata for these events.
 Using the specific example of the malaria parasites (Order Haemosporida)
 I will present examples of the specific challenges that have accompanied
 the study of these parasites including issues of delimiting species,
 phylogenetic study, including genetic oddities that are unique to these
 organisms, and taxonomic quandries that we now find ourselves in, along
 with other problems with maintaining continuity of information in a
 group that is both diverse biologically and important medically.
 Abstract

 Madagascar is one of the world's hottest biodiversity hotspots and a
 natural laboratory for evolutionary research. Tenrecs (Tenrecidae; 32
 currently recognized species) -- small placental mammals endemic to
 Madagascar -- colonized the island \>35 million years ago and have
 evolved a stunning range of behaviors and morphologies, including
 heterothermic species; species with hedgehog-like spines; and fossorial,
 aquatic, and scansorial ecotypes. In 2016, we produced the first
 taxonomically complete phylogeny of tenrecs, which has served as a
 framework for studying morphological evolution, phylogeography, and
 species limits. Most recently, we have built on this phylogeny to
 incorporate an enormous database of genetic, morphometric, and
 geographic data from \>800 vouchered tenrec specimens. These data have
 revealed interesting and unexpected aspects of their evolutionary
 history, including decoupled diversification of the cranium and
 postcranium. Using a machine learning approach, we have also uncovered
 numerous new, cryptic species in the family Tenrecidae. As phylogenetic
 and phenotypic data become more readily available through online
 repositories, we expect that the same approaches can be applied to other
 taxonomic groups, providing unprecented resolution of the tree of life.
diff --git a/get_data.sh b/get_data.sh
 # Loop over all abstracts and write the output to abstracts.txt

 for abstract in $(cat TDWG_abstracts.txt); do 

    # Strip off just abstract number
    anum=$(echo $abstract | sed 's/\/article\///g;s/\/download\/xml\///g;') 

    # Download XML representation of Abstracts
    wget "https://biss.pensoft.net${i}" -O $anum.xml ; done

    # Extract just Abstract text from XML using XPATH
    xmllint --xpath "/article/front/article-meta/abstract" $anum.xml | pandoc --from html --to markdown >> abstracts.txt
    
 done
diff --git a/TDWG_abstracts.txt b/TDWG_abstracts.txt
 /article/27339/download/xml/
 /article/26369/download/xml/
 /article/25437/download/xml/
 /article/26922/download/xml/
 /article/26860/download/xml/
 /article/26516/download/xml/
 /article/26323/download/xml/
 /article/26304/download/xml/
 /article/26262/download/xml/
 /article/26235/download/xml/
 /article/26177/download/xml/
 /article/26080/download/xml/
 /article/26075/download/xml/
 /article/25842/download/xml/
 /article/25738/download/xml/
 /article/25661/download/xml/
 /article/25577/download/xml/
 /article/25223/download/xml/
 /article/26168/download/xml/
 /article/27244/download/xml/
 /article/26490/download/xml/
 /article/26367/download/xml/
 /article/26286/download/xml/
 /article/26104/download/xml/
 /article/26102/download/xml/
 /article/25960/download/xml/
 /article/25864/download/xml/
 /article/25828/download/xml/
 /article/25890/download/xml/
 /article/25885/download/xml/
 /article/25724/download/xml/
 /article/25723/download/xml/
 /article/25881/download/xml/
 /article/25836/download/xml/
 /article/25876/download/xml/
 /article/25564/download/xml/
 /article/25560/download/xml/
 /article/25535/download/xml/
 /article/25481/download/xml/
 /article/26122/download/xml/
 /article/25852/download/xml/
 /article/26731/download/xml/
 /article/25869/download/xml/
 /article/25693/download/xml/
 /article/25658/download/xml/
 /article/25165/download/xml/
 /article/25641/download/xml/
 /article/25586/download/xml/
 /article/25700/download/xml/
 /article/25298/download/xml/
 /article/26749/download/xml/
 /article/25651/download/xml/
 /article/25289/download/xml/
 /article/25525/download/xml/
 /article/25282/download/xml/
 /article/25748/download/xml/
 /article/25694/download/xml/
 /article/25653/download/xml/
 /article/25585/download/xml/
 /article/26665/download/xml/
 /article/25838/download/xml/
 /article/25450/download/xml/
 /article/25439/download/xml/
 /article/25394/download/xml/
 /article/25268/download/xml/
 /article/25148/download/xml/
 /article/25582/download/xml/
 /article/25657/download/xml/
 /article/25608/download/xml/
 /article/25438/download/xml/
 /article/25395/download/xml/
 /article/25351/download/xml/
 /article/25324/download/xml/
 /article/25317/download/xml/
 /article/25310/download/xml/
 /article/25176/download/xml/
 /article/26808/download/xml/
 /article/26060/download/xml/
 /article/25474/download/xml/
 /article/25456/download/xml/
 /article/26836/download/xml/
 /article/25840/download/xml/
 /article/25812/download/xml/
 /article/25811/download/xml/
 /article/25805/download/xml/
 /article/25642/download/xml/
 /article/24749/download/xml/
 /article/25306/download/xml/
 /article/24930/download/xml/
 /article/25647/download/xml/
 /article/25646/download/xml/
 /article/25635/download/xml/
 /article/25580/download/xml/
 /article/25579/download/xml/
 /article/26009/download/xml/
 /article/25983/download/xml/
 /article/25982/download/xml/
 /article/25953/download/xml/
 /article/25604/download/xml/
 /article/25936/download/xml/
 /article/25776/download/xml/
 /article/25739/download/xml/
 /article/25727/download/xml/
 /article/25698/download/xml/
 /article/25589/download/xml/
 /article/25614/download/xml/
 /article/25478/download/xml/
 /article/25409/download/xml/
 /article/25345/download/xml/
 /article/25343/download/xml/
 /article/26514/download/xml/
 /article/25969/download/xml/
 /article/25415/download/xml/
 /article/25410/download/xml/
 /article/25990/download/xml/
 /article/25488/download/xml/
 /article/25487/download/xml/
 /article/25486/download/xml/
 /article/25121/download/xml/
 /article/24991/download/xml/
 /article/27087/download/xml/
 /article/26658/download/xml/
 /article/26615/download/xml/
 /article/26471/download/xml/
 /article/25728/download/xml/
 /article/25914/download/xml/
 /article/25664/download/xml/
 /article/26561/download/xml/
 /article/25699/download/xml/
 /article/27251/download/xml/
 /article/25762/download/xml/
 /article/25833/download/xml/
 /article/25749/download/xml/
 /article/25637/download/xml/
 /article/25261/download/xml/
 /article/25260/download/xml/
 /article/29123/download/xml/
 /article/28479/download/xml/
 /article/28364/download/xml/
 /article/28131/download/xml/
 /article/28158/download/xml/
	# Loop over all abstracts and write the output to abstracts.txt

	for abstract in $(cat TDWG_abstracts.txt); do

	# Strip off just abstract number
	anum=$(echo $abstract \| sed 's/\/article\///g;s/\/download\/xml\///g;')

	# Download XML representation of Abstracts
	wget "https://biss.pensoft.net${i}" -O $anum.xml ; done

	# Extract just Abstract text from XML using XPATH
	xmllint --xpath "/article/front/article-meta/abstract" $anum.xml \| pandoc --from html --to markdown >> abstracts.txt

	done
	/article/27339/download/xml/
	/article/26369/download/xml/
	/article/25437/download/xml/
	/article/26922/download/xml/
	/article/26860/download/xml/
	/article/26516/download/xml/
	/article/26323/download/xml/
	/article/26304/download/xml/
	/article/26262/download/xml/
	/article/26235/download/xml/
	/article/26177/download/xml/
	/article/26080/download/xml/
	/article/26075/download/xml/
	/article/25842/download/xml/
	/article/25738/download/xml/
	/article/25661/download/xml/
	/article/25577/download/xml/
	/article/25223/download/xml/
	/article/26168/download/xml/
	/article/27244/download/xml/
	/article/26490/download/xml/
	/article/26367/download/xml/
	/article/26286/download/xml/
	/article/26104/download/xml/
	/article/26102/download/xml/
	/article/25960/download/xml/
	/article/25864/download/xml/
	/article/25828/download/xml/
	/article/25890/download/xml/
	/article/25885/download/xml/
	/article/25724/download/xml/
	/article/25723/download/xml/
	/article/25881/download/xml/
	/article/25836/download/xml/
	/article/25876/download/xml/
	/article/25564/download/xml/
	/article/25560/download/xml/
	/article/25535/download/xml/
	/article/25481/download/xml/
	/article/26122/download/xml/
	/article/25852/download/xml/
	/article/26731/download/xml/
	/article/25869/download/xml/
	/article/25693/download/xml/
	/article/25658/download/xml/
	/article/25165/download/xml/
	/article/25641/download/xml/
	/article/25586/download/xml/
	/article/25700/download/xml/
	/article/25298/download/xml/
	/article/26749/download/xml/
	/article/25651/download/xml/
	/article/25289/download/xml/
	/article/25525/download/xml/
	/article/25282/download/xml/
	/article/25748/download/xml/
	/article/25694/download/xml/
	/article/25653/download/xml/
	/article/25585/download/xml/
	/article/26665/download/xml/
	/article/25838/download/xml/
	/article/25450/download/xml/
	/article/25439/download/xml/
	/article/25394/download/xml/
	/article/25268/download/xml/
	/article/25148/download/xml/
	/article/25582/download/xml/
	/article/25657/download/xml/
	/article/25608/download/xml/
	/article/25438/download/xml/
	/article/25395/download/xml/
	/article/25351/download/xml/
	/article/25324/download/xml/
	/article/25317/download/xml/
	/article/25310/download/xml/
	/article/25176/download/xml/
	/article/26808/download/xml/
	/article/26060/download/xml/
	/article/25474/download/xml/
	/article/25456/download/xml/
	/article/26836/download/xml/
	/article/25840/download/xml/
	/article/25812/download/xml/
	/article/25811/download/xml/
	/article/25805/download/xml/
	/article/25642/download/xml/
	/article/24749/download/xml/
	/article/25306/download/xml/
	/article/24930/download/xml/
	/article/25647/download/xml/
	/article/25646/download/xml/
	/article/25635/download/xml/
	/article/25580/download/xml/
	/article/25579/download/xml/
	/article/26009/download/xml/
	/article/25983/download/xml/
	/article/25982/download/xml/
	/article/25953/download/xml/
	/article/25604/download/xml/
	/article/25936/download/xml/
	/article/25776/download/xml/
	/article/25739/download/xml/
	/article/25727/download/xml/
	/article/25698/download/xml/
	/article/25589/download/xml/
	/article/25614/download/xml/
	/article/25478/download/xml/
	/article/25409/download/xml/
	/article/25345/download/xml/
	/article/25343/download/xml/
	/article/26514/download/xml/
	/article/25969/download/xml/
	/article/25415/download/xml/
	/article/25410/download/xml/
	/article/25990/download/xml/
	/article/25488/download/xml/
	/article/25487/download/xml/
	/article/25486/download/xml/
	/article/25121/download/xml/
	/article/24991/download/xml/
	/article/27087/download/xml/
	/article/26658/download/xml/
	/article/26615/download/xml/
	/article/26471/download/xml/
	/article/25728/download/xml/
	/article/25914/download/xml/
	/article/25664/download/xml/
	/article/26561/download/xml/
	/article/25699/download/xml/
	/article/27251/download/xml/
	/article/25762/download/xml/
	/article/25833/download/xml/
	/article/25749/download/xml/
	/article/25637/download/xml/
	/article/25261/download/xml/
	/article/25260/download/xml/
	/article/29123/download/xml/
	/article/28479/download/xml/
	/article/28364/download/xml/
	/article/28131/download/xml/
	/article/28158/download/xml/