CNI Fall 2012 Meeting

Website: Fall 2012 Meeting
Twitter: #cni12f

Disclaimer 1: These notes provide my perception and understanding of the CNI conference presentations. It is possible that errors were made in the reporting or that speaker statements have been inadvertently misrepresented.

Disclaimer 2: I'm writing almost entirely from my conference notes. There are gaps in my reporting, and some sections are scattered and rough. For more information on a project, see the "CNI project page" (linked in each section).

Jump to:

DataRes Symposium - Fall 2012 CNI Pre-Conference
New Member Session
Opening Plenary
Digital Preservation Network Update
Developing a Customized, Extensible Application for Digital Collections
Open Annotation Update
Building An Archival Identity Management Network
Debunking Myths And Establishing Guidelines For The ETD Lifecycle
HathiTrust Research Center
Closing Plenary

DataRes Symposium - Fall 2012 CNI Pre-Conference

Website: Data Management at UNT
Twitter: #drs12

Panel 1:
Shannon Stark, DataRes project
Martin Halbert, iCAMP
Spencer Keralis, Denton Declaration

This panel discussed two research projects (DataRes and iCAMP) that are exploring responses to a) the exponentially increasing quantity of data produced by research institutions, and b) "open research data" mandates passed down by grant distributing agencies like the NSF, NIH, and NEH, and a "manifesto" (Denton Declaration) declaring support for open access that was formulated at the May 2012 UNT Symposium on Open Access. The two research projects are closely related (iCAMP is largely centered at the UNT College of Information and DataRes at UNT Libraries) and both are funded by grants from IMLS.

The DataRes project (PI: Halbert) is concerned with identifying how institutions have responded and are responding to open data mandates (for example, whether policies have been formulated by university administration), and what expectations stakeholders have regarding guidance on the issue.

Stark and the DataRes researchers looked first at the agencies' guidance to researchers. The first step in this process was some very basic textual analysis of the language used by these agencies in describing open access policies. Word frequency statistics showed that the NIH was focused on "data", "sharing", and "privacy", the NEH on "data" and "management", and that the NSF had no clear "top words", which Stark attributed to a) the variety of disciplines involved and b) an intentional vagueness forcing researchers to make their own decisions.

The second step was a "policy scan", which involved looking for a formal response from institutions to these agencies' guidelines. They found that only 18% had institutional-level policies, and that the language used was often vague (e.g., you should probably make your data available).

The third step was to administer a survey to gauge faculty and librarian viewpoints (survey was a questionnaire; had 231 respondents, with librarians possibly over-represented). This survey showed some interesting facts; in response to the question

Does your institution have a policy regarding sharing research data?

72% chose "No", 19% "I don't know", and only 9% "Yes".

They also found that "the library" and/or the "office of research" were the preferred centers of support for data sharing. In institutions with data sharing guidelines in place, these entities, along with IT departments, were the ones supporting data sharing.

The key takeaways from their research included the following:

Institutional-level policies are not, and cannot be, the first response to open data mandates. Creating policies naturally takes time, so grassroots efforts (often coming from the library) should start first.
Research data management issues are inherently cross-departmental, and solutions within an institution should be developed collaboratively.
Collaboration, domain knowledge, and infrastructure are keys to success.
At some point, guidance (e.g., to faculty) is going to need to be prescriptive (e.g., you *need * to take these mandates seriously).

The DataRes researchers are still evaluating survey responses and are planning a new version targeted more directly to faculty. Their full report will be available in March 2013. They are planning a professional development curriculum for liaisons based on their findings.

Questions (selected and paraphrased):

Have the researchers noticed any cases where grant proposals had been declined or not renewed because of a refusal to share research data?

The panel noted that there had been some reports of this happening, but these were primarily anecdotal, although a declined grant proposal/renewal in the behavioral science department at Tulane seemed directly related to a data management failure.

iCAMP (Curate, Archive, Manage, and Preserve) is a research project (PI: William Moen) that asks how LIS education can be changed/enhanced to prepare information professionals and researchers for managing research data.

There are too few people and too little resources to manage academic data, as identified by the following reports:

McKinsey Global Institute Big Data Report
Data, data everywhere (?)
2011 IDC Report (covered here)

Individuals generate data, and institutions manage it. The question is how to bridge the gap between current capabilities for managing data and what is necessary for doing so (and the gap is continually widening). Data loss is occurring on a grand scale, and this loss is inevitable. A solution is not currently possible, but data managers can take steps towards a solution (can't cross the Grand Canyon in a day, iCAMP involves steps, "camping" along the way).

There are four roles related to data: creator, scientist, manager, and librarian (source). Scientists and other researchers are (in some ways first and foremost) "data creators" although they are not often perceived as such.

The 2010 ARL E-Science and Data Support Services study has shown that 87% of data management at universities has been collaborative, and that libraries are starting to respond to different needs involved in data management, although an LIS degree is still often required for data professionals.

There are some current projects dealing with data management in the LIS curriculum:

The iCAMP project is a 3 year capacity building project, paired with DataRes, and functioning as collaboration between UNT College of Information and UNT Libraries. Moen and his researchers are developing four courses on digital curation and data management

Digital Curation and Data Management Fundamentals (overview)
Tools, Apps, and Infrastructure for Digital Curation (technology)
Managing and Preserving Digital Data and Information (people and management skills)
Digital Curation and Data Management Seminar (capstone)

The educational program will be modeled on the concept of a "teaching library", where, as in a "teaching hospital", students will be able to practice the profession in a real-world environment. The program is targeted to graduate students in LIS, post-master's information professionals, and other graduate students (non-LIS disciplines). These graduate students will have the opportunity to earn a certificate in "data management". The end goal is increased student preparedness and an LIS education more aligned with current needs.

The curriculum is focused on competency--specifically, on competencies sought in actual job postings. To this end, Moen et al. analyzed 110 job postings in the field of data management, and they found that 85% required or preferred an ALA-accredited degree, 28% required or preferred another graduate degree, and only 1% required a PhD. Key skills and qualifications found in these postings included:

comfort working in an IT intensive environment
familiarity with specifications and standards
project management skills
research skills

The program will use a Drupal/Islandora/Fedora installation as a "virtual lab"/sandbox for the technology environment.

Questions and Comments (selected and paraphrased):

To what extent will libraries/librarians be involved in data management?

Libraries will need to adapt and hire outside LIS programs if LIS education cannot keep up with current needs.

Will LIS programs be agile enough to keep up with technology?

But Jeff Trecziak's position--he noted (as Chief Librarian) that he wouldn't (or would be reluctant to) hire another librarian with a LIS degree--seems to go too far.

Current efforts (including this LIS program) seem to be too little, too slow. The Grand Canyon does need to be crossed in one day. How do we get together and support data without getting replaced?

Some research has shown that 80% of current LIS syllabi that list "data curation" or "data management" as course topics are lying; they are just using the buzzwords and do not cover the topics in any real fashion.

The authors of the "Denton Declaration" are a diverse set of individuals who met at the May 2012 Open Access Symposium. They chose to write a "manifesto" to reflect the importance of the movement and because manifesti are often written at moments of crisis.

The manifesto has at its core a set of 13 principles. Their key claim is that "Open access to research data is critical for advancing science, scholarship, and society".

The manifesto is a polemic document inviting response, and the signatories are hoping that others will engage in discussion over the document.

Questions and Comments (selected and paraphrased):

What is the value of libraries' participation in data management? Are libraries adding value? How are they adding value?

Libraries seem to be a natural place for cross-departmental collaboration and having a 3rd party involved helps survivability of data.

If libraries can't step up and fill this role, they should get out of the way.

Libraries' value is having the foresight to save something that they will only be thanked for in hindsight.

The current culture embraces the value of saving data (which is good), but not so much the value of sharing it (where libraries come in).

IT can provide space for 'stashing' your data, but don't often have mechanisms for discoverability.

How do grant agencies or institutions ensure that researchers fulfill data sharing mandate?

A modest proposal: at the end of the project, a URL has to be provided for data and results. This URL will be checked (automatically) on a recurring basis.

Libraries need to assist with data management problems to show their continuing value/relevance. Doing so in a satisfactory manner will also create faculty advocates, which will help with libraries' PR problem. Librarians also need to take their conversations at CNI and ARL back to their institutions and/or scholarly conferences.

New Member Session

Clifford Lynch

In this session, CNI executive director Clifford Lynch provided some background on the organization and identified a snapshot of current priorities.

CNI is now all grown up. It was funded in 1990 by the ARL and the precursor to EDUCAUSE. In those days, when technology was primarily the supercomputer and email, connectivity was just starting to become a priority but its importance was recognized. The campus CIO and head librarian were often involved in similar pursuits, but there was no real cooperation between these fields. CNI was developed as a place where these different fields could come together to collaborate.

CNI played an important role in some "web" milestones, including just getting cultural heritage materials online and in developing authentication protocols and procedures.

CNI has member organizations (primarily but not exclusively academic research libraries) across the US, Canada, and Europe; it is not a worldwide organization. It is organizationally and financially small, and does not have many operational services. CNI does not accept grants (to facilitate conversation with funding agencies). Lynch himself reports to the ARL and EDUCAUSE steering committees.

CNI has two annual conferences and occasional topical events. CNI assists with advocacy and representation for member organizations (for example, by meeting with faculty about the effect of digital publishing), and CNI administrators serve on advisory committees for academic organizations.

Perhaps the organization's most important job is that of facilitating information exchange among member organizations (e.g., did you know so-and-so is doing something very similar at their organization?).

The organization's priorities relate to content, organizations and institutions, and standards. Regarding content, current priorities are digital preservation, digital humanities research, institutional repositories, and metadata best practices. Included here are discussions of how scholarly communication is changing (moving away from the simple PDF/ "print journal online" model), and on the ramification of e-books and the disappearance of monographs.

Regarding institutions/organizations, priorities are exploring how students are using technologies in learning (are they amateurs or experts?), the role of technology in content creation, and mobile devices. The "Executive Roundtable" event at CNI conferences relates to institutional level developments (member organizations discuss what is going on at their institutions). The "Executive Roundtable" for this conference covered the topic of "New Platforms for Scholarly Publishing".

Regarding standards, priorities relate to infrastructure: tracking identity management and privacy issues, especially "scholarly identity" (see the "Transforming Archival Practice" session).

Opening Plenary

Clifford Lynch

Lynch's opening plenary session opened with a recognition of a transition: Charles Lowry is retiring as Executive Director of ARL December 2012 and will be replaced by Elliott Shore. New members were recognized and a few conference-related announcements were made.

The year in review--much has happened in 2012. Research is changing, and interconnectedness is becoming more and more important. This year has really seen the emergence of Massive Open Online Courses (MOOCs--an unfortunate acronym), especially Coursera (covered in a recent EDUCAUSE meeting). The meeting's sessions on MOOCs cover the institutional layout of such courses, not simply a single professor's experience therewith. These MOOCs have shown that they are effective and valuable, at least in some fields. Although there are massive drop out percentages, there are still vast numbers that complete and succeed in MOOCs. How is success measured?

MOOCs have developed out of the practice of video capture of lectures, and are the next step in this level of sharing. Even more in the future, MOOCs will be assembled from multiple sources (unlike the current model) by various kinds of institutions. MOOCs may be better suited to some subjects, and whether they work at all for humanities remains to be seen. If MOOCs are successful in the humanities, academic economies will change dramatically (since graduate students in humanities often work as TAs).

One indicator of whether a subject will work is whether machine grading can be successful for that subject. Teaching computer programming is similar to teaching writing, and has had some success in this area. Tools have been developed to automatically evaluate code (and other writing).

MOOCs are also producing massive amounts of data, which is incredibly valuable in understanding learning and in improving online learning. But who owns/controls this data? Who can access it? Can students add their online coursework to their academic record? Google has started to get involved, offering a course on how to "search good".

E-textbooks are another interesting development in the academic world. E-textbooks introduce many licensing issues (with many exciting possibilities), and promise to remap relationships among bookstores, faculty, publishers, and students. The IT community in general has taken the lead on e-textbooks.

E-textbooks also raise interesting and sometimes troubling issues; for example, one e-textbook platform promises to let teachers know whether students have in fact done their reading (which again raises questions regarding students' rights).

Moving to the "macro level": the career prospects for STEM graduates is very much in the air. Scientists in general are under a great deal of pressure (reproducibility of results is becoming more and more important). The rise of PLOS One (and its sheer size) is impressive, and its model of vetting for correctness only, not importance/ranking, may become more popular.

Intellectual property goods and the principles of fair use are topics for debate. Can anyone inherit your e-books?

Math Overflow is an exciting new (note: only kind of new--Math Overflow and the other StackExchange sites have been around for years) scholarly environment, reminiscent of walking down a university corridor to ask a colleague a question.

Research data management is becoming more important, and data re-use is becoming a priority for researchers (though IRBs have traditionally been against re-use). Academic mobile development is another interesting intersection of consumer and higher education worlds. Identity tracking and open annotation methods (ability to annotate across silos) are exciting areas.

Disaster recovery procedures unfortunately had to be revisited this year in response to Hurricane Sandy. Those affected will no doubt share their experience and insights with the community. Too often, disaster planning is just backing up data, but there are physical structures and many layers of physical connectivity that are affected.

Digital Preservation Network Update

CNI project page
DPN

James L. Hilton
Steven Morales

Note: I didn't do my research in choosing to attend this session: I assumed (from the title and a scan of the project briefings) that this project was related to the recent Digital Preservation Outreach & Education programs from the Library of Congress. It's not: the Digital Preservation Network (DPN) is a federation/service for replicating and preserving digital content across repositories like HathiTrust and APTrust with a focus on sustainability and longevity and offering succession capabilities.

In the current model, institutions are aggregating content with a focus on access and some preservation intent, but these efforts are often distributed across the different content areas and are not really functioning at the correct scale.

By replicating content across repositories, the DPN ensures against a single point of failure. In this way it is like Internet2, and in fact the landscape for Digital Preservation currently looks similar to the landscape for networking back when Internet2 was getting started. In this landscape:

there are many layers to the problem
there are lots of one-off solutions
there are cost advantages to scaled solutions
problems need to be solved together
solutions should be able to move up the preservation stack

DPN is not a software project. The DPN infrastructure will audit and verify content to ensure proper succession. DPN will identify and use the best storage providers over time.

DPN currently has 57 charter members. The technical working group is now up and running and are testing connections, and the succession rights working group will soon start functioning. Next steps include proving, especially to researchers, that DPN is necessary for properly preserving research data

Proving this value is especially important since DPN administrators imagine deposit fees being paid directly from grant funds.

Questions and Comments (selected and paraphrased):

How does one get researchers/scientists to recognize the value of DPN? Most researchers do not care about long-term storage, let alone preservation level storage.

Yes, this is a problem. Open access data mandates will help.

What is at risk? What are you preserving?

DPN replaces your tapes in Iron Mountain. It provides failover access to an institution's data and metadata. It is not intended for digital curation (although web access to content is possible in an emergency).

Where is the data? What copyright code is in place?

DPN is currently functioning only in the US to simplify copyright concerns.

What is DPN aiming for in the long haul?

Preservation standardization--automatic translation from one format to another over time; the capture of academic discovery.

What are costs of DPN membership?

$20,000 membership fee plus associated service or deposit fees (haven't finalized financial process yet). Does not replace membership in HathiTrust, APTrust, or Chronopolis.

What is the relationship between the DPN and the Research Data Alliance?

Nothing formal--organization administrators have talked and will be meeting together next year to discuss partnership potential.

DPN has not kept its charter members in the loop in terms of progress, and there is a feeling that they are not "at the table".

DPN administrators apologize for the lack of progress updates; member institutions will be contacted early next year in a number of conference calls.

Developing a Customized, Extensible Application for Digital Collections

CNI project page
Breuer Collection
Plastics Collection

Suzanne E. Thorin
Sean Quimby
Jeremy Morgan

This application was developed as part of the Breuer project--an NEH Preservation and Access grant funded project to digitize over 30,000 items from Marcel Breuers collections (including furniture, private residences, and other objects). The project has been a model for international collaboration, with institutions from across Europe participating.

Design for the Breuer collection website was outsourced. Digitization and metadata creation was performed by a team of professionals, including an architecture expert to assist with that context. In developing the application, they realized the value of an XML-based platform and made the decision to move all of their digital collections (in ContentDM) to the new platform.

The Breuer Collection was a success and a donor expressed interest in having something similar done for the Plastics Collection. Development followed similar steps and a curator familiar with plastics was hired. They established an Intellectual Property policy framework, complete with takedown request forms, rights/license metadata, and so on.

In migrating from ContentDM to their custom (XTF-based) platform, the team took the following steps:

identify restricted items
normalize metadata
convert structural metadata to METS
convert descriptive metadata to MODS
convert metadata about identities (people, organizations) to EAC

The base XML files (records) for XTF are created with a METS database application, custom built in PHP. For TIFF viewing they use Djatoka, although they are planning to move to IIPImage (a process that will involve converting JP2K files to TIFFs). They are working on implementing an unobtrusive watermark for their digital images. XTF provides the collection front-end (transforming records or search results XML via XSL), and also exposes OAI-PMH metadata for harvesting in the library catalog and their Summon discovery service.

They are confident that this application and digital asset model is a good match for medium sized digital collections. Their METS DB application will be released as open source software in the future.

Open Annotation Update: OAC Experiment Results And Ongoing Work Of The W3C OA Community Group

CNI project page
Core Data Model

Timothy Cole
Paolo Ciccarese

In this session, Cole and Ciccarese discussed the results of the Open Annotation Collaboration experiment and described the ongoing work of the W3C Open Annotation Community Group. Annotations in this context are digital marginalia, often simple textual notes (for personal or public use). Levels of sharing can be specified in the metadata, but controls and mechanisms for publishing and limiting annotations are not in the scope of the OA standard.

The W3C OAC group is engaged in the following:

developing the basic open annotation data model
developing use cases, both simple and complex
testing and assisting development of annotation clients

They are working towards a common RDF based specification for annotating digital resources, and are focused on scholarly use cases and interoperability--sharing across clients and web services.

Open annotations are persistent web resources. The simplest annotation consists of two parts (RDF triples): Annotation oa:hasBody URI (e.g., text about the target), and Annotation oa:hasTarget URI (e.g., a target web page).

However, common web document models are often too simple: URIs don't readily allow specification of arbitrary chunks of data on a webpage. For example, in the use case of annotating for OCR text correction, annotations must use SVG overlay selectors or text offset to specify a discrete chunk of text (i.e., the piece that needs correction).

Annotation can be incredibly useful in biomedicine in the following ways:

by enabling the unique identification of subjects (across synonyms or changes over time)
by tapping into ontologies
by creating maps of knowledge (related subjects or bibliographic/citation connections)

The Annotation Ontology and Open Annotation Collaboration groups merged to form the W3C group, which is remarkable considering their diverse needs and the typical independence/reluctance to collaborate of the related fields.

The group has a public mailing list, a specifications page, and a wiki. Their specification's stable release 1.0 will by available by March or April.

The following projects are developing open annotation clients or are otherwise engaged in open annotation development:

Building An Archival Identity Management Network: Transforming Archival Practice And Historical Research

CNI project page
Speaker Deck

Daniel Pitti
Brian Tingle
Clifford Lynch

This panel described two connected projects: SNAC (Social Networks and Archival Context), which involves creating "a test bed of archival authority records", and Building a National Archival Authorities Infrastructure, which involves forming these records into a maintained resource and program like NACO. Building such a network involves constructing a set of systems (databases, etc.) for the collection of identities (including scholars, scientists, historical figures, and all kinds of organizations), with metadata for names, dates of birth, offices held, unique IDs, and so on.

The projects are funded by an NEH grant from the Mellon Foundation. The data sources for their network are (or will be):

about 150,000 finding aids
WorldCat records
ULAN and VIAF
other data sources like DBPedia

In the first phase, they extracted names from about 30,000 finding aids, searching through name, description, and correspondents fields. In doing so they identified 170,000 names, which were converted to EAC-CPF (Encoded Archival Context for Corporate Bodies, Persons, and Families) format (which is a "sister standard" to EAD).

In the second phase they began merging the names to create unique identities. In the process, they encountered common naming issues (like shared names, multiple names for same person, and so on); they first removed exact match duplicates, and then duplicates based on VIAF synonyms. They are still working on automated merging solutions using n-gram based textual analysis and the SHINGLE Language Model for Names. They have found that the key is to not merge too much, and let humans intervene on questionable records.

The researchers are not just identifying an authoritative name, but a full record and the context of a person's life (or the life of an organization). In determining what should go into an EAC-CPF record, researchers have identified target users: the Researcher, Contributor, Linker, Authority Librarian, and so on. Their archival records are maintained in an XTF framework for searching and browsing. Their website is like "Facebook for dead people" (although individuals do not have to be dead).

On an individual record page is a "related entries" section, which is the core of the project. These include entries for creatorOf and referencedIn relationships and entries for people (the "friend list"). There is also a "Linked Data" section providing access to the graph database and related search and visualization tools.

They have also developed a GreaseMonkey script that provides a "Context" button for identity-ready collections that can find related records in the Library of Congress, Online Archive of California, and so on.

Their site exposes a SPARQL endpoint, but Tingle questioned the value of such linked open data (at least for the end-user researcher), and expressed the feeling that RDF is "sucking up the oxygen".

Future plans include:

conducting assessment activities involving the members of their target user bases
scaling their interface up to the level of millions of records
providing more visualization/visualization tools
creating stable URLS (URLs are currently broken every time they update)
providing gateways to crowdsourcing
integrating local systems

Building and maintaining resources like this is important: archives are responsible for the curation of records, and records (which document human activity and creativity) are the cornerstone for cultural heritage. The Building a National Archival Authorities Infrastructure program will maintain these records and will be an emerging national cooperative program. A blueprint has been developed and institutions will meet in Fall 2013 to discuss the project. Engaging end-users and getting direction and correction from these users will be important.

The goal for this cooperative includes the following:

to provide flexible descriptions of identities
to better serve users of archives
to answer the following question: "I'm interested in Person X; where do I find all primary sources?"

NARA will host the administration of the resource/cooperative, although the technical infrastructure will be maintained by other institutions. Researchers are meeting with NARA in January 2013.

Questions and Comments (selected and paraphrased):

How much can the data be trusted?

Records will not be over-confident. Where data is vague, the "subjects" section may provide clues to the identity of the person/organization. The right researcher may make connections which can be passed back to the data maintainers.

Why move away from NACO?

Scope is slightly different from NACO. For example, many of the persons in this archival identity framework never wrote a book.

Will the cooperative feed corrections back to VIAF?

There will probably be such lines of communication.

What about living persons in this resource? There are a number of privacy issues, and it is going to be difficult to track identity across web presences (e.g. Twitter handles, etc.).

Debunking Myths And Establishing Guidelines For The ETD Lifecycle

CNI project page
Coverage in The Chronicle

Gail McMillan, Do Open Access ETDs Effect Publishing Opportunities in the Sciences? Findings from the 2012 Survey of Academic Journal Editors
Martin Halbert, Providing the ETDs of Today for the Researchers of Tomorrow
Matt Schultz, Providing the ETDs of Today for the Researchers of Tomorrow

In the first part of this session, McMillan described a survey put together to gauge attitudes about open access ETDs in science journal editors. There are reports and fears circulating in academia about articles being rejected by publishers because they have been "published" previously in digital archives or institutional repositories. However, there is not a lot of recent research or hard data in this field--the last time a similar survey was performed was over ten years ago.

In a preliminary survey, researchers asked recent graduates: if you restricted access to your ETD, why did you do so? A very large segment, 44%, responded that they had been advised to do so by a faculty member. Fewer than 5% replied that they had been advised to do so by a publisher.

Recent data from ProQuest shows that there are takedown requests for only about .02% of ETDs in their collection.

The NDLTD encouraged the survey of science journal editors. In determining their sample, they identified top journals in a number of fields. The primary survey question was: For manuscripts that are derived from Open Access ETDs, are they: always welcome, sometimes welcome (determined case by case), welcome only if different, welcome only if limited access, or never welcome?

They selected 290 editors to take part in the survey. There was no incentive for editors to take part; the survey was set up on SurveyMonkey and ran from August 9-27th, 2012; follow-up calls were made in September. 27.9% of those contacted responded. 60% were at commercial publishers, 25% were members of academic societies; 69% were editors-in-chief.

For their primary question they found the following responses (n = 81):

55%	Always welcome
21%	Case by case
9%	Only if different
2%	Only if limited access
13%	Never

An earlier survey of Humanities and Social Sciences editors found the following (fairly similar) responses (n = 128):

45%	Always welcome
27%	Case by case
14%	Only if different
3%	Only if limited access
4%	Never

It seems like editors often don't know that article submissions are Open Access ETDs. Students often submit articles before dissertation is complete.

In the open-ended comments section of the survey, they found that 33% made comments relating to Open Access being okay (usually in a positive context; one noted that they were happy to publish articles that would otherwise languish in an online archive); 21% made comments noting that their organization had no policy in place. Others noted that an online ETD, having not been peer-reviewed, is not considered published ("gray" literature). There were not really any comments from those who responded "Never".

It is possible that those who responded "Never" do not in fact know whether their organization has a policy in place.

Questions and Comments (selected and paraphrased):

Was there a correlation between discipline and the "Never" response?

Didn't really look into that.

It makes sense that students are choosing to embargo their ETDs. Looking at the survey statistics, the reservations are high enough that it is safest and easiest to just embargo the ETD. Of course, one could find out the policies of the prospective publisher, but that is extra work.

A lot of the fears were stirred up by an article in the Chronicle of Higher Education telling students to take down their Open Access ETDs.

It also seems like the humanities are most worried about this issue.

Embargoes reserve for students the right to convert their thesis or dissertation into an article or monograph, but long-term or indefinite embargoes don't make sense. Statistically, if this conversion doesn't take place in the first 3-5 years, it's not going to happen at all.

In the second part of this session, Halbert and Schultz discussed a project (IMLS funded--supported by NDLTD, MetaArchive, and UNT) to ensure that ETDs acquired will be available to future researchers by providing tools and other resources to manage the states/lifecycle of an ETD.

One part of this project is to provide guidance documents to ETD stakeholders (graduate schools, academic librarians, IT, students, and faculty) regarding the following:

access levels/embargos
ETD copyright issues
implementing programs

The researchers are engaging all stakeholders in developing these documents. They hope to provide well-rounded policy options that are descriptive rather than prescriptive and to inform students of their rights. They stress the value of usage metrics in demonstrating value. OA repositories provide long-term management as well as assist with dissemination. Download counts matter, and will become more of an impact factor.

They also provide guidance on metadata and record keeping (collecting metadata on ingest, usually by author; PREMIS for tracking transitions; and on cost planning (do ETDs go into institutional repository or their own framework? Should they go into a cooperative system?).

They are also developing software microservices (standalone tools for standalone concerns), lifecycle curation tools for ETDs. These involve the following:

format recognition
PREMIS event tracking
Virus checking
dropbox style submission frameworks

Archivematics is an open source option for preservation.

There will be a full workshop on the ETD lifecycle in 2013.

The HathiTrust Research Center: Opening Up The Elephant For New Knowledge Creation

CNI project page
HTRC homepage

John Unsworth
Beth Sandore Namachchivaya

The HathiTrust Research Center (HTRC) is the "public arm" of the HathiTrust Digital Library, providing access to the library's vast corpus. The HTRC, established July 2011, is engaged in:

facilitating terascale text mining
developing software tools for the exploration of textual data
building a cyber infrastructure for this textual data.

The HT corpus consists of 10.5 million volumes, and data analysis at this scale requires systems like the HTRC: 475 terabytes is too big for desktop research. The computation has to go to the data, rather than vice versa.

The HTRC is being developed with the needs of digital humanities researchers in mind. HTRC administrators invited 22 researchers to participate in discussions about their wants and needs for such a system. They found that the top needs for a typical researcher were:

better OCR quality
metadata that identifies the OCR quality level
metadata about language
human sourced metadata creation (including some descriptive) and correction for some areas

The HTRC has two closely-related primary goals:

to provide a persistent, sustainable structure to enable scholar to ask and answer new questions
to expose HT content within the bounds of copyright and licensing agreements

Ted Underwood at the University of Illinois (for example) has used quantitative and probabilistic models to automatically identify a work as poetry, prose, or non-fiction based on usage levels for words in use before 1150 AD (for example, higher levels of old words is sometimes indicative of poetry).

In another example, researchers can perform topic modeling to identify topics like 'like light wild over' (word groups that exist in various forms), which is more likely to be found in poetry.

But before such analysis can be done, the data has to be cleaned (and sometimes enriched). This process involves:

correcting OCR mistakes
cleaning vague metadata
discarding duplicates
adding metadata (about, for example, the gender of author or genre of work; there may be automated methods for generating such metadata)
identifying parts of volumes
removing running headers

In this process, researchers are adding value and can share their clean data, findings, and tools with others; for example, period lexicons, gazetteers of proper nouns, OCR correction rules for a specific period, document segmentation, FRBRization, clean metadata, and the code to do these things.

The HTRC architecture is a system of web services to enable remote access. Some analysis algorthms are available to researchers (for example, word frequency counts, sort lists, and so on; they don't always have to develop their own). The SEASR suite of tools was developed to work with this data. System supplied algorithms are collocated with the data. The system uses a Solr full-text and is built on a noSQL database. They use OpenID for authentication/authorization, and provide a portal front-end and programmatic access. Small jobs can be interactive, but heavier computational use must be written as batch process.

In providing analytical access to the corpus, the system must be aware of what is going out to ensure that a work is not reproduced in a form that would violate copyright. The HTRC system has in place a "trust but verify" model for handling requests.

At this point, the HTRC is working on how to engage digital humanists and computational scientists. They are meeting one on one with researchers: giving advice, facilitating engagement, and helping sharing among users.

Phase 2 is beginning in March 2013; in this phase non-copyrighted materials will be made openly available.

Questions and Comments (selected and paraphrased):

Is Google still involved with HathiTrust?

Yes, Google is still working with HT and continues to feed OCR corrections back to the corpus. Interleaving these corrections with the corrections and metadata from HTRC end-users is going to prove problematic.

How does the user interact with the corpus at this point? How does one get started?

There is not really a formalized process at this point. There is a queue of jobs to be run; adding your job involves communicating with HTRC administrators.

Closing Plenary

Hunter R. Rawlings III

Closing plenary address future of higher education, libraries, and IT. A fundamental question here is: what is college for? The answer is up for grabs; for the speaker, going to college was about getting an education; learning how to learn; becoming a good citizen. Now, college seems to be for getting a job. There are a number of initiatives focused on vocational education. But this is the responsibility of community colleges; there is nothing wrong with community colleges, and they provide a good service for society, but their model is not a good model for universities.

Virginia now requires schools to provide a breakdown of graduates' salaries 18 months after graduation. This program, along with other similar steps, are reducing education to a number and are viewing it as a simple commodity. More and more schools will follow these examples. Training for good jobs doesn't need to happen at school, it can happen at the job; and in fact many of the best jobs don't exist yet, so such in-school training is impossible.

The increased numbers of undergraduates in the US from China and other parts of the world is a bright spot in the current academic world.

PubMed is a powerful model for publishing and open access. NIH will soon begin enforcing open research data mandates. The AAU supports the Federal Research Public Access Act with one qualification: timeframe for open reporting should be one year, not six months.

The decline of library purchasing of monographs is troubling.

MOOCs are interesting, but they are clearly not there yet. They do not offer degrees; there is lots of student cheating; and there are low course completion rates. Students learn best when they have to perform in classroom.

There is a connection to be made between the advent of MOOCs and the invention/adoption of books in Greece. Plato feared that though books had some benefits, they would cause Greeks to lose memory; he also felt that they were deficient because they could not be queried in the same way a person could. Plato himself treated books as a plaything, maximizing their benefits while minimizing costs. MOOCs are not an avenue for deep learning, but they may be useful for light learning or for some subjects.

kardeiz/cni12f.md

CNI Fall 2012 Meeting

DataRes Symposium - Fall 2012 CNI Pre-Conference

New Member Session

Opening Plenary

Digital Preservation Network Update

Developing a Customized, Extensible Application for Digital Collections

Open Annotation Update: OAC Experiment Results And Ongoing Work Of The W3C OA Community Group

Building An Archival Identity Management Network: Transforming Archival Practice And Historical Research

Debunking Myths And Establishing Guidelines For The ETD Lifecycle

The HathiTrust Research Center: Opening Up The Elephant For New Knowledge Creation

Closing Plenary