Workshop on Cyberinfrastructure and Machine Learning for Digital Libraries and Archives

June 3, 2018  |  Fort Worth, Texas

In conjunction with Joint Conference on Digital Libraries 2018


Thanks so much for all attending the workshop. The presentations, as well as your questions and discussions illustrated the state of the space, suggested areas for improvement, and pointed to new research and practice opportunities. We hope you met new colleagues and enjoyed the sessions. We have added links to the presentations in the Agenda section. You can submit your feedback online at https://goo.gl/forms/Lgpik5k62LZ81IVC3.


Call for Special Issue

As we announced at the workshop, we plan to organize a special issue for journal of Data and Information Management. Each article should focus on a specific research problem with at least 6000 words.

The manuscript format can be seen at: https://content.sciendo.com/supplemental/1668101184/journals/dim/dim-overview.xml

Important dates:
Submission Deadline: 31st, July
Notification of Acceptance: 31st, August
Final Version: 30th, September

Click here for more information. If you are interested, please let Dr. Dan Wu (woodan@whu.edu.cn) know at your earliest convenient (before June 17).


About The Workshop

Academic libraries and archives have made significant progress accommodating data in their operations by implementing data management consulting services and repositories for final and relatively small sized datasets. However, providing scalable data management support and services remains challenging, especially for large volumes of data or at large research institutions. There is an urgent need for further research and implementation of automated methods to describe, represent, preserve, and facilitate the prompt and efficient access and reuse of large-scale scholarly data. This workshop introduces a tryptic model to address these challenges.

The model above illustrates the interactions between digital libraries and archives, cyberinfrastructure, and machine learning methods and tools.

As collecting institutions aggregate more and larger digital content, the processes of curating, preserving, and making that content accessible requires automation and scalability. Cyberinfrastructure refers to shared online research environments, backed up by advanced computing resources, hosted in data centers, and supported by experts. Coupled with cyberinfrastructure, machine learning methods and tools can provide digital libraries and archives with powerful resources to enhance their ability to represent, keep, and provide persistent access to collections, thus facilitating reuse. In turn, cyberinfrastructure and the projects that make use of it benefit from adopters within libraries and archives, who provide grounding in best practices and standards for data curation, discoverability, and integrity. To explore these topics, we invite researchers and practitioners from cyberinfrastructure, digital libraries, archives, and machine learning fields as well as domain experts to share ideas, introduce the theory and research methods, and share examples of best practices. The workshop will include keynote speakers, peer-reviewed papers, and a panel discussion.

Workshop Agenda 

The workshop introduces a tryptic model that connects digital libraries and archives, cyberinfrastructure, and machine learning to stimulate research and implementation of automated methods to describe, represent, preserve, and facilitate access and reuse of large-scale scholarly data. Cyberinfrastructure refers to shared online research environments, backed up by advanced computing resources and supported by experts. Coupled with cyberinfrastructure, machine learning methods and tools can provide digital libraries and archives with powerful resources to enhance their ability to curate, organize, represent, and provide persistent access to large-scale collections, thus facilitating their discoverability and reuse.

The papers and activities address the combination of cyberinfrastructure and machine learning throughout the lifecycle of digital collections; from data management planning, requirements gathering, description, preservation, access and publication. Bring your laptop for a day of lectures by remarkable researchers and for exciting hands-on exercises.

  Topic Speaker
9:00-9:30 Welcome and overview of the workshop's agenda Conference Chairs
9:30-10:00 Expanding Library Capacity and Facilitating Reuse through a Consortial Data Repository

Description | Presentation Slides

Matthew McEniry, Jessica Trelogan, and Santi Thompson
10:00-10:30 Coffee Break
10:30-11:00 Audiovisual Metadata Platform Planning Project

Description | Presentation Slides

Tanya Clement, Jon Dunn, Juliet Hardesty, Chris Lacinak and Amy Rudersdorf
11:00-11:30 Petabytes in Practice: Working with Collections as Data at Scale

Description | Presentation Slides

Will Thomas, Benjamin Galewsky, Gregory Jansen, Sandeep Satheesan, Richard Marciano, Shannon Bradley, Jong Lee, Luigi Marini and Kenton McHenry
11:30- 12:00 Hands-on tutorial: A Method for Modeling Large-scale Data Requirements to Cyberinfrastructure and Machine Learning*

Description | Presentation Slides

Maria Esteva
12:00-1:30 Lunch Break
1:30-2:00 What can Machine Learning and Crowdsourcing do for you? Exploring New Tools for Scalable Data Processing

Description | Presentation Slides

Matt Lease
2:00-2:30 Data Capsule Appliance for Restricted Data in Libraries

Description | Presentation Slides

Sachith Withana, Inna Kouper, and Beth Plale
2:30-3:00 Hands-on tutorial: Machine Learning on Cyberinfrastructure**

Description | Presentation Slides

Ruizhu Huang
3:00-3:30 Coffee Break
3:30-4:00 Predicting Library OPAC Users' Cross-device Transitions

Description | Presentation Slides

Dan Wu and Shaobo Liang
4:00-4:30 Improve Accessibility of Biology Papers through Integration of Domain Information Extraction in the Publication Pipeline

Description | Presentation Slides

Amit Gupta, Pankaj Jaiswal, Crispin Taylor, and Weijia Xu
4:30-5:00 Closing Discussion

*After introducing how data modeling was used in the design of the Digital Rocks Portal (https://www.digitalrocksportal.org), attendees working in multi-disciplinary groups will model large-scale data use cases including analysis, curation, access and publication functions and will map those to cyberinfrastructure. You are welcomed to share large-scale data curation and analysis cases to discuss and resolve during the workshop.

**Attendees will get chance to learn how to log on to a supercomputer and start an interactive session using a big data cluster to explore how it can be used for a machine learning project.

Call for Participation

We are soliciting presentations in the following research areas:

  • Best practices for using open science cyberinfrastructure for digital libraries and archives
  • Models and methods to improve large-scale data access and reuse including issues of data understandability and representation of complex datasets (e.g., derived from large-scale simulations, experiments, and observational research projects).
  • Machine learning methods using Linked data models and ontology applications for digital libraries and archives.
  • Challenges and solutions in curating datasets beyond textual content (e.g., video, volumetric images, genomics/bio data, architectural drawings, point clouds, GIS, satellite imagery, etc.).
  • Automated methods for managing and preserving scientific data collections of diverse formats.
  • Machine learning methods for improving collections accessibility and reuse.
  • Large-scale metadata generation and management for integration and interoperability of scholarly data.
  • Systems design and implementation, including data analysis/ for digital collections services in cloud computing environments.
  • Theory and models for integration of analysis and curation tasks for evolving scientific data collections using cyberinfrastructure.

Important Dates

Abstract Submission Deadline: May 7, 2018
Final Version (for presentation): May 31, 2018

Workshop Registration

Workshop attendees should register the workshop through the JCDL'18 conference registration system: https://2018.jcdl.org/registration

ACM/IEE Members: $105 (by May 3) / $155
Non-members: $155 (by May 3) / $175
Students: $25 (by May 3) / $75

Submission Instructions

We accept extended abstracts of a minimum of 2 pages. Abstracts should be submitted as PDF's in the standard ACM conference format available here: https://www.acm.org/publications/proceedings-template

After the workshop, presenters will have the opportunity to revise their submissions based on the feedback they received and published in one of the following:

As a reference, the latest version of the bulletin can be accessed here: http://www.ieee-tcdl.org/Bulletin/current/index.html

Submission link: https://easychair.org/conferences/?conf=cmd18

Workshop Program Committee

Maria Esteva, Texas Advanced Computing Center

Weijia Xu, Texas Advanced Computing Center

Jessica Trelogan, University of Texas Libraries

Ashley Adair, University of Texas Libraries

Richard Marciano, University of Maryland

Mark Hedges, Kings College

Dan Wu, WuHan University

Location Details

We will meet at 8:45 at:
EAD 514
Carl E. Everett Education & Administration
Camp Bowie Blvd, Fort Worth, TX 76107
https://goo.gl/maps/vFMWuuxYojN2



Workshop Descriptions

Matthew McEniry, Jessica Trelogan, and Santi Thompson 

Expanding Library Capacity and Facilitating Reuse through a Consortial Data Repository

The Texas Data Repository (TDR, https://thetexasdatarepository.org/), hosted by the Texas Digital Library (http://tdl.org), was launched in late 2016 as a platform for publishing and archiving research data produced by the faculty, staff, and students of higher education institutions in Texas. TDR has since been adopted by 11 members, representing institutions of widely different sizes and means. This paper will discuss further work to improve member institutions' capacity for big data curation, likely by leveraging partnerships with High Performance Computing Centers like the Texas Advanced Computing Center (http://tacc.utexas.edu) at the University of Texas at Austin. In the model proposed by this workshop, Digital Library consortia like the TDR have much to contribute in terms of pooled expertise on standards and best practice within research communities. They are also uniquely positioned to directly support and empower researchers themselves through advocacy, training, outreach, and the creation of self-help resources. This paper will offer a reflection on the first year of TDR from two of the larger member institutions, the University of Texas at Austin and Texas Tech University and discuss their respective approaches to expanding capacity for big data curation.

^ back to agenda


Tanya Clement, Jon Dunn, Juliet Hardesty, Chris Lacinak and Amy Rudersdorf 

Audiovisual Metadata Platform Planning Project

This paper describes a planning project for the design and development of an audiovisual metadata platform (AMP). The proposed platform will perform mass description of audiovisual content utilizing automated mechanisms linked together with human labor in a recursive and reflexive workflow to generate and manage metadata at scale for libraries and archives. The partners leading this planning project are the Indiana University (IU) Libraries, University of Texas at Austin (UT) School of Information, and AVP.

^ back to agenda


Will Thomas, Benjamin Galewsky, Gregory Jansen, Sandeep Satheesan, Richard Marciano, Shannon Bradley, Jong Lee, Luigi Marini and Kenton McHenry 

Petabytes in Practice: Working with Collections as Data at Scale

New modes of data-driven research require large quantities of data; responsible stewardship of that data as well as the ever-growing digitized and born-digital collections of records and cultural data require a robust practice of computational archival science (CAS) to facilitate computer-assisted access, management, and preservation of digital assets at scale.

By placing composable, scalable technologies into the hands of archivists we can extend their human intelligence with powerful decision support and processing, extending their gaze and diligence into new dimensions of data and complexity just as the microscope and telescope extend the eyes into otherwise inaccessible realms. We define a reference technology platform and outline a hands-on program using it to train archivists and information scientists in computational methods and thinking. In the sections that follow we describe the components of the reference platform and discuss the more broadly applicable skills that archivists working at scale must acquire through hands-on practice.

^ back to agenda


Maria Esteva 

Hands-on tutorial: A Method for Modeling Large-scale Data Requirements to Cyberinfrastructure and Machine Learning

After introducing how data modelling was used in the design of the Digital Rocks Portal (https://www.digitalrocksportal.org), attendees working in multi-disciplinary groups will model large-scale data use cases including analysis, curation, access and publication functions and will map those to cyberinfrastructure. You are welcomed to share large scale data curation and analysis cases to discuss and resolve during the workshop.

^ back to agenda


Matt Lease 

What can Machine Learning and Crowdsourcing do for you? Exploring New Tools for Scalable Data Processing.

Global growth in Internet access is driving a renaissance in human computation: use of people rather than computers to complete data processing tasks for which human abilities continues to exceed that of state-of-the-art machine learning (e.g. annotating text or images). As crowd computing expands traditional accuracy-time-cost tradeoffs vs. purely automated approaches, this has begun to change how we design and implement our computing systems.

Beyond collecting annotated data from crowds to train machine learning systems, we are increasingly seeing a new form of hybrid, socio-computational system emerge, which harnesses crowd labor in concert with machine learning at run-time to better tackle difficult processing tasks. As such, we find ourselves today in an exciting new design space, where the potential capabilities of tomorrow's computing systems is seemingly limited only by our imagination and creativity in designing new algorithms to compute with crowds as well as machine learning. In this talk, I will discuss recent research work in performing data curation and language processing tasks using machine learning, crowds, and their combination.

^ back to agenda


Sachith Withana, Inna Kouper, and Beth Plale 

Data Capsule Appliance for Restricted Data in Libraries

With the tremendous increase in volume in digitized content, research libraries have growing numbers of collections of digital content with access sensitivities. Our project extends an existing remote VM service in response to library need. The first step, to package the service as an appliance, is carried out in the context of participatory design. We discuss lessons learned and future work.

^ back to agenda


Ruizhu Huang 

Hands-on tutorial: Machine Learning on Cyberinfrastructure

Attendees will get chance to learn how to log on to a supercomputer and start an interactive session using a big data cluster to explore how it can be used for a machine learning project. Bring your laptops!

^ back to agenda


Dan Wu and Shaobo Liang 

Predicting Library OPAC Users' Cross-device Transitions

With the increasing ownerships of different smart devices, such as iPad and smart phones, users can access the library OPAC (Online Public Access Catalogue) services through multiple devices in different contexts to meet their needs. Thus, this phenomenon leads to transitions between different devices, which reflects users' continuing information needs.

This paper studies and then predicts the user's cross-device transition behavior when they use a library OPAC system in order to integrate information behaviors from multiple devices when accessing library OPAC and then give search recommendations for users. In order to predict the users next activities and the next device users might use after device transition, we capture behavior features from different perspectives and analyze the feature importance. Our study examines cross-device transition prediction in library OPAC, which can help libraries to provide smart services for users when accessing a library OPAC on different devices under different situations.

^ back to agenda


Amit Gupta, Pankaj Jaiswal, Crispin Taylor, and Weijia Xu 

Improve Accessibility of Biology Papers through Integration of Domain Information Extraction in the Publication Pipeline

We introduce present an ongoing project of extracting domain information from scientific publication. The extracted result can be added at the end of publication to increase its accessibility. The project implements text mining methods for entity extraction and utilizes cyberinfrastructure for online processing and service support. We will also describe the integration of this service with the existing publication pipeline at the American Society of Plant Biologists and report the initial feedback from publishers and authors.

^ back to agenda

 

Workshop Objectives
  • Bring together data curators, librarians and archivists, researchers, and computational scientists to share practical experiences at the intersection of digital libraries, machine learning, and cyberinfrastructure.
  • Promote the usage of cyberinfrastructure within the digital library and archives community.
  • Advocate adoption of best data curation practices within the computational research and cyberinfrastructure communities.
  • Discuss future opportunities and forge a dedicated community.