This is an abridged version of a final report describing the activities surrounding Phase I of a one-year pilot project called the University of Notre Dame Institutional Digital Repository. After outlining the goals and methods of the project, the report enumerates ways the project could be continued. The seventy-some people who participated in the project are now looking to administrators across the University to become familiar with the contents of the report and set its future course.
The full report, designed for duplex printing and complete with appendices, is available in PDF format at /idr/documents/idr-final-report.pdf- [830KB] .
Eric Lease Morgan & Team IDR
University Libraries of Notre Dame
December 18, 2006
Here is the briefest of summaries regarding what we did, what we learned, and where we think future directions should go:
The three-fold purpose of the Institutional Digital Repository (IDR) is closely aligned with the goals of the University. The IDR's three goals are:
With these goals in mind the IDR is defined as a set of digital objects combined with sets of services applied against those objects - think "digital library".
Great content is created here at Notre Dame, ranging from theses & dissertations written by graduate students to essays and posters by excellent undergraduates. includes The output of institutes that who explores and elaborates upon issues pertinent to associated their fields of study are included. It includes articles written by faculty as the result of their research and scholarship can be found as well as scanned images used to teach art history and other subjects.
By making greater amounts of University scholarly output accessible on the Web, more people with similar interests will be able to find, read, and comment on the good work happening at Notre Dame. Instead of hiding the University's "candle under a bushel basket", the IDR can make content more accessible to the outside world and exploit the globally networked environment in which we live and work.
The more than 1,300 items comprising the current content of the IDR are diverse in subject matter and format. For example, the subjects represented by the IDR include but are not limited to: anthropology, Asian studies, biology & life science, business, economics, engineering, European studies, geology, history, Latin American Studies, Latino Studies, political science, science and technology, social science, sociology, and theology & religion. This content is manifested by more than:
The IDR is searchable through a number of simple or advanced interfaces. It can be browsed by author, department, subject, and format. The IDR is not necessarily intended to be a destination, but rather a platform for syndicating content. Consequently, information contained in the IDR can easily and seamlessly be incorporated into the Web pages of the campus-wide portal, a department's website, or an individual's home page.
The IDR's digital objects can be just about any computer file created at the University for teaching, learning, or research. Types of content that could be included but are: working papers, data sets, pre-prints, pictures, movies, sounds, technical reports, conference presentations, etc. Computer formats that could be included are: Word documents, PDF files, JPEG images, tab-delimited text files, MPEG files, LaTEX files, Postscript files, etc.
The services for the IDR would be the same sorts of services expected from any repository that includes search and browse, but also could include other value-added services. Some examples of value-added services are: What's New?, create my vita, show me my Google PageRank, syndicate my content to the campus-wide portal, syndicate my content to departmental Web pages, create content- or subject-specific search and browse interfaces, show me who has looked at my page, show me who links to my page. In short, the IDR is a digital library designed to address a number of the University's priorities and the articulated needs of University students, instructors, and scholars for teaching, learning, and research.
The primary goal of the IDR is to explore and supplement ways to enhance learning, teaching, and research. It attempts to solve that problem to by collecting, organizing, archiving, and providing access to value-added services against content "born digital" here at Notre Dame.
The balance of this report outlines the current environment of institutional repositories and scholarly communication; what Team IDR did to learn about providing institutional repository services; a number of issues that have come to our attention because of these efforts; and finally, a number of options available to University administrations for taking the IDR to the next step.
People's expectations regarding access to information have significantly changed with the advent of globally networked computers. In combination with sets of nebulously resolved intellectual property rights issues, these expectations haves created an environment where some traditional best practices are increasingly seen as dysfunctional and new sets of best practices are yet to be established. The Academe, specifically the scholarly communications process, has not gone untouched in this regard. This section paints a picture of the current environment and how it came to be.
By the early to mid-1990's almost every faculty and staff member in institutions of higher education had computers on their desktops that were connected to the Internet. The typewriter had all but disappeared to be replaced with the word processor. Through the use of internal networks, files were easily duplicated from computer to computer. Scholarly works such as journal articles and monograph manuscripts were sent to publishers via email. Computers allowed people to create exact copies of their "born digital" creations.
At the same time, the scholarly publishing industry was undergoing great changes. Smaller publishing houses were being purchased and subsumed by larger publishers. Academia saw an increasing number of specialized titles. Fewer and fewer journals were being published by institutions of higher education and instead they were being sold to and published by for-profit institutions.
In 1994 Stevan Harnad, then a cognitive scientist at Princeton University, identified this trend and wrote the "Subversive Proposal" that suggesting authors publish their own scholarly works through the use of "public FTP (file transfer protocol)" thereby making them freely available on the Internet.1 This was an influential proposal, and essentially the first articulation for something that would later be called "open access".
By the mid-1990's the scholarly publishing industry had leveraged increasing monopolies and combined with a number of other factors, significantly raised prices, especially journal prices. These other factors include: junior faculty mandated to write and to publish in particular journals in order to achieve tenure ("publish or perish"); legal agreements that transfer intellectual property rights from author to publisher, and librarians pressured to maintain broad as well as deep collections. In such an environment for-profit publishers successfully have been able to raise the prices of their products and services by at least 7 percent / year for the past fifteen years. Consequently, libraries commonly pay $2,000/year for a subscription to a scientific journal and $150/year for a humanities journal. As of 2004 scientific books cost around $100 each and humanities books cost about $50 each.
In the late 1990's and beginnings of the 2000's, we saw the birth of the World Wide Web and the "dot-com boom". While the technology behind the Web was created in the very early 1990's, it did not become popular until the mass distribution of the "graphical Web browser" by Netscape, Inc. Companies fueled by venture capitalists sprang up all over the place. Using HTML, people rushed to put their content on the Web. With easy availability of all sorts of content, people spent hours "Web surfing."
Libraries responded in two ways. First, they began creating digital surrogates of some of their more rare and special collections. Second, they created lists of useful Internet resources akin to records in the venerable library catalog. Universities, university departments, and faculty reacted similarly by creating sets of "home pages" describing what they were doing and why it was important. University computing centers offered space for individuals to host their own sites, where online vitas, research agendas, and publication lists were created. Publishers responded by making more of their content available electronically and in full-text forms. Publishers also changed their subscription model. Often now libraries license the ability to view content, but not necessarily to own it.
Probably the most visible by-products of the dot-com boom are Yahoo and Google. Yahoo began its life doing what many libraries were doing, namely creating lists of Internet resources, organizing them into groups, and allowing people to browse the collection. Yahoo is an acronym for "Yet Another Hierarchical Officious Oracle". Google focused it attention on search instead of browse. By crawling the Web, counting the number of times people linked to various websites, indexing the content, and providing a very simple interface for searching the index, Google became the de-facto standard for searching the Web. If there is one thing that changed people's expectation regarding access to information, then that one thing is Google. Enter a few words. Get back a list of hits. Click on an item. Get the content. Fast. Easy. Usually very satisfying.
Academia, especially in Europe, was taking notice of these changes, and in 2002 the phrase "open access" was coined in the Budapest Open Access Initiative. The Initiative, a public statement regarding the dissemination of peer-reviewed scholarly materials, defined "open access" as a kind of free and unrestricted online availability of scholarly materials. From the Initiative:An old tradition and a new technology have converged to make possible an unprecedented public good. The old tradition is the willingness of scientists and scholars to publish the fruits of their research in scholarly journals without payment, for the sake of inquiry and knowledge. The new technology is the internet. The public good they make possible is the world-wide electronic distribution of the peer-reviewed journal literature and completely free and unrestricted access to it by all scientists, scholars, teachers, students, and other curious minds. Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.
Since the Initiative was articulated, numerous other statements have been written to advocating the importance of open access publishing including: the Bethesda Statement on Open Access Publishing, the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, and the Declaration On Access To Research Data From Public Funding.
The concept of open access publishing has manifested in two different ways: peer-review journals and repositories. To date the Directory of Open Access Journals lists about 2,400 peer-review, open access journal titles.3 These journals all are "free", scholarly, and cover a wide range of disciplines and subjects. The Public Library of Science and BioOne have become major publishers of open access journals that disseminating a large body of scientific literature.
Open access publishing also is manifested in repositories of two types: institution and discipline. Institutional repositories are collections such as pre-prints, working papers, post-prints, technical reports that originate from a specific university or college. MIT's repository is a good example. Probably collections of electronic theses & dissertations are the most popular form of repository. Their content is aggregated into the Networked Digital Library of Theses and Dissertations. Other repositories are subject-based. The most notable subject-based repository is arXiv, hosted at Cornell University. It contains a huge collection of scientific literature with heavy emphasis on physics. A project called OpenDOAR is attempting to list as many repositories as possible.
People's expectations have changed since the advent of Google. They expect to do some sort of Internet search, be presented with a list of possible items of interest, click on an item, and read the result. Usability studies show that users are frustrated with libraries because there is too much choice, things are too difficult to search, and electronic full text may not be available. Full text availability could mean that the item does not exist digitally. More probably it exists digitally, but the item is copyrighted by a publisher who secured the rights from the original author and now provides access to it through a propriety, non-standard interface. Journal article publication, unlike book publication, is not about paying the author but rather about the author's work getting cited and recognized. If the Academia were to make a greater amount of its "born digital" content freely available online via open access journals and/or repositories, then indexers (like Google) would find it, and libraries would collect and preserve it. Most likely the sphere of knowledge would grow faster and more people would be aware of the Academy's accomplishments.
Phase one of the IDR was a one-year pilot project. Its scope was limited to the content of four groups of people and a larger number of associated individuals:
Each of the associated individuals and their co-workers have been working with one or more of the 27-member Team IDR from the Libraries (see ) to collect, organize, preserve, and disseminate the digital content. The following sections outline the work done to accommodate these collections.
In order to spotlight meritorious scholarship done by undergraduate students, a number of excellent undergraduate research materials have been added to the IDR.4
Late in 2005 a Vice President and Associate Provost brought together a people from a number of University departments in an effort to begin the process of highlighting excellent undergraduate research. These departments included: Anthropology, Architecture, Biological Sciences, Electrical Engineering, FTT (Film, Television, and Theater), the Gigot Center for Entrepreneurial Studies, the Institute for Scholarship in the Liberal Arts, and the Nanovic Institute. At that time it was decided each department would investigate ways to submit examples of "excellent" undergraduate research to the IDR.
Site visits were made, relationships were built, and a laissez-faire policy was adopted regarding what would get submitted and when; each department was to use its own standards for defining "excellence". In addition it was decided each submission should include a short biography and a picture of the student, as well as a copy of his/her work. All of these things were expected to be displayed as a part of the project, and special software was written and procedures put into place in order to make this come to fruition.
Since the University in this instance is essentially operating as a publisher of content it did not own, a copyright permissions form (see Appendix ?) was drafted by University General Counsel essentially stating four things:
As materials were submitted to the IDR, a lot of time was spent getting the copyright permission forms signed. In the end, a total of 14 works were submitted from the Anthropology Department, Biological Sciences Department, Business School, and Nanovic Institute.
Thirty publications from the Institute of Latino Studies have been incorporated into the IDR.5
The Institute of Latino Studies publishes research reports, policy briefs, monographs, an annual news magazine, and more. The majority of their content is not in a digital format but thirty of their publications exist as PDF files. Working with the Institute we acquired the digital content, created records describing them, and added them to the IDR. As with the content of the Computer Science Department, the descriptions of the records are not as complete as they could be because of a lack of subject expertise by the people doing the data-entry. This emphasizes the need to work more closely with content creators.
Using the previously existing metadata from the University Libraries' catalog, we were able to create and include 161 records describing Kellogg Institute working papers into the IDR.6
The Kellogg Institute has a collection of at least twenty year's worth of working papers, that is, essays and articles written by Faculty Fellows, Visiting Fellows and Guest Scholars. Approximately half of their 300 papers are available digitally with, and the majority, if not all of them already cataloged by the Libraries.
Working with the Institute, the Team IDR acquired the digitized working papers, used the library catalog as a base, and created database records for each of the papers. Then the metadata was harvested into the centralized cache where browsable interfaces were provided similar to the one on Kellogg's website but with an additional searchable interface to the collection, something the Kellogg website does not support.
This particular IDR experiment was especially fruitful because of the full descriptive cataloging that had already been done.
Through an almost entirely automated process more than 300 engineering citations were added to the IDR.7
One of the primary purposes of the IDR is to accomplish scholarly communication by demonstrating and making available the good work done at the University. This work is partially exemplified by articles published in scholarly journal literature. In an effort to demonstrate how such content can be retrospectively included in the IDR, we the Libraries' bibliographic databases were searched for articles authored by Notre Dame faculty in the various engineering departments. The resulting sets of bibliographic citations where saved in a database application called EndNote and subsequently exported to an XML file. The XML was converted into a format usable by the DSpace importing routines and associated each citation with a dummy PDF document. The dummy PDF documents were used because (The actual articles are quite likely copyright protected. After the data were ingested, they it were harvested into the centralized cache where they are searchable and browsable in the same manner as the IDR's other content.
Using this automated means we were able to incorporate a large quantity of data quickly, thoroughly, and accurately.
Using a very manual process, about 250 computer science technical reports were incorporated into the IDR.8
For more than a decade the Computer Science Department has been making various technical reports available on its website. Earlier documents are saved as postscript files. Later documents are saved as PDF documents. As a demonstration, the older postscript files were converted to PDF and names, titles, dates, and abstracts were entered into the database. These items where then harvested and cached into the central IDR for searching and browsing.
Because the people doing the data entry did not have the necessary subject expertise, some things such as keywords were not added to the system unless they had been provided by the original document. This process taught us about the necessity for working more closely with content creators.
The entire corpus of the University's electronic theses & dissertations collection has been incorporated into the IDR.9
A few years ago the University Libraries in conjunction with the Graduate School implemented a system for collecting, indexing, searching, and disseminating electronic versions of theses & dissertations. The process centers around a piece of open source software called ETD-db and supported by Virginia Tech. As graduate students (both Masters and Ph.D.) approach graduation they are encouraged by the Graduate School to submit their theses or dissertation electronically as PDF documents. Upon submission students are asked for their name, degree, title of work, abstract of work, a few keywords describing their work, and whether or not they would like their work to be freely distributed. The Graduate School makes sure the work is correctly formatted and signed by all parties involved. Upon approval from the Graduate School the work gets passed on to the Libraries where it gets more formally described and added to the library catalog. Finally, the work is made available to the general public, if so specified. To date there are about 360 theses & dissertations in the system.
Using a metadata harvesting protocol called OAI-PMH, information about the theses & dissertations were able to be copied to a centralized cache and sets of value-added services were provided against them. Just like the out-of-the-box ETD-db interface, browsable interfaces were provided based on names and titles. Similarly, searchable interfaces were created against names, titles, abstracts, and keywords. Unlike the standard ETD-db application, the IDR is able to provide lists of theses & dissertations based on departments. This enables academic units to display lists of recently completed works. Since Google has been able to crawl the ETD-db over the past few years, other people have been able to link to its content. Such links increase the works' relevance, and the "PageRank" is able to be displayed by querying Google for this information.
This project demonstrated additional value-added services the IDR can provide against content "born digital" at the University.
About 650 images from the Art Image Library have been made a part of the IDR.10
The Art Image Library is a branch of the University Libraries. It is staffed by one full-time curator and a number of part-time students. Located in O'Shaughnessy Hall it includes about 230,000 slides depicting art and architecture from all over the world with an emphasis on Western culture. Most of the slides were created by taking photographs from books. To a lesser extent the slides were purchased by the University or given to the Library by faculty. Each slide is classified, given an accession number, and stored in sets of large metal cabinets.
The primary audience of the Art Image Library is the art historians who use it to teach art history. To make use of the collection, instructors search and browse the library collection and/or identify images from their own collections for classroom use. The library staff then digitize the necessary content and make the resulting files available on a shared file system (hard disk). Instructors incorporate the images into PowerPoint files for in-class presentations.
About 10,000 of the slides have been digitized over the past few years. A number of differing automated systems have been used to organize and access these images. Each system had it own strengths and weaknesses. None of the systems has been comprehensive regarding the total number of slides in the collection and none of the systems has been used to thoroughly and consistently describe each slide.
To automate the work of the Art Slide Library to a greater degree, an application called DigiTool was purchased. As a digital asset manager designed for libraries, it promises methods for easily describing images using Dublin Core metadata elements, access control at the item level, and customizable browser-based as well as computer-based interfaces.
Because of a number of challenges, migrating the digital images and their associated metadata into DigiTool proceeded in fits and starts. First, Dublin Core was not seen as an optimal metadata language for art history content. Instead VRA Core was desired, and consequently, time was spent creating a mapping between VRA Core to Dublin Core. A relational database schema was also designed to manage VRA Core metadata. A previously existing VRA Core database was then identified but the acquisition of the database was delayed many times and proved too time-consuming to implement. Second, the metadata records in the most recently used database system used by the library was determined to be incomplete and inconsistent. This made any migration from the existing database to a VRA Core-based system difficult, if not impossible, to automate. It also made it very difficult to ingest the metadata into DigiTool. Third, the OAI-PMH interface implemented by DigiTool and used to create the IDR's centralized cache was not standards-compliant. The problem was exacerbated by an upgrade to DigiTool that broke the interface all together.
To a large extent these challenges have been overcome through the combined use of brute force manual data entry and automation. A subset of the digitized content was identified. Each of the descriptions of the items in the subset was supplemented with subject terms garnered from ARTstor. These descriptions, along with the associated images, were then ingested into DigiTool. Finally, the DigiTool OAI-PMH interface was configured and harnessed to allow the IDR to provide searchable/browsable services against the 630 digital images in the collection.
This section describes the technological infrastructure behind the IDR.
The computer technology used by the IDR is a combination of legacy software, commercial software, and open source software. For example, the University's Electronic Theses & Dissertations Project was implemented a few years ago using Virginia Tech's ETD-db application. While not suited for all types of content such as images, the ETD-db is very well suited for theses & dissertations. Since the Graduate School was using this application with a great deal of success, it was deemed silly to migrate its functionality to another system. It was determined that an application called DigiTool would be used to automate the collection, description, and dissemination of content from the Art Image Library. This commercial software is supported by Ex Libris, the same vendor of the Libraries' integrated library system. DSpace, an open source software application specifically designed for institutional repository use, is used to collect, describe, and disseminate things like working papers, technical reports, and the excellent undergraduate research.
Each of these applications has its own strengths. ETD-db does very well what it was designed to do. DigiTool provides finely grained access control. DSpace is a "free" application widely supported by the academic community and probably ranks as the most popular in its class. At the same time, each of these applications hasve its own weaknesses. For example, each application operates in its own silo so that it is difficult to integrate the searching process. More importantly, each of these systems operates as a turn-key application so that it is very difficult to change anything but the most rudimentary cosmetic customizations.
Luckily each of these three systems supports a harvesting protocol called OAI-PMH. This protocol allows the metadata to be collected from each of the systems, to cache it locally, and then to provide searching/browsing services against the cache. MyLibrary, the same software used to drive much of the Libraries' website, is used to provide this functionality. Because MyLibrary is more like a toolbox and less like a turn-key application, there is much more control over the appearance of the implementation as well as more ability to provide additional functionality that none of the turn-key applications supports. Below is an enumeration of some of these features:
While the goals of the IDR are laudable, the IDR does not sell itself. People are often too busy to consider change. In this developmental phase, Team IDR learned from several different sources about the necessity of raising awareness through marketing and advocacy.
Two members of Team IDR and the Libraries' Associate Director for User Services attended an Association of Research Libraries Institute on Scholarly Communication in July 2006. Institute speakers recommended creating a systematic plan for increasing scholarly communication effectiveness. Plan objectives would include:
We learned a number of things about promoting awareness, marketing, and advocacy on issues surrounding scholarly communication:
The following comments shared by teaching/research faculty in the group helped us begin to understand some of the marketing challenges we will face:
To become and remain successful, the IDR must have a continuous flow of new content and it must also be able to substantiate that the content is being accessed, viewed and applied by other scholars. The symbiotic relationship of these two requirements implies that marketing the IDR will concentrate both on solicitation of content and promoting the IDR as an information resource.
A review of the literature confirms that institutions have faced an uphill battle in marketing their repositories to faculty. (See Appendix G for detailed information). At least one institution responded to this obstacle by focusing its institutional repository marketing efforts on promoting and archiving student work. This approach, "students - and eventually faculty - ...develop some conception of the issues surrounding copyright, fair use, licensing, and alternative publishing models" (Nolan, Costanza, 2006), exposes future academics to the potential benefits of open access in scholarly communication.
The success potential for any marketing effort increases when the need is clearly defined and when the innovation is aligned with institutional goals. The University's recent selection to participate in the Carnegie Academy for the Scholarship of Teaching and Learning, with emphasis on enhancing undergraduate research, creates a pointed need for a central repository for archiving and promoting exemplary student work. When considering the various paths a marketing plan for ND's IDR might take, focus on Excellent Undergraduate Research consistently rises to the top. This direction, at least in the early phase of IDR promotion, provides a unique opportunity to present the IDR as an innovation that successfully meets its stated goals:
Initial focus on Excellent Undergraduate Research does not preclude a long-term goal of marketing the benefits of the IDR to faculty. If the effort to spotlight student-produced content in the IDR is successful, it is likely that the IDR's appeal to faculty would increase.
The IDR is a collection of digital objects, and as such it presents a new challenge for the Libraries. Preserving digital objects requires proactive, ongoing intervention. Analog content is inherently more stable and, even if it sits unused, can often be read 50-100+ years after it is produced. If digital objects go unused for that length of time they would inaccessible. This is because the technology used to create and to read the digital content is ever-changing. For long-term accessibility (preservation) active human intervention is required. One widely accepted method for preserving digital objects is migration. Migration involves transforming from one format to another as software and technology change. GIF to JPEG. JPEG to JPEG2000. Wordstar to WordPerfect. WordPerfect to Word. Word to PDF. Etc. In this environment preservation is a self-selecting process. It often happens as a matter of course for digital objects that are heavily used. The documents that are accessed most get migrated from format to format as needed. It is also an example of the principle alluded to by Thomas Jefferson, "...let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident." A lesser used and more costly method for the preservation of digital objects is emulation. Emulation is a process whereby older computer technologies are mimicked to enable the display of obsolescent digital formats. This process requires access to outdated software and hardware, along with the accompanying documentation. Emulation allows for the display of content as it looked. Migration can lead to the loss of data and information since characteristics of the original, e.g. formatting, may not translate to the newer version.
When it comes to the IDR, probably the most important items for preservation are the theses & dissertations. Three years ago the Graduate School, in conjunction with the Libraries, implemented the Electronic Theses & Dissertations project. In this system graduate students are encouraged to submit their theses or dissertation electronically in a PDF format. Items sent through this process are not being printed nor archived. They are sent to a company called ProQuest that has contracted to preserve the materials, and, if Proquest goes out of business the Library of Congress has committed to their continued preservation.
The theses & dissertations are primary literature. By definition they are unique pieces of information to the University of Notre Dame community. These materials are being regularly backed up to tape archives which are stored in an adequate environment and routinely checked to ensure that the data is still readable. In addition, the IDR employs a number of best practices for the preservation of digital materials: 1) widely used, open standards, 2) widely used document formats, and 3) non-proprietary software. These measures enable the Libraries to retrieve the content from the underlying system and more easily migrate from one form to another.
The other content of the IDR is being preserved by creating duplicate copies of the content. Kellogg working papers exist at the Kellogg Institute. Much of the Institute for Latino Studies content is in paper form. The images from the Art Image Library exist as photographic slides. Any articles, preprints, etc in the IDR may have become published works and therefore duplicated through the publication process. A more proactive approach to preserving the unique content of the Electronic Theses & Dissertations may be something to consider.
The issues surrounding copyright are the most challenging issues for the IDR.
Most people believe intellectual expressions are inherently owned by the individuals or organizations creating them. These beliefs have manifested themselves in patent, trademark, and copyright laws. Under the current copyright laws of the United States things are copyrighted as soon as they are recorded on a fixed medium, and unless the recorded works are considered works for hire, the copyrights are owned by the creators. Rights can be transferred only through written legal agreements from creators to other parties.
Much but not all of this thinking is embodied in a University policy, "University Ownership of Intellectual Property Arising from University Research". For example, the policy explicitly states "The University owns all rights to all patentable inventions arising from University research". This sounds very much like a work for hire. When it comes to more traditional scholarly outputs, though specific exceptions are made:However, consistent with long-standing academic tradition, the University does not normally claim ownership of works such as textbooks, articles, papers, scholarly monographs, or artistic works. Creators, therefore, retrain copyright in their works, unless they are created under a grant or sponsored program that specifies ownership rights in some entity other than the creator, they are subject of a contract modifying ownership rights, or they are otherwise address in this policy.
Because of the laws and policies, copyrights for things like articles, working papers, technical reports, etc. are owned by the faculty and students who wrote them until they are transferred in writing.
To legally post excellent undergraduate research to the IDR a copyright agreement had to be written and signed by undergraduates. The agreement essentially states four things:
Trying to catch up with the undergraduates to physically sign the agreement was very time consuming. Time schedules did not mesh. Students left the University and email addresses did no't work. The postal service delayed correspondence. With only a single exception, Team IDR secured signed copyright forms for all the examples of undergraduate research. The one exception was an example where the undergraduate had a co-author (a faculty member), who was not willing (or able) to sign the copyright form.
Working with traditional scholarly output proved to be even more difficult. Since scholarly publishers require authors to sign copyright agreements, usually transferring copyright from author to publisher, getting permission to redistribute published articles means getting signed agreements from publishers. While an increasing number of publishers allow authors to make their works available on the Web after an embargo period, this is the exception and not the rule. Additionally it is usually only applicable to personal websites, not institutional or subject-based repositories. This is why there are no articles, only dummy PDF documents, in the Engineering collections of the IDR.
This being the case, working with the Kellogg Institute and the Institute for Latino Studies proved much easier. The institutes have a mandate to disseminate their locally generated information as freely as possible. Visiting scholars are expected to write working papers and reports intended to be made accessible to the general public, and not necessarily through traditional publishing venues. These two institutes already have policies in place stating that work done there is owned by the University as well as the authors.
There is much misunderstanding regarding copyright across campus. In general creators of copyrightable materials do not know what their rights are. People do not understand that they are able to negotiate copyright transfer agreements. Copyright laws are nebulous in our current digital environment where information is increasing bought and sold. On one hand the Academy is about "expanding the sphere of knowledge" in turn making it easier for people to benefit from the fruits of its labor. On the other hand copyright is freely signs away to publishers whose primary market is the Academy.
The definition of metadata is "data about data". This section outlines the role metadata and metadata creation played in the IDR. Much of this section's content is based on the observations and experience of a subset of Team IDR-the Working Group on Organizing and Describing the Digital Content of the University Libraries' IDR whose report is presented in Appendix F.
Metadata can be divided into three types: 1) descriptive, 2) structural, and 3) administrative. Descriptive metadata is used to pull like things together, create homogenous sets of information, and make explicit various pieces of implicit information about content. This section focuses primarily on descriptive metadata. Structural metadata describes the physical make-up of content and will be elaborated upon in the Preservation section of this report. Administrative metadata, used to accommodate a workflow within a system, is not relevant to this particular discussion.
Team IDR spent much of its time creating descriptive metadata to describe and provide access to digital content. Some of this content was in the form of images. Some of the content was in the form of working papers and technical reports. Based on experience, different forms of content require different forms of descriptive metadata. For example, working papers and technical reports often have authors and titles. On the other hand, images of specific parts of the Vatican are not so easily described. For this reason various descriptive metadata schema have been created. Dublin Core, consisting of fifteen commonly used fields/elements is quite popular. It forms the basis of describing things in DSpace and the ETD-db. It is also an inherent component of our harvesting protocol (OAI-PMH). On the other hand, VRA Core has been created in an effort to better describe images of art and architecture. It includes a variety of additional descriptive fields/elements such as material, technique and stylePeriod in order to meet the specific needs of users of art images. One size does not fit all, and different metadata schema were used to describe the materials in the IDR.
Through the process of creating metadata, the need for policies, guidelines, and authority control was re-enforced. A guiding principle of librarianship is "Save the time of the user." One way to implement this is through the use of name authority lists. Authors are often cited with different forms of their names, such as: first initial and last name; full first name and full last name, etc. When searching for people's names it is important to bring together all the documents by the same person, even if their names were cited differently in different documents. Authority control identifies a single authorized form of a given personal name, corporate body, series, subject term, etc. and provides cross-references from variant forms for the user. The applications used in the IDR do not include authority control functionality. In lieu of significant software enhancements, policy and procedural manuals can be created to partially compensate for the lack authority control features Without very sophisticated software, computers will fail to retrieve all of the documents on a specific topic or by a specific person. This is especially true in the humanities. Creating metadata for IDR content re-enforced the need for authority control, or, at the very least, the need for data-entry guidelines and policies.
The thorough application of descriptive metadata to IDR content requires specialized skills and collaboration. Based on the experience of other universities who have implemented institutional repositories, "If you build it, they won't come" was learned. Put another way, it takes more than bringing-up DSpace for it to get populated with content. This is true primarily because authors do not want to spend the time doing data-entry and applying metadata. In an attempt to overcome this issue all the data-entry into the IDR was done by members of Team IDR. In fact, the data-entry people constituted the largest proportion of people working on the IDR. The skills, knowledge, and experience of these people varied significantly. Some had little or no formal metadata experience. Others had the experience of thirty years of library cataloging. Some had a complete understanding over the use of their computer. Others needed a bit of training on computer fundamentals. Most importantly, doing data-entry on behalf of authors emphasized the need for subject-specific knowledge. In order to adequately describe the content of a technical report or a working paper, the metadata specialist needs to know or have direct access to another person who knows about the subject area. "Is the name of this species of frog significant to the description of this paper, and if so, then what is that name?" "To what degree is it important to know that this paper describes the differences between a compiled computer language and an interpreted language?" Without this sort of knowledge or without easy access to a person with this sort of knowledge, the application of descriptive metadata will be incomplete at best. In short, we learned that people applying descriptive metadata need to have the following skills:
This is list of possible choices - a menu - regarding future developments of the IDR. In general, things are listed a return on investment basis. Items with lower costs but higher returns are given priority. Put another way, items at the top of the list are easier to implement than items at the end of the list.
The previous "Chinese menu" of choices was a list of possible future directions, but implementing everything on the list is impractical. The University, and the Libraries in particular, can simply not afford to do it all. Assuming the Libraries has three to four people with the necessary skills who are willing and able to spend at least some of their time on future IDR efforts, the Chinese menu has been grouped into the following "meals", and Team IDR recommends the meal called "The Safe Bet".
Many people from inside and outside the University Libraries were involved with this first phase of the Repository. Within the Libraries Team Institutional Digital Repository consisted in the following people:
At one point there was a call for volunteers to help with some data-entry and a few more people from within the Libraries came forward:
On various visits to campus academic departments a number of library liaisons participated:
The institutional digital repository effort is a joint effort between people who work in the the Libraries and people who work outside the Libraries. This is a list of people, sans the authors, who have contributed to the Repository effort: