Principal Investigator
Personal Name
Sam Smith
Personal Identifier
https://orcid.org/0000-0001-6461-9054
Personal Affiliation
University of Prince Edward Island
Contributor
Role (Personal)
ProjectLeader
Person
Personal Name
Donald Moses
Project description/abstract
Text-based data collected from Statistics Canada was used to create a union list of born-digital products from the Canadian Census of Population, starting with the 1961 Census. This union list indicates where the census files are located in Canada (for example, the University of Toronto Data Library) and what they contain. The data is stored in a database and accessible through an online search engine (see: Search for Aggregated Data Files from Canadian Censuses URL: http://mdc.lib.uwo.ca/census/pubsearch.htm)
How will the data be collected or created?

The data will be collected in an Inmagic v.15 database from which non-proprietary file-formats, including ASCII, HTML and/or XHTML text files, may be extracted. Inmagic is a content management system, a type of database used to manage large amounts of content, such as documents, images, and more.
What conventions and procedures will you use to structure, name and version- control your files to help you and others better understand how your data are organized? 
As records are created and/or edited, they will be time-stamped with changes. The name of the person making the changes is also requested. A log file is kept to record information about database changes.

What documentation and metadata will accompany the data?

Four different pieces of documentation will be needed.

  1. A description of the project, which will include a description of the process undertaken to identify the various historical census data files;
  2. A description of the field structure in Inmagic (e.g., whether a field is required, uses a controlled vocabulary, is repeatable, etc.);
  3. The data entry instructions to be followed in populating the database; and
  4. Lists showing the possible values of the various controlled-vocabulary fields, and of the substitution lists used to automatically translate English to French text and vice versa within the paired controlled-vocabulary fields.
What data will you collect or create?

Text-based data are collected from Statistics Canada, falling under the Statistics Canada Open License.
 

How will you manage any ethical issues?

The raw data (i.e., bibliographic and holdings information) will be shared.

How will the data be stored and backed up during the research?

Currently, the Inmagic database has approximately 640 records and occupies approximately 4.7 MB across 11 proprietary format files. Anticipating that the database will grow in size to approximately 10,000 records, it might be anticipated to grow to no more than 100 MB. Additionally, regular file dumps in ASCII format will be performed, to ensure that the contents of the database will be transportable to other database systems or used by other interfaces: the records in the database currently occupy about 2K per record in delimited format, and compress from 1,220 K to 94 K. At their largest, each backup file might be expected to require 1.5 MB in compressed format. 

How will you manage access and security?

Access to the database will be either directly though the Inmagic interface for batch loading (V. Gray), or through the web-based interface (A. Cooper and other contributors). Access to editing the database will be restricted. Contributors will be given a user ID and password to allow for editing.

Which data are of long-term value and should be retained, shared, and/or preserved?

The project does not have a foreseeable end date. An ASCII delimited (and potentially a XHTML) version of the database will be created and could be stored on Scholars Portal Dataverse, a data repository which assigns DOIs to datasets, and supports preservation, discovery, citations, and data usage metrics. However, a consultation with the University’s Research Data Management Librarian will help identify other possible repository options for our research data. 

What is the long-term preservation plan for the dataset?

Preservation format copies of the database will be stored in ASCII delimited format. As new versions are created, they will be compared to previous versions to ensure that the previous versions contain the same data for unmodified records as the new. The required documentation files will be saved on the preservation site along with the delimited files.

How will you share the data?

To date, presentations have been made at regional Data Liberation Initiative training sessions for the Ontario, Western Canada, and Atlantic regions, and to Statistics Canada (Data Rescue and Recovery Update URL: https://cudo.carleton.ca/dli-training/4075). It is hoped that a presentation to the Quebec region will also be possible. Finally, an article may be written and submitted to Statistics Canada for inclusion in its DLI Newsletter and/or to an academic library journal to highlight the existence of the tool. Depending on the repository that the data is deposited in, there may be additional resources to notify the community. If deposited in Dataverse, a persistent digital object identifier (DOI) will be minted for the dataset providing a persistent identifier and improving chances of discoverability.

Who will be responsible for data management?

Sam Smith: batch uploading of records in French and English, editing records, performing routine maintenance on the Inmagic database; extracting ASCII files; creating documentation; migration to preservation platform

Donald Moses: creating and editing records; creating documentation; migration to preservation platform

What resources will you require to deliver your plan?

Storage space on a web-enabled server. The entire project would fit onto a 4-GB USB key with space to spare or may be written at intervals to DVD. Minimal long-term costs would be expected as long as Western maintains a web based Inmagic service: should this change, a new platform would need to be selected and created.