Introduction
The Learning Registry is a software solution and infrastructure that solves the problem of transmitting, storing, and replicating metadata and paradata (metadata about usage) about education resources. At present,
it consists of two root nodes which replicate to each other, as well as a network of other nodes which replicate with one of the two root nodes. Publishers and
creators of education resources can publish metadata about education resources to the Learning Registry so that knowledge about them can be transmitted to the
wider education community.
What the learning registry is not: The Learning Registry is not a searchable repository, nor is it a specific destination,
portal, or engine that educators will go to. It is an open technology framework to which any content creator can publish, and any technology vendor or
developer can leverage for applications. IOER harvests documents from the Learning Registry, applies some business rules to make it easier to use, and stores
the resulting metadata in a database. This database is then used to build a searchable index that users can then search for useful education resources.
A diagram of the flow of data between the Learning Registry and IOER follows:
Structure of a Learning Registry Document
Items in the Learning Registry (LR) are called documents, typically formatted as JSON documents. Each document consists of an envelope and a payload. Think of a learning registry document as a letter you might receive in the mail: The envelope is the container that the payload (letter) is put into before it goes into the LR.
The following properties (or fields) of the LR envelope are relevant for importing into the Data Warehouse:
- Signer
- The entity who signed the document (For example, SRI International).
- Submitter
- The person (or entity) who submitted the document (for example, SRI International on behalf of National Science Digital Library).
- Signature
- The digital signature used to verify the validity of the document.
- key_location
- The location on the web of the public part of the key used to sign the document.
- doc_ID
- The LR-assigned ID of the document in the Learning Registry.
- resource_locator
- The location of the actual resource (usually a URL).
- resource_data_type
- This is the type of data contained within the resource. Valid values are metadata and paradata.
- payload_placement
- Where the payload is placed. Valid values include inline, linked, and attached. The import process handles only inline, other values are logged and reported. So far only inline has been encountered.
- submission_tos
- The terms of use of the submission, not the terms of use of the resource
- payload_schema
- The schema used for the payload. In the case where multiple schemas are present, the payload contains data which matches to multiple schemas.
- resource_data
- This is the payload (or letter contained within the envelope)
- keywords
- These are keywords that can be used to search for data in the LR.
Extracting from the Learning Registry
Learning Registry API
There are defined API methods available to extract data from the LR. The API most useful for our purposes of extracting desired information for our index is called listrecords. It is one of the methods in the Harvest API.
Listrecords extracts LR documents in JSON format. These documents are converted to XML and stored in files. Each file contains approximately 200 LR documents (this is configurable and can be changed at any time). The files are placed in a queue. The import process then dequeues each file and imports the data into the database.
Schedule
Data is extracted from the LR on a nightly basis. This is configurable, and currently begins around 6pm. Data is extracted from the point in time where it last off forward to the current time.
Handling Long Imports
Currently LR activity is fairly low most of the month, but with periodic spikes when a publisher publishes a large amount of data. Importing the data is far more resource intensive than extracting it from the LR. In some cases it takes a few days to import a single day's data from the LR. Fortunately most of the time LR activity is very low so there is time for the import to catch up.
The LR takes an eventual consistency approach to managing their data, and we are adapting this approach to the import process. We visualize teachers doing searches on the database well into the night, and probably starting early in the morning, so have taken the following approach to limiting the hours that the import will run in:
- At the end of processing each file in the queue, the file is removed from the queue and placed in an archive folder.
- Data Warehouse totals are updated (this takes under a minute).
- The current time is checked. If the current time is during the window where the import is not allowed to run (this window is configurable), the import ends. The next time the import begins, any new data in the LR is added to the end of the queue, and the import picks up at the point where it left off during the previous run.
Transforming and Loading into the Database
There are two basic types of data in the LR, metadata and paradata. Metadata is data about the resource. It includes but is not limited to title, description, education level, subject area, and access and use rights. Paradata is a specialized form of metadata that contains data about the usage of a resource. It includes but is not limited to views, favorites, comments, and ratings.
It is possible that multiple entities have submitted metadata and paradata about a resource. IOER combines metadata from multiple sources into a single object in an attempt to give a more accurate picture of the qualities of the resource.
Paradata from multiple submitters for a single resource will also be combined. Here's an example showing why it should be this way:
National Science Digital Library publishes an article about how to solve linear equations to the LR. A teacher from Olympia School District rates the article 4 out of 5 stars. If the results are kept separate, nobody will know that the teacher from Olympia rated the NSDL-published article.
Determine Schema
The payload_schema property of the envelope is used to determine which handler to use for processing the LR document. This can be a single field or an array. When it is an array, the payload contains elements from multiple schemas.
Validating Minimum Requirements
Documents Published by IOER
Documents published by Illinois OER come from us. They are already in our database, so there is no need to import these documents. These documents are ignored and not logged.
URL Cleansing
The space character, along with a few others, are prohibited characters in a URL. Spaces in a URL's query string are commonly encoded as "+", however, this does not work for the part of the URL before the query string. In situations where the publisher has incorrectly encoded spaces as +, attempting to navigate to that page results in a Not Found (404) error. For parts of the URL which are not part of the query string, spaces should be encoded as %20. This allows the URL to work properly and the resource to be accessed.
IOER's import checks URLs for "+" characters outside the query string portion of the URL, and encodes these as %20 instead, thus allowing the link to work correctly for our users.
Spam Detection
Currently, spam detection consists of examining the LR document for words on a bad word list. The bad word list includes various swear words, misspellings of swear words, names of pharmaceutical products commonly associated with email spam, and other words commonly associated with email spam. If the LR document contains any of the words on the list, it is automatically flagged as spam and the record is thrown away. Spam detection occurs immediately before the record is transformed and loaded into the database.
Alternative to Physical Deletes
On the resource version table there is a field called IsActive. This field is used when we do not want to display a resource in the search, but do not want to remove it (for example, maybe the resource is good but it is severely lacking in metadata). In such a case, IsActive is set to false, so this version of the resource will not display in search results. Cases where IsActive is set to false include:
- The title is numeric
- The title is a date
- The title does not meet minimum length requirements (currently titles must be at least six characters long)
Audit Error and Warning Process
Errors and warnings are stored in a table along with the docID and filename containing the LR document. In this way we can review errors and warnings and handle as appropriate, including tweaking the import and reprocessing the record.
Cleansing Age Ranges
Age ranges come from the LR in various formats. It is not unusual to see age ranges like "-14+" or " --15-U" in addition to age ranges that make sense like "14-18." Age ranges are cleaned up using the following process:
- All whitespace is removed from the age range.
- If equal to "U-" change to "0-99"
- Month abbreviations are converted to numbers (for example, May-8 would be converted to 5-8)
- HTML entities for > and < are converted to their characters.
- >age is converted to age->99 (for example, >21 is converted to 21-99).
- <age is converted to 0-age (for example, <5 is converted to 0-5).
- + is converted to - if it is not the last character in the string.
- Ending + is converted to -99.
- Two or more 9's are converted to 99.
- Leading - characters are removed.
- Multiple consecutive - characters are removed.
- If there are two numbers separated by a - character, remove any trailing - characters, otherwise replace any trailing - with -99.
- By this point there should be two numbers separated by a -. If not, make it a range with only one number (for example, 18 becomes 18-18).
- Replace -U with -99.
- Make sure ages are in the correct order (so 99-14 becomes 14-99).
- Drop any leading zeroes (so 09-10 becomes 9-10).
Age ranges, where possible, will be mapped to Grade Levels, unless Grade Levels are also present in the document received from the LR. This will be done as follows:
- If the ending age is greater than 21, map to "General Public."
- If the ending age is between 18 and 22, and the age range is less than or equal to 4, map to appropriate (college) grade levels.
- If the ending age is less than 18, map to K-12 grade levels, including Pre-Kindergarten.
- Otherwise, do not map to grade level.
Grade Levels will be converted to Age Ranges where grade levels are available but age ranges are not. This will be done by mapping tables.
Cleansing Subjects and Keywords
Occasionally multiple subjects and multiple keywords come through on a single subject or keyword element. This is not desirable. If a subject (or keyword) comes through with semicolons in it but does not contain ampersands (that is, it contains semicolons but no HTML entities), the subject or keyword is split on semicolon and each keyword is stored separately.
Data Normalization
Different schemas (and indeed different submitters using the same schema) use different vocabulary to describe their data. This data is normalized to the vocabulary that ISLE is using for storing and displaying data via mapping tables that allow us to map many other vocabularies to our own.
Mapping tables contain the rules that convert the various vocabularies to our values. For each field crosswalked in the previously mentioned über-crosswalk, there is a mapping table that is used to convert these vocabularies to the ISLE vocabulary. It is possible for a single LR value to map to multiple values in our vocabulary. For example, an age range of 9-10 could map to Grade 4 as well as Grade 5, so two rows would be inserted in the Education Level table.
Orphan Tables
In the event that a value exists which does not have a rule to crosswalk it to our vocabulary, it is stored in an "Orphan table" so that it is not lost. When a rule is created that will crosswalk that value to our vocabulary, a process can be ran which will crosswalk as many of the orphan table entries to our vocabulary. Successfully crosswalked rows will be placed in their respective tables and removed from the orphan tables. Orphan tables exist for the same fields as the mapping tables.
Handling Duplicate Values
It is common for duplicate values to exist in the LR - even encouraged. For example, grade levels and subjects are commonly placed in the payload in the correct fields, as well as in keywords. This facilitates doing a slice (one of the LR APIs) to find resources in the LR, which looks only at keyword. So if Grade 8 is present in both education level and keywords, the obvious correct location for this data is in education level. Duplicate values are removed automatically, and a value will be stored in keywords only if it is not present in the other fields.
It is also fairly common to see the same term twice for a resource. For example, "Algebra" can appear twice in the subject field. In cases like this, "Algebra" will be stored only once.