OTN Data Workflows
A description of how the OTN Data Centre processes incoming data and joins it to the database of machine-derived animal presence data that is the core of the Data Centre's holdings.
OTN's data system ingests records from many researchers, and returns individual project reports on a quarterly basis. The provided records are ingested via a robust system of QA/QC processes, with recommended changes returned to researchers via their private project folders. Throughout the data processing procedure, we maintain open communication with researchers about any changes needed in their data and metadata submissions.
How OTNDC processes each data submission
For project metadata, the OTNDC team receives emails in otndc@dal.ca with project metadata formatted according to our downloadable template. Once submitted to OTNDC, staff visually and programmatically inspect the metadata, before making the necessary additions to the database and Plone file system to fully incorporate the new project.
For other types of metadata, the researchers can upload the data submission through the previously created folder made specifically for their project. These data submissions should follow the respective downloadable template, e.g. the tag metadata template or the receiver metadata template. A submission alert is automatically sent to the OTNDC team through otndc@dal.ca. The team figures out which type of data was submitted (tag, receiver, detection, or other). Depending on the type, a task list is created in the GitLab issue tracker and the data submission is visually inspected and quality controlled for any obvious erroneous records.
How the data joins the OTN data system
The quality controlled records visually inspected by the OTNDC are entered into the database via a suite of data processing Python scripts coordinated into workflows via single-purpose Jupyter notebooks, known collectively as Nodebooks. These Nodebooks accurately and efficiently process and verify the records through the database into higher level formatted tables. These higher level formatted tables are then used to match detection records to tag records.
Each type of metadata record goes through a specific process facilitated by the Nodebooks. After project metadata is received, the OTNDC team creates a GitLab issue with the necessary steps for project metadata processing. Following this, database entries for the project and its affiliated contacts are populated, and a schema is created where the metadata and data tables will reside. On the Plone system, a private project folder is created, accessible to project contacts with submission rights, where future data and metadata updates can be submitted directly to OTNDC.
The database ingestion workflow of some of the other metadata types is as follows:
This diagram shows the metadata coming through our Document Management System (Plone). These data submissions are quality controlled and processed into the database as the first level of tables (raw tables) (denoted in blue). From there, the Nodebooks are used to process them to the intermediate level (denoted in purple). After these are processed and verified with the Nodebooks, they are processed to the higher level "inherited" tables (denoted in yellow). These are automatically loaded to the parent schema.
Each type of submission goes through a specific workflow. For tag metadata, the OTNDC team downloads the submission of the tagging metadata and visually inspects them and checks that they follow the tag metadata template. The OTNDC team then creates a Gitlab issue with the necessary steps for tag metadata processing. The submission is then entered into its first level table (c_tag_suffix) and verified via the Nodebooks. The 'suffix' part of the table is generally a date of either its submission or last date in the file so it is findable later on and easier to connect similar tables. These first level tables are then processed to the intermediate level tables (animalcache_suffix and tagcache_suffix) and the Nodebooks check for errors such as checking that a tag is not deployed in two animals at the same time. After, these intermediate tables are processed to the higher level tables (otn_animals and otn_transmitters) and verified. Otn_transmitters is in the same format as moorings and is automatically inherited to moorings in the parent schema. Otn_animals is automatically inherited to otn_animals in the parent schema. The Nodebooks are used for this processing and verification. The OTNDC team investigates any errors that arise during any of these verifications and discuss the fixes with researchers.
Receiver metadata follow a similar workflow to tag metadata: the OTNDC team downloads the submission of deployment metadata and visually inspects them and checks that they follow the receiver metadata template. The OTNDC team then creates a Gitlab issue with the necessary steps for receiver metadata processing. The submission is then entered into its first level table (c_shortform_suffix) and verified via the Nodebooks. The distinct stations are processed into the intermediate level table (stations) and the Nodebook checks for errors such as stations with similar names. Once the errors are noted and fixed, the deployments in the first level table are processed to the intermediate level table (rcvr_locations) and checked for errors such as overlapping receivers. After, these intermediate tables are processed to the higher level table (moorings) and verified. These moorings tables are inherited to the moorings table in the parent schema.
Detection data has a more unique workflow: the OTNDC team creates a Gitlab issue with the necessary steps for detection data processing. After, the OTNDC team downloads the unedited VRL submission and runs it through VUE or Fathom to acquire the raw detections and events. These raw detections and events are processed into the first level tables (c_detections_suffix and c_events_suffix, respectively) and verified with the Nodebooks. The first level detections are then processed further to intermediate tables, split by their year (detections_yyyy) and verified. The first level events are processed to an intermediate table (events) that hold all the event data. The intermediate detection data is processed to higher level tables, still split by year (otn_detections_yyyy) and verified. The otn_detections_yyyy tables are inherited to the otn_detections_yyyy table in the parent schema. Part of this processing is matching the detection to its correct receiver. One of the checks the Nodebooks perform is to check if a detection cannot be matched to a receiver, in which case the researchers are consulted to see if there is missing receiver metadata. The downloads from the intermediate event data are processed into moorings as a DOWNLOAD record and matched to a receiver from the moorings RECEIVER metadata. The Nodebooks perform a similar check for missing receiver metadata for events. Once all metadata is processed and verified, the otn_detections_yyyy records are matched to tag records to facilitate animal tracking.
The Nodebooks also handle other types of metadata, such as mission reports, active tracking, and gliders. Each of these workflows have their own code sets to help facilitate their processing.
How information is returned to researchers when available
Once the OTNDC team has processed and verified the records, and matched detection records to tag records, a report of the results (known as detection extracts) are sent to the researchers. These detection extracts are sent out during the Data Push, occurring on average once every three months. These detection extracts can then be checked by the researcher and run through a series of OTN supported tools (Python: resonATe, R: glatos) to complete further analysis on these matches.
The above diagram shows the full workflow of the OTN system. The data is first acquired by the OTNDC then is loaded into the database into a schema for each project, which also holds different types of tables depending on the type of data submission. These tables generally have a lower level (raw tables), an intermediate level, and a higher level. Through the Nodebooks, these tables and schemas are quality controlled and processed through the levels. These higher level tables are automatically loaded to a "parent schema" via inheritance in the database. This parent schema is then used to load the data to public endpoints. For the first public endpoint, summaries are created out of the parent schema and this summary information is used to populate the discovery pages on our website. High levels of data that has the correct permissions is added to our Geoserver and our Erddap.