Effective data management in a data integration, analytics and sharing project like ICARUS is instrumental in ensuring that the data are properly collected, curated and mapped in accordance with the specific needs of the aviation industry. The ICARUS data management approach has been designed to lay the theoretical foundations of the core Data Bundles of the ICARUS Platform related to Data Collection and Data Curation (in terms of Data Cleaning, Data Provenance and Data Mapping).
In order to develop such data management methods in a solid manner, an iterative methodological approach, that consisted of the study of the underlying state-of-play in the related domains, the elaboration of the related methods (taking into account the aviation industry needs, requirements and peculiarities) and the extraction of key considerations, was followed.
In general, Data Collection is a broad term that refers to the population of the ICARUS platform with high-quality data from distributed information sources at proper granularity levels, in a timely manner. The ICARUS Data Collection Methods are based on a supply-driven mentality (since the data providers are responsible for collecting, ensuring the quality and checking in their data assets) and are presented under 3 core axes: Stakeholders, Applicable Processes and Data Profile.
As depicted in the following figure, Data Collection spans over the upstream, downstream, indirect and open data assets’ collection from the data providers’ perspective. The de facto data collection approach in ICARUS concerns files upload / exchange at the moment while the applicable (9-step) data check in and (4-step) data update processes are elaborated and the supported (text-based) data profiles in terms of formats and standards are put into context. In order to better understand the data collection needs of the aviation data value chain, the aviation data profiling that has been continuously updated has been revisited and the different data related interactions have been fleshed out. The ICARUS Data Collection approach also considered a set of aviation data APIs from ICARUS stakeholders and other aviation sources that were identified and documented.
The ICARUS Data Curation Methods consist of techniques and approaches for data cleaning, data provenance and data mapping and linking.
In the ICARUS perspective, the scope of the data cleaning approach is to safeguard the quality of the data assets that are uploaded in the ICARUS platform, by removing or correcting erroneous data that would lead to incorrect, inaccurate or even invalid results or conclusions. To this direction, Data Cleaning (or Data Cleansing) is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data (Wu, 2013).
Hence, the main purpose of the Data Cleaning process is to improve the overall quality and usability of the data asset by employing: (a) a set of the validation rules, that are covering several aspects of the data quality dimensions and the data validation practises, in order to identify the possible errors or inconsistencies, and (b) a set of data cleaning and data completion techniques in order to correct or eliminate these identified errors or inconsistencies.
The Data Cleaning process contains a series of steps related to the assessment and analysis of the data, as well as the refinement or removal of parts of the data as a result of the corrective actions that are performed based on the initial assessment and analysis. Overall, the ICARUS Data Cleaning approach includes a 5-step process, featuring a preliminary Data analysis, the definition of the validation rules, and of the cleansing workflow with the cleansing and missing value handling rules, the execution of the cleansing workflow, and the verification of the cleansing workflow results.
Data provenance is typically associated with the evidence-based detection of the origin and the evolution over time of a data asset, as well as of all its related processes, while contributing to determine any controversial data ownership aspects. To this direction, the ICARUS data provenance process practically captures and manages trustworthy data asset trails that shall effectively track the lineage and the derivation of the data assets that have been checked in in ICARUS in a coarse-grained, light-weight manner at dataset level.
The ICARUS-relevant provenance information complies with the W3C PROV Data Model and spans over 4 core perspectives, namely the Agent Perspective (Who?), the Artefact Perspective (Which?), the Process Perspective (What?) and the Underlying Timing Perspective (When?), as depicted in the following figure.
Since the provenance trails cannot refer to the actual data included in a data asset due to inherent restrictions imposed from the ICARUS encryption schemes , the data assets values cannot be monitored and actual reproducibility of the data cannot be achieved (e.g. to view intermediate data or to replay possibly alternative data processing steps on intermediate data), yet a full history log of the actions related to the whole data asset will be diligently maintained.
ICARUS Metadata Schema
The ICARUS metadata schema is also built considering international metadata schemas, such as Dublin Core, DCAT, VoID, DataCite, CKAN, Aviation Metadata Profile and ISO 19115, and features over 70 metadata, classified as:
- Core Metadata encapsulating the basic information accompanying a data asset, e.g. a unique identifier for the data asset following specific naming conventions, the title by which the data asset is formally known and a brief (free-text) description of the data asset.
- Semantic Metadata referring to semantic annotations for the data asset, as well as its mapping to the ICARUS data model and its linking to other data assets.
- Distribution Metadata that provide a better understanding for the availability of a data asset, define its accessible forms and allow for retrieving certain data asset extract (as defined by the data asset provider).
- Sharing Metadata shedding light on the rights and the policies associated to a data asset.
- Trading Metadata keeping track of the data contracts that have been made and registered in ICARUS.
- Preservation Metadata presenting the quality assessment of a data asset, as well as information related to its provenance.
Data Mapping and Linking encompass methods and techniques to address the inherent semantic interoperability problem at syntactic, schematic and semantic heterogeneity levels, that appears in any data integration endeavour. In ICARUS, a common aviation data model reconciling the different aviation data standards is considered as instrumental to ensure effective data integration at data check-in time and at data query time. To this end, a data model has been meticulously “designed for change” with the purpose of efficiently managing its whole lifecycle and effectively anticipating its consistent evolution.
ICARUS Data Model Lifecycle
The ICARUS data model lifecycle thus consists of 8 phases that include:
- Phase I: Modelling during which certain preparatory activities have been performed and the ICARUS common aviation data model has been constructed. The preparatory activities included two parallel streams: (a) the study of the ICARUS aviation ontology, based on the NASA ATM Ontology and considering the data collection activities from the ICARUS demonstrators and OAG, that were documented in D1.3, (b) the analysis of a set of aviation data standards that were prioritized, namely: A-CDM, ACRIS, AIXM, and partly SSIM, as well as a generic purpose data standard like UN/CEFACT CCTS (Core Components Technical Specification).
- Phase II: Model Storage that properly and securely stores the model in its JSON representation in order to be easily accessible at run-time.
- Phase III: Mapping Algorithms Definition embracing the design of algorithms for effectively mapping the data that are checked in in ICARUS (source schema) to the underlying ICARUS common aviation data model (target schema). In ICARUS, such algorithms range from traditional schema matching algorithms (that leverage the domain knowledge) to supervised machine learning algorithms (which learn from the data that are mapped) that shall be employed to calculate the mappings between source and target schema, at run-time.
- Phase IV: Mapping Algorithms Training, referring to the “offline” use of specific small training datasets that have been created by ICARUS to fit and tune the mapping algorithms that have been created in Phase III.
- Phase V: Semi-automated Data Mapping that practically executes the mapping algorithms and proposes specific mappings between the data that are checked in and the ICARUS common aviation data model.
- Phase VI: Model Evolution which reflects the inevitable updates and changes that need to be performed on the ICARUS common aviation data model as time goes by, either spontaneously by the ICARUS administrators to anticipate new needs (e.g. an update of an existing aviation data standard or the emergence of a new data standard) or on demand to address specific proposals they have received by data providers who attempt to check in their data assets in ICARUS. The changes that are performed on the data model are classified as major or minor, and result into a new version of the data model that may be backward compatible (so no action is needed for data that are already checked in) or may be non-backward compatible (so certain action for propagating the changes need to be taken). In detail, all evolution events concern addition, update or deletion and are practically governed by a set of evolution rules.
- Phase VII: Data Transformation which is responsible for transforming the data structure of a dataset in accordance to the mapping rules. It needs to be noted that if the ICARUS common aviation data model imposes specific measurement units or code lists in specific properties, the specific phase can also undertake the responsibility for properly transforming the data, as well, as it is applied prior to any data cleaning, anonymization and encryption method.
- Phase VIII: Data Linking that concerns how data that belong to different data assets can be potentially linked during query time, at schema level based on the ICARUS common aviation data model and their metadata (in accordance with the ICARUS metadata schema), to facilitate data consumers in exploring the data assets.
Such phases are practically interwoven to support 3 main workflows, namely Workflow I: At data model preparation time (based on the Related Data Model Lifecycle Phases: I-IV), Workflow II: At data check-in time (based on Related Data Model Lifecycle Phases: V-VII) and Workflow III: At data query time (based on Related Data Model Lifecycle Phases: VIII).
The respective data management methods had been originally defined in D2.1 and were revisited and refined in D2.3, with minor or major improvements that are explained in detail in order to reflect the latest advancements and perspectives in alignment with the ongoing ICARUS platform development activities.
Blog post prepared by Suite5.
ACI, EUROCONTROL & IATA (2017) A-CDM Implementation Manual v5, March 2017, Available online at: https://www.eurocontrol.int/sites/default/files/publication/files/airport-cdm-manual-2017.PDF
ACI (2019) Airport Community Recommended Information Services (ACRIS), Available online at: https://aci.aero/about-aci/priorities/airport-it/acris/
Aeronautical Information Exchange Model (AIXM), Available online at: http://aixm.aero/
IATA (2019) SSIM – Standard Schedules Information, Available online at: https://www.iata.org/publications/store/Pages/standard-schedules-information.aspx
NASA (2018) NASA ATM Ontology. Available online at: https://data.nasa.gov/ontologies/atmonto/
UN/CEFACT (2019) Core Component Technical Specification, Version 3.0, Available online at: https://www.unece.org/cefact/codesfortrade/ccts_index.html
Wu, S. (2013) A review on coarse warranty data and analysis. Reliability Engineering and System, 114: 1–11.