Data Preparation Challenges

Although the datasets uploaded in the ICARUS platform conform to the ICARUS common aviation data model, the uploaded datasets may not be directly usable for data analytics purposes as in order for the analytics algorithms to be applied, numerous data preparation and feature engineering steps are typically required. In addition, appropriately pre-processing the originally ingested data is often required for the creation of more insightful visualisations.

For example, it may be the case that a dataset contains flight data, including the scheduled departure datetime and the actual departure datetime (both of which are foreseen by the common data model). For most data analytics jobs, the exact timestamps would not be useful. Instead, a data scientist that wishes to experiment with a delay prediction algorithm, would maybe want to use the two timestamps to compute the delay in minutes and then group the flights into 3 categories depending on the computed delay time, e.g. on time, with small delay, with large delay. Furthermore, the data scientist may also need to combine information from two datasets, e.g. enrich the flight data with weather data. These data manipulation functionalities, which correspond to a very important part of a data analysis workflow, are offered in the ICARUS Platform through the Data Preparation functionalities.

Data manipulation functionalities in ICARUS

The provided data manipulation functionalities provided through the ICARUS Platform can be grouped under different actions:

  • Create column: The functionalities provided in this category allow the user to create new columns based on dataset values in existing columns. The category is further divided into subcategories:
    • Timestamp-related operations which allow the extraction of information from datetime and date fields. Indicatively, a user could choose to extract the year from a datetime field and save it in a separate column.
    • Math-related operations which allow the user to perform mathematical operations between columns and/or between columns and values, e.g. to add the values of two existing columns and keep the result in a new column.
    • Aggregation-related operations, which allow the user to create columns that hold the computation of aggregations (sum, count, average) over existing data. Under this category, the user may find also functions that facilitate the computation of time-based aggregations, e.g. the monthly average of a given column based on the datetime information of another column.
    • Shift operations, which allow the user to shift the values of an existing column by a selected offset in either direction (lagging or leading values).
    • Conditional operations, which allow the user to create a new column whose values will depend on the values of another column in a custom way, while supporting a more flexible conditional column creation (through AND and OR operations), as well. For example, under this category, the user may find mainly aviation-specific functions, such as the extraction of the IATA season from a selected datetime column.
    • String operations, which allow the user to create a new column whose values will be the result of applying string manipulation functions over the values of another column. Indicative examples include transforming to upper-/lowercase and casting a string to a numeric type.
  • Drop rows or columns: This category allows the user to drop some of the dataset columns and/or rows.
  • Filter row(s): The functionalities provided under this category allow the user to filter the dataset based on the values of selected columns. Indicatively, thresholds may be defined for numeric columns above/under which the rows should be discarded. The user is able to create and apply filters for numeric, time and string fields.
  • Rename column: The user is allowed to rename some of the dataset columns, both the ones of the original dataset and the ones created through the data preparation job.
  • Merge with another dataset: The functionalities in this category allow the user to merge the current dataset with another one that is already included in the user’s assets (including results of previously executed data preparation jobs).
  • Fill null values: The provided functionalities allow the user to fill missing values in selected columns either by explicitly providing a value or by choosing an appropriate aggregation in case of numeric data, e.g. filling null values with the computed average of the column values.
  • Compute aggregations: These functionalities allow the user to compute aggregations, i.e. sum, average and count, on the selected dataset.
  • Split columns: These functionalities allow the user to create new columns by splitting the values of a dataset column based on a selected separator character. Through these functionalities the user may split a column that contains ranges in two separate columns (holding the minimum and maximum value) or split a value in an arbitrary number of tokens. The latter is a special case that will result not only in a new column being created, but also on additional rows being added to the dataset, one for each part of the original string that was split.
  • Enrich content: The provided functionalities allow the user to enrich the dataset with additional external information. Indicative examples include geospatial data enrichment, such as adding columns with a country’s average latitude and longitude and extracting the ISO code of a country based on a column with country names, or the aviation-focused enrichment based on certain predefined controlled vocabularies from IATA and ICAO (e.g. airport codes).

Keeping all options open for a data scientist

The data manipulation actions that need to be performed over a dataset to make it suitable for analysis and visualization cannot be conclusively and uniquely defined. Instead, such actions depend (apart from the dataset) also on the preferences of each data scientist, the requirements imposed by each algorithm, the specific case being investigated, other datasets with which the original dataset will be combined etc. As such, a data preparation process even on the same dataset may be significantly diverse. The data scientist may need to experiment with various data manipulation methods and sequences of steps in order to bring the dataset in the desired form to be further used for analysis and/or visualization. Considering that data filtering and dataset merging are also applied through the Data Preparation Manager, the result of a data preparation workflow may constitute a “new” dataset that should be kept in order to be used as input to various algorithms and visualisations but also to be reused as input to new data preparation workflows.

Controlled expressivity on data manipulation

For security reasons, the ICARUS Platform does not allow data scientists to define any data manipulation function they wish to apply on the raw data (e.g. by writing code in a notebook). Instead, they should always proceed with utilizing the supported data manipulation functions. However, there is no limitation as to how many steps a data preparation job may have and each step builds on top of the outcome of the previous step, hence it is possible to create a new column using one of the provided functionalities and then further process the new column’s values. The steps are executed sequentially in the exact order they are defined by the data scientist.

Need to check license compatibility and IPR

Prior to applying any data preparation functions on a dataset, the ICARUS Platform cross-checks the applicable license details (e.g. whether derivation is allowed) and informs the data scientist accordingly.

Need to follow the results step-by-step

To facilitate the data scientists in selecting and applying the appropriate steps during a data preparation job, some insights are provided regarding the outcome of each step. Specifically, they may obtain information, such as rows count, null count, distinct values and common numeric aggregations, and preview the results (10-15 rows only).

In summary, based on the demonstrators’ experience, the data preparation functionalities have proved to be instrumental in gaining new insights on the raw data, but also on facilitating the proper definition of the analytics jobs in the ICARUS Platform.

Blog post authored by Suite5.

Featured Image Photo by Franki Chamaki on Unsplash