Data Cleaning is an umbrella of data processing actions related to the detection and correction (or removal) of corrupted or inaccurate data elements of a dataset. Typical data cleaning operations involve the identification of incomplete, incorrect, inaccurate or irrelevant data elements of the dataset, known as dirty or coarse data, and their correction, replacement or deletion.
The ICARUS Data Cleaning Process
In the ICARUS perspective, the data cleaning process aims at removing or correcting erroneous data that could lead to incorrect, inaccurate or even invalid results or conclusions, safeguarding in this way the quality of the produced results of a conducted data analysis. The ICARUS Data Cleaning process is composed by the four core steps: a) the preliminary data analysis of the selected data where the included data elements are inspected and insights are gained that are utilised in the next steps, b) the definition of the cleansing workflow with a set of validations rules, as well as cleansing rules and missing values handling rules that define the corrective actions that are performed if a validation rule is violated and an error is identified, c) the execution of the cleansing workflow based on the previous step and d) the optional verification step where the execution results are verified and assessed by the data provider.
Challenges in Data Cleaning
Data Cleaning is highly dependent on the dataset for which the cleansing workflow will be designed and executed. Based on the dataset, the data cleaning process can be rather complex and multi-variant, hence several challenges arise. The first challenge is related to the flexibility and usability of the data cleaning process. The process needs to be rather flexible in order to cope with all the peculiarities of the selected dataset as well as the needs of the data provider for the specific dataset. At the same time, the required flexibility might introduce confusion and limit the usability of the process for the data provider. In the ICARUS perspective, the designed data cleaning process ensures the balance between flexibility and usability based on the feedback of the ICARUS data providers.
The Need for Dynamic Data Validation
Another crucial challenge is the need for a dynamic data validation process with help of the dynamic data validation rules. Due to variety in the nature and the context of each data source, the data validation rule should effectively handle the key characteristics of each dataset in order to ensure the highest quality of the results of the process. Towards this end, the ICARUS data validation approach provides a list of dynamic, flexible, complete, coherent and efficient validation rules that were designed and implemented in collaboration with the data providers and the demonstrators of the project, taking into consideration the key characteristics of the foreseen datasets. Additionally, the process is highly configurable and flexible in order to be able to accommodate new datasets with different context and type of information.
In any data cleaning process, one of the important challenges is also the performance of this process, as it includes computationally intensive and time-consuming tasks which are executed on top of the large volume of data that are usually utilised in a big data ecosystem like ICARUS. Hence, the ICARUS data cleaning process was designed taking into consideration these aspects and utilising the state-of-the-art tools and libraries for data processing and data manipulation. Nevertheless, the flexibility offered by the data cleaning process enables the configuration of multiple validation and cleansing rules. Hence, the background operations performed for the data cleaning workflow execution are designed to be executed in the most effective and efficient manner in order to ensure the minimization of the execution time.
Handling of Missing Values
Finally, another crucial challenge is the data completion or missing values handling process and how this process could be bias-free. In literature, a large variety of methods and processes that are empowered with advanced statistical or analytical techniques are available. Nevertheless, their suitability and employment is highly dependent on the nature and the context of the information included in the dataset as several methods might impact the performed analysis or introduce significant effects on the conclusions that can be drawn from the analysis. Hence, the crucial aspect for the selection of the applied methods is the valid analysis of the root cause of the missing values, which is the responsibility of the data provider who has the extended knowledge over the selected dataset.
Within the context of ICARUS, the data cleaning process offers an extensive list of data completion or missing values handling methods that can be leveraged by the data provider, assisting him in the selection process to the most possible degree by exploiting the metadata of the dataset as provided by the data provider.
Blog post authored by UBITECH.