Introducing Slik: A data processing and modelling python library…
One thing I have realised about the Data Science and Machine learning (ML) world is that there is a lot of buzz/noise around model building and the kind of problems that state-of-the-art ML algorithms solve.
One missing piece that people tend not to pay so much attention to is the data cleaning and wrangling process. The actual learning that happens with an algorithm can be done in one line of code, depending on the complexity of the algorithm or the available compute resource. It could also take a few seconds or milliseconds to get a training job done.
However, what is also important is how you get to the final step of the model building process; which is predominantly preparing your data for modelling. Companies can use data from nearly endless sources — internal information, customer service interactions, and all over the internet — to help inform their choices and improve their business outcomes. But, this raw data needs to be transformed before it is fed into a machine learning algorithm.
The idea of leapfrogging the data cleaning and wrangling process for Data Scientists was what birthed Slik-wrangler. Slik-wrangler is designed to help navigate the issues of basic data wrangling and preprocessing when dealing with any form of data. The library helps to jump-start supervised learning projects and it has several tools that make it easy to load data of any format, clean and inspect your data. It offers a quick way to preprocess data and perform feature engineering.
Here is a detailed documentation of how Slik-wrangler works and how it can be used.
In this article, we would show some high-level usage of Slik-wrangler. While introducing some concepts, we would also demonstrate some useful functions within the Slik-wrangler library that helps in data cleaning and wrangling.
Why is data preprocessing important?
Data preprocessing is the process of transforming raw data into an easy-to-understand format. This is also an important step in data mining, as it is difficult to work with raw data. Before applying machine learning or data mining algorithms, you need to check the quality of your data.
Garbage In, Garbage out is a common phrase you will often hear when training your machine learning models. This means that using “bad” or “dirty” data to train a model will result in an improperly trained model that leads to bad performance or results.
Data Preprocessing Steps
Let’s take a look at the established steps you’ll need to go through to make sure your data is successfully preprocessed.
- Data Loading
- Data quality assessment.
- Data cleaning.
- Data transformation.
- Data reduction.
Data Loading
This is the very first step when working with any dataset. Slik-wrangler provides an easy way to read your data files efficiently using the loadfile module.
When reading a file using Pandas, you need to specify the specific extension of the file that you want to read. For example CSV, excel, parquet, etc. With Slik-wrangler you can read several files with different file extensions without specifying the specific file extension. This means with a single function read_file you can load CSV, excel, and parquet files.
All you need to do is specify the file path to the dataset, and the data extension difference is automatically handled.
Let’s say you have a CSV file with a hundred thousand rows, and you need to split it into smaller files of twenty thousand rows each. This is where the split_csv_file comes in. It is useful in scenarios where there is a limit to the number of rows that you need to read in a file.
Data Quality Assessment (DQA)
Data Quality Assessment (DQA) is the process of asserting the quality of the data (or dataset).
The process of asserting data quality ensures the data is suitable for use and meets the quality required for projects or business processes.
Slik-wrangler handles the data quality assessment with a dedicated module called dqa which contains several functions for checking the data quality. One of the functions which would be often used is the data_cleanness_assessment which shows an overview of how clean the data is:
This is all you need to get an overview of how clean your dataset is and if there are any issues to be addressed. Slik-dqa also provides independent functions for checking specific issues with the dataset, functions like: missing_value_assessment, duplicate_assessment, e.t.c.
Data Cleaning
Data cleaning is the process of formatting/cleaning data to make it suitable for analysis. It includes handling missing values, replacing duplicates, and correcting, repairing, or removing irrelevant data. It is the most important preprocessing step to help your data meet your downstream needs. Slik provides a valuable API for cleaning your dataset called preprocessing (imported as pp by convention).
Identifying and Fixing Outliers
Outliers can have a huge impact on data analysis, and modelling results. For example, if you take a survey of the age of people and someone inputs 1000 as their age. We know this is not possible and if not removed or corrected, it would greatly skew the results.
Slik-wrangler currently relies on the interquartile range approach to detect outliers present in a dataset. Slik-wrangler also fixes the outlier present in the data using different methods like replacing an outlier with the mean or median of the data point. You can select the numerical features you want to operate on and display a table identifying at least ’n’ outliers in a row.
Identifying Missing values
To check whether our dataset contains missing values, we can use the check_nan function from the slik_wrangler.preprocessing module which returns a plot of the percentage of missing values present in the data.
We can see that the column Cabin has over 77% missing values.
Handling Missing Values
There are several ways to handle missing values, you can either drop missing values or replace missing values. Slik-wrangler helps to handle the missing values in your data intelligently and efficiently. You can choose a strategy to handle your numerical features and pass a value for fillna params to handle your categorical features or fill it with the mode by default. You can also drop missing values across the rows and columns using threshold parameters.
Dropping missing values can be using one of the following alternatives:
- remove rows with missing values.
- remove column containing missing values.
Replacing missing values with another value. Usually, the following strategies are adopted:
- for numerical values replace the missing value with the mean or mode.
- for categorical values replace the missing value with the most frequent value of the column (mode) or any value you want.
The holy grail of the Slik-wrangler library is the slik.preprocessing function that cleans your data in one line of code. This function cleans your data by removing outliers present in the data, handling missing values, featurizing DateTime columns, and mapping relevant columns.
The function saves the preprocessed file in a project path that you specify.
Summary
In machine learning, data preprocessing is the process of preparing raw data to feed it into a machine learning algorithm. Statistics have shown that data scientists spend 80% of their time on data preprocessing. This is a significant amount of time that can be used for other things like EDA and predictive modelling. Slik-wrangler is suited for both beginner and expert data scientists; with its low code implementation and detailed documentation, users can easily implement data wrangling and preprocessing techniques on their datasets.
Furthermore, by using Slik-wrangler for data preprocessing you can achieve the following:
- Treating missing values, handling outliers, and managing columns effectively.
- Building a data schema from your data.
- Handling data quality issues and inconsistencies present in the data.
- Formatting the data so that it can be parsed to the machine.
With Silk-wrangler data scientists can focus on their model building without having to spend too much time reinventing the wheel and complete their projects in record time.
Check out the GitHub repository here for more information about the project and its contributors. Also feel free to star the repository as it will help the project reach more people. Thank you.