How to manipulate an enormous Data Set


Data Preparation is the core of data science and it is crucial for any data analysis. It involves data cleansing and feature engineering. Usually, this takes 60 to 80 percent of the whole analytical pipeline. However, it is a mandatory task in order to get the best accuracy from machine learning algorithms on your data-sets [1].

The first part of this process is the Data Cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct. Data cleansing is also known as data cleaning or data scrubbing. It includes many different functions such as: Basics (select, filter, removal of duplicates, etc.), Sampling (balanced, stratified), Data Partitioning (create training, validation, test data set), Transformations (normalization, standardization, scaling, pivoting), Binning (count-based, handling of missing values as its own group), Data Replacement (cutting, splitting, merging), Weighting and Selection (attribute weighting, automatic optimization, etc.), Attribute Generation (ID generation), Imputation (replacement of missing observations by using statistical algorithms).

The second part is the Feature Engineering is the process of selection of the right attributes to be analyzed. One uses domain knowledge of the data to select or create attributes that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. It includes different functions like: Brainstorming or testing of features, Feature selection, Validation of how the features work with your model, Improvement of features if needed and the return to brainstorming / creation of more features until the work is done.

Both data cleansing and feature engineering are part of data preparation and fundamental to the application of machine learning and deep learning algorithms. They are also complicated and time-consuming.

[1] S. P. Kai Whner, “Data preprocessing vs. data wrangling in machine learning projects.”

Disclaimer: The present content may not be used for training artificial intelligence or machine learning algorithms. All other uses, including search, entertainment, and commercial use, are permitted.