Email the Author
You can use this page to email fru kingsly about Effective Data Wrangling and Exploration with R.
About the Book
Data wrangling is one of the most important steps in data science and analytics, for it is claimed that it takes between 80% to 90% of an analyst’s time. Data wrangling goes by many names including data munging, data manipulation, data preparation and data transformations. Just as there are many names to data wrangling, there are also many definitions to it. Below we look at two of the most important ones:
TRIFACTA which is a leading provider of data wrangling software by the same name defines data wrangling as:
“Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time”.
Gartner defines data wrangling as:
“Data preparation is an iterative-agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for self-service data integration, data science, data discovery, and BI/analytics”.
Clearly from the above, we can deduce that data wrangling is the process of converting raw data from one form to another that is appropriate for a specific task at hand. It is rare in analytics to receive data in the form and shape that we want to perform our analysis. Most often, we will be required to transform, clean, enrich and explore that data before we move to our analysis.
Data wrangling involves:
- Importing and exporting data: to and from csv, excel, databases etc.
- Cleaning data: identifying and dealing with missing data, outliers, and duplicates
- Manipulating text and categorical data
- Manipulating dates
- Encoding and enriching data
- Manipulating columns and rows
- Split-apply-combine data
- Merging data
- Reshaping data
- Grouping and Aggregating data
- Exploring data
Data exploration is and should be the initial step of any data analysis project. It is a mini form of data analysis in which we make use of both descriptive statistics and data visualization techniques to better understand our dataset. With traditional analysis and research, we know with exactitude what we are after (that is the hypothesis is known) before collecting data. With exploratory analysis, the process is reversed; we assume little or no information about the outcome of the analysis but instead explore the data to come up with some meaningful insight or hypothesis. Data exploration involves:
- looking at the structure and size of the data
- looking at the completeness and correctness of the data
- looking at the possible relationships that may exist between data elements
As can be observed, the boundary between data exploration and data wrangling is blurred because both make use of data cleaning techniques to make sure that the data is correct and complete for data analysis.
This book is all about data wrangling and exploration as important steps leading up to data analysis.
How is this Book Structured
It is divided into seven parts which include:
- Part1: Programming with R (chapter 1 to 15)
- Part2: Import and export data (chapter 16 to 18)
- Part3: String and categorical data manipulation (chapter 19 to 21)
- Part4: Date manipulation (chapter 22 to 24)
- Part5: Data manipulation (chapter 25 to 28)
- Part6: Data cleaning (chapter 29 to 30)
- Part7: Data exploration (chapter 31 to 32)
About the Author