This week we participated in Data Science Club organized by Exponea – a great event consisting of a series of tech talks about Big Data technologies and approaches. For the second time Instarea’s Martin Bago gave a talk at this event. Martin’s current topic was about data exploration techniques in language R. The venue was full of young talent from the Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava – leading IT faculty in Slovakia.
Often, the first steps with a new dataset are very crucial for success in the future. If you understand new collection of data on your first try – the outliers, the concept of DB, the variables, etc. – your subconscious mind can work with that knowledge even when you are asleep and you’ll have no problems. However a small mistake in the beginning – bad understanding of given values/metric system, can lead to huge problems in the end. Just ask the first European Mars probe about this. Wait, what European probe? Exactly!
The data journey always looks the same – from importing data to cleaning it, then transforming to your needs, then visualizing and modelling. Repeat last three activities until you have something to show and to use.
In an example below I will try to show you how to treat new dataset in few simple steps. I will describe these steps in R environment – my favorite. Code below and screenshots are from R Studio.
For replicability I will use one of free datasets, available in library called datasets – which is free and rich. To install this package use simple command:
>> install.packages("datasets") #installing datasets package in R >> library(datasets)
For this example we will use the dataset called “mtcars”, which consists of automotive data. Firstly I recommend showing first and last lines of dataset. To do this, use commands “head()” and “tail()”.
For further investigation of the dataset, use commands with complex output, as “summary()“ and “str()“.
Both these commands will show you detailed view of columns, attributes and data types. It’s very crucial to find, in which format are the columns of your dataset – often you have to retype it, just to run the model correctly.
Command summary() shows to you even minimum, maximum, median and average values in datasets columns, which is very important to understand the distribution of values. Another way to understand the distribution of values is plotting histograms. To do this, use command hist() with attribute of dataset and explicit column, you want to visualize.
If your dataset is incomplete, you need to quantify, how many values are missing. To find that, use command is.na(), which will return matrix of Boolean values TRUE/FALSE – in best case scenario you will have a lot of FALSE on your screen.
If you have some missing values in the dataset, basic approach is to fill them with mean or median value of your dataset.
For an even better understanding of your dataset and the values there, use box plots with the commad boxplot(). This special type of chart shows you graphically the distribution of values with median, boundaries of 1. And 3. Quartile, and minimum and maximum values. Box plot is a great way to understand values of variables.
Now you know a lot about your dataset and values there. But you are still missing one crucial thing – which values have impact on other values? Which values correlate positively? And which negatively?
For this, I use my favorite library called corrplot. It’s very easy to use and has very intuitive commands. For example, to show correlation matrix, just write cor(“dataset_name”) – displayed results inform you about dependency between values. Number close to 1 or -1 means strong correlation, number close to 0 means weak correlation. Generally, you can focus just on values bigger than 0.5, -0.5 respectively.
In table above you can see relations between variables. Notice, for example disp (displacement of engine) in the left column and mpg (fuel consumption) in the header row. They are in a strong negative correlation, -0.84. And that’s correct, bigger displacement means less mpg – higher fuel consumption. Second example: disp and cyl (cylinders) are in strong positive correlation, 0.9. That’s valid, bigger displacement almost always means a bigger cylinder count.
For today, that’s all. If you wanna find how to visualize this data set in a more intuitive way, please check full presentation on SlideShare. If you have any questions, don’t hesitate to comment or contact me directly firstname.lastname@example.org