This week we participated in the Data Science Club organized by Exponea – a great event consisting of a series of tech talks about Big Data technologies and approaches. For the second time, Instarea’s Martin Bago gave a talk at this event. Martin’s current topic was about data exploration techniques in language R. The venue was full of young talent from the Faculty of Informatics and Information Technologies, the Slovak University of Technology in Bratislava – leading IT faculty in Slovakia.
Often, the first steps with a new dataset are very crucial for success in the future. If you understand new collection of data on your first try – the outliers, the concept of DB, the variables, etc. – your subconscious mind can work with that knowledge even when you are asleep and you’ll have no problems. However, a small mistake in the beginning – bad understanding of given values/metric system, can lead to huge problems in the end. Just ask the first European Mars probe about this. Wait, what European probe? Exactly!
The data journey always looks the same – from importing data to cleaning it, then transforming to your needs, then visualizing and modeling. Repeat the last three steps until you have something to show and to use.
In the example below I will try to show you how to treat new dataset in a few simple steps. I will describe these steps in R environment – my favorite. The code below and screenshots are from R Studio.
For replicability I will use one of the free datasets, available in the library called datasets – it is free and rich. To install this package use simple command:
>> install.packages("datasets") #installing datasets package in R >> library(datasets)
For this example we will use the dataset called “mtcars”, which consists of automotive data. Firstly I recommend showing the first and last lines of the dataset. To do this, use commands “head()” and “tail()”.
For further investigation of the dataset, use commands with complex output, as “summary()“ and “str()“.
Both these commands will show you a detailed view of columns, attributes and data types. It’s very crucial to find, in which format are the columns of your dataset – often you have to retype it, just to run the model correctly.
Command summary() shows to you even minimum, maximum, median and average values in datasets columns, which is very important to understand the distribution of values. Another way to understand the distribution of values is plotting histograms. To do this, use command hist() with an attribute of the dataset and explicit column, you want to visualize.
If your dataset is incomplete, you need to quantify, how many values are missing. To find that, use command is.na(), which will return the matrix of Boolean values TRUE/FALSE – in best case scenario you will have a lot of FALSE on your screen.
If you have some missing values in the dataset, the basic approach is to fill them with a mean or median value of your dataset.
For an even better understanding of your dataset and the values there, use box plots with the command boxplot(). This special type of chart shows you graphically the distribution of values with median, boundaries of 1. And 3. Quartile, and minimum and maximum values. Box plot is a great way to understand the values of variables.
Now you know a lot about your dataset and values there. But you are still missing one crucial thing – which values have an impact on other values? Which values correlate positively? And which negatively?
For this, I use my favorite library called corrplot. It’s very easy to use and has very intuitive commands. For example, to show the correlation matrix, just write cor(“dataset_name”) – displayed results inform you about dependency between values. Number close to 1 or -1 means strong correlation, a number close to 0 means weak correlation. Generally, you can focus just on values bigger than 0.5, -0.5 respectively.
In the table above you can see relations between variables. Notice, for example, disp (displacement of the engine) in the left column and mpg (fuel consumption) in the header row. They are in a strong negative correlation, -0.84. And that’s correct, bigger displacement means less mpg – higher fuel consumption. Second example: disp and cyl (cylinders) are in strong positive correlation, 0.9. That’s valid, bigger displacement almost always means a bigger cylinder count.
For today, that’s all. If you wanna find how to visualize this data set in a more intuitive way, please check full presentation on SlideShare. If you have any questions, don’t hesitate to comment or contact me directly firstname.lastname@example.org