In this exercise today, we will perform some simple data exploration using pandas in python. We will use a dataset that has information about various car models. The data is in a CSV file, mtcars.csv.
The notebook for this tutorial along with the dataset can be found here.
We can start by importing pandas and loading the data into the dataframe.
import pandas as pd
data = pd.read_csv('mtcars.csv')
Now that we have our data in a dataframe, we can take a peak into the data.
data.head()
We can also quickly get some statistics on the data by using the describe function.
data.describe()
We can also get information about the columns and datatypes of each column and the count of non-null values.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 32 non-null object
1 mpg 32 non-null float64
2 cyl 32 non-null int64
3 disp 32 non-null float64
4 hp 32 non-null int64
5 drat 32 non-null float64
6 wt 32 non-null float64
7 qsec 32 non-null float64
8 vs 32 non-null int64
9 am 32 non-null int64
10 gear 32 non-null int64
11 carb 32 non-null int64
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB
We can also look for null values in the dataframe.
data.isnull().sum()
model 0
mpg 0
cyl 0
disp 0
hp 0
drat 0
wt 0
qsec 0
vs 0
am 0
gear 0
carb 0
dtype: int64
Now lets say we want to see which model has the maximum MPG. We can do that by finding the row in which the mpg column has the highest values.
data.loc[data['mpg'].idxmax()]
model Toyota Corolla
mpg 33.9
cyl 4
disp 71.1
hp 65
drat 4.22
wt 1.835
qsec 19.9
vs 1
am 1
gear 4
carb 1
Name: 19, dtype: object
As we can see, Toyota Corolla has the highest MPG in our dataset. If we are only interested in the name of the model, we can modify the code above by adding the name of the desired column.
data.loc[data['mpg'].idxmax()]['model']
'Toyota Corolla'
Similarly, if we want value from more than one columns to be displayed, we can do that by passing the names of the columns as a list.
data.loc[data['mpg'].idxmax()][['model','wt','qsec']]
model Toyota Corolla
wt 1.835
qsec 19.9
Name: 19, dtype: object
The opposite of idxmin, so to get a minimum value in a columns, you can use the above code but replace idxmax with idxmin.
Another step in data exploration is correlation between variables. Pandas makes that very easy. We can have it draw a correlation matrix to give us a broad sense of correlation between different variables.
data.corr()
If you are interested in correlation between only two variables, for example, between mpg and wt, we can calculate that as following.
data.mpg.corr(data.wt)
-0.8676593765172279
Now what if we want to look at correlations of only one variable with all of the other variables. This is simple. First we draw the correlation matrix as mentioned above, cast it on variable to store it, and then can retrieve correlation of any of the columns with others.
matrix = data.corr()
matrix['mpg']
mpg 1.000000
cyl -0.852162
disp -0.847551
hp -0.776168
drat 0.681172
wt -0.867659
qsec 0.418684
vs 0.664039
am 0.599832
gear 0.480285
carb -0.550925
Name: mpg, dtype: float64
We can also sort these correlation values so that we can see which variables have the most effect on mpg in descending order.
matrix['mpg'].sort_values(ascending = False)
mpg 1.000000
drat 0.681172
vs 0.664039
am 0.599832
gear 0.480285
qsec 0.418684
carb -0.550925
hp -0.776168
disp -0.847551
cyl -0.852162
wt -0.867659
Name: mpg, dtype: float64
By default, the sort_values function with display values in an ascending order therefore we set it to False to get values in a descending order.
It would not be fair to talk about correlation without talking about covariance.
Covariance determines how much a variable changes with a change in the other variable. It could be positive of negative. Positive covariance means that the variables will change in the same direction. If it is negative, the variables move in opposite directions.
The syntax for covariance is similar to correlation, but replace corr with cov. So to get a covariance matrix, you can simply use:
data.cov()
All the other tasks that we did with correlation can be done for covariance as well.
Comments