Statistics Part 2, the basic measures

Basic Measures

This is the 2nd of a 3 parts brief introduction to statistics. Here for part 1 and here for part 3.


Wikipedia definition for statistics: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

And its the absolute basis for data mining! So a needed subject for the aspiring data miners.

A common situation is that you have a bunch of numbers from which you want to make a summary of, as in understand which numbers keep appearing more frequently(Descriptive statistics), what is the central value(average), but also maybe use it as a prediction on what will happen in future(Inferential statistics)...

These numbers are called sample data from an experiment. And it could be any number of things:
  • the amount of muffins sold in a bakery,
  • the number of times people click on "buy now" button from your online bakery,
  • the number of purple cars on sunday's,
  • the number of drunk people at 1 am versus at 5 am,
  • how many hours your cat sleeps per day,
  • how many shots on target does a basketball player achieve.

Some of these, if we create a statistic from them, can prove to be useful. For example, suppose you can only buy eggs at the market on the weekend. And you are wondering how many eggs are needed for cooking muffins for 1 full week. Because if you order too many eggs they will get bad and you have to throw them away, if you order too less, you don't have enough for the whole week.
So, if you create a statistic of how many muffins you sold last week, you will see how many eggs you will need to buy.

Using R software.
We will go trough some methods used for analyzing numbers (making statistics) using R software, which was built exactly for kind of thing!


Basic Statistic concepts
There's a few basic statistical concepts to be aware of.

Measures of central tendency
The best way to reduce a set of data and still retain part of the information, is to summarize the set with a single value.

mean
answears: what is the average from all the numbers?
> mean(c(10,11,12,9,13,11,11))
11

median
answears: (if we put all the numbers in order) what is the value in the middle?
> median(c(10,11,12,9,13,11,11))
11

mode
answears: what is the value that appears more frequently?
make an histogram for example:
> hist(c(10,11,12,9,13,11,11))
look at the graph and you can see the bigger bar means that 11 shows more frequently.



Measures of spread
While the measures of location attempt to identify the most representative value in a set of data a data description is not complete until the spread variability is also known. In fact, the basic numerical description of a data set requires measures of both centre and spread.

Range
answears: how broad is the interval of the min and max values in the data?
> range(c(10,11,12,9,13,11,11))
9 13

box plots
answears: visual representation of the data spread
> boxplot(c(10,11,12,9,13,11,11))



variance
answears: how far from the values the average are all the values?
> var(c(10,11,12,9,13,11,11))
1.66667

standard deviation
answers: how far from the values the average are all the values? normalized, that is, transformed into a value more easily interpreted.
> sd(c(10,11,12,9,13,11,11))
1.29

Measures of association
measures how strongly 2 sets of data are associated.

Covariance,
answers: how much two variables change together.
> cov(1:10, 5:14)
9.1667

Correlation, between 0 and 1
answers: how much two variables change together, like the covariance, but transformed to be more easily interpreted.
> cor(1:10, 5:14)
1

Correlation values are quickly grasped by looking at following graph:


This is the 2nd of a 3 parts brief introduction to statistics. Here for part 1 and here for part 3.

No comments: