Modern Data Analysis: April 2017

R - Packages and DataTypes

Packages
Functionality is maintained in packages. Some packages are part of the basic functionality and are predefined when you install R, others have to be installed (and loaded) before used.
To find out the directories of your packages, enter command .libPaths(), to find the already loaded packages choose search(). The commands installed.packages(), install.packages("..") and update.packages() are self-explaining. To load a package use library(..) or require(..).
To show the content of a package use command help(package="..").
In RStudio this is easy, as you see the packages in an own window.

Data Types

Scalar A Scalar is a single value vector (numeric, logical or character value) Example: s <- 3="" i="">

Vector A vector is a collection of scalars of the same type, to combine use c(..) Example: v <- c(1, 2, s, 4, 5, 6, 7, 8) (s from above)

Matrix A matrix is a collection of vectors, all elements have the same type. Example: m <- matrix(v, nrow=2, ncol=4, byrow = TRUE) (v from above, note that byrow determines if the values of v are filled by row or by column (default)

Array An Array is a collection of matrices, all elements have the same type. Example: a <- array(v, c(2,2,2))

DataFrame A data frame is a matrix that can hold different types of elements mixed. As your data will usually be a mix of different types, this is the most used datatype in R. Example: df <- data.frame(column1, colum2, ...) where the columns are vectors of the same length that can be of different type. Fill column names function names(df)=c("x", "y")

Factor A factor is a nominal (categorical) or ordinal variable (ordered categorical) Example: Yes/No are categorical, Small/Medium/Large/XLarge are ordinal variables

List A List is a wild collection of other data types. Example: l <- list(s, v, m, a, df) (variables from above). To name it use list("scalar"=s, ...). To access the elements use double brackets and index [[1]] or ["scalar"]

Null-Hypothesis, Z-Scores and Normal Distribution

Mirko

Statistics 0

This is Carl Friedrich Gauß, one of the most friendly looking German mathematicans, on a banknote. Nowadays Germany forms part of the European Union and use the (by the way much more secure) EUR banknote (a different story). He earned that honor because of his widely used contributions to algebra, geometry and astronomy. If you have a close look on the bank note you will even spot the "Gaussian Bell Curve". This is what I want to talk about in this post.
We saw in this post about null hypothesis and p-values that in order to accept an alternative hypothesis, we have to find a way to reject a null hypothesis. I just mentioned the $p$-value, however I did not explain yet how to calculate it.

For every value we get its corresponding z-score in the data set by subtracting the average and deviding by the standard deviation.

Instead of the original dataset we can then create the set of standardized values. You might wonder, why we should do this: We know that under the null hypothesis the standardized values have an average of $0$ and a standard deviation of $1$ (given that the data set holds enough data, for less than 30 values there is in fact a slightly different assumption)! This means that we would expect the observed values to follow the normal distribution (see picture below). In particular, around $68\%$ of the values should lie between $-1$ and $1$, around $95\%$ of the values between $-2$ and $2$, and only very few data should lie outside $-3$ and $3$:

Actually the bell curve does not end at $-4$ or $4$, but the values are getting close to zero very fast.
The actual formula of the normal distribution for a means $\mu$ and a variance $\sigma^2$ is
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp^{-(x-\mu^2)/2\sigma^2}$$
Here the R code:

x <- seq(-4, 4, length = 10000)

y <- dnorm(x, 0, 1)
plot(x, y, type="n", xlab = "x", ylab = "y", main = "Bell Curve", axes = TRUE)
lines(x, y, col="red")

To calculate the $p$-value in R, we can use the command
pnorm(...)

What we get out of it:
For every value $x$ the $y$ value determines the frequency of the occurance of this value in a normal distributed data set. We can now check the probability of a result as good as our (standardized) observation. We have to choose between a one-tailed and a two-tailed test, meaning we put our decision boundary on the one or two sides (split up) of the bell curve (details in a different post). Now if our observation is lower than the value we specified for our significance level, we decide to reject the null hypothesis.

Null Hypothesis, P-Value and Significance Level

Mirko

Statistics 0

What is the null hypothesis?
That's easy: It is the assumption, that an observation is simply due to chance. The contrary assumption, that an observation is NOT due to chance, is called the alternative hypothesis.

What is it used for?
It is used to formulate a data science problem of the contrary form: Could it be that there is something beyond chance in my observation? Can I build up confidence that there is "something special" in the data (the alternative hypothesis)?
To answer the question, we use a classical mathematical trick: We assume that the null hypothesis is true, so we work on purely random observations. Knowing the rules for chance, we could then calculate the probability for a value as good as the observed value, the so called $p$-value, given the null hypothesis is true. If this probability $p$ is high, then it is not a good idea to reject it. On the other hand, if $p$ is low then we would say that the assumption is not plausible and found a way to reject the null hypothesis - we gained confidence that the alternative hypothesis is valid.

When do we reject the null hypothesis?
If the $p$-value falls below a critical level, the so called significance level $\alpha$. Usual values are $\alpha = 0.05$ or $\alpha = 0.025$. In this case the probability is too low to accept the null hypothesis, the deviation from it is statistically significant.
Assume that the probability for the observation under the null hypothesis is only around 1% , in this case we are pretty confident (99%) that we should reject the null hypothesis. Therefore the $q$-value $q = 1 - p$ (here 99%) is also called the confidence.

If you are interested in how to calculate the $p$-value using the $z$-values, check out this post about the z-scores.

R - How to Start

Mirko

R 0

Welcome to a new challenge - an empty green meadow is waiting to be explored and developed!

R is a pretty strange programming language, it does not follow the usual conventions of a programming language and is not intuitive, or at least not in the beginning. I am familiar with quite a few programming languages, however R is nothing like them (in which other programming language do you prefer the assignment operator "<-" over a "="??). However as R is currently considered the statistics reference programm (even more popular than SPSS) and in addition it is open source, it is worth looking at it.

Here are some tips if you think about starting to learn R:

- Install R first, play around with the console. You will find out that it is not very handy. E.g. R does not use line delimiter (";", ".") after statements, R only permits a maximum line length is 80 characters, ...
- After you found out, that R is quite strange, get the RStudio (e.g. here). The RStudio is really helpful, it provides short cuts for the most used commands, a simple structure and a pretty nice user interface.
- Make yourself familiar with Google's R Style Guide, so you do make a fool out of yourself chatting with experts. Also you learn a lot of R's specialities (so far I considered "." a bad choice for a letter in identifyiers as you expect it to point to a subattribute, however in R it is accepted and "_" is the bad choice...)
- Remember command "rm(list = ls())" which is used to clear the current workspace (yes, there is a current workspace in which locally created variables live!)
- Look for free online courses (there are tons of them)
- Get familiar with the shortcuts "ALT + -" and "Strg + L"
- Find information on available packages on the sites of CRAN, have a look especially on this crantastic page that allows you to search for (popular) packages
- Use the predefined data sets in R (see "data()" for an overview), they will often be used in examples and it feels good to already know them
- Use command "View(..)" regularily on your data to get a clean picture of it
- When you use "require("packageName")", remember to use "detach("package:packageName", unload = TRUE)" at the end (use those commands together in RStudio)
- use command demo() to get an overview on the demos included in R. To execute a demos use the same command with one of the given arguments (e.g. demo(colors)).
- use fix() on a data frame to correct manually
- use transform() and the powerful (s)apply() on your data frames

This post is being updated regularily.

Lift

Mirko

Data_Science Statistics 0

Like its name indicates a lift is a measure on how good your binary classifier model is lifting the predictions. In other words it measures how well the new models choice is better than an old models choice or the random selection. A lift plot is an alternative to a ROC curve when you want to compare two classifier, it provides insights on the model and can help to determine a cutoff.

Lets start with the definition of list:
The lift measures the change in concentration of a target value if applied to a subgroup of the test set that is chosen by the model. Note that the lift is always connected to the concentration of the target value in the test set, the lower the concentration, the higher is also the possible lift of the model. Therefore there are no restrictions to the values list of the lift.

In an example:
Imagine that you want to improve a marketing campaign which adresses customers in order to sell new products or services. From the past campaign you build up a predictive model trying to identify the customers who are likely to respond. In the past campaign 4% of the overall adressed customer responded. If you now choose a random subset of 10 % of the adressed customers, you would expect 4% of the responders in there.
Now with your new model you can identify possible responders and you therefore choose the 10% most likely to respond to the campaign. If from these 4% of all customers 16% respond, then your classifier has a lift of 16 / 4 = 6.
Now you can calculate the expected responses adressing 20%, 30%, ... and you can plot the data in a so called lift chart, that could look like this (the yellow line corresponds to the random classifier, the red one to the fictious model):

These lift charts can be built up for different classifiers in order to compare them. Also from the form of the curve a possible maximum of adressed customer can be choosen in order to optimize the costs for the campaigns.