R - Simple Data Preparation

 
R covers a huge variety of functions for data manipulation, reorganization and separation. Here are some of the commands I consider the most useful when preparing data.

I will start with an example:
Imagine that we have an IoT scenario in which sensor data from temperature sensors s1, s2 and s3 are collected in an edge unit before a reduced set of data is sent over to the central unit. Each sensor sends data after a specific amount t in (milli)seconds.

s1:
t = 0: temp = 24.3
t = 1: temp = 24.7
t = 2: temp = 25.2
t = 3: temp = 25.0

s2:
t = 0: temp = 20.2
t = 1: temp = 20.1
t = 2: temp = 99.9
t = 3: temp = 20.1

s3:
t = 0: temp = 28.1
t = 1: temp = 28.0
t = 2: temp =
t = 3: temp = 27.7

To get the data into R we create vectors holding the values for the sensors. We see that the value for n=2 for sensor s3 is missing, therefore we add here "NA" in order to tell R that we do not have the value:
> s1 <- c(24.3, 24.7, 25.2, 25.0)
> s2 <- c(20.2, 20.1, 99.9, 20.1)
> s3 <- c(28.1, 28.0, NA, 27.7)

We can furthermore create a dataframe holding these measurements by
> sensor.data <- data.frame(s1, s2, s3)


What we get is the following output for sensor.data:
   s1   s2   s3
1 24.3 20.2 28.1
2 24.7 20.1 28.0
3 25.2 99.9   NA
4 25.0 20.1 27.7
 
We do not like the row identifyer that R sets up by default, we have our own identifyier using the values for t, we just have to tell R that it should use these values, we do that with command row.names:
> t <- c(0, 1, 2, 3)
> sensor.data <- data.frame(t, s1, s2, s3, row.names=t)

  t   s1   s2   s3
0 0 24.3 20.2 28.1
1 1 24.7 20.1 28.0
2 2 25.2 99.9   NA
3 3 25.0 20.1 27.7

First of all we want to see the data in an easy diagram:
> plot(t, sensor.data$s1, type="b", pch=1,
     col="red", xlim=c(0, 5), ylim=c(18, 100), 
     main="Sensor Data", lwd=2, xlab="time", ylab="temperature")
> lines(t, sensor.data$s2, type="b", pch=5, lwd=2, col="green")
> lines(t, sensor.data$s3, type="b", pch=7, lwd=2, col="blue")

We see that the value 99.9 seems to be a wrong measurement (we assume it here in order to see how to manipulate the data, actually an analysis is required to check the background of this outlying value):

> sensor.data[sensor.data[, 2:4]==99.9] <- NA
 
  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2   NA   NA
3 3 25.0 20.1 23.9
 
Also the program cannot handle NA values, therefore the plot seems to be incomplete. We can identify NA values using the command is.na(v) for a vector v.
We want to replace the missing values for sensor s2 and s3 by the means of the neighbour values (this is also an assumption that should be validated carefully):

> sensor.data$s2[3] <- (sensor.data$s2[2] + sensor.data$s2[4])/2
sensor.data$s3[3] <- (sensor.data$s3[2] + sensor.data$s3[4])/2
 
  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2 20.1 24.5
3 3 25.0 20.1 23.9
 
Now we decide that we want to extend our dataframe to also hold the sum of the values and the means of the values at every point in time. We do that by command:

> sensor.data.xt <- transform(sensor.data, sumx = s1 + s2 + s3, meanx = s1 + s2 + s3/3)
 
  t   s1   s2    s3  sumx    meanx
0 0 24.3 20.2 28.10 72.60 53.86667
1 1 24.7 20.1 28.00 72.80 54.13333
2 2 25.2 20.1 27.85 73.15 54.58333
3 3 25.0 20.1 27.70 72.80 54.33333
  
Next we decide to classify a dataset as critical, if the sum is greater than 73.0, as anormal if it is greater than 72.7 and normal else. So we create another variable in our data frame by

> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 73] <- "critical"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 72.7 & sensor.data.xt$sumx < 73] <- "anormal"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx < 72.7] <- "normal"

  t   s1   s2    s3  sumx    meanx riskcatg
0 0 24.3 20.2 28.10 72.60 53.86667   normal
1 1 24.7 20.1 28.00 72.80 54.13333  anormal
2 2 25.2 20.1 27.85 73.15 54.58333 critical
3 3 25.0 20.1 27.70 72.80 54.33333  anormal


Instead of using strings we make categories out of riskcategory by

> sensor.data.xt$riskcatg <- factor(sensor.data.xt$riskcatg)



TIPP: The statement  
variable[condition] <- expression 
is very powerful and useful for data manipulation
.
Previous
Next Post »
0 Comment