top of page
All Posts: Blog2

[Research] R Basics 2- Basic Wrangling (HarvardX)

In the previous post, the primary focus was to set-up the foundations for dealing with large amounts of data. Then, naturally, our next step would be to change or display the data, i.e. wrangle the data.

Variables and Vectors

Firstly, using c() that was previously dealt with, we can assign values or strings to a variable. Note that the names of the variables can be built-in functions; as long as the name does not collide with a pre built-in variable such as numbers, there is no problem in defining them. Unlike other coding languages that use the equal= sign, R uses the arrow instead as below.

```city<- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
temp <- c(35, 88, 42, 84, 81, 30)```

Then, we can recall these variables from the vector temp or city. length(temp) would give us the number of variables in the vector temp, and recall that 1:length(temp) is used to call all integers from 1 to length(temp). Hence, the code

`temp[1:length(temp)]`

would produce the following table:

But some datasets might have an enormous number of columns and rows. In this case, we can just use head() to get the top few rows in the data set.

`head(temp)`

If we want to see more, we can use top_n(data, number), which gives use the top number rows from the dataset.

`top_n(temp,5)`

Just as a sidekick to 1:length(temp), we might want the 7th, 14th, 21th, ... products of 7th rows. To obtain such data, we must recall that 1:length(temp) is, in fact, just an abbreviation of seq(1,length(temp)). Then, we can tweak it as below.

```seq(7, 50, 7)
seq(1, 10, length.out = 100)```

seq(7,50,7) would produce any 7+7k th integers smaller than 50. With the addition of length-out, R knows how many elements should be in the interval provided. For instance, if the code above is given, it would compute how much it should add to have 100 elements between 1 and 10.

Data Types

We know that class() can be used to determine the data type of an unknown variable. However, one very interesting property is that R assumes the existence of trailing zeros. If we compute class(1), strangely enough, we get "numerical.," meaning the 1 is not stored as an integer, but rather a real number with many trailing zeros. This is because an 1 that has been inserted has the possibility of having a minuscule decimal digit. On the other hand, if we were to insert the following, we get "integer."

```a <- seq(1,10)
class(a)```

This is because, within R, seq(1,10) is defined to only produce integers. However, there are still ways for input data to be recorded as integers, and that is the use of "L"

`class(3L)`

This would give us the answer "integer," since L denotes a number to be an integer. Real numbers take more memory, so defining integers using L would be a useful skill when computing large amounts of data.

A key difference with C++ is that R does not require %d, %f, float, double, or such declarations on the data type of the variable. We can exploit this property by using a technique called coercion. If a data in a dataset is missing, R automatically fills its gap with a NA. However, consider the code below.

```x <- c(1L, 3L, 5L,"a")
x <- as.numeric(x)```

1L, 3L, 5L were inputted as integers, but due to as.numeric, they are now stored in x as numeric variables. However, for a, it cannot be represented numerically, and hence a NA is outputted.

Ordering Data

Often when we deal with large amounts of data, we have to order them into a particular pattern, mostly descending or ascending order. While C++ has a dearth of such function and hence requires us to use the painstaking bubble sort, R has a glut of ordering functions--in fact, it has three!

Let's use this dataset n as an example.

`n<-c(38,13,1,100,24)`

Then, we can use either one of the three below.

```sort(n)
order(n)
rank(n)```

Using this code, the difference between the three ordering functions will be clear enough.

`data.frame(n, sort=sort(n), order=order(n), rank=rank(n))`

sort(n) would produce the string in ascending order, and order(n) would yield the index of the elements in ascending order. The first element is 3, which means that the third element of n would be the smallest, and the second element is 2, meaning that the second element of n would be the second smallest. Finally, rank(n) would give us the rank of the respective elements in ascending order.

Using this knowledge, we can use the set of codes to obtain what sort(n) would give us. The final code would not produce the entire dataset, but would still allow us to find the maximum.

```n[order(n)]
n[rank(n)]
n[order(n,decreasing="TRUE")]
n[which.max(murders\$total)]```

This is only basic wrangling, and future posts will cover more complicated functions and concepts of data wrangling. But even from these set of functions, we can easily see that R is quite effective and, in fact, customed, to deal with vectors like elements.

2 views
bottom of page