Fascinated by AI and coding languages that crunch massive amounts o data, I began to solidify my dream as a data scientist. To this end, I began learning R! I am currently listening to HarvardX's data science professional certificate course, where the first course is the "R Basics" course.
I was initially very conflicted between HarvardX's data science course and MITX's data science course, because of many reasons, but
HarvardX seemed to cover more content and had a focus on R, which I want to learn
It is more expensive, but I think it would be worth the pay
Past users of HarvardX had some positive reviews
After all, it's Harvard.
So, enough digressing; what is R?
R is basically a software made for crunching large data. One interesting fact is that even though the name is R, you neither download nor code in R. Instead of the Rstudio website, you download from the CRAN website instead, which is: https://cran.r-project.org/ Also, you code in Rstudio, which can be downloaded in https://rstudio.com/, which is frankly quite strange.
How R treats data is quite similar but at the same time different from languages such as C++ or Mathematica. R also shares the same functional programming characteristics as Mathematica, but it also has several differences such as the data structures. Unlike C++, there isn't a notion of a "array" in R, because data and charts are difficult to convert into arrays. For instance, it would be illogical and inefficient to convert the below into a six-dimensional array. Rather, data structures in R are defined as vectors, which contain strings and numerical values as objects.
Also, I feel like intertwining different bits of data is much simpler in R. As it will be explained soon,
But just like Mathematica or Matlab, we must install packages using the following code (which is an example to install the package dslabs) because base R isn't really that useful.
Whenever we want to use this package, we have to recall it, which can be done through
Also, in R, we can assign values, strings, or whatever to any variable. This is done through
where 3 is inserted into a. If we wanted to compute a+3 when a=3, we would do
which would give us > 6. But sometimes we might get confused with what variables we had, so we can look at the upper right hand corner to recall the variables that we had.
But most data are not a single numerical value, or a single length of a string. Therefore, there are some special operators used to recall and analyze data in R. Just as we would use the function library() to first recall the package installed, we must do the same for data, by inserting the code below.
Then, we must first know what types of information is in the data "movielens." This can be done by using str, which gives us the data structure of the data, but only looking at the data structure can be hard to understand. Hence, to visualize the data into a chart that we can easily construe, we can use view(movielens) to visualize it as below.
To only know the variables (movieid, title, year, genres, userId, rating, and timestamp), we can use names() as well.
str(movielens) view(movielens) names(movielens)
But to analyze data, we must be able to call specific portions of the data. Hence, we use the function $. Used as the code below, the operator $ allows us to call all values for a specific variable for the objects stored in a data set.
So in the code above, we would get all the titles of movies stored in movielens in default order. If we wanted to choose a specific portion of the data, we would one of the three codes below depending on what portion we want.
movielens$title movielens$title[4:10] movielens$title[c(4,10)]
The first case is when we only want the title of the fourth movie in the data set, where we would only insert . However, if we wanted all titles from the fourth to the tenth, we would use [4:10], whereas if we wanted only the fourth and the tenth, we would use c(4,10). You may now be wondering about c(), if it is a function. Yes, c() is a function that puts together multiple objects into a vector. It is not necessary to have only two elements in c; you can also insert more as below, and they do not have to all be numerical or string data.
c("Shanghai", "4", "New York", "Mathematica", "Lagos", "Seoul")
But you might feel that this is not enough, since it is also important to cut and paste data from different strings. In the next post, I will cover how we can deal with different vectors.