Welcome to the ReMInDeRY!

(Research Methods: Interactive Demonstrations in R at York University)

Here we host a set of interactive demonstrations for concepts related to research methods. These demonstrations will also introduce you to the statistical program called R.

If you are just starting the tutorial, please begin by clicking ‘Start Over’ at the bottom of the index to your left.

If you have any questions and comments, please contact Dr. Raymond Mar (mar@yorku.ca)

UNIT 2: Measures of Central Tendency

GOALS:

- Learn how to use R to describe your data.

- Examine different measures of central tendency, used to summarize scores.

- Use R functions to play with some real-world data.

Now that you’ve been introduced to R and have a feel for how these tutorials work, let’s start to learn about some basic statistics that are important for understanding research. Often, we’ll have lots of data to deal with, not just 5 or 6 numbers, and so it’s useful to summarize all of that data in some way. The analyses we run to summarize our data our known as Summary Statistics, and these are part of a larger category known as Descriptive Statistics: Statistics that describe our data. Descriptive statistics are useful for giving us a sense of what our data look like, which can be difficult just by eye-balling things when we have lots of it.

One set of summary statistics are Measures of Central Tendency. This is just what it sounds like: an indication of the centre of our data, or where most of the data lies. We’ve already got a headstart here becaused we learned how to calculate the mean (a.k.a. an average) in UNIT 1. Although the mean is perhaps the most well known measure of central tendency, it’s not the only one and not always the most useful one. The other two that we’ll learn about are the median and the mode.

Ok, let’s begin. If you’ve forgotten how these tutorials work, feel free to return to UNIT 1 to refresh your memory.

The Mean

Imagine that you’re 19 years-old and you’re with 3 of your friends, one who is 20 and two others who are both 23. With only 4 people, you can easily communicate this information to others by saying each person’s age (e.g., I’m 19, but Sinbad is 20, and Mustafa and Jelena are older, they’re both 23). Or, if you like, you might want to summarize this information by reducing it to just one number. You can do this by calculating the mean. In order to calculate the mean, you need to add all the numbers together and then divide by how many numbers there are.

In the box below, use R to calculate the mean for your group of friends. Please put your calculations into an object and then use that object to view the result. (If you recall, this is something you learned in Unit 1.) Remember that you’ll need to use brackets in order to ensure that your numbers are summed first, then divided by the number of people, or do things in two separate steps. The mathematical symbol for division is /.

You can always click the ‘Hint’ button to get some advice.
Or click the ‘Start Over’ button to reset everything in this box (or the ‘Start Over’ link in the index to re-set the entire Unit).

Hint: You need to add up all the ages, then take this new number and divide it by the number of people (i.e., 4 people, don’t forget yourself!). You also need to create an object and then put this information into it using <-. Then, to view the result, just put the name of your object on its own line.

The Mean (2)

If you’ve done things correctly, then you should have gotten a mean of 21.25.

Below is an example what your code might have looked like:

awesomeprof <- (19 + 20 + 23 + 23) / 4
awesomeprof

awesomeprof is our object that our calculation is placed inside, (19 + 20 + 23 + 23) sums all of our ages, and /4 divides the result (kept in the brackets) by 4. We write awesomeprof on its own on the next line so we can look inside our object and see our result: 21.25.

As we learned in Unit 1, we can also use the mean() function in R to do the same thing. With this approach, we put all our data into an object (using c() to concatenate all of our data first), then run the function on the object. Here’s what it would look like:

awesomeprof <- c(19, 20, 23, 23)
mean(awesomeprof)

You should get the same result with both sets of code. So now, instead of listing everyone’s ages, you can say, “my group of friends are a little over 21 years old, on average.”

The Median

The mean is just one measure of central tendency, with the other two being the median and the mode. Each tells us something different and has different strengths and weaknesses. Let’s learn about all 3 first and then explore their differences in more detail.

The median is simply the middle point of the data. To calculate it, we arrange all of our data in order, from smallest to biggest, find the data-point that’s right in the middle, and that’s the median. Imagine our group of friends, but now your Auntie is joining the group and she’s 52 years of age. Answer the quiz below regarding the median age for your group (Scroll up if you forget everyone’s ages and don’t forget your Auntie.)

The Median.

The Median (2)

Great! Now let’s use R to confirm that your answer is correct. The function to calculate the median is simple, just like the one that we used to calculate the mean: median(OBJECT). Remember, instead of the word OBJECT you’d use the name of the object that you put your data into. So, in order to calculate the median for our data, we need to first put our data into an object, and then run the function median on that object. Give it a try below:

Hint: You first need to put all your data, which are the ages of the people in your group (19, 52, 23, 20, 23) into an object (Don’t forget to use the little c in front to concatenate all of these numbers; see UNIT 1 if you’ve forgotten this), and you can call this object whatever you want (e.g., Neato). Put the data into the object using the arrow <-. Then you need to use the function median() to find the median, putting the name of your object inside the brackets.

The Mode

If you got the same number using your R code as you calculated by hand (i.e., 23), then you know you’re getting the hang of things!

Ok, the last measure of central tendency that we’re going to learn about is called the mode. This is simply the number that appears most often in the data. So, for the data we’ve been working with, the mode should be really easy for us to spot. (Scroll up if you’ve forgotten everyone’s ages again; my you’re forgetful!)

The Mode.

The Mode (2)

That’s right! Two people are 23 years of age, and only 1 person is any other age. So, the mode tells us that “More people are 23 in the group than any other age”.

Ok, let’s use R to calculate the mode for our data. In order to do so, we’re going to learn a few new things about R. First, R comes with some pre-loaded functions like mean() and median(), but not many. And so it actually doesn’t come with a function to calculate the mode. The good news is that many people love R and have created all kinds of functions to make R very useful. These functions are contained within packages, which need to be installed and then loaded into R’s “library”.

So, to help you out, I’m going to load a new package that will allow you to find the mode. This package is called “DescTools” (think, “Descriptive Tools”) and you can install and load it into the library using the following lines of code:

install.packages("DescTools")
library(DescTools)

NOTE: You do NOT have to run these lines in the box below. I have already run them for you in the background. This is just important code to learn for later, when you want to install and load packages for your own projects.

The function in the “DescTools” package that finds the mode is Mode(). Note that R is case-sensitive, which means that Mode() will do something different than mode() (not the lower-case for the latter), so be very careful to use the right one. Whenever we need to learn more about a function, we simply use a line of code that includes ? in front of the function. Give it a try below to learn how the Mode() function works.

Hint: Bring up the help file for the Mode() function by using this line of code: ?Mode

The Mode (3)

Ok, it seems pretty simple. Basically, we just put the object with our data in it inside the parentheses for Mode(). Let’s give it a try and find the mode for our data.

Hint: You first need to put all your data, which are the ages of the people in your group (19, 52, 23, 20, 23) into an object (Don’t forget to use the little c in front to concatenate all of these numbers; see UNIT 1 if you’ve forgotten this), and you can call this object whatever you want (e.g., Faizel). Put the data into the object using the arrow <-. Then you need to use the function Mode() to find the mode, putting the name of your object inside the brackets.

Working With Larger Datasets

If you got the same answer using your R code as you did calculating this by hand (i.e., 23), great work! Note that in a dataset you can have no Mode (i.e., all values appear equally as often), 1 mode, or even more than one mode (e.g., imagine that in your friend group 2 people were aged 23, and another 2 people were aged 77; both 23 and 77 would be the modes).

You’re probably finding it a bit silly to use R to calculate things like the mode and median when you can just spot the answer by looking at the data. Well, you’re right. It is a bit silly. But we won’t always have only 4 or 5 numbers to work with. As researchers, we often have hundreds or thousands of numbers to deal with and then it’s going to be really useful to know how to use R to calculate these things.

There is also lots of interesting data that is free to download. The city of Toronto, for example, has made lots of data about the city available to view and download here.

If we learn how to do statistics and use R, we can download these data and analyze them to answer interesting questions. To get these data into R, we would download the data and then load them into an object within R, similar to how we put data into an object “manually,” in the examples above.
I’ve taken some of this City of Toronto data, removed some unnecessary lines to make things cleaner and easier to import, made sure all of the numbers were formatted correctly (as numbers, or numeric), and saved it as a .CSV file (comma separated values) which is a nice clean format that R likes. I have then loaded it into R and placed it into an object (TOcatsdogs) using the read.csv() function. Here’s what the line of code looks like:

TOcatsdogs <- read.csv ("2016_licenced_dogs_and_cats.csv")

Looking at Data

Now, let’s take a quick look at what these data look like, using R. To get a glance at the first few lines of data, we can use the head() command. Just like with all the other functions we’ve been using so far, we put the object name within the brackets to apply the function to the object. Give it a try!

Hint: R is case-sensitive, which means it matters whether you use lower-case or upper-case letters. Use the head() command, placing the name of the data-object TOcatsdogs within the parentheses, to take a look at these data.

Looking at Data (2)

Great! You should see a small table showing the first 6 lines of data. You can see that there are 4 columns, corresponding to the 4 different variables in this dataset. (Scroll up if you want to look at them again.)

Variable Name	Description
AREA	First 3 letters of the area code in Toronto
CAT	Number of cats registered with the city in this area code
DOG	Number of dogs registered with the city in this area code
TOTAL	Total number of cats and dogs registered in this area code

But let’s imagine that we wanted to take a look at the first 10 lines of data, instead of just the first 6. How would we go about doing this? Let’s use what we learned from the previous exercises to figure out how to do this. How do we figure out what a function does, and how it works? Can we use this information to figure out how the head() function operates, and how to change the parameters to get the information we want (i.e., first 10 rows, instead of 6)?

?head

head(TOcatsdogs, n = 6)

head(TOcatsdogs, n = 10)

Problem-solving and trouble-shooting

You have just learned a very valuable lesson in R! Whenever you are struggling to figure out how to do something, the first thing to do is look at the help file (by placing ? in front of the function, e.g., ?head()). Functions often have many different ARGUMENTS, which you can think of as options or parameters. For head(), one of the arguments is n, which tells us how many rows to display when using this function. The default for n is 6, so we can change this to display more (or fewer) rows.

Another important lesson about R is that almost no-one writes code seamlessly without reference or help. Almost everyone relies on past or published code that does what we want, and this code is then edited to suit our needs. Searching for the right R code for what we want to do is immensely useful! Just be certain to include the phrase “R statistics” when searching, to help get the relevant results. StackOverflow is a very useful site for coding advice, so often hits from there are quite helpful.

Using our Measures of Central Tendency for Real Data

Ok, let’s use our measures of central tendency to summarize how many cats and dogs are in each area code of Toronto. To identify a column in a dataset, we use the $ to join the name of the data object and the particular variable we’re interested in. So, to refer to the CAT variable in the TOcatsdogs data object, we’d write TOcatsdogs$CAT. Let’s try to use our R code to find out the average number of cats registered in each area code of Toronto.

Hint: To calculate the average, we use the mean() function and place the object of interest within the parentheses. In this case, the object we’re interested in is the CAT column of the TOcatsdogs data object, which we refer to by using TOcatsdogs$CAT. So, put this into the parentheses for the mean() function.

Using our Measures of Central Tendency for Real Data (2)

If you’ve done things correctly, you should have found that on average, a postal-code area of Toronto has 252.051 cats registered within it. Now, why don’t you use what you already know about R to find out the answers to the following Quiz questions.

Hint: To calculate the average, we use the mean() function and place the object of interest within the parentheses. To find the median, we use the median() function, and for the mode we use Mode(). To refer just to the CAT column, we use TOcatsdogs$CAT. And so, by extension, to refer ot the DOG column of our data we use TOcatsdogs$DOG, and to refer to the total we use TOcatsdogs$TOTAL. Data objects are put within the parentheses for a function.

Cats and Dogs in Toronto

Summary

Great Job! You’ve learned so much in this tutorial and should be very proud of everything that you’ve done in R so far. For example, you now know:

How to calculate the Mean (mean()), which is what the data look like on average, across the entire group.
How to calculate the Median (median()), which is what the data in the very centre of the distribution looks like. You can also think of this as what the middle (or average) person looks like in a distribution.
How to calculate the Mode (Mode()), or the most common response in a dataset.
And you’ve learned how to do all of this using R!
On top of all that, you’ve learned how to load data into R (x <- read.csv ("example.csv")).
how to look at the first few rows of a dataset (head()).
how to install packages that have different functions in them (install.packages()).
and load packages (library()).
and how to look up the help file for a particular function or package (?head()).

Way to go!

You’ve completed Unit 2 of the ReMInDeRY!

If you are not sure you understood everything in this Unit, you can go through it again by clicking the ‘Start Over’ link in the Index, at the top-left part of your screen.

IMPORTANT NOTE FOR YORK STUDENTS:

PSYC 2030 (Research Methods) students: In order to get full marks for this assignment you must complete the Quiz for this tutorial, mentioned in the eClass assignment.

PSYC 2020 (Statistics) students: Click this link and enter all requested information carefully and accurately in order to receive your points for completing this tutorial.

Now you’re ready to tackle Unit 3 of the ReMInDeRY.