In the biology laboratory, we generate a ton of data. But how do we understand what it is telling us? We can use statistics! There are many ways to perform the analysis, depending on the data we collect. We’re going to start here with some basic concepts in statistics, and over the coming months we will explore more concepts in depth!
The first thing that we can do when we have a numerical data set is to perform a quick summary of the data by calculating the range, mean, median, and mode. These are all basic functions that can be done in Google Sheets (which is free to use) or any similar spreadsheet program.
- The range is the difference between the highest value and the lowest value. So, mathematically, we can think of this as range = max value – min value.
- A data set’s mean is what we generally think of as the average. All the numbers in the data set are added into one number, and then divided by the total number of data points. If we wrote this out, we can think of it as mean = (data1 + data2 + … datax ) / x, where x equals the total number of points in our data set.
- The number that is directly in the middle of the data set is the median. This is determined by ordering data set from largest to smallest, and then selecting the number in the middle. For data sets with odd numbers of samples, the median is easy – you just take that middle number. However, with data sets with even numbers of samples, you have to average the two middle numbers and that becomes the median.
- The mode is the number that shows up most in a data set. To determine this value, you have to count the number of times each number shows up in your data. Now, the interesting thing about the mode is that a data set might one mode, multiple modes, or no modes! It all depends on the frequency that a data point shows up.
The data set we’re going to analyze is available here: https://catalog.data.gov/dataset/inland-fisheries-length-weight-ds195-d727c. This data set has length and weight data for several related species of freshwater fish. However, you can choose to download any data set you choose. Be sure to download the “.csv” version. The “.csv” means comma-separated value; this is a text file in which the data in a line is separated into columns by commas. This can be read by most spreadsheet programs so it is a great file type to use when compatibility is an issue! To use with Google Sheets, we uploaded the file to our Google Drive and then opened the file using Sheets. This process will differ based on the spreadsheet program you are using so be sure to try this out before having your students try!
So, the data set that I’m using for this analysis has information for several species of fish. We want to just focus on the bluegill population so that we can examine the variability between individuals of the same species. So the first thing that I do is “lock” the first row, which contains all the header information (in Sheets, select the first row and then go View > Freeze > one row). The next thing we want to do is to sort all the information alphabetically, so you can click on the little triangle next to “CNAME” (common name) and then sort A-Z. Delete all the entries that are NOT bluegills (make sure to save a copy of the original .csv file in case you need to start over). You should have 140 individual fish in your spreadsheet!
So, let’s start with analyzing the bluegill length data. The first calculation that I want to perform is the range. (For reference, I am going to run all of the bluegill length calculations here so that they can be used as examples). To find the range, we need to find the minimum and maximum lengths. Luckily, the spreadsheet can do that for us with the =MAX(first cell: last cell) and the =MIN(first cell: last cell ) commands. Once you have identified those values, subtract the length of the smallest fish from that of the largest fish and you have your range!
Using a spreadsheet to calculate the mean, median, and mode using a spreadsheet are incredibly easy, as the program can do it all for you!
- Use =AVERAGE(first cell: last cell) to calculate the mean. The algorithm adds all the data and then divides it by the total number of cells that are selected.
- Use =MEDIAN(first cell: last cell) to calculate the median. The algorithm orders the data and determines the correct number, regardless of whether you have an even or odd number of data points.
- Use =MODE(first cell: last cell) to calculate the mode. The algorithm determines how many times each number shows up and which one (if any) show up the most.
All these calculations have been performed for the length of the fish. Now, have your students perform the same calculation for the weight of the fish! What answers do you get? How can we use the range, mean, median, and mode to tell us something about the fish population?