What are frequencies?
Frequencies are counts of things in different categories. For example, 417 people were asked to choose their favorite pet from a list of animals. Here's the data:
Bird | 71 |
Cat | 90 |
Dog | 144 |
Fish | 82 |
Llama | 30 |
There are five categories. The numbers are the counts of people who chose the animal. Dogs are top, naturally.
The same data is often presented in a bar chart, like this:
Sometimes the categories are ranges of numbers, like this:
It's still categories and counts. The pattern for the code is the same.
In this lesson, you're going to learn how to take raw data like this:
dog
dog
cat
llama
bird
cat
dog
… and compute the frequencies.
A frequency array
We'll keep the frequency data in an array. Declare it like this:
Dim petFreqs(5) as Integer
We know there are five categories, so we can declare the array to be exactly that length.
Here's what the pet data will look like:
petFreqs(1)
is the frequency for birds. petFreqs(2)
is the frequency for cats. And so on.
A frequency program
Let's write a program to read a file of favorite pet data, and compute frequencies. You can download the worksheet and some data files at the bottom of this page.
The worksheet looks like this to start:
Click Run, and the program will read in the raw data…
dog
dog
cat
llama
bird
cat
dog
Here's some output:
The overall structure of the program will be:
Extract
Process
Output
Main program
There are two arrays. rawData
is for the raw data read from the file. petFreqs
is for the frequencies, that is, the counts for each pet type. petFreqs(1)
will be the number of birds, petFreqs(2)
the number of cats, and so on.
There are calls to three subs. This one…
readPetData rawData, numDataElements
…reads the raw data from the data file, and returns it in rawData
. numDataElements
has the number of data elements read. Which data file? The one the user names:
The second sub call:
computeFreqs rawData, numDataElements, petFreqs
It takes the rawData
and runs through it, counting the number of each pet type. It returns the result in petFreqs
.
The last sub call:
outputFreqs petFreqs, numDataElements
It shows the results to the user.
Notice how simple and chunky the main program is:
That's what you want.
Get pet data from a file
Here's the sub that gets the data from the file:
Nothing new here. Basically:
Get the file name from the worksheet.
Open the file.
While there is more data to read:
Read one raw data element.
Append it to the array.
Excel needs to know which folder the file is in. It looks in the same folder where the workbook is:
Open ThisWorkbook.Path & "\" & fileName For Input As #1
Remember to close the file. If you forget, and try to run the program again, you'll set a "File already open" error. Open the immediate window (CTRL+G), type…
close #1
… and press Enter.
Compute frequencies
Here's the sub to compute the frequencies:
We want to count things, using the elements of the petFreqs
array to store the counts. The counts have to start at zero, so we need to make sure all of petFreqs
the elements start at zero. So:
The counting loop starts in line 52.
The CPU sets rawDataIndex
to 1, then runs the body of the loop. It makes rawDataIndex
2, then runs the body of the loop. It makes rawDataIndex
3, then runs the body of the loop. Etc. It ends the loop at numDataElements
. That was computed earlier, by readPetData
.
On line 53…
pet = rawData(rawDataIndex)
… the CPU grabs an element from rawData
, whichever one rawDataIndex
says. It puts the value into the variable pet
.
On line 54…
If pet = "bird" Then
… the CPU tests whether pet
is "bird". If it is, add one to the count of birds. The count of birds is in petFreqs(1)
, so this will do the job:
countIndex = 1
...
petFreqs(countIndex) = petFreqs(countIndex) + 1
If the pet was a dog, we need to update a different array element:
countIndex = 3
...
petFreqs(countIndex) = petFreqs(countIndex) + 1
The If
statements run through the pet types, and set countIndex
to one thing or another, depending on the pet. Then countIndex
is used to update the right array element.
petFreqs(1)
, and not petFreqs(4)
, or whatevs?
Cells(7, 1) = "Bird"
Cells(7, 2) = petFreqs(1)
The code assumes that petFreqs(1)
is there the bird data is.
Want to put the bird count in petFreqs(4)
? No big:
For index = 1 To numDataElements
pet = rawData(index)
If pet = "bird" Then
petFreqs(4) = petFreqs(4) + 1
ElseIf pet = "cat" Then
...
Cells(7, 1) = "Bird"
Cells(7, 2) = petFreqs(4)
We just need to be consistent.
ElseIf pet = "llama" Then
petFreqs(5) = petFreqs(5) + 1
End If
Next index
There's a bunch of If
s that just ends.
In the programs we've written earlier, the input sub checked for valid data. That's what should happen here, too. readPetData
should check for unknown pets, and not add that data to rawData
.
Two index variables
There are two arrays. One for the raw data: rawData
. One for the counts: petFreqs
. Arrays are used with indexes, and the same index values mean different things for the arrays. rawData(3)
is what one person said their favorite pet is. Cat, dog, bird, whatevs. petFreqs(3)
has the index 3 as well, but it's the count for a particular pet (it's dog for our program).
Let's think about the code. We could use one index variable, like this:
Dim index as Integer
...
For index = 1 To numDataElements
pet = rawData(index)
If pet = "bird" Then
index = 1
...
petFreqs(index) = petFreqs(index) + 1
index
is being used for two things. An index into rawData
…
pet = rawData(index)
.. and an index into petFreqs
…
petFreqs(index) = petFreqs(index) + 1
Code like this will make you scream. It might work, or might not. Even if it does, it's easy for bugs to creep in.
Better to use two different index variables, one for each array:
Dim rawDataIndex As Integer 'Used with raw data array.
Dim countIndex As Integer 'Used with freq array.
The code looks like this:
For rawDataIndex = 1 To numDataElements
pet = rawData(rawDataIndex)
If pet = "bird" Then
countIndex = 1
...
petFreqs(countIndex) = petFreqs(countIndex) + 1
Ah! Less confusion. Less screaming. Your dog will thank you.
Output
Here's part of the main program:
computeFreqs
sends out petFreqs
, filled with juicy data. outputFreqs
takes that data, and shows it to the user.
It just sends data to the worksheet.
Computing frequency categories
In the pets programs, we used a bunch of If
statements to compute the index, that is, to compute which category of pet a person had chosen:
If pet = "bird" Then
countIndex = 1
ElseIf pet = "cat" Then
countIndex = 2
ElseIf pet = "dog" Then
countIndex = 3
ElseIf pet = "fish" Then
countIndex = 4
ElseIf pet = "llama" Then
countIndex = 5
End If
petFreqs(countIndex) = petFreqs(countIndex) + 1
Sometimes it's easier to use arithmetic to compute indexes. Suppose there's a credit score that ranges from 100 to 500. The file scores.txt
has scores for a bunch of humans:
322
377
317
286
336
225
125
200
265
Let's write a program that computes frequencies in groups of 100, like this:
You can download the worksheet and some data files at the bottom of this page.
We'll have an array scoreFreqs
, with four elements. scoreFreqs(1)
will count the number of scores from 100 to 199. scoreFreqs(2)
will count the number of scores from 200 to 299. And so on.
It's almost the same as the pets program. Most of the differences are in the computeFreqs
and outputFreqs
subs.
Freqy
Here's computeFreqs
:
Check the code at line 53:
LIne 53 gets a score from the raw data array. Line 54 computes which category that score is in. It does the same job all the If
statements from pets.
Let's see how it works. Suppose score
is 188. That should be in the first category, 100 to 199. score / 100
is 1.88. The Int()
function makes that an integer, chopping off the 0.88. So Int(188 / 100)
is 1. That's the right category.
Suppose score
is 372. That should be in the third category, 300 to 399. score / 100
is 3.72. The Int()
function makes that an integer, chopping off the 0.72. So Int(372 / 100)
is 3. That's the right category.
There's just one fly in the ointment. The highest score is 500. That supposed to be in category 4: 400 to 500. But Int(500 / 100)
is 5. The formula doesn't work.
500 is a special case, also called an edge case. The easiest way to deal with it is to add an If
:
Edge cases are common in data analysis work. They're also a good source of bugs. If any source of bugs can be said to be good. Test edge cases with your code.
Output
Here's the output code:
Line 78 outputs a label to a cell:
Cells(countIndex + 6, 1) = label
The label will be something like "From 200 to 299:".
Lines 72 to 77 compute the label. There's an If
to handle the edge case.
Exercises
5934,124,Free
5935,20,Blue jay
6851,51,Stork
5242,192,Stork
6232,133,Free
Each record is data about one customer. The first column is a customer id. The second column is that customer's purchases over the last three months. The third column is the customer's favorite bird. The only possible choices are:
Cardinal
Blue jay
Stork
Larry
Free
Write a program that computes the frequencies of each favorite bird, but only for customers who purchased $100 or more. The data file name should be in the spreadsheet. The output is to the spreadsheet. Like this:
The program should still work if the number of records in the file changes.
Upload your work. The usual coding standards apply.
(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)