Frequencies

Keywords: 

What are frequencies?

Frequencies are counts of things in different categories. For example, 417 people were asked to choose their favorite pet from a list of animals. Here’s the data:

Bird 71
Cat 90
Dog 144
Fish 82
Llama 30

There are five categories. The numbers are the counts of people who chose the animal. Dogs are top, naturally.

The same data is often presented in a bar chart, like this:

Frequencies

Sometimes the categories are ranges of numbers, like this:

Frequencies

It’s still categories and counts. The pattern for the code is the same.

In this lesson, you’re going to learn how to take raw data like this:

dog
dog
cat
llama
bird
cat
dog

… and compute the frequencies.

A frequency array

We’ll keep the frequency data in an array. Declare it like this:

Dim petFreqs(5) as Integer

We know there are five categories, so we can declare the array to be exactly that length.

Here’s what the pet data will look like:

Pet data

petFreqs(1) is the frequency for birds. petFreqs(2) is the frequency for cats. And so on.

A frequency program

Let’s write a program to read a file of favorite pet data, and compute frequencies. You can download the worksheet and some data files at the bottom of this page.

The worksheet looks like this to start:

Worksheet

Click Run, and the program will read in the raw data…

dog
dog
cat
llama
bird
cat
dog

Here’s some output:

Output

The overall structure of the program will be:

Extract
Process
Output

Main program

  1. Private Sub cmdRun_Click()
  2.     Dim rawData(1000) As String
  3.     Dim petFreqs(5) As Integer
  4.     Dim numDataElements As Integer
  5.     'Read data into array.
  6.     readPetData rawData, numDataElements
  7.     'Compute frequencies for each pet.
  8.     computeFreqs rawData, numDataElements, petFreqs
  9.     'Output frequencies.
  10.     outputFreqs petFreqs, numDataElements
  11. End Sub
There are two arrays. rawData is for the raw data read from the file. petFreqs is for the frequencies, that is, the counts for each pet type. petFreqs(1) will be the number of birds, petFreqs(2) the number of cats, and so on.

There are calls to three subs. This one…

readPetData rawData, numDataElements

…reads the raw data from the data file, and returns it in rawData. numDataElements has the number of data elements read. Which data file? The one the user names:

File name

The second sub call:

computeFreqs rawData, numDataElements, petFreqs

It takes the rawData and runs through it, counting the number of each pet type. It returns the result in petFreqs.

The last sub call:

outputFreqs petFreqs, numDataElements

It shows the results to the user.

Notice how simple and chunky the main program is:

  1. Private Sub cmdRun_Click()
  2.     Dim rawData(1000) As String
  3.     Dim petFreqs(5) As Integer
  4.     Dim numDataElements As Integer
  5.     'Read data into array.
  6.     readPetData rawData, numDataElements
  7.     'Compute frequencies for each pet.
  8.     computeFreqs rawData, numDataElements, petFreqs
  9.     'Output frequencies.
  10.     outputFreqs petFreqs, numDataElements
  11. End Sub
That’s what you want.

Get pet data from a file

Here’s the sub that gets the data from the file:

  1. 'Read the pet data file into an array.
  2. 'Params:
  3. '  rawData: array of favorite pet data (out)
  4. '  numDataElements: number of data elements (out)
  5. Sub readPetData(rawData, numDataElements As Integer)
  6.     Dim fileName As String
  7.     Dim pet As String
  8.     fileName = Cells(2, 2)
  9.     numDataElements = 0
  10.     Open ThisWorkbook.Path & "\" & fileName For Input As #1
  11.     'Read until end of file.
  12.     Do While Not EOF(1)
  13.         numDataElements = numDataElements + 1
  14.         Input #1, pet
  15.         'Put data into array.
  16.         rawData(numDataElements) = pet
  17.     Loop
  18.     Close #1
  19. End Sub
Nothing new here. Basically:

Get the file name from the worksheet.
Open the file.
While there is more data to read:
    Read one raw data element.
    Append it to the array.

Excel needs to know which folder the file is in. It looks in the same folder where the workbook is:

Open ThisWorkbook.Path & "\" & fileName For Input As #1

Remember to close the file. If you forget, and try to run the program again, you’ll set a “File already open” error. Open the immediate window (CTRL+G), type…

close #1

… and press Enter.

Compute frequencies

Here’s the sub to compute the frequencies:

  1. 'Compute the frequency of each pet.
  2. 'Params:
  3. '  rawData: array of favorite pet data (in)
  4. '  numDataElements: number of data elements (in)
  5. '  petFreqs: array of pet frequencies (out)
  6. Sub computeFreqs(rawData, numDataElements As Integer, petFreqs)
  7.     Dim rawDataIndex As Integer 'Used with raw data array.
  8.     Dim countIndex As Integer 'Used with freq array.
  9.     Dim pet As String
  10.     'Init frequencies.
  11.     For countIndex = 1 To 5
  12.         petFreqs(countIndex) = 0
  13.     Next countIndex
  14.     'Count number of each type of pet.
  15.     For rawDataIndex = 1 To numDataElements
  16.         pet = rawData(rawDataIndex)
  17.         If pet = "bird" Then
  18.             countIndex = 1
  19.         ElseIf pet = "cat" Then
  20.             countIndex = 2
  21.         ElseIf pet = "dog" Then
  22.             countIndex = 3
  23.         ElseIf pet = "fish" Then
  24.             countIndex = 4
  25.         ElseIf pet = "llama" Then
  26.             countIndex = 5
  27.         End If
  28.         petFreqs(countIndex) = petFreqs(countIndex) + 1
  29.     Next rawDataIndex
  30. End Sub
We want to count things, using the elements of the petFreqs array to store the counts. The counts have to start at zero, so we need to make sure all of petFreqs the elements start at zero. So:

  1. For countIndex = 1 To 5
  2.     petFreqs(countIndex) = 0
  3. Next index
The counting loop starts in line 52.

  1. For rawDataIndex = 1 To numDataElements
  2.     pet = rawData(rawDataIndex)
  3.     If pet = "bird" Then
  4.         countIndex = 1
  5.     ElseIf pet = "cat" Then
The CPU sets rawDataIndex to 1, then runs the body of the loop. It makes rawDataIndex 2, then runs the body of the loop. It makes rawDataIndex 3, then runs the body of the loop. Etc. It ends the loop at numDataElements. That was computed earlier, by readPetData.

On line 53…

pet = rawData(rawDataIndex)

… the CPU grabs an element from rawData, whichever one rawDataIndex says. It puts the value into the variable pet.

On line 54…

If pet = "bird" Then

… the CPU tests whether pet is “bird”. If it is, add one to the count of birds. The count of birds is in petFreqs(1), so this will do the job:

countIndex = 1
...
petFreqs(countIndex) = petFreqs(countIndex) + 1

If the pet was a dog, we need to update a different array element:

countIndex = 3
...
petFreqs(countIndex) = petFreqs(countIndex) + 1

The If statements run through the pet types, and set countIndex to one thing or another, depending on the pet. Then countIndex is used to update the right array element.

Jeremy
Jeremy
How did you know that the count of birds is in petFreqs(1), and not petFreqs(4), or whatevs?

Tara
Tara
Good question! It doesn’t matter where we put the count for birds, or dogs, or cats, as long as we’re consistent. Later, you’ll see this code:

Cells(7, 1) = "Bird"
Cells(7, 2) = petFreqs(1)

The code assumes that petFreqs(1) is there the bird data is.

Want to put the bird count in petFreqs(4)? No big:

For index = 1 To numDataElements
    pet = rawData(index)
    If pet = "bird" Then
        petFreqs(4) = petFreqs(4) + 1
    ElseIf pet = "cat" Then
...
Cells(7, 1) = "Bird"
Cells(7, 2) = petFreqs(4)

We just need to be consistent.

Lily
Lily
What happens if there’s a unknown pet on the list, like chupacabra?
Tara
Tara
Check out the code:

   ElseIf pet = "llama" Then
        petFreqs(5) = petFreqs(5) + 1
    End If
Next index

There’s a bunch of Ifs that just ends.

Lily
Lily
Oh, I get it. If it’s a chupacabra, nothing happens. No count gets added to.
Tara
Tara
Right!

In the programs we’ve written earlier, the input sub checked for valid data. That’s what should happen here, too. readPetData should check for unknown pets, and not add that data to rawData.

Two index variables

There are two arrays. One for the raw data: rawData. One for the counts: petFreqs. Arrays are used with indexes, and the same index values mean different things for the arrays. rawData(3) is what one person said their favorite pet is. Cat, dog, bird, whatevs. petFreqs(3) has the index 3 as well, but it’s the count for a particular pet (it’s dog for our program).

Let’s think about the code. We could use one index variable, like this:

Dim index as Integer
...
For index = 1 To numDataElements
    pet = rawData(index)
    If pet = "bird" Then
        index = 1
    ...
    petFreqs(index) = petFreqs(index) + 1

index is being used for two things. An index into rawData

pet = rawData(index)

.. and an index into petFreqs

petFreqs(index) = petFreqs(index) + 1

Scream Code like this will make you scream. It might work, or might not. Even if it does, it’s easy for bugs to creep in.

Better to use two different index variables, one for each array:

Dim rawDataIndex As Integer 'Used with raw data array.
Dim countIndex As Integer 'Used with freq array.

The code looks like this:

For rawDataIndex = 1 To numDataElements
    pet = rawData(rawDataIndex)
    If pet = "bird" Then
        countIndex = 1
    ...
    petFreqs(countIndex) = petFreqs(countIndex) + 1

Ah! Less confusion. Less screaming. Your dog will thank you.

Output

Here’s part of the main program:

  1. 'Compute frequencies for each pet.
  2. computeFreqs rawData, numDataElements, petFreqs
  3. 'Output frequencies.
  4. outputFreqs petFreqs, numDataElements
computeFreqs sends out petFreqs, filled with juicy data. outputFreqs takes that data, and shows it to the user.

  1. 'Output frequency data and total number of pets.
  2. 'Params:
  3. '  petFreqs: array of pet frequencies (in)
  4. '  numDataElements: number of data records (in)
  5. Sub outputFreqs(petFreqs, numDataElements As Integer)
  6.     Cells(7, 1) = "Bird"
  7.     Cells(7, 2) = petFreqs(1)
  8.     Cells(8, 1) = "Cat"
  9.     Cells(8, 2) = petFreqs(2)
  10.     Cells(9, 1) = "Dog"
  11.     Cells(9, 2) = petFreqs(3)
  12.     Cells(10, 1) = "Fish"
  13.     Cells(10, 2) = petFreqs(4)
  14.     Cells(11, 1) = "Llama"
  15.     Cells(11, 2) = petFreqs(5)
  16.     Cells(13, 1) = "Total"
  17.     Cells(13, 2) = numDataElements
  18. End Sub
It just sends data to the worksheet.

Computing frequency categories

In the pets programs, we used a bunch of If statements to compute the index, that is, to compute which category of pet a person had chosen:

If pet = "bird" Then
    countIndex = 1
ElseIf pet = "cat" Then
    countIndex = 2
ElseIf pet = "dog" Then
    countIndex = 3
ElseIf pet = "fish" Then
    countIndex = 4
ElseIf pet = "llama" Then
    countIndex = 5
End If
petFreqs(countIndex) = petFreqs(countIndex) + 1

Sometimes it’s easier to use arithmetic to compute indexes. Suppose there’s a credit score that ranges from 100 to 500. The file scores.txt has scores for a bunch of humans:

322
377
317
286
336
225
125
200
265

Let’s write a program that computes frequencies in groups of 100, like this:

Result

You can download the worksheet and some data files at the bottom of this page.

We’ll have an array scoreFreqs, with four elements. scoreFreqs(1) will count the number of scores from 100 to 199. scoreFreqs(2) will count the number of scores from 200 to 299. And so on.

It’s almost the same as the pets program. Most of the differences are in the computeFreqs and outputFreqs subs.

Freqy

Here’s computeFreqs:

  1. 'Compute the frequency of scores.
  2. 'Params:
  3. '  rawData: array of favorite score data (in)
  4. '  numDataElements: number of data elements (in)
  5. '  scoreFreqs: array of score frequencies (out)
  6. Sub computeFreqs(rawData, numDataElements As Integer, scoreFreqs)
  7.     Dim rawDataIndex As Integer 'Used with raw data array.
  8.     Dim countIndex As Integer 'Used with freq array.
  9.     Dim score As Integer
  10.     'Init frequencies.
  11.     For countIndex = 1 To 4
  12.         scoreFreqs(countIndex) = 0
  13.     Next countIndex
  14.     'Compute freqs for each score range.
  15.     For rawDataIndex = 1 To numDataElements
  16.         score = rawData(rawDataIndex)
  17.         countIndex = Int(score / 100)
  18.         If countIndex = 5 Then
  19.             countIndex = 4
  20.         End If
  21.         scoreFreqs(countIndex) = scoreFreqs(countIndex) + 1
  22.     Next rawDataIndex
  23. End Sub
Check the code at line 53:

  1. score = rawData(rawDataIndex)
  2. countIndex = Int(score / 100)
LIne 53 gets a score from the raw data array. Line 54 computes which category that score is in. It does the same job all the If statements from pets.

Let’s see how it works. Suppose score is 188. That should be in the first category, 100 to 199. score / 100 is 1.88. The Int() function makes that an integer, chopping off the 0.88. So Int(188 / 100) is 1. That’s the right category.

Suppose score is 372. That should be in the third category, 300 to 399. score / 100 is 3.72. The Int() function makes that an integer, chopping off the 0.72. So Int(372 / 100) is 3. That’s the right category.

There’s just one fly in the ointment. The highest score is 500. That supposed to be in category 4: 400 to 500. But Int(500 / 100) is 5. The formula doesn’t work.

500 is a special case, also called an edge case. The easiest way to deal with it is to add an If:

  1. score = rawData(rawDataIndex)
  2. countIndex = Int(score / 100)
  3. If countIndex = 5 Then
  4.     countIndex = 4
  5. End If
Edge cases are common in data analysis work. They’re also a good source of bugs. If any source of bugs can be said to be good. Test edge cases with your code.

Output

Here’s the output code:

  1. 'Output frequency data and total number of scores.
  2. 'Params:
  3. '  scoreFreqs: array of score frequencies (in)
  4. '  numDataElements: number of data records (in)
  5. Sub outputFreqs(scoreFreqs, numDataElements As Integer)
  6.     Dim countIndex As Integer 'Used with freq array.
  7.     Dim high As Integer
  8.     Dim low As Integer
  9.     Dim label As String
  10.     For countIndex = 1 To 4
  11.         low = countIndex * 100
  12.         high = low + 99
  13.         If high = 499 Then
  14.             high = 500
  15.         End If
  16.         label = "From " & low & " to " & high & ":"
  17.         Cells(countIndex + 6, 1) = label
  18.         Cells(countIndex + 6, 2) = scoreFreqs(countIndex)
  19.     Next countIndex
  20.     Cells(13, 1) = "Total"
  21.     Cells(13, 2) = numDataElements
  22. End Sub
Line 78 outputs a label to a cell:

Cells(countIndex + 6, 1) = label

The label will be something like “From 200 to 299:”.

Lines 72 to 77 compute the label. There’s an If to handle the edge case.

Exercises

Exercise: Favorite bird frequencies
This data file contains customer data from a birding company. Here’s a sample of the data.

5934,124,Free
5935,20,Blue jay
6851,51,Stork
5242,192,Stork
6232,133,Free

Each records is data about one customer. The first column is a customer id. The second column is that customer’s purchases over the last three months. The third column is the customer’s favorite bird. The only possible choices are:

Cardinal
Blue jay
Stork
Larry
Free

Write a program that computes the frequencies of each favorite bird, but only for customers who purchased $100 or more. The data file name should be in the spreadsheet. The output is to the spreadsheet. Like this:

Output

Upload your work. The usual coding standards apply.

Attachments

birders.csv

(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)

Summary