Filtering data

Keywords: 

Just some of the data

Often you want to process some of the data in a data set. Two main reasons.

Dirty data

Data from the RL is often dirty. For example, suppose you had this Web form:

Web form

A bunch of people filled it in. Their responses were copied to an Excel worksheet.

Here are some things people might type:

  • 1
  • None
  • two
  • six lovely kitties
  • 2 real, 2 stuffed

Ack! You need to run through the data, and pick out the values that make sense, before you can do any analysis.

Subsets

Sometimes you want to analyze a subset of the data. For example, suppose you have this sales data:

Regional sales

Maybe you want to process only data from the North region.

Filtering

What you want to do is data filtering.

Marking bad data

Let's start by going through a data set, and marking bad data.

Here's that sales data again:

Regional sales

We want a program that goes through the data, and checks that each entry is OK. Rules:

  • Region should be North, West, or Central. No other values are valid, although any mix of upper- and lowercase is OK.
  • Sales should be numeric, and zero or greater.

If any of the rules are violated, mark that cell, giving it a red background.

There's add one other rule-like-thing. Almost all sales are less than 500. If the program detects a sales value 500 or more, it will set the background color of the cell to yellow. Not definitely an error, but a suspicious value that should be checked.

  • Warning if sales 500 or more.

Here's all the data, and a button:

Data

There might not be 20 rows, though. There might be 28, 19, whatevs.

Here's what the program should do:

Output

Lily writes code

Tara
Tara
Lily, could you write the program, please?

Lily
Lily
OK. Let's see. Read through the data, check each cell, and apply those rules.

I don't know how many rows there are, so I'll use the pattern we just learned about to run through the data.

Pattern

Find the end of a data set

Situation:
The number of items in a data set varies. Maybe 502 one day, and 534 the next. You want to find the number of items.
Actions:
Use a Do loop and an index variable. Check cells one by one, until there's an empty cell.

Patterns are Good Things. They're solutions you can reuse. That's what expert programmers do.

Lily
Lily
I'll start by copying that code, and see what needs to be changed.

  1. Dim rows as Integer
  2. rows = 0
  3. Do While Cells(rows + 1, 1) <> ""
  4.     rows = rows + 1
  5. Loop

Lily
Lily
Now what? Hmm… OK, I know. This is like the input validation we did before, except that instead of showing as error message, the cell's color is changed.

Let me check that pattern. I'll look in the pattern catalog:

Pattern catalog

Ah! Here it is:

Link to pattern

OK, let's see… Here's some code I can use:

Pattern code

It won't fit exactly. It never does. But I can adapt it.

Lily is mixing patterns. She's putting the validation pattern inside the loop of the find-the-end-of-the-data pattern.

  1. Dim rows as Integer
  2. rows = 0
  3. Do While Cells(rows + 1, 1) <> ""
  4.     rows = rows + 1
  5.     userInput = InputBox( "What is the amount?" )
  6.     If Not IsNumeric(userInput) Then
  7.         MsgBox "Sorry, amount must be a number."
  8.         End
  9.     ElseIf userInput <= 0 Then
  10.         MsgBox "Sorry, amount must be greater than zero."
  11.         End
  12.     EndIf
  13. Loop
Lily
Lily
There's no InputBox, though (line 4). The data isn't being typed in. It's coming from a cell. Let's see… The program has just tested Cells(rows + 1, 1) (line 3), so if the program gets to line 4, then that cell isn't blank. I guess we can test that cell.

  1. Dim rows as Integer
  2. rows = 0
  3. Do While Cells(rows + 1, 1) <> ""
  4.     rows = rows + 1
  5.     userInput = Cells(rows, 1)
  6.     If Not IsNumeric(userInput) Then
Lily
Lily
The If (line 6) won't change. I'll just leave it alone.

But the way the program shows errors is different.

  1. Dim rows as Integer
  2. rows = 0
  3. Do While Cells(rows + 1, 1) <> ""
  4.     rows = rows + 1
  5.     userInput = Cells(rows, 1)
  6.     If Not IsNumeric(userInput) Then
  7.         MsgBox "Sorry, amount must be a number."
  8.         End
Lily
Lily
I don't want to show a error message. I want to change the background color of the cell. How do you do that? I don't remember that anywhere.

Hmm… I'll try searching.

Search

There's a tip on how to find stuff on this site.

Results

Lily
Lily
I'll check out the BFF page … Search on the page for "background".

On-page search

Lily
Lily
Oh, that's right, I remember now. Interior.Color.

  1. Dim rows as Integer
  2. rows = 0
  3. Do While Cells(rows + 1, 1) <> ""
  4.     rows = rows + 1
  5.     userInput = Cells(rows, 1)
  6.     If Not IsNumeric(userInput) Then
  7.         Cells(rows, 1).Interior.Color = vbRed
  8.         End
  9.     End If
Lily
Lily
I don't want that End (line 8), because I want the program to keep going when it finds a mistake. So it can find other mistakes.

I'll add the test for negative sales as well.

  1. Dim rows as Integer
  2. rows = 0
  3. Do While Cells(rows + 1, 1) <> ""
  4.     rows = rows + 1
  5.     userInput = Cells(rows, 1)
  6.     If Not IsNumeric(userInput) Then
  7.         Cells(rows, 1).Interior.Color = vbRed
  8.     ElseIf userInput < 0 then
  9.         Cells(rows, 1).Interior.Color = vbRed
  10.     End If
  11. Loop
Lily
Lily
Let me run it.

Some good and bad here. Good thing: Lily decided to test what she has so far. Write a bit, test a bit, write a bit, test a bit…

Bad thing: before testing, Lily should have quickly read through the code. She might have picked up some mistakes. Quicker than having to go through testing cycles.

Rats!

Lily
Lily
Oh, yeah.

Awww...

Lily
Lily
Oops! The numeric checks are supposed to be on column 2.

  1. Option Explicit
  2. Private Sub cmdCheck_Click()
  3.     Dim rows As Integer
  4.     Dim userInput As String
  5.     rows = 0
  6.     Do While Cells(rows + 1, 1) <> ""
  7.         rows = rows + 1
  8.         userInput = Cells(rows, 2)
  9.         If Not IsNumeric(userInput) Then
  10.             Cells(rows, 2).Interior.Color = vbRed
  11.         ElseIf userInput < 0 Then
  12.             Cells(rows, 2).Interior.Color = vbRed
  13.         End If
  14.     Loop
  15. End Sub
Output

Lily
Lily
Hey! Not bad.

Except B1. I need to start the loop at the second row. Let's see…

Oo! I can initialize rows to one, then the code will check row two.

  1.     rows = 1

Lily
Lily
Now, add the check for the warning for sales that could be too high.

  1. Option Explicit
  2. Private Sub cmdCheck_Click()
  3.     Dim rows As Integer
  4.     Dim userInput As String
  5.     rows = 0
  6.     Do While Cells(rows + 1, 1) <> ""
  7.         rows = rows + 1
  8.         userInput = Cells(rows, 2)
  9.         If Not IsNumeric(userInput) Then
  10.             Cells(rows, 2).Interior.Color = vbRed
  11.         ElseIf userInput < 0 Then
  12.             Cells(rows, 2).Interior.Color = vbRed
  13.         ElseIf userInput >= 500 Then
  14.             Cells(rows, 2).Interior.Color = vbYellow
  15.         End If
  16.     Loop
  17. End Sub
Output
Lily
Lily
OK! Now add the test for the region. Wait… OK, uppercase/lowercase doesn't matter. I'll add that trim thing, too.

  1. Option Explicit
  2. Private Sub cmdCheck_Click()
  3.     Dim rows As Integer
  4.     Dim userInput As String
  5.     rows = 0
  6.     Do While Cells(rows + 1, 1) <> ""
  7.         rows = rows + 1
  8.         userInput = Cells(rows, 2)
  9.         If Not IsNumeric(userInput) Then
  10.             Cells(rows, 2).Interior.Color = vbRed
  11.         ElseIf userInput < 0 Then
  12.             Cells(rows, 2).Interior.Color = vbRed
  13.         ElseIf userInput >= 500 Then
  14.             Cells(rows, 2).Interior.Color = vbYellow
  15.         End If
  16.         userInput = Cells(rows, 1)
  17.         userInput = LCase(Trim(userInput))
  18.         If userInput <> "north" And userInput <> "west" _
  19.                 And userInput <> "central" Then
  20.             Cells(rows, 1).Interior.Color = vbRed
  21.         End If
  22.     Loop
  23. End Sub
Output
Tara
Tara
Yay! Good job!

Could you add some comments?

  1. Option Explicit
  2. Private Sub cmdCheck_Click()
  3.     Dim rows As Integer
  4.     Dim userInput As String
  5.     'Initialize.
  6.     rows = 1
  7.     Do While Cells(rows + 1, 1) <> ""
  8.         rows = rows + 1
  9.         'Check sales.
  10.         userInput = Cells(rows, 2)
  11.         If Not IsNumeric(userInput) Then
  12.             Cells(rows, 2).Interior.Color = vbRed
  13.         ElseIf userInput < 0 Then
  14.             Cells(rows, 2).Interior.Color = vbRed
  15.         ElseIf userInput >= 500 Then
  16.             'Show warning.
  17.             Cells(rows, 2).Interior.Color = vbYellow
  18.         End If
  19.         'Check the region.
  20.         userInput = Cells(rows1, 1)
  21.         userInput = LCase(Trim(userInput))
  22.         If userInput <> "north" And userInput <> "west" _
  23.                 And userInput <> "central" Then
  24.             Cells(rows, 1).Interior.Color = vbRed
  25.         End If
  26.     Loop
  27. End Sub

Keeping track

Let's add to Lily's code, so that it keeps track of the number of errors. This is what we want:

Error count

  • Add three more variables.
  • Initialize them to zero.
  • Add 1 to an error count when each error happens.
  • Output the counts.

  1. Option Explicit
  2. Private Sub cmdCheck_Click()
  3.     Dim rows As Integer
  4.     Dim regionErrors As Integer
  5.     Dim salesErrors As Integer
  6.     Dim salesWarnings As Integer
  7.     Dim userInput As String
  8.     'Initialize.
  9.     rows = 1
  10.     regionErrors = 0
  11.     salesErrors = 0
  12.     salesWarnings = 0
  13.     Do While Cells(rows + 1, 1) <> ""
  14.         rows = rows + 1
  15.         'Check sales.
  16.         userInput = Cells(rows, 2)
  17.         If Not IsNumeric(userInput) Then
  18.             salesErrors = salesErrors + 1
  19.             Cells(rows, 2).Interior.Color = vbRed
  20.         ElseIf userInput < 0 Then
  21.             salesErrors = salesErrors + 1
  22.             Cells(rows, 2).Interior.Color = vbRed
  23.         ElseIf userInput >= 500 Then
  24.             'Show warning.
  25.             salesWarnings = salesWarnings + 1
  26.             Cells(rows, 2).Interior.Color = vbYellow
  27.         End If
  28.         'Check the region.
  29.         userInput = Cells(rows, 1)
  30.         userInput = LCase(Trim(userInput))
  31.         If userInput <> "north" And userInput <> "west" _
  32.                 And userInput <> "central" Then
  33.             regionErrors = regionErrors + 1
  34.             Cells(rows, 1).Interior.Color = vbRed
  35.         End If
  36.     Loop
  37.     'Output error count
  38.     Cells(4, 4) = regionErrors
  39.     Cells(5, 4) = salesErrors
  40.     Cells(6, 4) = salesWarnings
  41. End Sub

Exercises

Exercise: Show large price changes
Download this worksheet. It has prices at the start and end of each day:

Start

You program can't be sure how many day's data are in the worksheet.

Add code that will mark large price changes. When the daily price rises by 2 or more, mark the day green. When the price falls by 2 or more, mark it red. Like this:

Output

The first row does not contain data.

Upload your solution. The usual coding standards apply.

When you submit, please answer this question: What did you learn from this exercise? Type in the text area on the solution form.

Attachments

mark-price-changes-start.xlsm

(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)

Exercise: Find the quackers
Duck You want to hire some ducks for your new movie, "The Night of the Zombie Ducks." You go to your local duck pond, and ask the ducks to text you the number of times they quack each day. Through the Magic of Technology, the data is collected in an Excel workbook. Each text has the sender's DuId (duck id) added to it automatically.

Here's a sample:

Start

Duck beaks aren't the best for texting, so there could be bad data. Also, some ducks didn't follow the instructions, as in row 7.

Download the workbook, so you can add your own code. Write a program that does two things:

  • Draws lines through rows that have bad data. Quacks per day could be nonnumeric, or less than zero.
  • Mark the ducks that quack 70 or more times per day.

Here's some output:

Output

Look at row 7. Both cells have strikethroughs.

Hints:

  • The first row doesn't have data in it.
  • Useful code you can adapt: Cells(7, 7).Font.Strikethrough = True

Upload your workbook. The usual coding standards apply, as always.

In the text box on the submission form, please answer this question: What did you learn from this exercise?

Attachments

duck-quacks-start.xlsm

(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)

Summary