People who write business programs, whether they're full-time programmers or not, spend much of their time on data analysis programs. Executives and managers rely on the analyses to run the company.
Many basic analysis programs have a similar structure. In this lesson, we'll take at look at that common pattern.
There's much more to data analysis than this. Building statistical models, writing simulations, real-time analysis… there are whole university degrees in this stuff. The programs we'll write only do the most basic forms of analysis.
What a typical analysis program looks like
Extract
Process
Output
That's a pattern.
Example
The national Camel Dancing Exams are a Big Deal in the camel world. The best camels get endorsement deals ("Red Bull helps me strut my stuff"), appearances in Dancing with the Ungulates and Strictly Camel Dancing, and other goodies.
There are ten testing centers. Examiners send results to Dromedary and Bactrian, independent auditors who certify the results. Testing centers send their data by email, fax, pigeon, or shouting really loudly. Most of the time the data gets through OK, but sometimes there are errors.
Test scores range from 0 to 50, theoretically. However, scores less than 20 or greater than 40 are unusual. Those scores are outliers.
You work for D&B. Your boss, Sheldon Spittyface, asks you to write a program to analyze the test data. He wants to know the highest, lowest, and average score. He wants you exclude data errors (e.g., nonnumeric data), and outliers.
Here's what the worksheet looks like to start:
Main program
The main program follows the extract-process-output pattern. Here it is.
Dim
statements first?
Your program doesn't know how much data it's going to get. It should return a count of the number of valid values, along with the data itself.
Now that you know what params extractValidData
will need, you can declare them:
validData
is an array, with 1,000 elements. Since there have never been more than 300 CDE scores, that should be enough for a long time.
Here's the pattern again:
Extract
Process
Output
The next part is Process.
Send computeStats
the number of data elements, and the data. Get back from it the stats the Boss wanted: highest, lowest, and average. You need to declare those variables:
The pattern again:
Extract
Process
Output
The last part is Output:
Send outputStats
the things it needs to output, and it will figure out how.
Let's work on each of the subs.
Extract to array
The extraction sub is called like this:
The sub runs through the data in the worksheet, until it gets to the end. We know how to do that, with this loop you've seen before:
If you don't know how this works, see Finding the end.
You need to mix in with this tests, to make sure the data is OK. The flag pattern helps. Something like this:
If this isn't familiar, check Extracting to arrays.
Here's the code for the camel program:
Compare it with the pseudocode above it. It's much the same, but with the details filled in.
The loop test is a little different:
Do While Cells(row + dataStartRow, 2) <> ""
The 2 is because the data is in row 2. dataStartRow
is because the data doesn't start at row 1 of the worksheet:
dataStartRow
is an offset variable. It just moves the processing down a bit.
Do While Cells(row + dataStartRow, 2) <> ""
The code we had before was:
Do While Cells(row + 1, 1) <> ""
Should the new code be:
Do While Cells(row + dataStartRow + 1, 2) <> ""
To get the arithmetic right.
Let's see how. If the code is right, what's the first data value that extractValidData
should process?
The first score is 35. That should be the first value that extractValidData
looks at.
The breakpoint is just after the line that sets dataValue
. dataValue
should be 35 the first time program hits the breakpoint.
OK, let's run the program and find out:
Yay! It's 35, just what we expected.
That's what professional programmers do. There's so much that can go wrong in a program. Eliminating potential errors will make you scream less.
Sub extractValidData(count As Integer, validData)
validData
doesn't have a data type, like Integer
.
Process array
That's the extract sub. Now for the processing sub. It's called like this:
The code for the sub will be like what we've seen before, with the accumulator pattern.
Not much new here, except that this time you're processing an array.
When writing code like this, it's good to work backwards. For example, you need the average. That's the total divided by the count. You already have the count from the extraction routine. For the total, use the accumulator pattern.
'Params:
' count: number of valid data items (in)
' validData: array of valid data items (in)
' highest: highest score (out)
' lowest: lowest score (out)
' average: average score (out)
They each have (in)
or (out)
at the end. Shows whether the params is for sending data into the sub, or for getting data out.
Output
The last chunk. Here's the call:
Here's the code.
Variants
Here's the pattern again:
Extract
Process
Output
When using patterns, remember that they're guides, not cut-and-paste templates. Often you need to change them to fit business requirements. Take data extraction:
- From a worksheet to an array.
- From a worksheet to another part of the same worksheet.
- From a worksheet to a different worksheet.
- From a worksheet to a worksheet in a different workbook.
- From a file to a worksheet.
- From a file to an array.
The logic for each of them is essentially the same. There's a data source, with the raw data. There's a destination, for the extracted data. There's a loop that reads the raw data. There's a flag and some If
statements, to decide when to copy the data.
Part of programming is looking beneath the surface, and seeing the logic underneath. If you just rely on surface thinking, you won't get very far.
In the rest of this section, let's look at variants of the extract, process, and output chunks.
Exercises
(The data got a bit scrambled.)
Output sales statistics, but only for stores 1 to 3, and weeks 1 to 10. You can assume that all the data is valid.
Upload you solution.
(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)