Surprised cat
So far, all of the data has been in a worksheet. Here's pseudocode for extracting data from a worksheet into an array:
While there is more data
Get some data
If the data meets some criteria Then
Copy it to an array
EndIf
Loop
In RL, data that comes from other systems, or is downloaded from the Web, might not be in a worksheet. Often it will be in a plain text file. Your program will need to extract data from that.
Here's pseudocode for extracting data from a file into an array:
While there is more data
Get some data
If the data meets some criteria Then
Copy it to an array
EndIf
Loop
Indeed.
Geeks often talk about algorithms. An algorithm is a general way of doing something, like extracting data. This:
While there is more data
Get some data
If the data meets some criteria Then
Copy it to an array
EndIf
Loop
The algorithm can be implemented in different languages, like C++, Java, PHP, VBA, Python… whatevs. The algorithm can handle data from different sources, like worksheets, files, databases… whatevs.
Of course, the details are different. But the underlying approach is the same.
Camels again
Here's the code we wrote for extracting data from a worksheet, for the camels dancing exam example:
Suppose the data is stored in a file instead. It's in CSV (comma-separated values) format. A sample:
1000, 35
1001, 20
1002, 38
1003, 32
1004, 32
1005, 32
1006, 22
In each line, the first number is the exam id, and the second is the score. It's the same data as before, just moved into a file.
Here's what the worksheet will look like:
The exam scores aren't on the worksheet. Instead, the user types in the name of the file with the exam scores.
Here's extractValidData
, rewritten to read data from the file.
The code is similar to what we had before. Let's look at the differences.
Opening the file
We have to tell Excel that we want it to prepare a file for reading. We didn't have to do this when the data was in the worksheet.
Excel needs to know the name and location of the data file. The user typed the file name into cell B7, so we grabbed that from the worksheet:
fileName = Cells(7, 2)
Then we tell Excel to open that file for reading:
Open ThisWorkbook.Path & "\" & fileName For Input As #1
This code assumes that the data file is in the same folder as the worksheet. That's what ThisWorkbook.Path
is all about. The #1
is the file handle. From now on, the program will use #1
when it wants to refer to the file.
Our code needs to know when it has gotten to the end of the data. When extracting data from a worksheet, the program looked for an empty call, with code like this:
Do While Cells(rows + 1, 2) <> ""
The file version is a little simpler, because there is a built-in function that detects end of file:
Do While Not EOF(1)
EOF(1)
is True
when the end of file 1 has been reached. So the loop keeps on running as long as the end of file 1 has not been reached.
This code…
Input #1, inputExamId, inputScore
… reads two values from a line in the data file.
The only other difference is this statement just before the end of the sub:
Close #1
This closes the file, making it available for other users. Always close your files. Some languages don't care about closing, but VBA does.
This sometimes causes problems. Say the program crashes in extractValidData
, after opening the file, but before closing it. Next time you run the program, Excel might complain that you can't open the file, because it is already open.
If this happens to you, then close the data file manually. Open the Immediate window (Ctrl+G). Then type the close statement as it is in the code, and hit Enter:
This will close the file.
Good thunking
To read from a file, only extractValidData
needed to change. This is good thunking. When the Boss asks for a change, only a part of the program is affected. Less writing, debugging, and screaming. Less time spent on the change. Higher productivity. Higher reliability.
Professional programmers think about these things. Having code work is not enough. The code should be easy to manage, as well. Good thunking helps with that.
Other changes
Some changes have a greater effect on the program. Suppose the Boss wants you to output the number of data errors and outliers, along with the stats. Like this:
Here's the main program now:
You need some two more variables. They're computed in extractValidData
, and returned to the main program. outputStats
outputs them. Like the honey badger, computeStats
doesn't care.
Here's the new code for extractValidData
:
The counters are initialized (lines 45 and 46). They're incremented when an error or outlier is detected (lines 59, 62, and 65). You need to add a new range test. Recall that exam scores are from 0 to 50. Numbers outside that range are errors. However, number from 0 to 19 and 41 to 50 are not errors, they're outliers. This code counts them that way.
outputStats
hardly changes:
Again, good thunking means we can make the changes without rewriting everything. Less time spent writing, debugging, and screaming. More productivity. More reliability.
Exercises
- Read the data from this CSV file.
- Write the data to a Web page.
Read the file names from a worksheet. Create this:
Upload your work. The usual coding standards apply.
(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)