Reading a file

Keywords: 

Surprised cat

So far, all of the data has been in a worksheet. Here's pseudocode for extracting data from a worksheet into an array:

While there is more data
    Get some data
    If the data meets some criteria Then
        Copy it to an array
    EndIf
Loop

In RL, data that comes from other systems, or is downloaded from the Web, might not be in a worksheet. Often it will be in a plain text file. Your program will need to extract data from that.

Here's pseudocode for extracting data from a file into an array:

While there is more data
    Get some data
    If the data meets some criteria Then
        Copy it to an array
    EndIf
Loop

The same!

Indeed.

Geeks often talk about algorithms. An algorithm is a general way of doing something, like extracting data. This:

While there is more data
    Get some data
    If the data meets some criteria Then
        Copy it to an array
    EndIf
Loop

The algorithm can be implemented in different languages, like C++, Java, PHP, VBA, Python… whatevs. The algorithm can handle data from different sources, like worksheets, files, databases… whatevs.

Of course, the details are different. But the underlying approach is the same.

Camels again

Here's the code we wrote for extracting data from a worksheet, for the camels dancing exam example:

  1. 'Extract valid data from raw data.
  2. 'Params:
  3. '  count: number of valid data items (out)
  4. '  validData: array of valid data items (out)
  5. Sub extractValidData(count As Integer, validData)
  6.     Dim row As Integer
  7.     Dim dataStartRow As Integer
  8.     Dim dataValue As String
  9.     Dim dataOK As Boolean
  10.     'What row the data starts in.
  11.     dataStartRow = 8
  12.     'Current raw data row.
  13.     row = 0
  14.     'Count of valid data.
  15.     count = 0
  16.     'Does the next row have data in it?
  17.     Do While Cells(row + dataStartRow, 2) <> ""
  18.         'Get the next data value.
  19.         dataValue = Cells(row + dataStartRow, 2)
  20.         'Validate it.
  21.         'Set flag.
  22.         dataOK = True
  23.         'Change flag if there is an error.
  24.         If Not IsNumeric(dataValue) Then
  25.             dataOK = False
  26.         ElseIf dataValue < 20 Or dataValue > 40 Then
  27.             dataOK = False
  28.         End If
  29.         'Did the data pass the tests?
  30.         If dataOK Then
  31.             'Data is OK. Copy to extracted data array.
  32.             count = count + 1
  33.             validData(count) = dataValue
  34.         End If
  35.         'Go to the next raw data row.
  36.         row = row + 1
  37.     Loop
  38. End Sub
Suppose the data is stored in a file instead. It's in CSV (comma-separated values) format. A sample:

1000, 35
1001, 20
1002, 38
1003, 32
1004, 32
1005, 32
1006, 22

In each line, the first number is the exam id, and the second is the score. It's the same data as before, just moved into a file.

Here's what the worksheet will look like:

Worksheet

The exam scores aren't on the worksheet. Instead, the user types in the name of the file with the exam scores.

Here's extractValidData, rewritten to read data from the file.

  1. 'Extract valid data from raw data.
  2. 'Params:
  3. '  count: number of valid data items (out)
  4. '  validData: array of valid data items (out)
  5. Sub extractValidData(count As Integer, validData)
  6.     Dim inputExamId As String
  7.     Dim inputScore As String
  8.     Dim dataOK As Boolean
  9.     Dim fileName As String
  10.     'Count of valid data.
  11.     count = 0
  12.     'Open the data file.
  13.     fileName = Cells(7, 2)
  14.     Open ThisWorkbook.Path & "\" & fileName For Input As #1
  15.     'Loop while there is more data.
  16.     Do While Not EOF(1)
  17.         'Read a line of data.
  18.         Input #1, inputExamId, inputScore
  19.         'Validate the score.
  20.         'Set flag.
  21.         dataOK = True
  22.         'Change flag if there is an error.
  23.         If Not IsNumeric(inputScore) Then
  24.             dataOK = False
  25.         ElseIf inputScore < 20 Or inputScore > 40 Then
  26.             dataOK = False
  27.         End If
  28.         'Did the data pass the tests?
  29.         If dataOK Then
  30.             'Data is OK. Copy to extracted data array.
  31.             count = count + 1
  32.             validData(count) = inputScore
  33.         End If
  34.     Loop
  35.     Close #1
  36. End Sub
The code is similar to what we had before. Let's look at the differences.

Opening the file

We have to tell Excel that we want it to prepare a file for reading. We didn't have to do this when the data was in the worksheet.

Excel needs to know the name and location of the data file. The user typed the file name into cell B7, so we grabbed that from the worksheet:

fileName = Cells(7, 2)

Then we tell Excel to open that file for reading:

Open ThisWorkbook.Path & "\" & fileName For Input As #1

This code assumes that the data file is in the same folder as the worksheet. That's what ThisWorkbook.Path is all about. The #1 is the file handle. From now on, the program will use #1 when it wants to refer to the file.

Our code needs to know when it has gotten to the end of the data. When extracting data from a worksheet, the program looked for an empty call, with code like this:

Do While Cells(rows + 1, 2) <> ""

The file version is a little simpler, because there is a built-in function that detects end of file:

Do While Not EOF(1)

EOF(1) is True when the end of file 1 has been reached. So the loop keeps on running as long as the end of file 1 has not been reached.

This code…

Input #1, inputExamId, inputScore

… reads two values from a line in the data file.

The only other difference is this statement just before the end of the sub:

Close #1

This closes the file, making it available for other users. Always close your files. Some languages don't care about closing, but VBA does.

This sometimes causes problems. Say the program crashes in extractValidData, after opening the file, but before closing it. Next time you run the program, Excel might complain that you can't open the file, because it is already open.

If this happens to you, then close the data file manually. Open the Immediate window (Ctrl+G). Then type the close statement as it is in the code, and hit Enter:

Closing

This will close the file.

Good thunking

To read from a file, only extractValidData needed to change​. This is good thunking. When the Boss asks for a change, only a part of the program is affected. Less writing, debugging, and screaming. Less time spent on the change. Higher productivity. Higher reliability.

Professional programmers think about these things. Having code work is not enough. The code should be easy to manage, as well. Good thunking helps with that.

Other changes

Some changes have a greater effect on the program. Suppose the Boss wants you to output the number of data errors and outliers, along with the stats. Like this:

Errors and outliers

Here's the main program now:

  1. Private Sub cmdAnalyze_Click()
  2.     Dim validData(1000) As Integer
  3.     Dim count As Integer
  4.     Dim highest As Integer
  5.     Dim lowest As Integer
  6.     Dim average As Single
  7.     Dim dataErrors As Integer
  8.     Dim outliers As Integer
  9.     'Extract the valid data.
  10.     extractValidData count, validData, dataErrors, outliers
  11.     'Is there enough data to analyze?
  12.     If count < 1 Then
  13.         MsgBox "Sorry, there is not enough data to analyze."
  14.         End
  15.     End If
  16.     'Compute stats.
  17.     computeStats count, validData, highest, lowest, average
  18.     'Output stats.
  19.     outputStats count, highest, lowest, average, dataErrors, outliers
  20. End Sub
You need some two more variables. They're computed in extractValidData, and returned to the main program. outputStats outputs them. Like the honey badger, computeStats doesn't care.

Here's the new code for extractValidData:

  1. 'Extract valid data from raw data.
  2. 'Params:
  3. '  count: number of valid data items (out)
  4. '  validData: array of valid data items (out)
  5. '  dataErrors: count of data errors - non-numeric,
  6. '          or out of range (out)
  7. '  outliers: count of outliers (out)
  8. Sub extractValidData(count As Integer, validData, _
  9.         dataErrors As Integer, outliers As Integer)
  10.     Dim inputExamId As String
  11.     Dim inputScore As String
  12.     Dim dataOK As Boolean
  13.     Dim fileName As String
  14.     'Count of valid data.
  15.     count = 0
  16.     'Init error counters.
  17.     dataErrors = 0
  18.     outliers = 0
  19.     'Open the data file.
  20.     fileName = Cells(7, 2)
  21.     Open ThisWorkbook.Path & "\" & fileName For Input As #1
  22.     'Loop while there is more data.
  23.     Do While Not EOF(1)
  24.         'Read a line of data.
  25.         Input #1, inputExamId, inputScore
  26.         'Validate the score.
  27.         'Set flag.
  28.         dataOK = True
  29.         'Change flag if there is an error.
  30.         If Not IsNumeric(inputScore) Then
  31.             dataErrors = dataErrors + 1
  32.             dataOK = False
  33.         ElseIf inputScore < 0 Or inputScore > 50 Then
  34.             dataErrors = dataErrors + 1
  35.             dataOK = False
  36.         ElseIf inputScore < 20 Or inputScore > 40 Then
  37.             outliers = outliers + 1
  38.             dataOK = False
  39.         End If
  40.         'Did the data pass the tests?
  41.         If dataOK Then
  42.             'Data is OK. Copy to extracted data array.
  43.             count = count + 1
  44.             validData(count) = inputScore
  45.         End If
  46.     Loop
  47.     Close #1
  48. End Sub
The counters are initialized (lines 45 and 46). They're incremented when an error or outlier is detected (lines 59, 62, and 65). You need to add a new range test. Recall that exam scores are from 0 to 50. Numbers outside that range are errors. However, number from 0 to 19 and 41 to 50 are not errors, they're outliers. This code counts them that way.

outputStats hardly changes:

  1. 'Output statistics to worksheet.
  2. 'Params:
  3. '  count: number of valid data items (in)
  4. '  highest: highest score (in)
  5. '  lowest: lowest score (in)
  6. '  average: average score (in)
  7. '  dataErrors: count of data errors - non-numeric,
  8. '          or out of range (out)
  9. '  outliers: count of outliers (out)
  10. Sub outputStats(count As Integer, highest As Integer, _
  11.         lowest As Integer, average As Single, _
  12.         dataErrors As Integer, outliers As Integer)
  13.     Cells(11, 2) = count
  14.     Cells(12, 2) = highest
  15.     Cells(13, 2) = lowest
  16.     Cells(14, 2) = Round(average, 1)
  17.     Cells(16, 2) = dataErrors
  18.     Cells(17, 2) = outliers
  19. End Sub
Again, good thunking means we can make the changes without rewriting everything. Less time spent writing, debugging, and screaming. More productivity. More reliability.

Exercises

Exercise: Sales for some stores, with files
Earlier, you computed sales statistics for weeks 1 to 3, store 1 to 10. Do that again, but this time:

  • Read the data from this CSV file.
  • Write the data to a Web page.

Read the file names from a worksheet. Create this:

Start

Upload your work. The usual coding standards apply.

Attachments

partial-sales-data.csv

(If you were logged in as a student, you could submit an exercise solution, and get some feedback.)

Summary