Lately I’ve been working on developing a web app that manages assignments in programming classes. I know this isn’t an original project — there are at least two other such things that have been written at CMU alone — but this one was born of the complete terribleness of one of the aforementioned projects. Anyway, the focus of this post isn’t the development part, or the fact that Ruby on Rails is basically programmer’s paradise (this is a true fact, by the way).
Working on this project got me thinking about the way grading is done in CS classes. I’ve been on course staff for four different CS courses at CMU now, and been in many more, and it seems like all of them take different approaches to grading. I’m of many minds about which approach is the best. That train of thought got me thinking more generally about the problems of grade inflation and the fact that people like to blame anyone but themselves for their own inadequacy. So this might be a long post.
If a student entering a college programming course has never taken a programming course before, the nature of programming assignments, and what it means in terms of how they’re graded, will come as a shock. A programming assignment is a complex thing, but the end result (the program’s behavior) is completely unambiguous. There are no subtleties; no wiggle room. This means that students put a lot of effort into crafting a substantial, complex, difficult piece of work, which is naturally reducible to a cold hard number. It can be a frustrating phenomenon to face.
There are four major approaches to grading programs for correctness that I’ve seen:
- Use automated tests exclusively to determine correctness. Give out all these tests, so students can, in effect, grade themselves.
- Use automated tests exclusively to determine correctness. Give out a subset of these tests (typically the less comprehensive ones).
- Use automated tests exclusively to determine correctness. Don’t give out any of them.
- Use a mixture of mostly manual testing and some automated testing, or exclusively manual testing. Don’t give out the automated tests, if any.
Within the approaches that involve giving out automated tests, there’s also the distinction of whether the source code to the tests is given out, or just the results of running them on a server.
The discussions of these approaches assume that the assigned program has been fully (and
unambiguously) specified in prose, and any automated tests used to determine correctness accurately test the functionality set out in the spec.
Students obviously favor the first approach, because that way they know exactly what their correctness score is all along. (Typically, the correctness of a program accounts for the vast majority of the overall score, as opposed to other things like coding style.) As a pedagogue, however, I’ve turned against approach 1. As a teaching assistant, it’s all right — I have to do very little grading work, and students don’t have a leg to stand on if they disagree with the grade they get. (“Pedagogue” and “teaching assistant” are different concepts here — unfortunately.) The reason I don’t like it as a pedagogue is that it discourages (a) good programming practice and (b) learning. Students end up programming by trial and error. They’ll write some code and run the tests. Some tests fail. At this point, the more competent students
debug their code, looking carefully at the reasons for the failures, instrumenting their code, using a debugger, possibly even writing their own tests to probe the bug further. However, staffing CS courses has taught me, mostly, that such competent students are few and far between. The less competent students, i.e. the majority, who have no idea how to systematically debug, will put print statements everywhere, thus confirming that something is indeed wrong, and then run for help. (These students tend not to do well when the code and bug are the kind where merely putting in a print statement makes the bug disappear.) The trouble is that they’re encouraged to think of the course-provided tests as an oracle, rather than as a safety net. People tend not to really think about and plan out their code before starting to write it, since they know that ultimately, their own confidence in the correctness of their code doesn’t matter. The absolute worst is when the tests report a score that students judge as “good enough”, and they turn in code that
they know is flawed, assured that it won’t harm them. It’s an insane, absurd luxury; I really hope people harbor no illusions that they can continue doing stuff like that after they graduate, when the code they write will matter for
years.
To some extent, I’ve been guilty of some of this myself, despite considering myself a good student and a good programmer. I’ve fallen into the trial-and-error coding trap, where a test fails, I glance at the output, make a change that I think will solve the problem, rerun the test, it fails, rinse and repeat. But I systematically debugged when I needed to, and to this day I have never turned in code with known bugs. That doesn’t change the fact that this approach to grading lulled me into bad habits.
To be fair, Approach 1 does have the advantage that since students have access to testing code, they can study it, modify it to suit their needs at the time (or even make it better), and possibly glean general testing techniques from it. Unfortunately, most students don’t do that kind of thing; they just run the tests and look at what they spit out, so this advantage isn’t worth much.
Approach 2 is a compromise between 1 and 3. Students don’t know their overall scores, since there are some tests that will count towards their score, of which they can’t see the results. This mostly solves the “not perfect but good enough” problem, at least among students who aren’t pathologically lazy. Since the most rigorous tests aren’t given to the students, they’re encouraged to stress-test their code themselves, or risk getting burned when they pass all the tests they have and fail most of the ones they don’t (which typically are worth more score-wise). To a TA, this scheme reintroduces the problem of whining students: they got a poor score on the tests they didn’t have, and for some reason they feel this is unfair. Their complaint is that their grade is being based on an unknown grading instrument. To me, students who use this excuse are miserable pathetic excuses for programmers, but I think I’m starting to see the psychological basis for it. I think they feel that because programs are completely concrete, deterministic and specified, the methods for determining their correctness must have the same properties. They can’t deal with the presence of unknown in the evaluation of a concrete, unambiguous piece of work. I may be talking out of my ass, since I don’t identify with this thinking at all, but that’s my theory.
Obviously, students who don’t like Approach 2 absolutely can’t stand Approach 3. I can’t say I like Approach 3 much either, as a TA or pedagogue. It exacerbates the whining students problem, and provides no pedagogical or logistical advantages over Approach 2. (I don’t think releasing a few basic tests to students is pedagogically harmful, as long as they account for a small proportion of the assignment’s score.) I think there’s general agreement on this — I’ve only been in one CS class that took this approach. Well, there was another class that had one programming assignment and eleven discrete math assignments, but the programming assignment was “write a quine”, for which a fully comprehensive test is
make quine && cmp `./quine` quine.c, so it doesn’t really count.
Approach 4 is a different kind of compromise between Approaches 1 and 3. Possibly recognizing the pedagogical disadvantages of Approach 1 and students’ hatred of Approach 3, Approach 4 replaces the blind ruthlessness of automated testing with the compassion and sympathy of a human grader. Students don’t really have a problem with this approach, since it generally gets them good grades, mainly because manual testing of programs is nowhere near as rigorous as automated testing (and it’s a hell of a tedious job too, let me tell you), and because a human can be lenient to code that is “almost right”. My thoughts are that code that is “almost right” is still
broken and deserves an appropriate score, but I didn’t make the rules in the course I staffed that took this approach.
To this end, one of the courses I’ve staffed has adopted a vastly complex approach to grading. The basic approach is 2: correctness is determined solely by automated testing, and a subset of the tests is released (only the results; no source code is ever released). There are several twists. First of all, students are required to write their own tests, and these tests are handed in along with the actual program, for credit. Then, three things happen: the staff’s tests are run against the student’s code, the student’s tests are run against the staff’s reference implementation, and the student’s tests are run against their own code. While running the student’s tests, the tests’ code coverage is measured (this course was done in Java, in case you wondered). Depending on the comprehensiveness of the student’s tests, as measured by code coverage, some of the results of the staff tests on the student’s code are released to the student (never all of them, though). The full results of the student’s tests (on both the student’s code and the reference code) are available to the student all the time. This is actually very useful feedback: if the student’s tests fail on the staff’s code, it’s likely that the tests are wrong, and students can fix the tests and be more confident in using them to test their code. (Also, it gives the staff a good laugh when a student’s tests fail on the student’s own code.) So students are forced to write their own tests, and they are graded on the quality of these tests. High-quality tests are rewarded with peeks at “the real answer”. Of course, this still isn’t how the real world operates, but at least students have an incentive to write good tests.
Man, in the midst of writing this I’ve realized that those stories I’ve read about a shortage of good software developers in the workforce might actually be true. I never believed them — even though I see pretty good evidence for them all the time as a TA. The fact that college CS courses don’t do much to prepare people for real-world software development looks like it might be a major reason.
Students don’t buy the “this isn’t what the real world is like” argument, though. They think that because this technically isn’t what we call “the real world”, if we go out of our way to create a real-world-like process of judging their work, we’re being unfair. I strongly think the opposite is true:
if we don’t try to prepare them for the real world, we’re not doing our jobs right. In the real world, nobody is going to walk up to you and hand you a fully comprehensive test suite for the code you’re supposed to be writing. If you don’t know how to test and debug code, you won’t get far. Once you have a job, they’ll probably assume you can already test and debug competently; if you’re picking it up as you go along, you won’t do very well. College CS courses are the perfect opportunity to drill students in good development habits and techniques (not limited to testing and debugging; this includes source control, documentation, coding style, etc.). Courses that don’t take that opportunity are that much less useful to students.
So you know how at the beginning I implied that I’d write about grade inflation here? As it turns out, I haven’t gotten my thoughts on that topic completely organized, and I really would rather write about it coherently. Also, that topic and this one don’t mesh as neatly as I’d thought. So a post on grade inflation is forthcoming.