Much of this discrepancy is due to the "first game" phenomenon. Go programs play moves that look good, and, applying their heuristics for judging human strength, naive human players are wont to give programs more credit than they are due. The result of this respect is that in initial games, human players tend to play in safe, orthodox ways that fail to probe and reveal the weaknesses of the programs.
When a player has gotten few games with a particular program under his belt, or even if he is familiar with computer go programs in general, the effective level of the program's play drops preciptiously. Human players learn to make unorthodox moves that challenge a program's understanding of the meanings of moves, and accordingly, programs make big mistakes.
How then should we rate the go programs? Since humans can learn and programs can't, is it fair to let humans use their ability to identify and exploit specific (or for that matter, general) weaknesses? Does the stable level of play that a program reaches against an experienced player really reflect the program's "true" level better than it's initial performance? My feeling is, "yes."
Part of the reason that games like go and chess are considered useful domains for research in AI is that the success of a program can be easily assessed by arranging a competition with a human player of known rank. Why confound this property by imposing conditions for the competition? What is gained, except a warm fuzzy feeling for go programmers? Why should we rate go programs based only on their strengths and not on their weaknesses? We certainly don't rate human players that way! Finally, I would point out that in principle there's nothing to keep programmers from writing programs that learn from their mistakes -- why build the assumption of defeat in this task into our thinking?
This is, of course, a matter of speculation, but I'm inclined to disagree. My take on it is that, probably because of their use of patterns, go programs often play good shape without understanding the meaning of the moves. This is basically syntax without semantics; and it results in programs that make huge blunders that seem quite at odds with their overall level of play.
My guess is that it won't be possible to "patch over" these blunders; in my estimation they reflect a fundamental shortcoming of existing programs. Specifically, I think that no program will reach the "true" 1-dan level without a creditable life & death analysis module. Of course, programmers will develop these, eventually, but I'm inclined to classify this as a major hurdle rather than something that can be overcome by incremental improvements (though I don't want to make too much of the semantics).