Origins of 10X – How Valid is the Underlying Research?

  1. Posted on January 9, 2011 6:15:PM by Steve McConnell to 10x Software Development
  2. Methods & Processes, Testing & QA, Technique, Agile, estimation, requirements, productivity, Management, Design, Maintenance, 10x, research, programmer productivity, Articles, Books, Construction

I recently contributed a chapter to Making Software (Oram and Wilson, eds., O'Reilly, 2011). The purpose of this edited collection of essays is to pull together research-based writing on software engineering. In essence, the purpose is to say, "What do we really know (quantitatively based), and what do we only kind of think we know (subjectively based)?" My chapter, "What Does 10X Mean" is an edited version of my 2008 blog entry "Productivity Variations Among Developers and Teams: The Origin of 10x." The chapter focuses on the research that supports the claim of 10-fold differences in productivity among programmers.

Laurent Bossavit published a critique of my blog entry on his company's website in French. That critique was translated into English on a different website.

The critique (or its English translation, anyway) is quite critical of the claim that programmer productivity varies by 10x, quite critical of the research foundation for that claim, and quite critical of me personally. The specific nature of the criticism gives me an opportunity to talk about the state of research in software development, my approach to writing about software development, and to revisit the 10x issue, which is one of my favorite topics.

The State of Software Engineering Research

Bossavit's criticism of my writing is notable for the fact that it cites my work, comments on some of the citations that my work cites, but doesn't cite any other software-specific research of its own.

In marked contrast, while I was working on the early stages of Code Complete, 1st Ed., I read a paper by B. A. Sheil titled "The Psychological Study of Programming" (Computing Surveys, Vol. 13. No. 1, March 1981). Sheil reviewed dozens of papers on programming issues with a specific eye toward the research methodologies used. The conclusion of Sheil's paper was sobering. The programming studies he reviewed failed to control for variables carefully enough to meet research standards that would be needed for publication in other more established fields like psychology. The papers didn't achieve levels of statistical significance good enough for publication in other fields either. In other words, the research foundation for software engineering (circa 1981) was poor.

One of the biggest issues identified was that studies didn't control for differences in individual capabilities. Suppose you have a new methodology you believe increases productivity and quality by 50%. If there are potential differences as large as 10x between individuals, the differences arising from individuals in any given study will drown out any differences you might want to attribute to a change in methodology. See Figure 1.

ProductivityVariation

Figure 1

This is a very big deal because almost none of the research at the time I was working on Code Complete 1 controlled for this variable. For example, a study would have Programmer Group A read a sample of code formatted using Technique X and Programmer Group B read a sample of code formatted using Technique Y. If Group A was found to be 25% more productive than Group B, you don't really know whether it's because Technique X is better than Technique Y and is helping productivity, or whether it's because Group A started out being way more productive than Group B and Technique X actually hurt Group A's productivity.

Since Sheil's paper in 1981, this methodological limitation has continued to show up in productivity claims about new software development practices. For example, in the early 2000s the "poster child" project for Extreme Programming was the Chrysler C3 project. Numerous claims were made for XP's effectiveness based on the productivity of that project. I personally never accepted the claims for the effectiveness of the XP methodology based on the C3 project because that project included rock star programmers Kent Beck, Martin Fowler, and Ron Jeffries, all working on the same project. The productivity of any project those guys work on would be at the top end of the bar shown on the left of Figure 1. Those guys could do a project using batch mode processing and punch cards and still be more productive than 95% of the teams out there. Any methodological variations of 1x or 2x due to XP (or -1x or -2x) would be drowned out by the variation arising from C3's exceptional personnel. In other words, considering the exceptional talent on the C3 project, it was impossible to tell whether the C3 project's results were because of XP's practices or in spite of XP's practices.

My Decision About How to Write Code Complete

Bringing this all back to Code Complete 1, I hit a point early in the writing of Code Complete 1 where I was aware of Sheil's research, aware of the limitations of many of the studies I was using, and trying to decide what kind of book I wanted to write.

The first argument I had with myself was how much weight to put on all the studies I had read. I read about 600 books and articles as background for Code Complete. Was I going to discard them altogether? I decided, No. The studies might not be conclusive, but many of them were surely suggestive. The book was being written by me and ultimately reflected my judgment, so whether the studies were conclusive or suggestive, my role as author was the same--separate the wheat from the chaff and present my personal conclusions. (There was quite a lot of chaff. Of the 600 books and articles I read, only about half made it into the bibliography. Code Complete's bibliography includes only those 300 books and articles that were cited somewhere in the book.)

The second argument I had with myself was how much detail to provide about the studies I cited. The academic side of me argued that every time I cited a study I should explain the limitations of the study. The pragmatic side of me argued that Code Complete wasn't supposed to be an academic book; it was supposed to be a practical book. If I went into detail about every study I cited, the book would be 3x as long without adding any practical value for its readers.

In the end I felt that detailed citations and elaborate explanations of each study would detract from the main focus of the book. So I settled on a citation style in which I cited (Author, Year) keyed to fuller bibliographic citations in the bibliography. I figured readers who wanted more academic detail could follow up on the citations themselves.

A Deeper Dive Into the Research Supporting "10x"

After settling on that approach with Code Complete 1 (back in 1991) I've continued to use that approach in most of the rest of my writing, including in the chapter I contributed to Making Software.

One limitation of my approach has been that, with my terse citation style, someone who is motivated enough to follow up on the citations might not be able to find the part of the book or article that I was citing, or might not understand the specific way in which the material I cited supports the point I'm making. That appears to have been the case with Laurent Bossavit's critique of my "10x" explanation.

Bossavit goes point by point through my citations and was not able to find the support for the claim of 10x differences in productivity. Let's follow the same path and fill in the blanks.

Sackman, Erickson, and Grant, 1968. Here is my summary of the first research to find 10x differences in programmer productivity:

Detailed examination of Sackman, Erickson, and Grant's findings shows some flaws in their methodology (including combining results from programmers working in low level programming languages with those working in high level programming languages). However, even after accounting for the flaws, their data still shows more than a 10-fold difference between the best programmers and the worst.

In years since the original study, the general finding that "There are order-of-magnitude differences among programmers" has been confirmed by many other studies of professional programmers (Curtis 1981, Mills 1983, DeMarco and Lister 1985, Curtis et al. 1986, Card 1987, Boehm and Papaccio 1988, Valett and McGarry 1989, Boehm et al 2000).

The research on variations among individual programmers began with Sackman, Erickson, and Grant's study published in 1968. Bossavit states that the 1968 study focused only on debugging, but that is not correct. As I stated in my blog article, the ratio of initial coding time between the best and worst programmers was about 20:1. The difference in program sizes was about 5:1. The difference in debugging was the most dramatic difference, at about 25:1, but it was not the only area in which differences were found. Differences found in coding time, debugging time, and program size all support a general claim of "order of magnitude" differences in productivity, i.e., a 10x difference.

An interesting historical footnote is that Sackman, Erickson, and Grant did not set out to show a 10x or 25x difference in productivity among programmers. The purpose of their research was to determine whether programming online offered any real productivity advantage compared to programming offline. What they discovered, to their surprise, was that, ala Figure 1, any difference in online vs. offline productivity was drowned out by the productivity differences among individuals. The factor they set out to study would be irrelevant today. The conclusion they stumbled onto by accident is one that we're still talking about.

Curtis 1981. Bossavit criticizes my (Curtis 1981) citation by stating

The 1981 Curtis study included 60 programmers, which once again were dealing with a debugging rather than a programming task.

I do not know why he thinks this statement is a criticism of the Curtis study. In my corner of the world debugging is not the only programming task, but it certainly is an essential programming task, and everyone knows that. The Curtis article concludes that, "a statement such as 'order of magnitude differences in the performance of individual programmers' seems justified." The (Curtis 1981) citation directly supports the 10x claim--almost word for word.

Curtis 1986. Moving to the next citation, Bossavit states that, "the 1986 Curtis article does not report on an empirical study." I never stated that Curtis 1986 was an "empirical study." Curtis 1986 is a broad paper that touches on, among other things, differences in programmer productivity. Bossavit says the paper "offers no support for the '10x' claim." But the first paragraph in section II.A. of the paper (p. 1093) summarizes 4 studies with the overall gist of the studies being that there are very large differences in productivity among programmers. The specific numbers cited are 28:1 and 23:1 differences. Clearly that again offers direct support for the 10x claim.

Mills 1983. The "Mills 1983" citation is to a book by Harlan Mills titled Software Productivity in which Mills cites 10:1 differences in productivity not just among individuals but also among teams. As Bossavit points out, the Mills book contains "experience reports," among other things. Apparently Bossavit doesn't consider an "experience report" to be a "study," but I do, which is why I cited Mills' 1983 book.

DeMarco and Lister 1985. Bossavit misreads my citation of DeMarco and Lister 1985, assuming it refers to their classic book Peopleware. That is a natural assumption, but as I stated clearly in the article's bibliography, the reference was to their paper titled, "Programmer Performance and the Effects of the Workplace" which was published a couple years before PeopleWare.

Bossavit's objection to this study is

The only "studies" reported on therein are the programming contests organized by the authors, which took place under loosely controlled conditions (participants were to tackle the exercises at their workplace and concurrently with their work as professional programmers), making the results hardly dependable.

Editorial insinuations aside, that is a correct description of what DeMarco and Lister reported, both in the paper I cited and in Peopleware. Their 1985 study had some of the methodological limitations Sheil's discussed in 1981. Having said that, their study supports the 10x claim in spades and is not subject to many of the more common methodological weaknesses present in other software engineering studies. DeMarco and Lister reported results from 166 programmers, which is a much larger group than used in most studies. The programmers were working professionals rather than students, which is not always the case. The focus of the study was a complete programming assignment--design, code, desk check, and for part of the group, test and debug.

The programmers in DeMarco and Lister's study were trying to complete an assignment in their normal workplace. Bossavit seems to think that undermines the credibility of their research. I think it enhances the credibility of their research. Which do you trust more: results from a study in which programmers worked in a carefully controlled university environment, or results from a study in which programmers were subjected to all the day-to-day interruptions and distractions that programmers are subjected to in real life? Personally I put more weight on the study that more closely models real-world conditions, which is why I cited it.

As far as the 10x claim goes, Bossavit should have looked at the paper I cited rather than the book. The paper shows a 5.6x difference between the best and worst programmers--among the programmers who finished the assignment. About 10% of the programmers weren't able to complete the assignment at all. That makes the difference between best and worst programmers essentially infinite - and certainly supports the round-number claim of 10x differences from the best programmers to the worst.

Card 1987. Bossavit says,

The 1987 Card reference isn't an academic publication but an executive report by a private research institution, wherein a few tables of figures appear, none of which seem to directly bear on the "10x" claim.

The publication is an article in Information and Software Technology, which is "the international archival journal focusing on research and experience that contributes to the improvement of software development practices." There is no basis for Bossavit to characterize Card's journal article as an "executive report."

Bossavit claims that none of the tables of figures "seem to directly bear on the '10x' claim." But on p. 293 of the article, Figure 3, titled "Programmer productivity variations," shows two graphs: a "large project" graph in which productivity ranges from 0.9 to 7.9 (a difference of 8.8x ), and a "small project" graph with a productivity range of 0.5 to 10.8 (a difference of 21.6x). These "programmer productivity variation" graphs support the 10x claim quite directly.

Boehm and Papaccio 1988. I will acknowledge that this wasn't the clearest citation for the underlying research I meant to refer to. I probably should have cited Boehm 1981 instead. In 1981, Barry Boehm published Software Engineering Economics, the first comprehensive description of the Cocomo estimation model. The adjustment factors for the model were derived through analysis of historical data. The model shows differences in team productivity based on programmer capability of 4.18 to 1. This is not quite an order of magnitude, but it is for teams, rather than for individuals, and generally supports the claim that "there are very large differences in capabilities between different individuals and teams."

Boehm 2000. Bossavit states that he did not look at this source. Boehm 2000 is Software Cost Estimation with Cocomo II, the update of the Cocomo model that was originally described in Boehm 1981. In the 2000 update, the factors in the Cocomo model were calibrated using data from a database of about 100 projects. Cocomo II analyzes the effects of a number of personnel factors. According to Cocomo II, if you compare a team made up of top-tier programmers, experienced with the application, programming language, and platform they're using, to a team made up of bottom tier programmers, inexperienced with the application, programming language, and platform they're using, you can expect a difference of 5.3x in productivity.

The same conclusion applies here that applies to Boehm 1981: This is not quite an order of magnitude difference, but since it applies to teams rather than individuals, it generally supports the claim that "there are very large differences in capabilities between different individuals and teams." It is also significant that, according to Cocomo II, the factors related to the personnel composing the team affect productivity more than any other factors.

Valett and McGarry 1989. Valett and McGarry provide additional detail from the same data set used by Card 1987 and also cites individual differences ranging from 8.8x to 21.6x. Valett and McGarry's conclusion is based on data from more than 150 individuals across 25 major projects and includes coding as well as debugging. Bossavit claims this study amounts to a "citation of a citation," but I don't know why he claims that. Valett and McGarry were both at the organization described in the study and directly involved in it. And the differences cited certainly support my general claim of 10x differences in productivity among programmers.

Reaffirming: Strong Research Support for the 10x Conclusion

To summarize, the claim that Bossavit doesn't like, is this:

The general finding that "There are order-of-magnitude differences among programmers" has been confirmed by many other studies of professional programmers (Curtis 1981, Mills 1983, DeMarco and Lister 1985, Curtis et al. 1986, Card 1987, Boehm and Papaccio 1988, Valett and McGarry 1989, Boehm et al 2000).

As I reviewed these citations once again in writing this article, I concluded again that they support the general finding that there are 10x productivity differences among programmers. The studies have collectively involved hundreds of professional programmers across a spectrum of programming activities. Specific differences range from about 5:1 to about 25:1, and in my judgment that collectively supports the 10x claim. Moreover, the research finding is consistent with my experience, in which I have personally observed 10x differences (or more) between different programmers. I think one reason the 10x claim resonates with many people is that many other software professionals have observed 10x differences among programmers too.

Bossavit concludes his review of my blog entry / book chapter by saying this:

What is happening here is not pretty. I'm not accusing McConnell here of being a bad person. I am claiming that for whatever reasons he is here dressing up, in the trappings of scientific discourse, what is in fact an unsupported assertion meshing well with his favored opinion. McConnell is abusing the mechanism of scientific citation to lend authority to a claim which derives it only from a couple studies which can be at best described as "exploratory" (and at worst, maybe, as "discredited").

Obviously I disagree with Bossavit's conclusion. Saying he thinks there are methodological weaknesses in the studies I cited would be one kind of criticism that might contain a grain of truth. None of the studies are perfect, and we could have a constructive dialog about that. But that isn't what he says. He says I am making "unsupported assertions" and "cheating with citations." Those claims are unfounded. Bossavit seems to be aspiring to some academic ideal in which the only studies that can be cited are those that are methodologically pure in every respect. That's a laudable ideal, but it would have the practical effect of restricting the universe of allowable software engineering studies to zero.

Having said that, the body of research that supports the 10x claim is as solid as any research that's been done in software engineering. Studies that support the 10x claim are singularly not subject to the methodological limitation described in Figure 1, because they are studying individual variability itself (i.e., only the left side of the figure). Bossavit does not cite even one study--flawed or otherwise--that counters the 10x claim, and I haven't seen any such studies either. The fact that no studies have produced findings that contradict the 10x claim provides even more confidence in the 10x claim. When I consider the number of studies that have been conducted, in aggregate I find the research to be not only suggestive, but conclusive--which is rare in software engineering research.

As for my writing style, even if people misunderstand what I've written from time to time, I plan to stand by my practical-focus-with-minimal-citations approach. I think most readers prefer the one paragraph summary with citations that I repeated at the top of this section to the two dozen paragraphs that academically dissect it. It's interesting to go into that level of detail once in awhile, but not very often.

References

Boehm, Barry W., and Philip N. Papaccio. 1988. "Understanding and Controlling Software Costs." IEEE Transactions on Software Engineering SE-14, no. 10 (October): 1462-77.

Boehm, Barry, 1981. Software Engineering Economics, Boston, Mass.: Addison Wesley, 1981.

Boehm, Barry, et al, 2000. Software Cost Estimation with Cocomo II, Boston, Mass.: Addison Wesley, 2000.

Boehm, Barry W., T. E. Gray, and T. Seewaldt. 1984. "Prototyping Versus Specifying: A Multiproject Experiment." IEEE Transactions on Software Engineering SE-10, no. 3 (May): 290-303. Also in Jones 1986b.

Card, David N. 1987. "A Software Technology Evaluation Program." Information and Software Technology 29, no. 6 (July/August): 291-300.

Curtis, Bill. 1981. "Substantiating Programmer Variability." Proceedings of the IEEE 69, no. 7: 846.

Curtis, Bill, et al. 1986. "Software Psychology: The Need for an Interdisciplinary Program." Proceedings of the IEEE 74, no. 8: 1092-1106.

DeMarco, Tom, and Timothy Lister. 1985. "Programmer Performance and the Effects of the Workplace." Proceedings of the 8th International Conference on Software Engineering. Washington, D.C.: IEEE Computer Society Press, 268-72.

DeMarco, Tom and Timothy Lister, 1999. Peopleware: Productive Projects and Teams, 2d Ed. New York: Dorset House, 1999.

Mills, Harlan D. 1983. Software Productivity. Boston, Mass.: Little, Brown.

Sackman, H., W.J. Erikson, and E. E. Grant. 1968. "Exploratory Experimental Studies Comparing Online and Offline Programming Performance." Communications of the ACM 11, no. 1 (January): 3-11.

Sheil, B. A. 1981. "The Psychological Study of Programming," Computing Surveys, Vol. 13. No. 1, March 1981.

Valett, J., and F. E. McGarry. 1989. "A Summary of Software Measurement Experiences in the Software Engineering Laboratory." Journal of Systems and Software 9, no. 2 (February): 137-48.

JeanHuguesRobert said:

January 9, 2011 7:58:PM

Two things.

1/ As a software developer myself, yes, 10x, easy.

2/ English is not Laurent Bossavit mother tongue, and French people are not English (we'd rather "overstate" than "understate") ;)

3/ That the team members would be the number one factor explaining productivity is obviously not the best good news for those who promote better methodologies

Well, 3 things actually.

Gaurav Sharma said:

January 9, 2011 8:24:PM

@JeanHughesRobert No. 3 nails it. It took around 10 seconds to google this guy up and learn that he's "Agile".

Steve McConnell said:

January 9, 2011 9:00:PM

Re: No. 3 - Maybe. How do the 10x programmers get to be 10x? Part has to be use of better practices--agile or otherwise. If a programmer is truly 10x and agile practices makes them 25% better, they'll be 12.5x. If another programmer is 5x and agile practices make them 25% better, they'll be 6.25x. The better programmer is still better, but *both* of them are better than when they started, and that's really the objective, isn't it? Start with what you've got, and make the most of it -- whether you start at 1x or 10x.

Larry OBrien said:

January 9, 2011 9:12:PM

Isn't it pointless to say "the best are X better than the worst"? Surely the question is the variance in the small population of a team and its candidates. It is in that context that I have a problem with the common belief that the distribution is order-of-magnitude wide. In my experience, std dev is likely in the ~1-2x range.

Geoff H said:

January 10, 2011 2:27:AM

Assuming between 2 people that one has never programmed a specific task, and another has programmed it 4 times before.

The one with the knowledge of 4 full product life cycles will be repeating known territory, and possibly innovating along the way.

The one without previous experience in this domain will have to figure it out as they go.  False starts will be made, by the end things will be visible that weren't able to be known before the project was finished, and the project's quality will suffer from it, as well as the time to get it fully functional.

Also, experts can find similar patterns to solutions they know easier, and adapt solutions to fit different problems, because those are also patterns that are repeatable.

That's my take.  Plus, I see it all the time, as well as the infinite case: truly difficult work cannot be done by some programmers, they aren't there yet.

Paul Johnson said:

January 10, 2011 6:54:AM

On the COCOMO II: I think your multiple of 5.3 from "top tier" versus "bottom tier" teams comes from setting those Effort Adjustment Factors to be either "High" or "Low".  However since you are looking for the extremes, the top few percent against the bottom few percent, it would be better to use "V.Low" and "V.High".  With a 100kSLOC project I get 1989 months effort for a "V.Low" team and 117 months effort for a "V.High" team, for a factor of 17 overall.

Secondly, have you seen Lutz Prechelt's paper "An empirical comparison of seven programming languages"?  (IEEE Computer 33(10):23-29, October 2000.)  It shows (amongst other things) the time required to complete a programming problem in several languages, and has plots showing the range.  The "10x" claim is clearly supported by this data.

Josh Lane said:

January 10, 2011 7:06:AM

Assume that 10x "in the large" is true, and that 1-2x is true amongst the cast of available characters for a typical project.

In many cases, 1-2x could be attributed to willful behavior of the participants... explicit management choice to staff principally from one end of the productivity spectrum, small local talent market + unwillingness to use remote staff, unfriendly working conditions, etc. etc.

The point being, perhaps different decisions would create opportunities to attract other talent, and take 1-2x closer to 10x.

Obviously if you're already going out of your way to hire the best of the best (and assuming you're reasonably good at that), this isn't very interesting to you.

But if you've got productivity issues today, and you have flexibility in your hiring and retention practices, perhaps understanding the consequences of increasing the size of your available talent pool is useful.

No silver bullet, but still useful.

Paul Johnson said:

January 10, 2011 7:26:AM

On staffing: bear in mind that those at the top end of the spectrum often have a clear idea of their abilities, and will seek jobs which reward them appropriately, both in job satisfaction and money.  These people will simply not apply for a "Senior java programmer $50k/year in BigCorp" position.

Taran Rampersad said:

January 10, 2011 7:59:AM

I welcome this (I am new to the 10:1 ratio), and I do acknowledge that it almost seems intuitive from my experience.

In the 'real world', though, I offer that it could be dangerous in that poor project management, poor requirements definition, poor communication with the client (be it internal or external or hybrid) can also affect that of the team.  I'd say that under the right circumstances, as you point out in the comment, a programmer can better get near to potential (though, personally, I'm not an agile afficionado). Combine that with the relative simplicity of programming itself these days with 'copy and paste' OO happening, it's difficult to ascertain that any one of the top developers is consistently a top developer in any circumstance. I think you might agree with me (and I look forward to finding out).

Thus, I think that there is a need to balance this research from a management/project management standpoint as well. My concern is that there are too many young software project managers out there who may read about such development productivity and not realize that they themselves could and should be contributors to the productivity of the project as well as the productivity of the developers. This has nothing to do with what you wrote... but everything to do with what you wrote?

In the end, there still isn't a silver bullet.

Steve McConnell said:

January 10, 2011 9:35:AM

The 10x difference is just the difference attributable to personnel capability. When you throw in other factors, you can get differences far higher than 10x.

Throw in differences in language level (the Prechelt paper), differences in working environment, differences in companies' ability to minimize turnover, and other non-capability factors and you can easily get to a difference of greater than 10x.

As Larry O'Brien pointed out in his comment above (and as I pointed out in my original blog article on this topic), I think it's unusual to see a full 10x difference within any single organization. Organizations tend to hire from the top of the talent pool, or from the bottom, and so my personal observation is that within any given development team the real differences probably in the 3x - 5x range.

If you have a large enough organization, though (Boeing, Microsoft, Google, etc.) I think you'll find the full 10x range within the organization, although not necessarily within any given dev team.

Lorin Hochstein said:

January 10, 2011 10:51:AM

Steve:

Have you read Lutz Prechelt's tech report "The 28:1 Grant/Sackman legend is misleading, or: How large is interpersonal variation really? (1999)"? As far as I know, it's never been published in a peer-reviewed venue (was submitted to TSE), but it's an interesting attempt to trying to quantify productivity variation by analyzing the results from previous studies.

Carl Delsey said:

January 10, 2011 11:17:AM

To me the DeMarco and Lister paper doesn't really support the 10x theory. While they do note a 5.6x difference between the best and worst, when they attempt to factor out environmental effects, they suggest only a 1.21x difference in individuals.

They state "...characteristics of the workplace and corporate culture ... may explain much of the overall variation in programmer performance." The study was conducted with pairs of individuals from the same organization and they only saw a 1.21x difference between individuals in each pair.

Steve McConnell said:

January 10, 2011 1:59:PM

@Lorin, Yes, I read the Prechelt paper a long time ago. Prechelt's correction of Sackman, Erickson and Grant still leaves 9.5:1 and 14:1 differences, which still supports the 10x claim.

The Prechelt paper is interesting because it says, among things, "if we ignore the most extreme cases, the differences between best and worst are by far not as dramatic..." That's kind of the point, isn't it? That there are extreme cases? I don't think ignoring the extreme cases makes sense.

He also compares the middle of teh top quartile to the middle of the bottom quartile (i.e., 87.5 percentile to 12.5 percentile), i.e., removes the best 12.5% and worst 12.5% of the results from his comparisons. In other places he compares quartiles. And in other places he looks at 1 standard deviation (i.e., the middle 67%). None of those approaches make sense to me. He's basically removing the chunks of people who are best and worst from his data set. If we're really trying to determine the range from best to worst, we can't remove the best and the worst from the data set!

@Carl - DeMarco and Lister don't make any claims about what happens when you factor out enironmental differences, per se. The 1.21x you cite is the *average* difference between individuals within the same pair of programmers. They don't report what the *maximum* difference is between the individuals in a pair, and it's the maximum difference that is of interest to the 10x issue.

Bryan Pflug said:

January 10, 2011 4:27:PM

First, thanks for being intellectually honest and open-minded regarding the original criticism. You set a good example for us all.

I think most practitioners accept the observations and common sense that others have offered in response to your article.  However, there will always be skeptics of any assertion, no matter how well much evidence you may assemble. In my following remarks, I hope to offer you additional ammunition to build your case that the range in individual performance is as wide as indicated. Yet like Bossavit, I also have concerns about the state of practice in software engineering research, when compared with other scientific disciplines; but your request of him to produce counter-examples is exactly the appropriate response, since in the absence of alternative studies, he is merely offering his opinion against the body of evidence that I personally find quite compelling.  

The sources of variability observed in the studies you cite seem to fall into many different buckets, and it is not clear that they are all based upon individual differences. Until we can meaningfully control all but a few of the independent variables which affect productivity, and replicate experiments done by others over time, it is indeed risky to draw conclusions about the magnitude and relationship of selected variables on dependent factors, and describe them as being causal. While many of the studies you referenced talk about variability across individuals, the difficulty of their different assignments, and the experience of the individuals, can also be a significant factor in productivity. In my experience, individual performance varies widely depending upon the problem itself, the fitness of an individual to a situation (i.e. their experience and training), the degree to which an individual's capabilities can be utilized in an assignment, the available capacity, level of utilization, and raw talent of the individual itself, and even how we chose to measure productivity itself.

In my opinion, few studies in software provide sufficient controls that allow conclusions to be drawn about the effects of individual talent alone, and because of measurement differences alone, comparisons across these studies can be even more problematic. However, as you indicate, it is the pattern across these studies that is convincing, and the fact that such broad performance differences should not be a surprise to us. Indeed, studies of human performance in other areas have substantiated that a range of as much as 10 to one exist for creative ventures similar to software development.  For example, in Hard Facts, Sutton and Pfeffer report:  “Psychologist Dean Keith Simonton, who has spent his career studying greatness and genius, concludes: "No matter where you look, the same story can be told, with only minor adjustments. Identify the 10 percent who have contributed the most to some endeavor, whether it be songs, poems, paintings, patents, articles, legislation, battles, films, designs, or anything else. Count all the accomplishments that they have to their credit. Now tally the achievements of the remaining 90 percent who struggled in the same area of achievement. The first tally will equal or surpass the second tally. Period." One study showed that a mere 16 composers produced about 50 percent of classical music that is performed and recorded today, while 235 others produced the remaining half. Another study found that 10 percent of the authors had written about 50 percent of the books in the Library of Congress. Research on computer programmers showed that the most productive programmers were 10 times more productive than the least productive, and five times more productive than average programs.

Perhaps the best data available documenting such individual differences in software in a controlled manner is from the individuals who took the Personal Software Process training from the SEI. That data was collected using a normalized set of measures, has well defined entry criteria for participants, and collects data across a large student population for the same set of problems over many years. In A Discipline for Software Engineering (in the book’s section on Productivity Variations) Watts Humphrey reports: “While productivity factors can be useful for statistically controlled populations, they generally are meaningless when applied to uncontrolled populations. This is because those populations could have members who are incapable of satisfactorily completing the project. For example, notice from Fig 1.10 the time 25 graduate students took to write program 1A. The times range from 53 to 1080 minutes. Cumulatively, 50 percent of the students took fewer than 145 minutes and 72 percent took fewer than 236. Overall, there is more than a 20- to 1 ratio between the fastest and the slowest times. When you consider that some of these programs probably still had defects and thus would not work properly, the true range is probably higher.”

When we speak of measuring productivity, my view is that we should be considering measures of utility of the resulting software for set of customers, rather than LOC-based measures. My own opinion about this is that if measured this way, the range of productivity is even broader than your cited studies indicate. If a programmer solves a problem better with less code than another, he should be considered more productive than the other, not less. Yet current measures and studies fail us in incorporating this important indicator of value into productivity. I believe your original blog’s point is not that we can consistently reduce productivity to a single, precise, numeric value, but rather that there is huge range of improvement opportunities, offering an order of magnitude in performance benefits between lowest-performing individuals and top contributors, and that that these benefits are available to projects; more evidence of this range, and the differences between individuals and projects, can be found at www.pflogging.com/productivity.

Finally, a Wikipedia search of  the term unwarranted variation will reveal the extent to which variation occurs in even mature disciplines like health care, a much more structured, regulated, and scientifically studied industry than we find in most of the software projects I've seen. Such studies indicate that large variation in individual and team performance in health care is more frequent than we often realize. Further, to your point, the noise such variation introduces obviously obscures our ability to draw meaningful conclusions about the effects that other interventions we may employ can offer us. Perhaps this is why trends within a fixed population, rather than comparisons of absolute numbers across populations, are so popular in both literature and our industry. Yet relying on such comparison approaches, rather than absolute values, may be robbing us of the ability to leverage important innovations that are typically crucial to long-term industry improvements (see http://www. www.pflogging.com/control). Your blog helps build confidence that such improvements continue to be worth pursuing.

software support said:

January 11, 2011 1:52:AM

I was pretty impressed with your insightful thoughts that really did inspired me and helped me anticipate essential information regarding this software support action which I actually want to learn and give emphasize with. your article is not just effectively useful but at the same time very much productive. Thanks for this great discussion

Paul Johnson said:

January 11, 2011 2:28:AM

Taran (above) points out that team and project factors can have a big influence on the performance of individuals in the team.  This is true, and of crucial importance.  If a team were merely the sum of its parts then a sufficiently large team of randomly selected programmers would have a very predictable performance due to averaging effects.  However in practice it seems that variation between teams is as great as variation between individuals.  It follows that a team is not merely the sum of its parts, and that certain crucial things happen well in some teams and not in others.  For more on this, see Peopleware by DeMarco & Lister.

Robin Debreuil said:

January 11, 2011 4:36:AM

I think the disturbing part with the 10X statement isn't the 10X as much as the "best programmers". Any task in computing is reasonably easy once you know how to do it. Certainly everyone has a level they can't get beyond, but then the best/worst is certainly infinity, not 10X. For tasks both programmers have similar experience with I really doubt you see 10X - of course similar experience almost by default means similar aptitude so one could argue that is an unfair sampling. But which is it? What is the task/experience controls? What defines the top/bottom sizes? What even defines finished?

I suspect its like locals and visitors finding their way around a city - some pairings are greater than 10000X and others less than 2X. You can't average things and conclude the best navigators are 10X better than the worst ones based on the raw horsepower of their brains.

While I don't have evidence for any of this, it does strike me that the question isn't really defined and yet the debate is about the answer.

Laurent Bossavit said:

January 11, 2011 5:53:AM

Hi Steve,

Thanks for taking the time to respond to the essay. Your response gives me several things to think about and possibly address in a follow-up, but I want to touch on a few quick points here.

I goofed on the DeMarco and Lister cite in overlooking the primary source. Fortunately it's easily findable as a PDF online.

I goofed on the Card 1987 cite. It does have one table as you report that confirms the 10x claim, and I overlooked that.

That goof doesn't undermine my critique of your citation policy, which I can succinctly restate: no reader, given ONLY the set of sources you have cited, can reassure themselves as to the validity of the underlying research.

For instance the Card 1987 article does not state how many programmers were included in the study, what the methodology was, or which programming languages were included - a factor which could be quite significant given that the table says "productivity is source lines per staff hour".

I've labeled that Card article an "executive report". In truth I don't quite know the appropriate label, but it certainly isn't a "study" in the usual sense; it's an article about the SEL.

What I've gathered (I'm not a scientist) is usually called a "study" is a paper reporting on some research, where the Abstract states the claim, and where various sections report on the methodology, the conclusions, and most importantly the noted "threats to validity".

The job of follow-up studies is to attempt replications that address such threats to validity. If you follow the chain of citations from a recent publication you can thus reassure yourself that the various ways that the original research could have been biased were properly guarded against, and it is this kind of careful work which eventually establishes a scientific fact.

I have a similar issue with your citing Mills 1983. (Unlike perhaps many of your readers, I own that book. It's a collection of essays, editorials, musings. It's not an academic work.) The main thing is: what is the precise page number where the "10x" claim is to be found in the Mills book?

I fully stand by the term "cheating", for instance, as far as this particular cite is concerned. It is cheating to make your reader have to check a whole book carefully to locate the exact spot where the author is saying what you claim supports the claim of your own article. It is dreadfully unfair, particularly so for that particular book.

I could easily cherry-pick quotes from that Mills book that appear to contradict 10x: "left to programmers, laissez faire, we could expect a productivity improvement of 50%, but if we managed it in we could expect a factor of 3".

Or if I was allowed to cherry-pick, I could point to a section Mills titles "Empirical Evidence" (which in fact cites note) where he alludes to *two* orders of magnitude differences, and attempt to discredit the source for manifest exaggeration.

My point is that if if you don't make it dead easy to check your cites, you undermine the quality of follow-up research on your topic.

And if your cites aren't in fact of the sort that can be scrutinized for the "methodological weaknesses" you lambast me for not pointing out, you simply shouldn't cite them!

Here is (what I suppose is) the relevant quote by Mills, on page 265:

"There is a 10 to 1 difference in productivity among practicing programmers today - that is, among programmers certified by their industrial position and pay. That differential is undisputed, and it is a sobering commentary on our ability to mesure and enforce productivity standards in the industry."

That's it. That's the entire extent of empirical evidence OR personal experience bearing on the 10x claim given by Mills in this, the last article in the book; and as far as I can tell in the whole book. Bald assertion, nothing more.

Saying that I am "aspiring to some academic ideal in which the only studies that can be cited are those that are methodologically pure in every respect" is a pure straw man argument. The Mills book is not a study. It is not "methodologically" anything, pure or flawed. It is a container of Mills' opinions - but we are drowning in a superabundance of opinions.

I'll have to stop here before this turns into a lecture. (Perhaps it already has; and who am I to lecture you? "Some agile guy", the ad hominem artists will say. True, but irrelevant.)

So let me close with this. Fact and opinion: they aren't the same thing. Calling an opinion a fact doesn't make it one.

Dave Nicolette said:

January 11, 2011 6:53:AM

The discussion reminds me of a study I read in the late 1980 dealing with the relative productivity of COBOL programmers. Unfortunately, I cannot find the results online now. I think the company doing the research was called Bachman.

In any case, they found a productivity difference of 26:1 between the best and the average COBOL programmers. This was a greater difference than they had expected, so they followed up to find out what accounted for it. They discovered the difference was neither methodology nor technical skills. Instead, the more-productive programmers were those who networked with their colleagues, while the less-productive ones were those who sat alone and struggled through problems without help.

A second comment (opinion, of course) is that "productivity" may not be the right thing to measure. "Productivity" means the quantity of output produced per unit of time. A highly productive team or individual may only be generating a mass of work-in-process inventory and/or defects. Other success factors are more meaningful, and a focus on productivity may lead us down the wrong path altogether, regardless of the validity of the analysis methods employed in a study. Just a thought.

Laurent Bossavit said:

January 11, 2011 6:57:AM

@Bryan

Thoughtful comment above which also has lots of food for thought that I hope to digest over time. However allow me to pick a nit:

"Another study found that 10 percent of the authors had written about 50 percent of the books in the Library of Congress."

I've tried to track that down, because I find it somewhat surprising (which triggers skepticism); D.K. Simonton in "Scientific Genius" doesn't make that exact claim, only that the study cited (Dennis 1995, which seems to be his "Variations in productivity among creative workers") "found the same elitism [...] in the Library of Congress [...] as he did in the scientific disciplines".

The form of your claim is "Sutton and Pfeiffer say that D.K. Simonton says that Dennis says that 10 percent of the authors have written about 50 percent of the books in the Library of Congress". Well, it's what Sutton and Pfeiffer say all right (Google confirms it). But they do not give a direct supporting citation for that claim. Simonton doesn't, and the 1955 article by Dennis isn't online so I cannot check the claim.

How sure are *you* personally of the final claim, that "10 percent of the authors have written about 50 percent of the books in the Library of Congress"? At what odds would you take a bet on it?

I have no doubt that this kind of ratio obtains where *literary fame* is concerned, and in fact nosing around for confirmation of the Library of Congress thing led me to one study which states this clearly enough in the abstract. I can more easily imagine that 10 percent of the authors *sell* 50 percent of the books. But the process whereby the Library of Congress accumulates books isn't the same, and I would intuitively expect it to be more egalitarian, than the process whereby books are sold.

I want my beliefs to be based on something more solid than "someone says that someone says that someone says so".

Ron Jeffries said:

January 11, 2011 9:22:AM

A, thanks for the compliment. Ancient history but KB and MF and I didn't program on C3. Still, it's a fair cop that just being around those other two guys is worth a lot.

B, I tend to agree with Laurent that the citations are difficult in that they don't point to the relevant material. And in fact I suspect that what science there is around 10X may not be very good.

That said, I can name two programmers who differ by 10X or more in capability. So it may be weak citations, it may be weak science but I am quite sure that there is //at least// a 10X variance in programmer productivity.

Interesting, though. And good job not taking things personally, Steve!

Steve McConnell said:

January 11, 2011 9:48:AM

@Bryan, Thanks for the pointer to the Humphrey book. I had never connected the dots between that book and the 10x topic but I agree with you that Humphrey's research on PSP can be used as another study that supports the 10x claim.

On the Library of Congress point, coincidentally I was looking last weekend at the Publisher's Weekly list of best selling novels by decade. E.g., the 1980s: en.wikipedia.org/.../Publishers_Weekly_list_of_bestselling_novels_in_the_United_States_in_the_1980s. It struck me how often some of the authors showed up on the list. Stephen King, Tom Clancy, Danielle Steel, and Robert Ludlum show up in the top 10 list for the year several times each.

As I point out in my book chapter in Making Software, the 10x difference in productivity in programmers really shouldn't be that surprising, because there are 10x differences in productivity among people in every other profession too. My conclusion about the 10x difference among programmers is really just saying that programming is *not* the exception to the rule.

Steve McConnell said:

January 11, 2011 9:57:AM

@Laurent, I think I outlined the reasons for my citation style clearly in the recent blog post. The (author date) style is very common in research literature (e.g., IEEE Software or IEEE Transactions on Software Engineering), so you are basically holding my work to a different citatino standard than other research.

I'm not sure what to say about your definition of "study." Perhaps the equivalent word in French has slightly different connotations than the word does in English. I believe I used the word correctly in English. I didn't say "other flawless, rigorously controlled studies." I said "other studies," and I think my explanation in this blog post sufficiently explains that my citations were in fact "other studies."

I don't agree that the purpose of every study that followed Sackman, Erickson, and Grant was to precisely represent their methodology and validate their conclusions. The purpose of each study was defined by each study's authors. They overlapped in purpose with SE&G. That does not imply any requirement to be the same, and it doesn't prevent them from supporting the *general* conclusion of SE&G. In my book chapter and blog post I was careful to say that "the *general finding* that 'There are order-of-magnitude differences among programmers" has been confirmed by many other studies." "General finding" means "general", not "exactly the same."

On the rest of the points, I think we've both made our arguments. Other readers can decide for themselves whether the studies really support the 10x claim. As I've stated, I think the 10x claim is supported as well as any other claim in software engineering research.

Laurent Bossavit said:

January 11, 2011 10:38:AM

Steve,

Do you confirm that the quotation by Mills on p265 is what you had in mind when you used that author and date for a citation?

Steve McConnell said:

January 11, 2011 11:20:AM

@Laurent, I don't have the Mills book here, but that quote sounds like what I had in mind. When I cited that book I was also considering that quote to be in the context of the full body of Mills' work (some of which is contained in that book) -- including his own performance as Chief Programmer on the famous Chief Programmer Team project where he was *far more* than 10x as productive as the average programmer of his day (83,000 LOC on punch cards in one year in the 1960s -- wow).  

Patrick Morrison said:

January 11, 2011 6:31:PM

Is it time to invest in a new study (or four) that addresses the question of programmer productivity?  What data would we need to settle these questions, or at least discuss them without frustration?

Ingo said:

January 17, 2011 6:15:AM

As invited speaker I was attending to a programming contest of students. They had between 6 months and 12 months of preparation time for the contest. The environment was the same for all participants, except, that they could bring their own keyboard and have two written pages of individual information. There was no internet access to google code. Laguages allowed were C, C++ and Java.

In total there were >60teams of 3 persons each. The results were, that for an fairly easy task the time to provide a correct solution varied by a factor of 24 with 8 teams failing to provide a solution at all. The more complex the task the closer were the times used for providing a correct solution. On the other hand, for complex tasks the number of teams failing to provide a solution rose as well.

Before that, I was also stuck with the 10x figure, but these results were even more sobering or should I say horrifying (seen from a customer's view).

Laurent Bossavit said:

January 17, 2011 11:59:AM

The 1999 Prechelt paper mentioned above reviews many studies similar in some way to those cited in the original 10x article and the "Making Software" chapter. It's interesting to track those down and see what they actually say.

For instance, "An Empirical View of Inheritance", by Cartwright and Shepperd (1998), only finds 2.2:1 differences between best and worst.

Steve McConnell said:

January 17, 2011 12:48:PM

@Laurent, Re: the Cartwright and Shepperd paper, I hadn't previously seen this report, so I haven't studied it in detail. At first glance it seems to contain several of the more common methodological errors the Sheil paper talked about:

1. Student programmers rather than professionals

2. Different tasks were done by different programmers, so the study doesn't control for the dynamic I discussed in Figure 1

3. Small sample size -- only 5 programmers per group

4. Small assignment -- 31 to 109 minutes total

5. It doesn't look at a full programming task -- only a short maintenance fix (i.e., it's unclear how much design the programmers had to do)

With this small number of subjects (only 5 working in each of 2 styles), I don't think this really says anything one way or the other about the overall 10x issue.

The study does implicitly raise a fascinating question> It shows that a small maintenance change in an inheritance structure takes longer, even though more lines of code get changed in the flat (non-inheritance) structure. The idea that the group that generated more lines of code took less time is not intuitive, and is interesting  -- but is of course still subject to the methodological limitations I mentioned above.

Steve McConnell said:

January 17, 2011 12:49:PM

@Ingo -- fascinating! Can you give a range of times needed to complete the solution? I.e., was this a 10 minute task, a 3 hour task?

Laurent Bossavit said:

January 17, 2011 1:11:PM

> With this small number of subjects (only 5 working in each of 2 styles), I don't think this really says anything one way or the other about the overall 10x issue.

That's 10 in total. Now, here is what you had to say on sample sizes in 2008:

"The other thing to keep in mind is that in most cases these are small sample sets -- 20 to 50 programmers. If we were dealing with huge sample sizes -- thousands of programmers -- comparing best and worst wouldn't be all that interesting because you wouldn't run across similar size groups very often. But when you're looking at results from samples this small, the odds are you *will* run across similar results similar to what is reported in the studies."

Have you changed your mind since?

Steve McConnell said:

January 17, 2011 2:39:PM

@Laurent -- Cartwright and Shepperd did have 10 total subjects, but since the 10 were divided into 2 groups of 5, and since each group worked in a different way, it isn't meaningful to compare results across groups. For purposes of analyzing differences in productivity among individuals, it's more accurate to characterize their study as 2 groups of 5 and analyze the differences within groups, but not across groups. To me their study isn't very interesting in terms of the 10x issue because they looked only at groups of 5 programmers, which I think is too small.

My 2008 remark that you cite wasn't intended to be a general comment about sample sizes. It was a response to a specific comment, which I think is clear if you read the comment in context.

The point I was making was that if the productivity studies looked at variations across huge groups like 10,000 programmers and found 10:1 differences that way, the 10x difference wouldn't be very interesting because it would only apply to groups of ~10,000 in size. But the studies don't look at groups that big. They tend to look at groups of 20-50 programmers and have found 10x differences in groups of that size. That makes the 10x observation relevant to a lot of groups of the sizes people actually work in.

Here's the link to my 2008 comment if anyone wants to follow up on it (scroll way down into the comments section): forums.construx.com/.../productivity-variations-among-software-developers-and-teams-the-origin-of-quot-10x-quot.aspx.

Maybe I also need to point out that I am not arguing (and to my knowledge no one is arguing) that *every* group of programmers demonstrates a 10x difference in productivity from best to worst. The more homogeneous the group is in terms of ability, experience, motivation, etc., the smaller the gap from best to worst will be. All other things being equal, smaller groups will tend to demonstrate smaller variations from best to worst.

Laurent Bossavit said:

January 17, 2011 4:39:PM

Mills 1983 seems even more problematic with a still smaller number of subjects - one, that being Harlan Mills himself.

Daly (1996), "Replication and a Multi-Method Approach to Empirical Software Engineering Research", also cited in Prechelt, with a group size of twice 10, finds 4.6:1 and 5.3:1 ratios between best and worst. Total times range between 18 and 111 minutes across the two groups.

(The dataset in Prechelt "contains 1491 observations made for 614 different subjects from 137 experiment groups ranging in size from 2 to 38 persons". So this sample size is middle of the road for the Prechelt dataset.)

> The more homogeneous the group is in terms of ability, experience, motivation, etc., the smaller the gap from best to worst will be

That is in part a tautological argument: the more homogenous a group is in terms of ability, the more homogenous the measurements of that ability will be. Well, I wouldn't be surprised.

The original post claimed "10-fold differences in productivity and quality between different programmers with the same levels of experience". So it shouldn't matter how homogenous or diverse the group is in terms of experience, which is the main factor usually controlled for.

More problematically, what kind of group would you expect *not* to show 10-fold differences, specifically; that is, what empirical data will you then allow to falsify the original claim?

You said above "Bossavit does not cite even one study—flawed or otherwise—that counters the 10x claim, and I haven’t seen any such studies either".

We have now seen two studies so far that counter the claim. (In one sense you have "seen" those studies, insofar as they are referenced in Prechelt's work, of which you were aware in 2008.)

Suppose we were to go on in that vein: look for the primary sources on the studies Prechelt says were included in his large "variance" dataset, and find out what the data said.

Can we come up, ahead of that work, with unambiguous criteria on what studies we ought to consider valid counter-examples of the 10x claim, in terms of group size and makeup, how low a variance we have to see to consider it "not 10x", and so on?

Steve McConnell said:

January 17, 2011 5:46:PM

@Larent, On the tautology point -- What I was thinking of there is companies that are consistent about hiring from the top end of the talent pool (aka the "ability" pool), middle, or bottom. As I stated specifically back in the 2008 discussion, and again earlier in this chain in acknowledging Larry O'Brien's comment, I have observed (informally) that productivity differences within a single company are usually in the range of 2-5x, but not the full 10-20x reported in the research literature. So yes, the more homogeneous the group is, the lower the observed productivity difference will be.

You wrote, "The original post claimed '10-fold differences in productivity and quality between different programmers with the same levels of experience'. So it shouldn't matter how homogenous or diverse the group is in terms of experience, which is the main factor usually controlled for." That logic doesn't quite work. If there are 10x differences when experience is held constant, there could be even larger differences when experience isn't held constant.

You ask, "What kind of group would you expect *not* to show 10-fold differences?" I already answered that. Groups that are more homogeneous in experience, ability, motivation, and so on will tend to show lower differences.

Finally, you say, "What empirical data will you then allow to falsify the original claim?" Fair question, and I do think the 10x claim is "falsifiable"

I would want to see (1) studies that covered as broad a range of circumstances as the studies that support the 10x claim; (2) studies that are on a par methodologically with the 10x studies, i.e., they wouldn't have to be perfect, but their methodological flaws, if any, would need to be minor, (3) studies that include sample sizes similar to the 10x studies, i.e., at least a few that have samples of 50+; (4) in aggregate, they would need to provide more evidence in support of lack of 10x variability than the studies that support 10x variability; and finally (5) they would need to explain why so many prior studies have found 10-20x variability and why so many people (myself included) have thought that we observed order-of-magnitude differences but really didn't.

Having said all that, I think this is a pretty academic point. The studies that you're citing to disprove 10x still cite 2x to 5x differences with small sample sizes. That just proves those studies didn't happen to have anyone really good or really bad in their sample. Those studies don't prove that a 10x difference wouldn't exist in a larger sample.

*****

Taking a step back from the details, my overall feeling at this point in the chain is that I'm not sure why we're still discussing this. Do you really believe there aren't 10x differences in productivity? If so, why? Do you really believe "all programmers' productivity is about the same?" Or do you just think the difference is 2-3x but not 10x. If you're arguing either of these points, that makes me wonder whether you just haven't seen any programmers that are that good? Or maybe you haven't seen any that are all that bad? I have certainly seen plenty of examples on both ends of the spectrum, and I've talked to enough people on this topic to believe that my experience is not at all unusual.

If there's anybody reading this chain who believes "most programmers' productivity is about the same" I'd love to hear from them.

Laurent Bossavit said:

January 17, 2011 7:04:PM

> Do you really believe there aren't 10x differences in productivity?

What I disbelieve is that this claim is "as solid as any research that’s been done in software engineering" (Pete McBreen has weighed in on that recently, to call it "damning with faint praise" - I agree!). It doesn't feel solid, it feels "not even wrong".

What I believe is that this "fact" has brought us no useful understanding of what programming productivity consists of.

What I believe is that to overstate the quality and quantity of the evidence bearing on that claim is to hold back more appropriate research on that topic. (The appropriate way to assess the quality of the research is to report the evidence consistently - that rules out including personal opinions in a list supposedly of "many studies", but failing to list other studies that don't appear to support the claim.)

Instead of being called "productivity" studies they should more rightly be called "time and motion studies", since what they typically measure is the time taken by some group of programmers (typically students) to complete a task chosen by someone else under somewhat arbitrary circumstances. It's not clear how well that kind of measurement can translate to the real-world effectiveness of the same developers.

An old experiment by Jerry Weinberg showed that when you ask programmers to focus on some aspect of job performance - time to complete, memory optimization, runtime optimization - that aspect ends up being where they perform best. So the "studies" may really measuring how well people manage to optimize their own performance for speed, at the detriment of other aspects of job performance (but these aspects matter for productivity in professional performance).

What I also suspect is that the "studies" are basically measuring noise. I am aware of no study that attempted to estimate how much of the variability would still be there if you measured the *same* programmer on different kinds of tasks, or even at different points in time; this kind of elementary checking of "construct validity" is commonplace in psychometrics and its absence in software engineering is quite troublesome.

The point of any "controlled" study is that you are trying to discern a *signal* among the noise; the noise itself is a given, a nuisance but not useful knowledge by itself. If you take a larger sample, you get more noise, and (as several commenters have pointed out before me) with enough noise you can make the best-worst differential arbitrarily high.

I believe that if you got a large sample of programmers AND controlled for all the really relevant variables, you would see a relatively tight clustering of productivity. Programming ability isn't magic, or irreducible. It is built up from smaller elements, and a good experimental design should be able to isolate those.

The lack of correlation with experience should be a big hint that there's something amiss. We should expect that even such a restricted aspect of productivity as time to complete a task should rise with experience. The issue here is that we don't really have access to this variable "experience"; we only have access to an imperfect proxy which is "years of experience reported on CV". It's a common observation that ten years of experience is different from one year of experience repeated ten times.

My hunch is that the variable that really matters is the extent of the programmer's knowledge of design schemas that are relevant to the task. This is what grows with actual experience and deliberate practice. (Norvig's solution of the Sudoku exercise impresses, because he demonstrates ready access to a wide range of design schemas at different levels of abstraction: encoding part of the solution in the data representation, list comprehensions, hash tables, unit tests, constraint propagation, mutual recursion, backtracking.)

Controlling for that is rather difficult, but you could easily set up an experiment to test its impact directly: measure the spread of performance on tasks that are similar in underlying structure, but different in surface detail, solved successively by the same programmers.

What's left is perhaps harder to learn through experience or education: recognizing, in a real-world task, which design schema is best suited to that task. To the extent that this relies on intuition, which we can't measure or even formalize yet, we would still see some left-over variability (probably correlated with measures of fluid intelligence) after controlling for extent of schema knowledge.

Maybe, at this point, we'd have a useful debate at hand: if the variability due to hard-to-teach factors is high, we'll conclude that great programmers are "born not made"; talent spotting, rather than systematic training, will take the lead in the effort to raise the bar. If, on the other hand, the variation is mostly owed to a patient accumulation of effective programming techniques, we'll devote more effort to improving programmer education.

Steve McConnell said:

January 18, 2011 12:12:AM

Laurent, I read McBreen's commentary. He's either misreading or misreporting my point when he characterizes my claim that this is "as solid as any research that’s been done in software engineering" as "damning with faint praise." That was not my intent, and I think that was clear in context.  I will comment about McBreen's other point about "if there are 10x differences in productivity why aren't there 10x differences in compensation" since that's an interesting discussion of its own and deserves a blog post of its own, not just a comment here.

To the degree you sound like you agree with McBreen's point, I would again challenge you to find any other research in software engineering that's based on more solid evidence. You cite lots of aspirations for future research and then further speculate about what that research might find, and that kind of speculation is fine, but until that research has actually been done and until it produces the findings you expect it to produce (if that's what it finds), those points are irrelevant. The existing research supports the 10x claim, and you still have not provided any meaningful evidence to the contrary.  

The studies you have cited are either not very relevant or support my point. The most recent study you cited was not published in a peer reviewed journal. You claim I "failed to list other studies that don't appear to support the claim." But you haven't actually cited any such studies. The athors of the study we've most recently exchanged comments about don't comment anywhere in their paper that they think it's about productivity variations between programmers. Their paper is about the effectiveness of different inheritance approaches using C++. It is not about variations in programmer productivity. It is not reasonable to expect me to include every study that might tangentially be construed as having something to do with programmer productivity, especially those that have not actually been published, use student programmers, and use sample sizes of only 5 programmers per group.

That Cartwright and Shepperd study is interesting because the groups of 5 programmers showed 2x differences in productivity, even with a small sample size. The earlier Daly 1996 study that Cartwright and Shepperd were trying to replicate used 10 programmers per group and showed 5x difference in productivity, still with a small sample size. It is not a huge leap of imagination to suppose that if this study was reproduced again with 50 programmers instead of 5 or 10 the researchers might find 10x of higher differences in productivity. Of course that's speculation on both our parts -- the only thing we know for sure about these studies is that they are limited by the fact that they used small groups of student programmers, weren't officially published, didn't directly focus on productivity variations, and therefore aren't very relevant.

Many of the studies I cited were originally published in refereed journals or conference proceedings. The authors of these studies agreed with my conclusion -- all reported high variability in productivity among different programmers. You sound like you don't find those studies credible, but the editors of the journals and the conference organizers that published them did find those papers credible. If you are going to accuse authors of lying with citations every time someone cites a paper from refereed publication that says what he says it says, you're going to be calling a lot of people liars.

Per your original criticism of my article, I did not overstate the quality or quantity of evidence supporting the 10x claim, and I did not "lie with citations." Your repeating that over and over doesn't make it true. I listed my sources. You have not responded to my arguments in support of those sources.

*********

To summarize where we are on the studies after several rounds of comments, here is the support for the 10x claim:

Sackman, Erickson & Grant - doesn't support 28:1 as originally thought, but as I pointed out in my 2008 blog article as well as this one, it does support claims of 10x or higher. You have not provided any reason to question this study other than disbelieving the original 28:1 ratio, which I had already described in my original article.

Curtis 1981. Curtis stated "a statement such as ‘order of magnitude differences in the performance of individual programmers’ seems justified." This was a wholly appropriate citation from a refereed journal, and you have not given any reason to question this specific study.

Curtis 1986. This "study of studies" again provided support for the 10x claim, and was published in a refereed journal. You have not given any reason to question this specific study.

Mills 1983. You insist on oversimplifying my citation of Mills to one sentence, even though my citation was to his longer book that discusses Mills' opinion as well as his many years of research. You insist on characterizing Mill's summary of his research as his "opinion." I have characterized it as a research summary. I believe my characterization is more accurate.

DeMarco and Lister 1985. This directly supports a 5x productivity claim, with the additional proviso that about 10% of their test subjects didn't finish the assignment at all, making the real ratio more than 5x. This was published in refereed conference proceeding. You have not given any reason to question this study either, aside from pointing out that it involved professional programmers in their normal work setting, which I do not agree is a criticism of the study.

Card 1987. Aside from your initial mischaracterization of this as an "executive report," you haven't given any reason to disbelieve this research either. Again, this was published in an archival journal, and its author believes it supports a claim of high productivity variability among programmers.

Boehm 1981. At the team level, supports a 4x difference in productivity. You haven't given any reason to question this research.

Boehm 2000. Same as Boehm 1981. You haven't given any reason to question this research.

Valett and McGarry 1989. This is another study that was published in a refereed journal, which you have given no specific reason to question.

Humphrey 1995. Not one of my original citations, but a good addition from another blog reader. Directly supports the claim of 20x differences in programmer productivity.  I would add this citation to my original list of citations now that it's been brought to my attention.

Prechelt 1999. Not one of my original citations, but is a relevant (if unpublished) citation from another blog reader. Prechelt himself doesn't seem to believe the 10x claim, but if you avoid doing questionable things to the data like he does (such as removing the top 12.5% and bottom 12.5% and calling what remains the full difference), his data set also supports the 10x claim. Because this hasn't been published, I would not add this to my original list of citations even though I'm including it here since the discussion thread has already talked about it.

***********

You said, "What I believe is that this 'fact' has brought us no useful understanding of what programming productivity consists of." That's not the point. I never claimed this discussion did bring us more understanding of that issue. This brings us understanding of the range of human performance on programming tasks. If programmers all have the same capabilities, that will lead us to structure and manage our teams using one set of approaches. If we can expect moderate to high variation in performance among programmers, we'll structure and manage our teams different ways.

You claim that the 10x studies are basically just measuring noise. The 10x studies are not basically measuring noise. That was my point with Figure 1. The *rest* of the studies can be accused of measuring noise, but not the studies on variations among programmers themselves.

***********

Where does that leave us overall? it seems to me that it leaves us back where we started, albeit we've added a couple more studies supporting the 10x claim. I provided a number of published studies in support of the 10x claim that were directly on point. You have cited a couple of unpublished research reports having to do with tangential topics like use of object-oriented practices, but no research that's directly on point.

At this point I think we've exhausted any discussion that might be of any conceivable interest to anyone other than the two of us. As I stated in my earlier private email to you, I am willing to continue dialoging via email if you want to do that.

Application said:

August 13, 2013 4:53:PM

Sorry to be late to the party, it's one of my faults. I personally measured programmer productivity of two six month projects, one with six developers and one with 11. Both results support the 10x phenomena. Each project involved professional developers where the entire team developed with the same language under the same methodology. So I've long been a believer in 10x. Not to contradict Larry O'Brien, but yes, I've seen 10x among small teams. It depends on who an organization hires and what they pay. I wonder what percentage of developers are 10x? That's an interesting question. For example, what benefit would a small company enjoy if it paid 20% over average to attract the top producers?

Post a Comment:

 
 

Steve McConnell

Steve McConnell is CEO and Chief Software Engineer at Construx Software where he consults to a broad range of industries, teaches seminars, and oversees Construx’s software development practices. In 1998, readers of Software Development magazine named Steve one of the three most influential people in the software industry along with Bill Gates and Linus Torvalds.

Steve is the author of Software Estimation: Demystifying the Black Art (2006), Code Complete (1993, 2004), Rapid Development (1996), Software Project Survival Guide (1998), and Professional Software Development (2004). His books twice won Software Development magazine's Jolt Excellence award for outstanding software development book of the year.

Steve has served as Editor in Chief of IEEE Software magazine, on the Panel of Experts of the SWEBOK project, and as Chair of the IEEE Computer Society’s Professional Practices Committee.

Steve received a Bachelor’s degree from Whitman College, graduating Magna Cum Laud, Phi Beta Kappa, and earned a Master’s degree in software engineering from Seattle University.
Contact Steve