Google's Disconcerting Project Aristotle

John Vandivier

NYT <a href="https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html?smid=pl-share">released an article reporting Google's findings from an internal research project called Project Aristotle. It found that technical skill doesn't matter that much in team building. They found the key factor was psychological safety. This article discusses some reasons I find it hard to trust Google's results.

Four key issues are front and center:

  1. The actual data is inaccessible
  2. Replication fails
  3. There is evidence that Google developed a pattern of rejecting evidence-based research and systematically selects progressive results
  4. Google seems not to be using their own findings
1. The Data is Inaccessible

This is straightforward. On <a href="https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/introduction/">Google's team effectiveness reWork page they describe Project Aristotle and link the NYT article, but they do not provide access to the source data. They consider it proprietary. This makes replication and even understanding Google's results exceedingly difficult. 8 important questions come to mind:

  1. What do they mean when they say psychological safety is the key factor?
  2. What share of what outcome(s) were explained?
  3. How robust is the finding?
  4. Compared to what other factors?
  5. What is the degree of confidence?
  6. What actual measures were used to operationalize the concept?
  7. What is the optimal financial amount to invest into psychological safety?
  8. Is this factor even subject to standard diminishing marginal return?
Without this information it is not possible to consider Google's result scientific, actionable, or frankly even intelligible except in a distrusted intuitive manner.

A result is trustworthy to the degree that is it replicable and well-defined, which entails that trustworthy results must be accompanied by open data. As a market leader in technology and data, Google knows this and open data is entirely feasible for them. The fact that they omit such data makes it suspect above and beyond the simple fact that replication fails anyway.

2. Replication Fails

Google isn't the only company that studies team effectiveness. I have a bit of data about programmers I work with. I have also contributed to hiring and firing decisions. I can confidently state the following:

  1. New hire technical skill causes management trust, and therefore causes the preconditions for psychological safety.
  2. Technical skill is the most important factor in team effectiveness for development projects.
Part of the replication failure is the fact that Google and companies like mine don't even agree about what success looks like. For my company, success is:
  1. Completion of work requirements on time
  2. High customer satisfaction
  3. Achieve reasonable profit
#1 drives #2 and #3, so we can simplify and study factors of timely work completion. They basically fall into two categories:
  1. Understanding requirements
    1. Client research activities
    2. Contracting defined scope early on
    3. Preventing scope creep
    4. Iterative adaptation
  2. Developer effectiveness
    1. Technical skill
    2. Work ethic
    3. Communication skill
So clearly there is a role for soft skills. I'd even say there's a role for psychological safety. High employee turnover drives down profit, and a psychologically discomforting environment will contribute to turnover. In addition, psychological comfort likely facilitates communication. At the same time, excessive psychological comfort might contribute to scope creep because the client feels they can ask for anything and the provider is worried about the psychological discomfort which might arise from denying any request. Excessive psychological safety might also result in inappropriate romantic or other relationships at the office. After all, psychological safety just refers to a feeling of safety for interpersonal risk taking, and what's a better example of interpersonal risk taking than the attempt of an individual to initiate a non-professional relationship in a professional setting?

While there is some role for soft factors including psychological safety, they are largely non-central. Simply contrast these two extremes:

  1. A friendly, communicative team with no technical skill
  2. A technically skilled, hard working team, which is subject to high penalties for unapproved activities
The second group will be subject to psychological discomfort. In particular, the discomfort which arises from the ability to be fired, fined, sued, or any combination of such things, for very slight misbehavior. Yet this group will vastly outperform the first group in terms of timely project completion. Indeed, the second group looks very much like one of the country's largest labor teams: The military. The first looks more like a friend or social group than a professional group of any kind.

My team measures developer effectiveness in a number of ways including:

  1. Story points completed per time
  2. Inverse bugs generated per time
  3. Ratio of accepted to rejected pull requests
Google's Project Aristotle has a disappointingly subjective view of effectiveness:
They looked at lines of code written, bugs fixed, customer satisfaction, and more. But Google’s leaders, who had initially pushed for objective effectiveness measures, realized that every suggested measure could be inherently flawed - more lines of code aren’t necessarily a good thing and more bugs fixed means more bugs were initially created.

Instead, the team decided to use a combination of qualitative assessments and quantitative measures.

3. There is evidence that Google developed a pattern of rejecting evidence-based research and systematically selects progressive results

I love how the explanation at the end of the last section describes the selection of the definition of effectiveness. It states that Google's leaders called for the change from objective to subjective measures. It fits my theory perfectly, which is that company leadership is selecting preferred results, and it seems to indicate that objective measures produced some other result, such that a change in study was required. There is also no justification for the finalized mix of qualitative and quantitative outcomes selected for the definition of effectiveness. It's as if the research team's justification is basically, "Well, that's what leadership picked."

While this appears to be grossly non-rigorous, it's not the first example of such anti-science behavior. I note four additional points in this section, for a total of five, which is enough for me to claim a pattern.

  1. Academic research, such as Epstein and Robertson 2017, has identified a left bias in Google search results
    1. Tangentially, reports are that Facebook is also biased, and therefore it may be a trend in Silicon Valley and among other big name tech firms as well.
  2. Shapiro notes that rotating Google doodles throughout the year favor leftist causes.
  3. Project Oxygen concluded that soft skills matter more than STEM for top employees.
    1. Some of the incorrectness of this result is attributable to misreporting in the media. For example, Project Oxygen specifically looked at managers rather all employees or developers in particular. No one expects STEM to be more important than a creative writing degree for a professional writer. Most people think STEM matters for developers.
    2. Even so, Google once again failed to permit data access or provide clear meaning about certain factors.
    3. They also engaged leadership-directed, leading (in the sense of determining the answer beforehand) research questions. As an example, Michelle Donovan, one of the original researchers, said the guiding question shifted from “Do managers matter?” to “What if every Googler had an awesome manager?”
    4. Their findings basically amount to \"We discovered good managers don't piss off developers, \" which should have revivified the original question about whether managers as a pure role are desirable at all. Instead, leadership decided of their own accord that a management layer would exist and the research team needs to figure out how to ensure it is composed of good managers. They didn't consider alternatives.
  4. Google fired the author of the Google Memo and Google's leadership participated in his public humiliation, despite the fact that the Google Memo is largely correct.
    1. Also see Vice/Motherboard and Wikipedia for memo references, other than the actual PDF copy linked above.
If you haven't read the Google Memo you should. It's contents defy the media reporting on its contents, as well as Google's statements about its contents. Here, for example, are selected excerpts from the first two pages:
I value diversity and inclusion, am not denying that sexism exists, and don’t endorse using stereotypes...Google’s political bias has equated the freedom from offense with psychological safety, but shaming into silence is the antithesis of psychological safety...This silencing has created an ideological echo chamber where some ideas are too sacred to be honestly discussed...Differences in distributions of traits between men and women may in part explain why we don't have 50% representation of women in tech and leadership...Discrimination to reach equal representation is unfair, divisive, and bad for business...Of course, I may be biased and only see evidence that supports my viewpoint. In terms of political biases, I consider myself a classical liberal and strongly value individualism and reason. I'd be very happy to discuss any of the document further and provide more citations.
The memo alleges a left bias, where left bias is defined according to the included chart, and it seems accurate imo:

<img class="aligncenter size-full wp-image-6507" src="http://www.afterecon.com/wp-content/uploads/2018/01/left-bias.png" alt="" width="556" height="125" />

4. Google seems not to be using their own findings

Google claims to be concerned about psychological safety, but they fired James Damore for creating the Google Memo. That's an example of thought intolerance or policing, and thought intolerance reduces diversity as well as psychological safety. Google also continues administering technical tests in order to get an interview.