I’m sort of a statistics idiot. I am endlessly fascinated by statistics, especially as a game developer, where they power everything from animation to scoring systems, but I also have a lot of love for what statistics can tell you about the world at a glance. It’s a comforting feeling to see reality described as a value.
Industry has long searched for reliable metrics on which to gauge success. The traditional metric is “how much did it sell and at what price versus our investment”, but this statistic is vividly dependent on time as a factor. How do you get a gauge of the success of a product when that product’s lifespan is considerable? The concept of the “indie darling” where a game turns out to be a success where little was expected is warm and fuzzy and fun to think about, but completely untenable as a reliable business model.
For analysts, prediction of success is a progressively sexy concept as investment increases. In video games, recently I was made aware of so-called “mock reviews”, a practice in which games journalists write game reviews for unfinished products for internal use by the publisher so as to predict how the game will score at final release, and, I assume, determine the marketing budget; as a game developer, how much real change can you introduce into a game at a point where the game is already “reviewable”?
Finally though, as a game is released, with the long tail of game sales these days, what determines success?
Unfortunately for everyone, consumer and industry, “success” is currently measured in the Metascore. It’s time for me to ramble.
Averages are wonderful in their need for reinterpretation. They are the most boring of statistics, existing only to get rudimentary ideas, remove edges, peaks and valleys. The average of a triangle’s vertices give you its midpoint; A representation of the triangle for certain, but what a pitiful representation. Such a representation only has value through interpretation and contextualization. Consider the average CPU use of an application. It might idle and do nothing, and it might burn every core you have, and so your average, out of context, is almost completely useless. You’ll be sat at that dull median, knowing even less than when you started. Averaging a system with lots of variation is, as far as I can tell, silly; The only thing you can measure is a tendency, and a tendency is not a precise value. The aggregate of a game would permanently be in the 80s or 70s. Only outlier games would diverge from this average
Video games media is about as score-driven as media comes. They are a natural fit for scoring, after all. In playing games, especially of yore, success is eminently quantifiable, and this quantification of reality and success is a big draw for a lot of gaming as a whole. It only makes sense, I suppose, to quantify the success of the game as a product as well.
It turns out, however, that scoring of this sort is a little too complicated for its own good; You might as well task yourself in reviewing a human being, what with all the warts and beauty a game can bring to the table. How would you score your friends, and how would they measure up?
The topic of game review scoring is a hot one. I suppose the basic argument is about what exactly a score means. Is it to determine whether the consumer should make a purchase or not? If so, why not adopt a binary metric such as the thumbs up/down of Ebert & Siskel? Is it to determine how the purchase stacks up to other purchases? At that point you are in the domain of averages, and you end up with lists sorted by score; For a long time The Legend of Zelda: Ocarina of Time was “the best game in the world”, which regardless of how you feel about that game is a patently ludicrous notion to anyone who have any interest in the full spectrum of experiences games can offer.
So the choice appears to be between simpler scoring – “good” or “bad”- and the more elaborate systems, often resulting in scores with decimals. While there are attempts at walking the middle ground between these two approaches, “guide” or “data”, these attempts seem to reduce scoring ranges in the faith that this implies legroom for error and as such should be less contentious. It’s a noble endeavor, but still a compromise rather than a solution to a problem that goes further than the individual scoring mechanic.
Recently, after reading Destructoid’s 10/10 (in actuality 100/100 counting decimals) review of Halo 4, I was struck by how offensive I found that scoring mechanic versus Giant Bomb’s range of 1-5 stars. The implication was, I felt, that even at 5/5 stars the broad strokes of the 5-star range meant there was implied room for flaw, whereas the 100/100 score was too precise to allow any doubt or reason, which are profoundly important to as subjective an art as video games. In a sense, the larger the range, the more I require the full range to be used, lest the values of that range boil away into a skewed average where none of it matters.
I couldn’t tell you when it happened, but at some point, video game scores became practically homogeneous. I’m not opposed to the idea that games themselves have become homogeneous; Look no further than the past decade’s love affair with the Modern Military Shooter, possibly the worst, blandest thing to happen to video games for as long as I have been playing them, though judging by the success of the genre that clearly puts me in the minority.
That games such as XCOM, a moderately simple turn based tactics game (a genre as common as oxygen in the 90s), can appear as rescuing angels of innovation in the year 2012 unfortunately speaks less to the merits of XCOM and more to the creative flatline of an industry where ballooning budgets and economic recession have put the fear of death into nary every publisher in town.
With such enormous budgets, yet so much fear, predicting success is, again, intensely attractive. If X is 100, and Y is like X, Y should be 100 as well, right? Let’s do another one of those Xes.
The answer, it appears, is to guard our investments with aggregate scores. I’m not inherently opposed to score aggregation. As a consumer, I find them highly useful. Rotten Tomatoes is a wonderful thing, probably one of my favorite sites today. It works mostly because film reviews work. While there are scoring systems in place for movies, Rotten Tomatoes does not average scores gathered but rather converts every score into a basic thumbs up or down, or fresh tomatoes vs rotten tomatoes. A gushing review or a middling-to-good review are both fresh; One does not skew the other. In the same sense, a vicious rage fest of a review versus a merely disappointed one count for the same.
The real crux of the problem with statitics and game reviews is publishers’ willingness to base their business off this skewed aggregate Metascore. I wasn’t shocked to hear Obsidian’s developers would not receive a bonus payout if Fallout New Vegas didn’t make 90% on Metacritic, but it didn’t make me any less furious, knowing the very first thing about averages and statistics.
Because averages are painfully sensitive to extreme values (the extremes of a data set is how you gauge the entirety of that data, were you to graph it for instance), so-called outliers will throw off entire ranges. Given 200 scores of 90, a single 10 might drag you down to 89 depending on your rounding. No bonus for you, developers! Why? Because a game reviewer dared have a vigorously divergent opinion.
Rotten Tomatoes have eliminated the outlier problem by normalizing the range into a set of binary values. In one fell swoop they have made a range that is intuitive to the viewer yet insensitive to the personality traits of scoring mechanisms or even reviewers themselves. The resulting percentile score is less a precise metric but rather the answer to a question: Out of how many reviewers, how many thought this movie was any good?
Metacritic instead embraces the whimsical granularity of the games press, adopting Destructoid’s to-me-problematic 100-point range, and as a result, outliers are a cause of great concern. The actual website is fine about it, presenting up front the highest scoring, the lowest scoring, and then someone from the mid range. As a consumer, looking through aggregated reviews, these are the ones I actually care about.
I am much more likely to read “bad” reviews of products simply because they tend to be the more impassioned. It is easier to disagree with a bad review than to disagree with a positive, though that might just be my personality that makes it so. Regardless, I look to outliers to gauge myself on that spectrum. Games are not as easily quantifiable as film; I’ve been burned much too many times on trusting the common consensus (Metal Gear Solid 4 is still the biggest piece of shit still in my collection, take that Metacritic average).
A range is only useful when every value on it has a meaning. Some outlets prize themselves on their willingness to apply the full range, while others take the more politically inoffensive approach of skewing the range towards the positive – everybody knows a game scored 6/10 is pure garbage, right? Combined with the games press’ love affair with granular statistics, this further devalues an average, as nobody seems capable of agreeing on what range they are operating, while quietly refusing to acknowledge their scores are being aggregated and used to drive the industry.
There are numerous further issues with Metacritic, such as their normalization of disparate ranges. For instance, a 1/5 translates to a 20/100, which is in conflict with sites that use the full 100-point range. I shudder to think how Metacritic would interpret a binary system.
Yet none of these issues with Metacritic as a platform would be affecting the industry if it hadn’t been for publisher analysts using the aggregate as a metric for success. Because it is not a metric for success. It is a statistical guesswork based on opinionated guesswork, normalized and processed and skewed by a conflicted press. It barely qualifies as statistics.
And so, Tom Chick’s 1/5 review of Halo 4, actually a good and informative read if a little personal, becomes controversial, with analysts and game developers up in arms about how he dares to write such “look-at-me journalism” (in the words of an enraged David Scott Jaffe) knowing the real-world “value” of the Metascore, or on the flipside, how Metacritic knowing the value of their metric dares include such outliers in their measurement.
For as long as Metacritic’s score average is taken so seriously and given such real-world implications, nobody wins. Not the press, not the developers, and certainly not the consumers.