Page 1 of 1

Sochi: no conspiracy against Kim, but flawed judging system

PostPosted: Mon Feb 24, 2014 6:33 pm
by geon79
Hi everyone.
So, I took the published score details of the long program of ladies' event in Sochi and analysed them to some detail. My conclusions are the following, which I'm going to prove (I hope):

1. No evidence of conspiracy to make Sotnikova win the medal.
2. The training and selection of judges and/or the scoring system is severely flawed

1) There are speculations that one or more judges may have fixed the competition to favour Sotnikova. First of all, in order to do that you need at least two judges to be involved, because the lowest and highest GOE marks for every technical element and the component marks are removed from the computation. To see if TWO judges were actively overvaluing Sotnikova and undervaluing Kim, I recalculated the score:
A removing the highest and the lowest TWO marks from computation and
B removing the two highest and the lowest two overall marking judges (i.e. the ones that gave the lowest and highest total sum of GOE points and component points)
The results are
official tech-> 75.54 comp-> 74.41 total->149.95
A tech-> 75.55 comp-> 74.64 total->150.19
B tech-> 75.55 comp-> 74.40 total->149.95
official tech-> 69.69 comp-> 74.50 total->144.19
A tech-> 69.49 comp-> 74.40 total->143.89
B tech-> 69.67 comp-> 74.64 total->144.41

As you can see the difference it makes is very small, which means no subgroup judges actively favoured Sotnikova, unless it was the WHOLE judging board who did together; but this last possibility is unlikely considering point 2.

2) The scores shown at point 1 are calculated from an average among judges, but the average tells only half of the story. What is really worrying is the HUGE spread of scores. Isolating the marks of single jugdes (we don't know the identity of them, but their marks are listed separately on the result sheets) we obtain these scores:
official tech-> 75.54 comp-> 74.41 total->149.95
highest marked tech-> 80.73 comp-> 77.20 total->157.93
lowest marked tech-> 70.33 comp-> 70.08 total->141.13
official tech-> 69.69 comp-> 74.50 total->144.19
highest marked tech-> 73.79 comp-> 78.00 total->151.79
lowest marked tech-> 66.89 comp-> 67.02 total->134.09

So one judge marked Sotnikova program over 16.80 points more than another one, and this gap reaches 17.70 points for Kim. These differences are huge, it's even more than the difference in free skating score between the first and the sixth classified free program in the event.
This means that the judging system is inherently flawed. It's not a matter of tastes, we are beyond the discussions about how jumps should be scored compared to spins, how components should be valued, etc.
However everything is marked, the unforgivable flaw is that the results are highly inconsistent.
The purpose of a judging system is having a STANDARD and top judges are expected to apply that standard with high consistency. In other words, two high level judges should mark the same performance within few points of difference from one another. Of course, a small variation in performing sports is expected and that's why there is a board of 9 judges to average and smooth out that small uncertainty. But 17.7 points of difference for the same performance (a gap that could easily accommodate 10 skaters in the final classification) is far too much and it can't be evened out by averages.
I bet several fans of figure skating, without any specific training and selection, could mark a free skating program within 17 points of its actual value, so how are these judges trained? Is the standard they are supposed to follow clearly defined or just a matter of opinion? 17.7 point spread from a judging board at Olympic level sounds like a mock scoring system, just smoke and mirrors to make things look accurate while in reality there's no progress from the 6.0 system.

ISU needs seriously to rethink this through. This isn't fair for fans of figure skating and, even more so, for athletes that devote their lives to this sport. I hope someone of the high spheres reads this message and asks himself some serious questions.


Edit: typos

PostPosted: Mon Feb 24, 2014 9:05 pm
by Andy
Thanks for the interesting analysis. I can add some more points of dscusson to it. According to rule 1631, the performance of the judges s evaluated as follows
1) for the TES, the average scoring is calculated per element, as graded by each of the nine judges as follows.
Let's say that element A is scored for its GOE +3, +2, +2, +1, 0, 0, -2, 0, +2. The average is then (3+2+2+1+0+0-2+0+2)/9=0.89

Subsequently the score of each judge is compared to this average, so the first judge would be +2.11 above average for element A.

For each judge each element is evaluated as above, then the positive differences and the negative differences are added. Let's say that judge number one scored the seven elements of a short program with the following biases from the averages: +2.11, -0.11, -1.50, +0.20, +0.00, -1.20, +1.00.
The positive differences are +2.11+0.20+0.00+1.00=+3.31
The negative differences are -0.11-1.20=-1.31

So the total bias for judge nr 1 is 3.31-(-1.31)=4.62

The acceptable limit for the TES is one point per element, meaning that in a short program any score within 7.0 GOE point from average will not be scrutinized. Note at this point that I jotted down random numbers, includng a royal 2.11 difference. Note also that in a singles FP the number of elements goes up to 12 for ladies and more for men. Allowed difference is 12. We can fit and elephant in this difference, and no further action would be taken towards the judge marking this way.

2) The story is even better for PCS. For every PCS a cumulative corridor of 7.5 points is allowed (1.5 point per PCS, but only considering the TOTAL bias). Averages are calculated as above of course, with a small variation.
In this case the positive and negative differences are added to each other, so our judge number one marking the 5 PCS with the following differences: +1.5, +2.0, -2.5, -0.25, -0.10, would score a total bias of +1.5+2.0-2.5-0.25-0.10=0.65

Again in this case no question would be asked...
Too bad though that a difference of 7.5 points from average translate in a fat 15 points allowed more or less in a men's FS - and that is just for the PCS.

3) Let's see now how two judges can start influencing the review process. Let's take an absurd example. Skater A receives for SS the following marks: 7,7,7,7,7,7,7,7,9

Anybody would question the 9... well, the average mark is 7.22, so the bias is +1.78. Just out of the 1.5 limit.

Now let's say that two judges have a similar idea on how to score the same skater A. SS is now: 7,7,7,7,7,7,7,9,9. Again, I think anybody would jump on the two 9's. The average is 7.44, the bias for each judge is 1.56.
Now it is enough the following: 7.25,7,7.25,7,7.25,7,7.25,9,9 and we are set: average is 7.55, bias for the two judges is... 1.45. Perfectly acceptable. This translates into a 2.9 point more (men's FS), more that enough to accomodate one gold medal.

I think now it is easy to combine points 1-2-3 to draw some conclusions on how it would take 2 judges to severely impact the outcomes of any event, without any consequences.

PostPosted: Wed Feb 26, 2014 12:31 am
by geon79
Thank you very much for the information about how judges are assessed. To me it looks very inadequate for an international standard of an Olympic sport, it means that the final score (and ranking) depends too heavily on the composition of the judging board and lengthy discussions about few points more or less to that skater or about the ranking sound futile, because it's very likely the few points differences are simply due to random fluctuations of poorly trained judges.
In this particular case, where the difference in score varies as much as 17.7 points from one judge to another for the same performance, the bias calculated in the way you described reaches just 8.33 for the technical score and 4.36 for the components... So well within the 12 and the 7.5 points respecitively that ISU considers acceptable.
This is really disappointing and frankly unacceptable. I daresay ISU should reduce that tolerance to no more than 4.00 for tech and 2.0 for components and train its international judges to that standard.

PostPosted: Wed Feb 26, 2014 3:14 pm
by tennisfan
I think there are issues the ISU should take up in terms how judges are selected and who gets to be on the panels but variance in scores is not a complaint I have. Different people will have different opinions and a large part of skating is subjective - do you like this choreography? do you think it goes with the music? You can't have a properly judged competition if the judges are limited to marking within a small variation of each other particularly when the judges don't know what scores are being given.

PostPosted: Wed Feb 26, 2014 3:32 pm
by Andy
You make a valid point, Tennisfan, but I would have an objection to it. Skating is first and foremost a sport, at least if it is meant to stay competitive and olympic.

The whole point of CoP was to reduce subjectivity and increase objectivity in marking. Cultural bias and personal taste should not be playing a major part, at least not in the technical side of the scoring. A flutz is a flutz, and a fall is a fall. Minor disagreement could be possible, but minor. Not a discrepancy from +2 to -1 in GOE.

On the PCS, SS and TR are (or could be made) in my opinion very objective. The balance of a program is very objective as well: a short program featuring three jump passes in the first minute is clearly less well constructed than a short program with 1 jump pass in the first minute, one in the second and the third in the last 50 seconds. Same goes for use of the space (rink), just to name a few examples.

If we decide that we don't mind subjectivity, than I see no point in CoP and I'd wish to go back to the 6.0. In that case the weight of each judge was relative to the others, with at least 5 judges needed to 'agree' on a placement. It would be enough to smooth it down a tad more (for example deleting the highest and lowest plavement and using a majority of 4 on the remaining 7 placements) to further protect the skaters from any potential wrongdoing or judging mistake.

As shown above, two 'agreeing' judges are enough to potentially deviate the outcome of a competition in a dramatic way. And at the moment it seems to me a bit naive to attribute any strange deviation to a cultural bias.

PostPosted: Thu Feb 27, 2014 5:07 am
by tennisfan
I think the whole point of COP was to ensure that skaters got credit for what they did - which is much easier to see with COP than with 6.0. Personal preference and subjectivity are always going to be part of the judging - you can't have judging without it and it is apart of all sports. The difficulty with figure skating is the subjectivity goes beyond just was the position held properly which can be a yes or no question to an extent but to whether a skater's program was innovative or reflected the music. The biggest contribution of the system is that skaters can see, particularly on the technical side, exactly how they lost marks and then take steps to address that issue. It is making the sport better but it was never going to prevent results that are controversial.

Re: Sochi: no conspiracy against Kim, but flawed judging sys

PostPosted: Sat Mar 14, 2015 10:41 am
by Jessica
I prefer the Code of Points’ technical scoring to 6.0.
Artistry is in the eye of the beholder.
Personally I thought Yuna’s interpretation scores were a little high because she seemed kind of just there, not THERE as in IN THE MOMENT.
Why do people make a stink about this controversy? Why not Carolina’s 2013 Europeans? Why not Virtue and Moir’s Olympic gold in 2010? Why not Evgeni Plushenko’s 3 world titles and 4 Olympic medals that put me to sleep?

Re: Sochi: no conspiracy against Kim, but flawed judging sys

PostPosted: Fri Apr 10, 2015 11:14 pm
by Jessica
I’ll bet that the skaters’ reactions changed the judges’ reactions as well. When Adelina got done, she had that I-did-it kind of reaction. She was hopping around the rink and yes, putting some of the ice in her mouth (love her but she’s a nut). She was in tears over her performance, that’s how good it was for her.
When Yuna got done, she was like, “Another nice program from the queen” and that was it. In Vancouver she’d had an excited reaction, so she got higher marks. If the skater’s excited, so are the judges.
The same thing happened with Michelle Kwan and Tara Lipinski at the 1998 Nagano Olympics. Michelle did a beautiful program, but at the end her expression was just…relief. Relief to have skated clean. Tara was screaming and grinning and looking happy-go-lucky like Elena Radionova and it’s-the-best-skate-of-my-life. And, you guessed it: Tara won the gold medal.