Busy with exams/dissertation!

Well, I just had an exam today (Monte Carlo Methods) so I’ve not been updating. Also, my dissertation deadline is in a fortnight and it’s far from finished (yikes!)

I am, however, going to plug my first (first-author) publication: http://www2.warwick.ac.uk/fac/cross_fac/iatl/ejournal/issues/volume5issue1/ang/

This was based on the work that I did 2 years ago during summer.

Also, I’ve been brought to the attention of this article by Karl Friston, a response to it, and a reply from Karl. Enjoy!

Some MATLAB Optimization

So I had a line of code:

>> tic;tmpmat1 = [repmat(-1,1,3) repmat(0,1,3) repmat(1,1,3)]';toc
Elapsed time is 0.000451 seconds.

MATLAB helpfully suggested using the zeros and ones functions instead:

>> tic;tmpmat1 = [-ones(1,3) zeros(1,3) ones(1,3)]';toc
Elapsed time is 0.000040 seconds.

Even multiplying by a constant in front doesn’t change things:

>> tic;tmpmat1 = [repmat(-pi,1,3) repmat(0,1,3) repmat(pi,1,3)]';toc
Elapsed time is 0.000604 seconds.
>> tic;tmpmat1 = pi*[-ones(1,3) zeros(1,3) ones(1,3)]';toc
Elapsed time is 0.000091 seconds.
>> tic;tmpmat1 = 1/3*[-ones(1,3) zeros(1,3) ones(1,3)]';toc
Elapsed time is 0.000080 seconds.
>> tic;tmpmat1 = [repmat(-1/3,1,3) repmat(0,1,3) repmat(1/3,1,3)]';toc
Elapsed time is 0.000245 seconds.

[Link] NBA shots analysed, spatially.

From: http://gizmodo.com/5892424/five-years-of-every-nba-shot-attempt-visualized

The original is at: http://www.kirkgoldsberry.com/courtvision.htm, the paper by Kirk Goldsberry is at http://www.sloansportsconference.com/wp-content/uploads/2012/02/Goldsberry_Sloan_Submission.pdf

This is pretty interesting and would definitely be of use to NBA coaches! One wonders if there exists sufficient data (I’m pretty confident Opta has the stats) to do so for football, and it will definitely be worth looking at. For example, one could back up whether or not Rooney is a good shooter from distance!

And of course, in other sports, such data if available would greatly help coaches/trainers spot weak/strong areas of a player or his/her opponent for more effective training.

 

A statistical analysis – really?

I try to support alternative media, but sometimes, I do wonder: http://theonlinecitizen.com/2012/03/paps-election-victories-a-statistical-analysis/

A friend linked this on Facebook this morning. Being in a hurry to go to lecture, I skimmed through it in 30 seconds, and spent the hour in lecture intermittently thinking about how dodgy it looked.

Firstly, inequalities in popular vote vs “seats” are not new. The United States Presidential Election uses the electoral college system, and in 2008, Obama won 365 out of 538 “seats” whilst only obtaining 52.9% of the popular vote. And in the United Kingdom, you often hear of complaints from the supports of the Liberal Democrats about how their vote share does not translate to seats in parliament, and hence the referendum on the Alternative Vote last year. But I digress. For those who are more interested in the mechanics of different voting systems, Tim Gowers has presented a pretty good analysis.

On to the “statistical analysis”.

The first assumption to be made is that if there are only (Single Member Constituencies) SMC’s, what will the probability be given that the PAP has only 60% of the votes yet wins more than 90% of the seats in parliament? Further assumptions are that there are 1million voters, 100 SMCs, and 10,000 voters in each SMC. For the PAP to win in a SMC, it has to have 5001 or more votes. This is a binomial cumulative density function. However, to calculative this distribution for large numbers in binomial distribution is erroneous. We can approximate this distribution using the gaussian distribution if the sample size is large and other conditions are fulfilled. After some calculation, the probability of the PAP having 5001 or more votes in an SMC is 0.6615, which looks reasonable.

Nothing wrong with having a simple model, and the assumptions are reasonable. So n = 10000 and p = 0.6 in the above model, giving the normal approximation of the number of votes won as being a normal distribution with mean np = 6000, variance = np(1-p) = 2400. This does give the probability as 0.6615.

This looks fair enough, no one is going to argue that n is not large enough, and p is certainly not far off from 0.5.

Given that the probability of victory in each SMC is only 66%, let us calculate the probability that PAP has 90 or more seats. Again using gaussian distribution as an approximation, the probability of winning the election with 90 or more seats is only 0.1434. This is not a very good chance, and would indicate that the PAP has been performing this spectacularly throughout our nation’s history. We can categorically reject the arguments that the opposition is weak and that they lack the people’s support et, because the truth is only 60% voted for the PAP. Of course there is the caveat that there are some votes that are void, but this is only a small amount.

Sure, this would be true IF the above model was close enough to reality. But this is the reality: the wards are pretty much unequal in size.

Most first year university students doing a course in basic probability and statistics would ask the first question, where are the confidence intervals? Also, most first year university students doing a course in basic probability and statistics would probably have come across Simpson’s Paradox. One of the often quoted examples is the case of University of California, Berkeley being sued for gender discrimination against women in terms of admission. The figures quoted were 44% of 8442 men applying being admitted successfully, and only 35% of 4321 women applicants being successful.

However, a study by Bickel et al. found that no department being clearly biased against females, but there was in fact a slight bias towards females! The original paper is here, or you can read it on wikipedia.

Let’s try to apply this to the elections. There were 14 GRCs and 9 SMCs in the 2011 general elections. For simplicity, let’s assume 14 GRCs and 10 SMCs, with 100,000 and 10,000 constituents each, for a total of 1.5 million. Assume the popular vote is split 60-40, so Party A gets 900,000 votes and Party B 600,000.

Scenario 1:
Each constituency gets the vote split 60-40, Party A wins all the seats.

Scenario 2:
Party A wins 49,000 votes in 10 of the GRCs, and 90,000 votes in the other 4. For the SMCs, in 9 out of 10 of them, Party A wins 4,900 votes each, and the remaining 5,900 votes are won in the last SMC.

So, if each GRC is worth 5 seats, and each SMC is worth 1, Party A would have won 21 seats out of 80, so Party B has won nearly three-quarters of the seat in parliament with just 40% of the popular vote!

Scenario 3:
Okay okay, I can hear some of you telling me the above is contrived. Let’s try instead then, for Party A, 85,000 votes in 2 of the GRCs,  80,000 votes in another 2, 75,000 votes in 3, and 40,000 votes for the other 7 GRCs. The remaining 65,000 votes are distributed with 8,100 each in 5 of them, and 4,000 each in the other 5. Neither side has the majority and the seats are split 40-40. This is not as unrealistic as it seems!

One other point to note is, the heterogeneity of voting preferences across the country. So, for a particular ward, it is not necessarily true that on average you will get a 60-40 split. Determining the average proportion is not as straightforward, and usually involve exit polling, which are practically illegal under Singapore laws. One could estimate based on past voting results, but given the changes in voter trends and redrawing of electoral boundaries, this is not easy. I have barely scratched the surface of the subject of statistical analysis in elections and I have to admit that I have not read much of the existing literature on the topic.

A small point: I once read somewhere (unfortunately I’ve forgotten where), that there is always a baseline proportion of voters voting for a particular party. The same source quoted 20% each way in this particular case. One would have to remove the relevant part of the tails. (In fact, approx. 25% of the normal distribution in the first model lies in this region.) Hence, using a normal might not be so good an approximation as hoped for in a model.

While the original article makes a really huge claim that “there is quite a gap between the number of seats they should have won and the actual outcome.” I would be sceptical that using “using these simple sets of assumptions” would give such a conclusive result. In fact, the probability quoted should be treated as junk, and not indicative of anything at all.

As a sidenote: would putting up a screenshot of my MATLAB output (or I could have used R as well) in a nice purple background get me published in TOC?

Update 14 March 2012 11:55:

In case anybody asks why I did not email the TOC, I did. Maybe I should have linked them to here. Here is the reply and my email to them. I am not impressed.

theonlinecitizen toc <theonlinecitizen@gmail.com> Tue, Mar 13, 2012 at 13:08
To: Shen Ting Ang <angshenting@gmail.com>
Thanks Shien Ting for your feedback. You may leave this same feedback at the comments section of that article. We also welcome you to write for us an article with more accurate statistics.
Regards.
On Tue, Mar 13, 2012 at 6:37 PM, Shen Ting Ang <angshenting@gmail.com> wrote:

Dear Editor,

The author makes the following conclusions:
“A parsimonious hypothesis is that somehow voters consistently voted 60-40 for the PAP in each constituency especially in the GRCs so that the PAP scores an overwhelming victory. This disconnect between the people’s will and the election outcome can only be attributed to the fact that the elections are unfairly skewed towards the PAP. Whether there is intent or an unfortunate coincidence is not clear, but it is the responsibility of Parliament to form a committee to look into this and to perhaps level the playing ground so that democracy may take a big step forward.
One may also ask, given that the PAP has the probability of 0.5 to win over 90 seats, what percentage of votes should they have? By using reverse calculation (inverse error function), they should enjoy 0.7441 of votes. If they are to win by that margin with a 75% probability, they should enjoy 0.7719 of votes. Clearly, using these simple sets of assumptions, there is quite a gap between the number of seats they should have won and the actual outcome. “
It is factually incorrect to say this. The author has used a model which is simplified and bears no resemblance to reality. Whilst the model itself is useful in pointing out the flaws of the First-Past-The-Post voting system in general, it does not imply that the elections are unfairly skewed.
It is unclear what the author is trying to say or conclude, but one leaves with an impression that he is asserting the elections are unfairly skewed on the basis of his simple model. Given how his model fails to consider (1) the actual sizes of the constituencies (which can lead to Simpson’s Paradox) (2) different voting proportions in each constituency in reality and (3) the limitations of using a normal approximation as he has suggested, I feel that at least ample warning should be given regarding the interpretation of the results. In fact, I would go as far as to say that the results he has obtained are not informative in any way.
Lastly, the author mentions “forecasting software” in his last paragraph. Any such “forecasting software” will be based on statistical methods published in academic journals, which are freely available to anyone who buys the relevant subscriptions. The point being missed is the fact that polling was done beforehand and not made public.
As a final year student majoring in statistics, I am concerned that articles such as this are being published and misinforming the general public. I personally like the idea and direction that TOC stands for, but I am disappointed that TOC has published an article containing so many inaccuracies which only mislead, rather than educate the general public.
Regards,
Shen Ting

 

 

 

Transport fare increases: poor understanding/presentation?

This is a rather old topic, but I think there are some themes which crop up now and then.

Back in July 2010, the Land Transport Authority (of Singapore) made the following announcement about transport fares: http://app.lta.gov.sg/corp_press_content.asp?start=x3neq0v8k3f2u78mq76fu4i322vziz6ued16cw5iki8yv7k6pz

This news article highlighted one of the main selling points of the new fare structure: http://news.xin.msn.com/en/singapore/article.aspx?cp-documentid=4277276

Dr Lim added that the PTC’s projected impact analysis showed that 63 per cent of all commuters would see fare savings in their weekly public transport spending under the new system.

These commuters would save an average of 48 cents a week (or $25 a year).

Of the 34 per cent of commuters who see an increase, the average increase is 31 cents a week (or $16 a year).

Also:

Mr Lim said this means that in all, operators will bear a permanent reduction in fare revenue of about $88 million per year.

A week later, there was a poll thread in the (in)famous Hardwarezone forums. Of course, the results of such polls should be taken with a pinch of salt, but 80% of the respondents reported an increase in the fares they paid.

Surely this does not quite add up?

First of all, the people who saw fare increases saw mostly increases of less than 10 cents. Those that did enjoy a decrease in their fare saw decreases of more than 10 cents or in some cases, more than 20 cents. This was confirmed by a friend of mine who works in LTA. So it is quite possible that the transport operators are earning less overall.

Now, what about the claim that 63 percent of commuters would see their fare decrease instead? I have no quick and easy answer to this question. One might be inclined to suspect that perhaps some double-counting took place when considering the commuters doing transfers, since these are the ones enjoying fare decreases from the new Distance-based fares.

Unfortunately, it was quite widely touted that this adjustment was a fare decrease, and this led to angry reactions when people found out they were paying more for their daily commute. So while, it was quite rightly a fare decrease for the operators, a significant amount did not see a decrease, and this was actually described in both links given above, but the angry crowd had failed to see them or were not given adequate communication about this happening!

Eventually, some of the angry crowd realised belatedly than an average fare decrease does not equal to a fare decrease for everyone. Two years later, I suspect most have come to accept the merits of the “new” system.