As promised in a prior post, I’ve finished the data entry for the Kuchikomi (クチコミ総選挙) election predictions. The Kuchikomi election prediction algorithm was created by hottolink (ホットリンク) and it is based on internet chatter (口コミ). The algorithm parses through information on the internet from mass media to blogs on individual politicians and parties to produce a prediction percentage based on historical data. You can I guess think it as a souped up Google AdPlanner:
候補者個人に関するクチコミと所属政党に関するクチコミを集め、それぞれがどの程度得票率に影響を与えるかを過去の国政選挙をもとに分析し、予測モデルを構築して算出した値が予想得票率です。
In terms of very general results, Kuchikomi had about 80% of the single district seats correct (mind you, they didn’t calculate the proportional representation seat — a whole another issue). Both Asahi and Nikkei newspapers predicted more than 94% of the single district seats correct. But considering Kuchikomi is still new technology, 80% isn’t too bad. But as you’ll see, there are some major weaknesses.
Party Bias
First of all, Kuchikomi did really bad for minor parties (namely: YP = Your Party (みんなの党), SDP = Social Democratic Party (社民党), PNP = People’s New Party (国民新党)) and independents (= I). See the graph below as I plotted the predicted vote percentage vs the actual vote percentage of each candidate separated by party:

A perfect prediction will show a one-to-one lines. But for the aforementioned minor parties and independents, we see that Kuchikomi had very low predictions, yet resulted in considerably higher numbers (don’t worry about the mysterious acronyms, some of them are uber-minor parties). In the description of their algorithm, they mentioned that they gather information for both the candidate and the party. But it looks like they put more weight into on the party. For example, Your Party has former LDP members that were fairly well-known before the election. But because Your Party only formed as a political party three weeks before the election, Kuchikomi failed to gain sufficient information; thus, the low predictions. The leader (Watanabe Yoshimi) was predicted highly, but they had only two candidates in that district. Kamei Shizuka, the leader of the People’s New Party, was predicted to have 15% of the vote, but he actually got 60% — a semi-educated human being would have predicted that much better. In a similar vein, Kuchikomi just sucks for Independents. Either the media just doesn’t like to talk about them, or again, Kuchikomi doesn’t weigh the individual as much as the party.
Let’s look a little closer at the two main players: DPJ and LDP. The following graph is basically a mash up of the previous graph, where the red dots indicate the DPJ and the blue ones as LDP:

A one-to-one black line is passed through to indicate anything above it constitute underestimated predictions and anything below are overestimated predictions. The DPJ estimates are generally underestimated, and they are underestimated more strongly for lower prediction percentages. In general, the variance of the actual vote percentage is higher for lower prediction percentages. A paradox that happens here, is that even if Kuchikomi is underestimating the DPJ results in terms of percentages, it actually overestimated the number of seats that the DPJ won.
So what is going on? Well, it is hard to tell. One would think that underestimating the percentages would lead to underestimating the number of seats. This parallel can get broken if the prediction screws up (overestimates) big for a few seats, while underestimating moderately. I decided to plot the movements (difference between prediction percentage and actual percentage) on seats that were won by either the LDP or the DPJ and was predicted to win by either the LDP or the DPJ. Of course there are 4 cases: predict LDP -> observe LDP (upper right), predict LDP -> observe DPJ (upper left), predict DPJ -> observe DPJ (bottom left) and predict DPJ -> observe LDP (bottom right):

The red dots show the predicted percentage of both the DPJ and the LDP. The arrow protruding from the red dot “moves” to the actual vote percentage. So if the arrows going straight up, the prediction was underestimated for the LDP, and if the arrow is going to the right, the prediction was underestimated for the DPJ (and vice-versa). We see the red dots very close to black one-to-one line compared to the actual votes. This indicates that the predictions are a lot more flat than the actual results maybe due to lack of information. We see the arrows on the bottom right graph (predicted DPJ seat but actually went to the LDP) going vertically straight up. This of course means that the DPJ prediction was fine, but it fucked up on LDP results big time.
Now who were the candidates for the bottom right graph? Here is the list:
Aomori 2 “Eto Akinori”
Aomori 3 “Ooshima Tadamori”
Aomori 4 “Kimura Taro”
Chiba11 “Mori Eisuke”
Chiba12 “Hamada Yasukazu”
Ehime 1 “Shiozaki Yasuhisa”
Ehime 4 “Yamamoto Kouichi”
Fukui 1 “Inada Tomomi”
Fukui 2 “Yamamoto Taku”
Fukui 3 “Takagi Tsuyoshi”
Fukuoka 7 “Koga Makoto”
Gifu 2 “Tanahashi Yasufumi”
Gunma 4 “Fukuda Yasuo”
Hiroshima 1 “Kishida Fumio”
Hokkaido 7 “Itou Yoshitaka”
Ibaraki 4 “Kajiyama Hiroshi”
Ishikawa 2 “Mori Yoshirou”
Kagoshima 2 “Tokuda Takeshi”
Kagoshima 4 “Ozato Yasuhiro”
Kagoshima 5 “Moriyama Hiroshi”
Kanagawa 2 “Suga Yoshihide”
Kanagawa11 “Koizumi Shinjiro”
Kanagawa15 “Kouno Taro”
Kouchi 1 “Fukui Teru”
Kouchi 2 “Nakatani Gen”
Kouchi 3 “Yamamoto Yuuji”
Kumamoto 3 “Sakamoto Tetsushi”
Kyoto 5 “Tanigaki Sadakazu”
Mie 5 “Mitsuya Norio”
Miyazaki 2 “Etou Taku”
Nara 4 “Tanose Ryoutarou”
Okayama 1 “Aisawa Ichirou”
Okayama 5 “Katou Katsunobu”
Tokushima 3 “Gotouda Masazumi”
Tokyo17 “Hirasawa Katsuei”
Tottori 2 “Akazawa Ryousei”
Wakayama 3 “Nikai Toshihiro”
Yamaguchi 1 “Koumura Masahiko”
I’m not sure anything general can be derived from this list. I do however see former Prime Ministers (Mori, Fukuda, Koizumi (well, his daddy..)) and pols that appear on TV frequently (Kouno, Hirasawa, Gotouda) in this list. This might indicate a big-name bias that can’t be detected from more neutral mass-media articles (although I would think blogs would give some of this info). It could also mean that the LDP really pushed certain candidates at its last rush (many of the big-names also hold high positions).
How about the graph on the top right? The graph shows again the same movement: no bias for the DPJ but large underestimation for the LDP.
Fukuoka8 “Asou Tarou”
Gifu4 “Kaneko Kazuyoshi”
Shimane1 “Hosoda Hiroyuki”
Tochigi5 “Motegi Toshimitsu”
Tottori1 “Ishiba Shigeru”
Yamaguchi3 “Kawamura Takeo”
Yamaguchi4 “Abe Shinzou”
Former PMs (Abe, Asou), TV personas (Ishiba) and pols that made frequent media appearances (Kawamura, Hosoda).
Geo Effects
Another thing that I looked at, were prediction performances according to each prefecture and district. Each district has their own color that may influence media and blog-related chatter that goes onto the internet.
I first created an error measure as follows:

This basically gives me a proxy for how “bad” the prediction went. I put an exponential simply because I wanted to penalize mistakes at higher prediction percentages. The squared difference is to penalize large mistakes. What I did was calculate this measure for each candidate of each district. Then I can take aggregate statistics to explore how bad the predictions went given the location and the candidate.
In this case, I simply took the mean across districts and the candidates to arrive at a Prefecture-wide value. I can plot this (thanks Prof. Aoki (青木繁伸)) on a Japanese map to neatly see this information at a glance:

This basically shows poorly preforming predictions as a darker color. Is there some sort of pattern here? Again, it is hard to see. The error measure that I picked was fairly arbitrary, and I take the mean a couple times anyways — creating a major loss of detail. One thing though, Miyazaki had easily the largest error among the Prefectures. Can this be caused by media-crowding? Meaning, because the Governor caused a huge ruckus, maybe information let out about the election were superfluous.
A nice thing about these measures, is that we can play with them easily by making correlations with extraneous variables. Because I took Prefecture-wide statistics, I can regress it by, say Prefecture-wide population statistics:

We obviously see something interesting here. The Prefectures with high population has contain their error a lot more efficiently than the country lands. An error measure higher than 0.5 were observed at low population Prefectures (Akita, Gifu, Kagoshima, Kumamoto, Miyazaki, Okinawa, Ooita, Shimane, Tochigi, Toyama, Yamagata) except for Saitama. Can this be called the urban bias? I am guessing this can happen because there are more media outlets, a larger educated populace and a stronger dependence to government policies for populated areas.
Of course this is all very general, but I decided to stop here since my lower back is hurting. But it is easy to see that further analyses can be done.
Last but not least: Gender
A quickie on prediction performance according to gender. Here is a graph that plots actual vote percentage minus predicted vote percentage, separated into the DPJ and the LDP.

The female politicians going above the zero line means that the predictions were underestimated. This is definitely a good thing since Kuchikomi has stated that they use regression results from past elections to get their predictions — meaning, this is slight evidence that people are more likely to vote woman than in the past. Then again, the LDP women predictions were spot on, so it might just be a DPJ thing. The women in general didn’t perform as well as the men .
ah, So?
Aside from being a statistical exercise, what was the reason for doing this? Once the election is over, who cares about these retrospective-I-predicted-better-than-you-chest-pumping?
The obvious benefit is that we learn a few things about the real world. Any prediction is at best, abstract extrapolations of the real world, and based on how things are setup, it’s a good learning process to articulate what is happening. In this case, Kuchikomi data was based on information from the internet. If there are such biases that I mentioned (flattening, urban, big-name, gender, party, etc.), even though it might be bad in terms of prediction performance, these imperfections tell us something about the elections.
The most important imperfection, IMO, was that the LDP was underestimated by a fair amount (not that the DPJ was underestimated). This is somewhat surprising considering that Kuchikomi was using much of the data from historical elections where it was always the case that the LDP won. For whatever reason, certain LDP politicians just did way better in real life. I called it big-name bias, but it could also mean that the Regime Change (政権交代) meme was played so much in the media that Kuchikomi bought into the hype. Conversely, it could mean that Kuchikomi got it right, but the media overplaying the Regime Change meme influenced voters more than anything online. These tidbits you would not learn from opinion polls because opinion polls grab data from real people instead of information from the internet.
Another reason we should care about predictions is for data checking. Recently, Nate Silver of FiveThirtyEight accused Strategic Vision (a polling company) of having some shoddy, suspicious polling (Nate has huge balls doing this alone). Predictions can be applied to different datasets, and it usually outputs a general trend. When one results in a departure from the trend, it gives some evidence to look into information rigging. This came helpful in the recent Iranian election.
Anyways, for me, the biggest coup was my new ability to graph Japan in R. I also have maps for the districts (but I can’t show both the whole country and the districts at the same time); thus, now I can create pretty pictures on 県民性.
Feel free to contact me if interested in the data, code or the map.