A terrible use of data

Last week, I got into a bit of a heated discussion with an admin on the facebook Vegonews[1] page. They had shared a graph, attributed to the website www.diseaseproof.com (though I’ve been unable to find it there), which I think is clearly designed to suggest a causative relationship where the data simply does not show one.

Here is the graph:

Graph attributed to diseaseproof.com which shows percentage of calories from unrefined plant foods and percentage of deaths from heart disease and cancer for the countries Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, "Korea", Thailand and Laos.. It appears to show that as plant food consumption increases, the risk of dying from heart disease and cancer decreases.

Graph attributed to diseaseproof.com which shows percentage of calories from unrefined plant foods and percentage of deaths from heart disease and cancer for the countries Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, “Korea”, Thailand and Laos.. It appears to show that as plant food consumption increases, the risk of dying from heart disease and cancer decreases.

Apart from the scaremongering “KILLER DISEASES” title, the first thing that struck me upon looking at this graph was that the countries on the left generally have a much higher living standard than those on the right, so people in those countries probably live longer, and thus are more likely to develop diseases such as heart disease and cancer, which tend to affect more people later in life. But that was just a hunch, and if I’m critiquing someone else’s use of data, I should probably have my own to counter with. So I headed over to http://esds.ac.uk/international/ and opened up the World Bank macro dataset “World Development Indicators[2]”. After about five minutes of selecting and downloading data, I had the following information:

Country  Life expectancy at birth (years)  GDP per capita, PPP (2005 international $)[3]
Hungary           74       16,958
United States           78       42,297
Belgium           80       32,808
Sweden           81       33,771
Finland           80       31,493
Portugal           79       21,660
Venezuela, RB           74       10,973
Greece           80       24,206
Mexico           77       12,441
Korea, Dem. Rep.           69  ..
Korea, Rep.           81       27,027
Thailand           74        7,673
Lao PDR           67        2,288

As you can see, I also included GDP for comparison. GDP as an indicator of development is massively abused, and is something I think we should be moving away from as much as possible[4], but for a quick exercise such as this, I think it is an acceptable shorthand for “can the average person afford to get enough food?”

I’ve also included both Koreas, as the original graph-designer somehow, astonishingly, neglected to specify which one they meant. Is it the famously secretive, dictatorial North Korea, with a life expectancy of 69 and not enough data for the World Bank to estimate their GDP? (though Wikipedia handily estimates it at $2.4k per capita). Or is it the democratic, high-standard-of-living South Korea, where you can expect to live to the ripe old age of 81?

Here’s my graph:

Graph showing life expectancy and GDP for the same countries as the previous graph,  Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Korea (South and North), Thailand and Lao. There is generally a higher GDP and life expectancy for the countries on the left, but no strong trend.

Graph showing life expectancy and GDP for the same countries as the previous graph, Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Korea (South and North), Thailand and Lao. There is generally a higher GDP and life expectancy for the countries on the left, but no strong trend.

Not the most conclusive graph in the world, but then I would say the same for the original, and sadly I’m sure there are many people who took it at face value. I took a couple of quick averages, splitting the countries into left-of-Greece (where we eat too little unrefined plant foods and die of heart disease and cancer) and right-of-Greece (where we eat nothing but vegetables and nobody gets cancer!). (I excluded both Koreas from this).

Left average life expectancy = 78, GDP = $27k

Right average life expectancy = 73, GDP = $7k

So the question becomes – would you rather die of heart disease at 78, or of something else (starvation, diarrhoea, pneumonia) at 73?

But the thing that most grates about this graph is the apparently random selection of countries. If you have enough data points (e.g. countries) then you can select the ones you want to make a relationship look like it exists where it doesn’t. So I undertook a similar exercise, and downloaded data for all 220 countries available from the World Bank on forested area (as a percentage of total land area) and risk of maternal death (% over a lifetime). And behold, I have found a terrible relationship! We must plant trees in order to save the poor mothers!

Graph of forested area as a percentage of total land area and likelihood of maternal death for the countries Brazil, Peru, Panama, Ecuador, Vanuatu, Nepal, India, Madagascar, Ghana, Burkina Faso, Rwanda, Uganda and Mali. The trends appear to show that as forested area decreases, the risk of maternal death increases.

Graph of forested area as a percentage of total land area and likelihood of maternal death for the countries Brazil, Peru, Panama, Ecuador, Vanuatu, Nepal, India, Madagascar, Ghana, Burkina Faso, Rwanda, Uganda and Mali. The trends appear to show that as forested area decreases, the risk of maternal death increases.

(I’d like to say that I didn’t spend a lot of time on this graph, but that would be a lie. It’s actually quite engrossing seeing what you can do once you decide your intention is to abuse the data).

Edited to add:

BadgerBrian points out in the comments that a scatterplot can be a much better visual tool for identifying whether there is a relationship between two variables. His graph here of the GDP and life expectancy of the 12 countries originally mentioned demonstrates this well: there is a strong positive correlation between increasing GDP and increasing life expectancy up until about $25k, then it flattens out (and the USA does a stellar job of having very high income and quite underwhelming life expectancy!) I’ve not been able to get the graph to display in the comments so here it is:

Scatterplot of GDP vs life expectancy for Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Thailand and Lao. Graph shows positive correlation between the variables up until $25k, where the relationship flattens out

Scatterplot of GDP vs life expectancy for Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Thailand and Lao.
Graph shows positive correlation between the variables up until $25k, where the relationship flattens out


[1] “Vegonews is about sharing the most up to date information on veg*nism, and to spread the immense benefits of a healthy and natural lifestyle. http://vegonews.com/”

[2] World Bank (2012): World Development Indicators (Edition: April 2012). ESDS International, University of Manchester. DOI: http://dx.doi.org/10.5257/wb/wdi/2012-04

[3] In my original comment on facebook, I used GDP per capita, constant 2000 US$. I’ve changed this for PPP – Purchasing Power Parity, where the dollar amount of GDP is adjusted to reflect how much it actually costs to afford certain products in that particular country.

[4] For example, I argued at an Oxfam meeting last year that income should only be used as an measurement of a broader dimension “livelihoods”, rather than be a dimension itself, and will hopefully be using that in the framework for my PhD.

Proxies

Two contrasting topics this week have had me thinking about measuring change, which reminded me of the eloquent arguments given by Ben Goldacre* about the use of proxies in medical research, and have led me to think about their use in other areas.

Firstly, I saw a talk given about the inherent complexity in development activities and research. At a fundamental level, we really do not know what policies and activities work best to alleviate poverty, but institutions are so set in how they function, that it is hard to do things in new ways. So we identify problems that we can solve (e.g. the lack of bed nets for keeping out mosquitos) rather than face up to the ones that are more complicated (how to eradicate malaria). By focusing on targeted issues, we can feel that we are making progress, even though our ultimate goal (improved health and wellbeing) is too amorphous to measure.

Then I read this blog post about Rape Crisis Scotland promoting their Reclaim the Night march with banners saying “Women are not for sale”**. That post linked to this blog about the new model in Sweden which is aiming to eventually stigmatise the purchase of sex to the point of eradicating the sex industry. When I look around (the internet, mostly) it seems that many younger feminists (3rd/4th wave) are calling for less prohibition around sex-work, which is in direct opposition to the tide of policy. It occurred to me, again, that perhaps what we have here is an issue of proxies. All feminists want to end violence against women, but there is no known way to do that unilaterally. Women who are sex workers are at greater risk of violence, so prohibition seems like a way to reduce that violence. But really, the thing that we actually want to reduce is patriarchy (well, kyriarchy, but this is a very gendered issue), and violence against women is just one indicator of that. The prevalence of sex-work is a proxy indicator, used because of the correlation between sex-work and violence against women.

The use of proxy indicators is easy, and in some cases essential, when we cannot directly measure the thing that we want to change. But there is a real danger of actually changing our actions, in order to fit within the framework of what can be measured. The ONLY way to really know how to bring about change that we want is to come up with a bunch of ideas, randomise the end-points, try out the ideas, and see what works. This is much harder in social science than medicine of course, but it’s still the only way.


*We want to know about morbidity and mortality, but this takes too long to measure and is affected by too many things, so we measure proxy indicators like blood pressure. But drugs which affect those proxy measures still may not affect the thing we really want, in the way that we want.

**Not all sex-workers are women, obviously. But most violence in sex-work is against women, by men.