A terrible use of data

Last week, I got into a bit of a heated discussion with an admin on the facebook Vegonews[1] page. They had shared a graph, attributed to the website www.diseaseproof.com (though I’ve been unable to find it there), which I think is clearly designed to suggest a causative relationship where the data simply does not show one.

Here is the graph:

Graph attributed to diseaseproof.com which shows percentage of calories from unrefined plant foods and percentage of deaths from heart disease and cancer for the countries Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, "Korea", Thailand and Laos.. It appears to show that as plant food consumption increases, the risk of dying from heart disease and cancer decreases.

Graph attributed to diseaseproof.com which shows percentage of calories from unrefined plant foods and percentage of deaths from heart disease and cancer for the countries Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, “Korea”, Thailand and Laos.. It appears to show that as plant food consumption increases, the risk of dying from heart disease and cancer decreases.

Apart from the scaremongering “KILLER DISEASES” title, the first thing that struck me upon looking at this graph was that the countries on the left generally have a much higher living standard than those on the right, so people in those countries probably live longer, and thus are more likely to develop diseases such as heart disease and cancer, which tend to affect more people later in life. But that was just a hunch, and if I’m critiquing someone else’s use of data, I should probably have my own to counter with. So I headed over to http://esds.ac.uk/international/ and opened up the World Bank macro dataset “World Development Indicators[2]”. After about five minutes of selecting and downloading data, I had the following information:

Country  Life expectancy at birth (years)  GDP per capita, PPP (2005 international $)[3]
Hungary           74       16,958
United States           78       42,297
Belgium           80       32,808
Sweden           81       33,771
Finland           80       31,493
Portugal           79       21,660
Venezuela, RB           74       10,973
Greece           80       24,206
Mexico           77       12,441
Korea, Dem. Rep.           69  ..
Korea, Rep.           81       27,027
Thailand           74        7,673
Lao PDR           67        2,288

As you can see, I also included GDP for comparison. GDP as an indicator of development is massively abused, and is something I think we should be moving away from as much as possible[4], but for a quick exercise such as this, I think it is an acceptable shorthand for “can the average person afford to get enough food?”

I’ve also included both Koreas, as the original graph-designer somehow, astonishingly, neglected to specify which one they meant. Is it the famously secretive, dictatorial North Korea, with a life expectancy of 69 and not enough data for the World Bank to estimate their GDP? (though Wikipedia handily estimates it at $2.4k per capita). Or is it the democratic, high-standard-of-living South Korea, where you can expect to live to the ripe old age of 81?

Here’s my graph:

Graph showing life expectancy and GDP for the same countries as the previous graph,  Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Korea (South and North), Thailand and Lao. There is generally a higher GDP and life expectancy for the countries on the left, but no strong trend.

Graph showing life expectancy and GDP for the same countries as the previous graph, Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Korea (South and North), Thailand and Lao. There is generally a higher GDP and life expectancy for the countries on the left, but no strong trend.

Not the most conclusive graph in the world, but then I would say the same for the original, and sadly I’m sure there are many people who took it at face value. I took a couple of quick averages, splitting the countries into left-of-Greece (where we eat too little unrefined plant foods and die of heart disease and cancer) and right-of-Greece (where we eat nothing but vegetables and nobody gets cancer!). (I excluded both Koreas from this).

Left average life expectancy = 78, GDP = $27k

Right average life expectancy = 73, GDP = $7k

So the question becomes – would you rather die of heart disease at 78, or of something else (starvation, diarrhoea, pneumonia) at 73?

But the thing that most grates about this graph is the apparently random selection of countries. If you have enough data points (e.g. countries) then you can select the ones you want to make a relationship look like it exists where it doesn’t. So I undertook a similar exercise, and downloaded data for all 220 countries available from the World Bank on forested area (as a percentage of total land area) and risk of maternal death (% over a lifetime). And behold, I have found a terrible relationship! We must plant trees in order to save the poor mothers!

Graph of forested area as a percentage of total land area and likelihood of maternal death for the countries Brazil, Peru, Panama, Ecuador, Vanuatu, Nepal, India, Madagascar, Ghana, Burkina Faso, Rwanda, Uganda and Mali. The trends appear to show that as forested area decreases, the risk of maternal death increases.

Graph of forested area as a percentage of total land area and likelihood of maternal death for the countries Brazil, Peru, Panama, Ecuador, Vanuatu, Nepal, India, Madagascar, Ghana, Burkina Faso, Rwanda, Uganda and Mali. The trends appear to show that as forested area decreases, the risk of maternal death increases.

(I’d like to say that I didn’t spend a lot of time on this graph, but that would be a lie. It’s actually quite engrossing seeing what you can do once you decide your intention is to abuse the data).

Edited to add:

BadgerBrian points out in the comments that a scatterplot can be a much better visual tool for identifying whether there is a relationship between two variables. His graph here of the GDP and life expectancy of the 12 countries originally mentioned demonstrates this well: there is a strong positive correlation between increasing GDP and increasing life expectancy up until about $25k, then it flattens out (and the USA does a stellar job of having very high income and quite underwhelming life expectancy!) I’ve not been able to get the graph to display in the comments so here it is:

Scatterplot of GDP vs life expectancy for Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Thailand and Lao. Graph shows positive correlation between the variables up until $25k, where the relationship flattens out

Scatterplot of GDP vs life expectancy for Hungary, USA, Belgium, Sweden, Finland, Portugal, Venezuela, Greece, Mexico, Thailand and Lao.
Graph shows positive correlation between the variables up until $25k, where the relationship flattens out


[1] “Vegonews is about sharing the most up to date information on veg*nism, and to spread the immense benefits of a healthy and natural lifestyle. http://vegonews.com/”

[2] World Bank (2012): World Development Indicators (Edition: April 2012). ESDS International, University of Manchester. DOI: http://dx.doi.org/10.5257/wb/wdi/2012-04

[3] In my original comment on facebook, I used GDP per capita, constant 2000 US$. I’ve changed this for PPP – Purchasing Power Parity, where the dollar amount of GDP is adjusted to reflect how much it actually costs to afford certain products in that particular country.

[4] For example, I argued at an Oxfam meeting last year that income should only be used as an measurement of a broader dimension “livelihoods”, rather than be a dimension itself, and will hopefully be using that in the framework for my PhD.

Advertisements

4 thoughts on “A terrible use of data

  1. Oh Maeve thank-you for pursuing this. Its such a pet hate of mine, misrepresentation of data, it gives my profession a bad name.

    I agree with all of your points. Just a few additional things to think about. When you have two variables which you think are related, rather than plotting pairs of bars for each individual (country in this case). Its best to plot the pairs of data points as a scatterplot, you can label each point to identify the country. There are a couple of reasons why this is better.

    Firstly, when you do a bar chart the order of the countries can have a huge influence on how the relationship looks, the scatterplot is not order dependent. Secondly there may not be a linear relationship between the two variables and the scatterplot may help identify whether this is the case.

    I have taken the GDP/life expectancy data and plotted a scatterplot. Indeed it looks like that there is a relationship between these two variables. High GDP countries have higher life expectancy, however there appears to be a certain level of GDP beyond which doesn’t produce a greater life expectancy. ie. around £20000

    I will send you the graph, perhaps you could add it to my comments as I cann’t seem to do this within this “reply” box.

    Bxxxx

    • Oh another thought, this kind of levelling off is often seen when there is a limiting factor to the response variable. A kind of diminishing returns, you have a biological limit on how old humans can get, you cannot get any older than this limit no matter how much money you have. So after a certain amount of money which does help increase you lifespan extra money adds less and less until there is no benefit at all.

      Bxxxx

    • Very good points BadgerBrian. In terms of the scatterplot, I’m reminded of my MSc dissertation which was looking at the relationship between energy security and wellbeing. I used lots of scatterplots to show various versions of almost this exact relationship – strongly positive and then levelling off at high levels. I’ll try to do a post on it sometime (or alternatively I could just send you the pdf!)

      I can’t work out how to make the graph display in comments (it’s there in the source code but still not showing up) so I’ve edited the original post.

  2. Great analysis. I especially like how you compared the flawed graph to the graph of risk of maternal death and forest area. As you say, it is unfortunate that some would take this sort of disinformation at face value.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s