Data is a huge part of today’s world. As a society, we generate Quintillion bytes of new gold every single day. (1 followed by 18 zeros!) A lot of it is fairly useless, but if there’s a mistake made at some point, it could cost you dearly.
Today I’ll be looking at two examples of how data inaccuracies led to some catastrophic outcomes.
Using Excel to store too much Data
It’s a well-known fact that during the pandemic, data provided by the government and leading health organisations across the world helped us understand the spread of the virus. The general public was able to access huge swathes of data from countries all over the world and collate it into some pretty nifty-looking dashboards.
How The Uk Dealt With Dealt With New Cases
The UK Government spent a huge sum of taxpayers’ money trying to track the spread. And one step they went wrong was using Excel to store everything. All new cases were entered into one master spreadsheet. The inaccuracy in this instance? Row limits…
Excel is one of the world’s most used software packages. I was never a big fan of it at school until I learned how useful the functions, and combination of functions could be I steered clear of it. While working at another company that sounds like a rainforest I quickly learned that it struggles when trying to use vast amounts of data, which is why there is a limit in place for the total amount of rows.
While a simple CSV (Comma Separated Value) file can contain any amount of rows or columns, Excel has the ability to load 1,048,576 rows and 65,536 columns. If they seem like strange numbers it does follow a pretty basic principle.
If we look at the 14-bit row feature we can see how it quickly stacks up 2×2^20 = 1,048,576 rows to have 2 bits in each. Not so strange numbers after all.
How Did It Go Wrong Then?
All of the test centres were sending over CSV files, but many of them were saved from a format you may be familiar with .xls. This file format is the predecessor of the latest release whose file format is xlsx. and the former can only contain 65,000 rows. Any indexed on the first system past that point was simply missed when uploaded to the main spreadsheet.
There are reports that this only affected the final data outcomes when it came to reporting, which is one of the most crucial elements of the job.
Apparently, those who were infected, but left off the final sheet were still contacted by the laboratory that conducted the test.
A file extension was to blame. More precisely the people that chose the format to save it in.
Missile Defence System Get’s Timing Wrong.
In 1991 an American US base was hit by an Iraqi SCUD missile during the first Gulf War. Tragically 28 US Servicemen lost their lives and it was due to a rounding error.
The Patriot missile system had an advanced computer system on board to calculate the trajectory of incoming missiles but it needed to accurately know the time to be able to do this. Once the system had been booted up and the onboard computer started running, a small timing error rounded down every passing second. This in turn compounded over time until the error became so great that the interceptors missed their targets.
Although this is a grim depiction of data inaccuracies, rounding floating point errors appear quite a lot throughout the course of our technological advancements. Even NASA made the mistake, losing a billion-dollar probe in the process.
What Can You Do To Avoid Data Inaccuracies?
Knowing they exist is enough, I had no idea that rounding errors caused a space probe to be lost until going through University where I learned the fundamentals of data inaccuracies.
Going over the data with a fine tooth comb is always worth the time. This is why 80% of an Analysts job is the cleaning and transforming the data to make sure it’s in good shape.
Check the data types are all as they should be. If a column of numbers has been stored as a string, you can’t add them up. You’d be able to create a histogram from the occurrences of the same digits, but that is another issue.
Exploratory Data Analysis
Data visualisations can help you see if your data is correct. If dates haven’t been formatted correctly, you may see gaps in time where you know something should be happening. They don’t need to be complex works of art, you can work on that later, just a simple line chart to see if the results are what you were expecting.
Running experiments with different types of graphs can show you if something is wrong with the data. If it’s poorly formatted then you’ll more than likely run into errors.
Running simple code such as all_sales.describe() in python is an excellent way to summarise your data.
Using other short pieces of code and chaining them you can see if there’s any missing data such as all_sales.isna().sum() which shows you in total how many entities are missing. Knowing that 25 of 1,000 might not be too useful. Take it further and show it as a percentage round((all_sales.isna().sum()/all_sales.shape)*100,2) to get a better idea.
Data Services From Us
If you need some expert analysis and someone to find the gaps in your data or don’t want to deal with data inaccuracies then get in touch with a member of the team, you can find me on LinkedIn posting visuals and sharing some really insightful stuff back to the community. Or you could head over to our contact page and send us a message!
Check Out Some Of The Other Musings From Embryo Employees
The benefits of posting on LinkedIn daily – By Dylan Heath
Rome Wasn’t Built In A Day: Why Patience Is The Key To Content Marketing – By Charlie Meyler