What happens to old data? – Coding and other stuff

Of all the buzzwords in IT “Big Data” is one that irks me the most. There has always been the concept of large volumes of data (think banking) before we needed to invent a pre-school sounding term for it.

My interest was piqued when I see that they’re building a new data-centre next to my workplace. Every day I see the progress. Bulldozers and backhoes excavating dirt, drillers plug in steel reinforcements. When it’s complete it will house a 4 storey data centre for a large well known equities trading company.

There is no doubt that we consume and store more data now than we ever have. By 2017 more than 50% of the world’s population were active internet users (1) Throughout the latter part of the 2010s, network data is being consumed at a much larger rate, thanks to data hungry video services like Netflix and YouTube, and are accounting for up to 25% of all internet traffic. Although transitory in nature it still occupies space on a network pipe. Once the data flows through to its intended destination – be it pictures on a screen or music from a speaker, the data is discarded.

But what about all the data that isn’t streaming? The accumulation of data is staggering, now that more than 2 of every 7 people on Earth are users of Facebook alone. Photos, videos, text, graphics all stored in chronological order. The digital footprint of each user would vary but over many years would create a sizeable amount of storage. Thought experiment: Just to store the first character of the name of everyone in the world would require 7 Gigabytes of storage alone!

Over time we have moved back and forth from local and remote computing. Thin-client ‘timeshare workstations’ was all but necessary in the 1950’s to the 1970’s (BBN was a bare bones “AWS” in its day) to fat client ‘desktop computers’ when processing/storage grew, and now back to hybrid thin client ‘cloud/internet’. We may have our own personal data stored on disk, but a progression to store this with trusted 3rd party providers is becoming the norm as we embrace SaaS ‘Software as a Service’. Over the last 20 years Email, in particular the mailbox, is one instance where a service has morphed into a cloud based product rather than a locally administered product. There are benefits for cloud storage – redundancy and fault tolerance among other things – but if the service (however unlikely) shuts down so too does the data with it. There is an inherent level of trust involved. This is part and parcel of embracing “the cloud”.

There are other issues with cloud storage which stem from lack of direct control. We willingly hand over our personal data in the form of text, video and audio data to social media sites like Facebook, Instagram and Twitter, and blithely forgo any rights to our own stuff by EULA (and not to mention access by three letter agencies). In many cases they are able to use the data collected individually or in aggregate for commercial gain, targeted ads, swaying public opinion on matters in politics. This was seen first-hand with the Facebook/Russia scandal in 2017 and the similar Cambridge Analytica scandal, where a firm exploited weakly controlled FB app data to obtain information from users who thought they were interacting with a quiz. The result: tailoring campaign advertising for the Trump campaign machine.

This also got me thinking about all the data being stored that has no immediate use. The concept I will call “dormant data”. There is an awful lot of data generated year on year that is simply in a state of hibernation. Unused or extremely infrequently accessed data that for all intents and purposes doesn’t exist for any practical use. We call this “archiving”. But consider that each byte needs to be stored somewhere in some form. And the amount being archived is only growing larger year by year.

A practical example of this concept of “dormant data”:

Google released its Gmail product in 2004. I was one of the first to receive an email address during a pre-release phase, and if the number of mis-addressed emails is any indication I chose a popular email address. Receiving 1GB of data for an email address in 2004 was obscenely grotesquely massive amount for the time. This was back when Microsoft offered a mere 2MB storage space. Anyone sending a video file (which wasn’t as popular as it is now) may end up having their email bounce due to the middling file size restrictions of the day.

With little warning they also wiped all data from the “Sent Items” folder as an easy way to reclaim storage, though it made for many pissed off users (myself included) who weren’t given adequate notice. Microsoft presumably made a judgement that there was little value in keeping Sent mail. Perhaps seen as more-or-less a duplication of the Inbox, where an inbox alone would suffice. This was a decision based on storage considerations and it shows how corporate decisions can affect the implicit contract made to users of the service.

Gmail gave us the freedom to keep whatever we want, for as long as we want. Things that are stored more than, say, 5 years ago have little to no practical value but still must be stored somewhere by some mechanism. Studies show that the amount of data being pumped out is increasing exponentially every year. Some have pegged this as “double every two years at least” by 2020. If we only assume that data growth is only around 7-10% year-on year then the time it will take for this data to double is roughly every 7 to 10 years (by rule of 72). The amount of new data produced within a single 7-10 year span will be equivalent to all data ever produced before that. This is scarily prescient when it comes to other compound increases such as oil production aka “Peak oil”. Though data is not a finite resource.

Now imagine the amount of forgotten data that is stored with little practical use, that is take bytes out of a disk, tape, NAS drive, optical store etc. The amount of data that is being actively viewed or interacted with may constitute around 5% of data that is in active use. The other 95% is stored for the sake of being stored – most data I surmise would never be interacted with ever again. In Astronomy, more than 75% of the universe is composed of Dark matter, something we can’t see or even measure! This unused data will be a digital equivalent of such. Vast amounts of data out there that is simply dead to use.

Thinking many years down the track, what will happen to all these swathes of unused forgotten bits of information? Will we keep ever-perpetually storing it, building ever more data centres? – or will they take a Microsoft approach and sweep away things more than X years old? Will this be considered a “dark ages” for digital data.

We don’t have this problem so much for storage now as we just build more data centres. Though the rate at which we output data would mean that the construction of new centres will likely increase at the same pace as the rate of data increase. There’s flow on effects too, with more electricity usage and more carbon pumped into the atmosphere.

In time we will have a name for this phenomenon and what to do with it. Maybe burying data-centres under the sea (2) may not be such a bad idea after all. Or maybe, like our old room in the family home, we will simply have to “chuck things out”.

Leave a Reply Cancel reply