A trending issue, with much recent activity in the headlines, is the thorny topic of what I will call our ‘digital shadow’. By this I mean collectively all the data that represents our real self in the virtual world. This digital shadow is comprised of both explicit data (e-mails you send, web pages you browse, movies/music you stream, etc.) and implicit data (the time of day you visited a web page, how long you spent viewing that page, the location of your cellphone throughout the day, etc.).
Every time you move through the virtual world, you leave a shadow. Some call this your digital footprint. The size of this footprint or shadow is much, much larger than most realize. An example, with something as simple as a single corporate e-mail sent to a colleague at another company:
Your original e-mail may have been a few paragraphs of text (5kB) and a two page Word document (45kB) for a nominal size of 50kB. When you press Send this is cached in your computer, then copied to your firm’s e-mail server. It is copied again, at least twice, before it even leaves your company: once to the shadow backup service (just about all e-mail backup systems today run a live parallel backup to avoid losing any mail), and again to your firm’s data retention archive – mandated by Sarbanes-Oxley, FRCP (Federal Rules of Civil Procedure), etc.
The message then begins its journey across the internet to the recipient. After leaving the actual e-mail server the message must traverse your corporation’s firewall. Each message is typically inspected for outgoing viruses and potentially attachment type or other parameters set by your company’s communications policy. In order to do this, the message is held in memory for a short time.
The e-mail then finally begins its trip on the WAN (Wide Area Network) – which is actually many miles of fiber optic cable with a number of routers to link the segments – that is what the internet is, physically. (Ok, it might be copper, or a microwave, but basically it’s a bunch of pipes and pumps that squirt traffic to where it’s supposed to end up).
A typical international e-mail will pass through at least 30 routers, each one of which holds the message in its internal memory for a while, until that message moves out of the queue. This is known as ‘store and forward’ technology. Eventually the message gets to the recipient firm, and goes through the same steps as when it first left – albeit in reverse order, finally arriving at the recipient’s desktop, now occupying memory on their laptop.
While it’s true that several of the ‘way-stations’ erase the message after sending it on its way to make room for the next batch of messages, there is an average memory utilization for traffic that is quite large. A modern router must have many GB of RAM to process high volume traffic.
Considering all of the copies, it’s not unlikely for an average e-mail to be copied over 50 times from origin to destination. If even 10% of those copies are held more or less permanently (this is a source of much arguing between legal departments and IT departments – data retention policies are difficult to define), this means that your original 50kB e-mail now requires 250kB of storage. Ok, not much – until you realize that (per the stats published by the Radicati Group in 2010) approximately 294 billion e-mails are sent EACH DAY. Do the math…
Now here is where life gets interesting… the e-mail itself is ‘explicit data’, but many other aspects (call it metadata) of the mail, known as ‘implicit data’ are also stored, or at least counted and accumulated.
Unless you fully encrypt your e-mails (becoming more common, but still only practiced by a small fraction of 1% of users) anyone along the way can potentially read or copy your message. While, due to the sheer volume, no one without reason would target an individual message, what is often collected is implicit information: how many mails a day does a user or group of users send? Where do they go? Is there a typical group of recipients, etc. Often times this implicit information is fair game even if the explicit data cannot be legally examined.
Many law enforcement agencies are permitted to examine header information (implicit data) without a warrant, while actually ‘reading’ the e-mail would require a search warrant. At a high level, sophisticated analysis using neural networks are what is done by agencies such as the NSA, CSE, MI5, and so on. They monitor traffic patterns – who is chatting to whom, in what groups, how often, and then collating these traffic patterns against real world activities and looking for correlation.
All of this just from looking at what happened to a single e-mail as it moved…
Now add in the history of web pages visited, online purchases, visits to social sites, posts to Facebook, Twitter, Pinterest, LinkedIn, etc. etc. Many people feel that they maintain a degree of privacy by using different e-mail addresses or different ‘personalities’ for different activities. In the past, this may have helped, but today little is gained by this attempt at obfuscation – mainly due to a technique known as orthogonal data mining.
Basically this means drilling into data from various ‘viewpoints’ and collating data that at first glance would be disparate. For instance, different social sites may be visited by what appears to be different users (with different usernames) – until a study of ‘implicit data’ [the ip address of the client computer] is seen to be the same…
Each web session a user conducts with a web site transmits a lot of implicit data: time and duration of visit, pages visited, cross-links visited, ip address of the client, e-mail address and other ‘cookie’ information contained on the client computer, etc.
The real power of this kind of data mining comes from combining data from multiple web sites that are visited by a user. One can see that seemingly innocuous searches for medical conditions, coupled with subsequent visits to “Web MD” or other such sites could be assembled into a profile that may transmit more information to an online ad agency than the user may desire.
Or how about the fact that Facebook (to use one example) offers an API (programmatic interface) to developers that can be used to troll the massive database on people (otherwise known as Facebook) for virtually anything that is posted as ‘public’. Since that privacy permission state is the default (unless a user has chosen specifically to restrict it) – and now with the new Facebook Timeline becoming mandatory in the user interface – it is very easy for an automatic program to interrogate the Facebook archives for the personal history of anyone that has public postings – in chronological order.
Better keep all your stories straight… a prospective employer can now zoom right to your timeline and see if what you posted personally matches your resume… Like most things, there are two sides to all of this: what propels this profiling is targeted advertising. While some of us may hate the concept, as long as goods and service vendors feel that advertising helps them sell – and targeted ads sell more effectively at lower cost – then we all benefit. These wonderful services that we call online apps are not free. The programmers, the servers, the electricity, the equipment all costs a LOT of money – someone has to pay for it.
Being willing to have some screen real estate used for ads is actually pretty cheap for most users. However, the flip side can be troubling. It is well known that certain governments routinely collect data from Facebook, Twitter and other sites on their citizens – probably not for these same citizens’ good health and peace of mind… Abusive spouses have tracked and injured their mates by using Foursquare and other location services, including GPS monitoring of mobile phones.
In general we collectively need to come to grips with the management of our ‘digital shadows.’ We cannot blindly give de facto ownership of our implicit or explicit data to others. In most cases today, companies take this data without telling the user, give or sell it without notice, and the user has little or no say in the matter.
And these issues are not just relegated to PC’s on your desk… the proliferation of powerful mobile devices running location-based apps have become an advertiser’s dream… and sometimes a user’s nightmare…
No matter what is said or thought by users at this point, the ‘digital genie’ is long out of the bottle and she’s not going back in… our data, our digital shadow, is out there and is growing every day. The only choice left is for us collectively, as a world culture, to accept this and deal with it. As often is the case, technology outstrips law and social norms in terms of speed of adoption. Most attempts at any sort of unified legal regulation on the ‘internet’ have failed miserably.
But that doesn’t mean this should not happen, but such regulation must be sensible, uniformly enforceable, equitable and fairly applied – with the same sort of due process, ability for appeal and redress, etc. that is available in the ‘real world.’
The first steps toward a more equitable and transparent ‘shadow world’ would be a universal recognition that data about a person belongs to that person, not to whomever collected it. There are innumerable precedents for this in the ‘real world’, where a person’s words, music, art, etc. can be copyrighted and protected from unauthorized use. Of course there are exceptions (the ‘fair use’ policy, legitimate journalistic reporting, photography in public, etc.) but these exceptions are defined, and often refined through judicial process.
One such idea is presented here, whether this will gain traction is uncertain, but at least thought is being directed towards this important issue by some.
[shortly after first posting this I came across another article so germane to this topic I am including the link here – another interesting story on data mining and targeted advertising]
Tagged: data mining, online, privacy, security