First glance
While it is now well-known that companies worldwide are obliged to participate into mass surveillance by sharing your data with state agencies, we are trying to estimate the extent to which regular scam has access to that data.
Perhaps the primary source of data leaks is major data breaches; there’s even a list of data breaches in Wikipedia.
Apparently a common belief is that big companies are less likely to exploit your data for much more than human web trafficking, than, say, Chinese hackers. But sometimes the latter can access the data of the former – including huge techy companies and services, such as Gmail, Ubuntu, Twitter, Facebook, Apple, eBay, Adobe, Everenote, and Dropbox, as well as financial, retail, healthcare, government ones. You hand data to those, they get hacked or accidently publish1 it, doing all the collection job for the scam that may be interested.
The extent
The wikipedia list mentions an estimate of 2 billion user entries total, while the amount of internet users is roughly a half of the world population – about 3.5 billion users. With the average literacy rate of about 85%2, it’s pretty high, and it also suggests why so many websites and programs seem to aim rather inexperienced users. LeakedSource3 mentions a similar figure, 2.2 billion accounts in their database. Probably some of the breaches remain untold, but there should be plenty of accounts belonging to the same users, and plenty of spam bots – so the actual number of users whose data was compromised may be, say, an order of magnitude lower than that – hundreds of millions.
That estimate is probably not too high, since some of the individual breaches are reported to contain huge numbers. Here they are, in millions: US Military veterans – 76, TK / TJ Maxx – 94, Target Corporation – 70, 7-Eleven + Nasdaq + others – 160, Heartland – 130, eBay – 145, AOL – 92, Anthem Inc. – 80, Adobe Systems – 152. The LeakedSource website also mentions, among the 20 most recently added databases: Rambler.ru – 98, Last.fm – 43, Dropbox.com – 67.
Causes
Each breach takes multiple stupid decisions or mistakes: from hiring and assigning those who are not suitable for a job, to actually not doing it properly: computers don’t give data away by themselves, unless they are told to – even if unintentionally.
Though it is in human nature to make mistakes, there are techniques to prevent nearly all of them – such as reviews by competent people, formal and machine-verified proofs, layered security, and so on. It is mostly carelessness that leads to the breaches.
Data usage
Many things can happen with the user data which is available to scammers: phishing is probably the most basic way to apply it, though one may use that data to get into user accounts on other services, to spam while impersonating them4, to resell for future directed attacks, and so on. Knowledge is power, and it’s rather unjust in this case.
Counter measures
Entertainment services that require registration can simply be avoided, while others may be rather tricky to avoid. For those, one can just use separate identities: obviously different logins and passwords, no private data such as real name or photos when not necessary, preferably using Tor, etc. A single alternative identity for online activities is not sufficient, since it leads to higher accumulation of data in the hands of scammers than the practically achievable minimum.
Basically, just do the opposite of what Vaulin did.
Unauthorized access laws
As one may guess, those breaches are illegal: that’s unauthorized access to information. Recently Bruce Schneier reposted a couple of related news: “Visiting a Website against the Owner’s Wishes Is Now a Federal Crime” and “Password Sharing Is Now a Crime”. One of those quotes some guy, who explained “without authorization” as follows:
An unambiguous, non-technical term that, given its plain and ordinary meaning, means accessing a protected computer without permission.
It’s so unambiguous that we have at least one more unambiguous – though rather technical – definition in Wikipedia:
Authorization or authorisation is the function of specifying access rights to resources related to information security and computer security in general and to access control in particular. More formally, “to authorize” is to define an access policy.
At the first glance, they don’t seem to contradict each other: a permission is given (or not) by defining an access policy. Computers are precise, and they follow that policy as programmed. Pretty cool, and it’s pretty much impossible to violate such a policy when it’s implemented using computers.
Alas, people read and understand those unambiguous definitions quite differently. What the guy tried to explain there, is not plain “permission by a strict and unambiguous program”, but “what one really means or wants”. So, when one messes up and fails to define the access policy as they wanted to, and then somebody uses what was in fact defined – it is usually considered an unauthorized access, for which the more competent person is supposed to be punished – since they have probably understood what’s going on. Orin Kerr of The Washington Post states:
The main question becomes mens rea: The visit becomes a federal crime when the visitor knows that the computer owner doesn’t want it.
Indeed, that also explains how the more conventional unauthorized access is deemed illegal. Alas, it’s mostly useless against scam and major data breaches, and may even simplify those by discouraging ethical hackers from penetration testing.
Yes, that’s a thing – pretty common, even.↩
Ever wondered what it was like to live in the world where most of the human population was savage? Surprise, we’re not far from that.↩
The website itself looks rather scammy, but appears to indeed have those databases.↩
Well, that’s what legitimate websites such as LinkedIn are doing already.↩