2019-10-20

We need better data privacy self-regulation

Some smart people believe concerns about data privacy are illusory and drummed up as an excuse to expand the regulatory reach of the government. There is some truth to that. Government is highly motivated to further its dominion, and has seized the opportunity for several recent theatrical, vacuous meetings with social media giants. However, data privacy is a genuine issue area and privacy advocates see real and significant need for improvement. Privacy concerns are not an illusion.

That said, improving the situation doesn't require laws. I feel we would be better served by more experimentation and more ethical action by vendors and more education of consumers.

Data privacy concerns are real

There is a large amount of information asymmetry between the users and makers of modern software, especially on the matter of user data collection. Users do not appreciate just how much user data is collected by cloud services and apps, and how that data is analyzed and used to synthesize further information: metrics, predicted intent, sentiment, categorization, etc.

The data analysts who are hands-on with user data know the situation is nothing like a decade or two ago. The volume of data collected today, and the uses to which it can be applied, make legacy notions of privacy seem not only quaint but from a wholly different era. Our notions of the capability to be monitored by third-parties seem stuck in about the year 2000. We imagine data collection to be equivalent to a bicycle, but today's data industry is actually rocking a SpaceX BFR.

The magnitude of user data collection, exfiltration, and analysis remains unclear to laypeople. It also has increased more quickly than our metaphors and reasoning aids, meaning many laypeople are prone to apply legacy metaphors about telephone taps and radio waves. I suspect laypeople also don't appreciate how innovative the industry has been with data usage, leaving them imagining it's exclusively about ad targeting. Only quite recently have more consumers become aware of the other ways data are applied to filter the content they see, provide recommendations only to increase engagement, and otherwise "improve the consumption experience." And then there are the more disturbing data uses dabbling at the edge of the data privacy Overton window, where data gathered by third parties is "weaponized" against people, e.g., to make finding employment more difficult or set the price of insurance.

Put simply, most data collection, exfiltration, and analysis is happening invisibly to the users from which it is gathered. Data collection is a hidden feature of software, and it's not a feature users would ever ask for. The user doesn't see their data being taken. It's not like the old days when you'd see a red "transmit" light flashing to indicate your computer is sending something to a metaphorical cloud. No, today, the now very real "cloud" reaches into your local workstation and just creates and captures data about you using your own CPU.

Even when lip service is given to consent, the type of data is vague ("usage characteristics"), the purpose is vague ("to improve user experience"), the scope is vague ("combined with other services we provide"), the volume and frequency are vague ("periodically"), and the third-parties are vague ("trusted partners"). And sometimes, there is simply no opting out as long as you are a user of the product in question.

Data collection policies are often conveyed to users in legalese within user license agreements that no one understands. But thinking that people don't care about privacy because they don't read these agreements carefully is mistaken, bordering on user-hostile.

Nevertheless, some users, especially those who are privacy advocates, are aware of and deeply concerned about both the semi-legitimate uses and abuses of collected user data. It is true this group, of which I am part, is currently failing to communicate the severity of the concern clearly enough to sway broad opinion. Messaging about privacy and data hygiene are still largely memes and legacy metaphors. But to reiterate, this is not evidence that data concerns are illusory.

Collection of data about people has a history of conspiracy theories on one side and "nothing-to-hideism" on the other, making it exhausting for laypeople. Today, both defeatism and well-meaning trust in the "strict controls" put in place at major data harvesters (e.g., Google, Facebook) are common. But those currently downplaying the concern of data privacy today are an echo of those who said the Fourth Amendment would never permit bulk surveillance by the government. We all became conspiracy theorists for a few weeks when we learned better. (How quickly that vanished!)

Who you gonna call?

Given the data specter haunting privacy advocates, calling for government intervention is facile and predictable. Government the go-to when people feel the market is failing. It's also ultimately foolish because the medicine will be worse than the disease.

But yes, the market can do a lot better with collected user data. A lot better.

There is experimentation in ways to improve matters, primarily in the open source space where questions of revenue streams are less front-and-center. This is the domain of the highly motivated and knowledgable. It's self-hosted applications that are built and deployed by software developers on servers they operate, with data collection they manage personally for their exclusive purposes.

But some commercial enterprises are taking up privacy as a differentiation factor. High-profile examples are Mozilla (makers of the Firefox web browser), the Brave web browser, and even Apple in some of its more recent iPhone advertising.

Mozilla in particular makes privacy a central focus. They have a manifesto which includes:

Principle 4: Individuals’ security and privacy on the internet are fundamental and must not be treated as optional.

Other companies are making slower moves to catch the shifting wind of public opinion. Google, Facebook, Microsoft, and Amazon will periodically communicate some messaging about respecting privacy. But these four, and others who have become addicted to slurping the user data trough, have a lot of ground to cover before they look remotely like an organization that actually respects user privacy.

There's also experimentation in the middle ground, combining open source with service and commerce. You can find service vendors who will install, host, and maintain "self-hostable" software for you. This is a happy side-effect of a relatively new concept in software: "containerization." For laypeople in the audience, some of the high-profile technologies in play here are Docker and Kubernetes. Originally for business use, these technologies combined with managed hosting and eventual further usability improvements, should make operating private servers feasible for a broader user base.

I really hope momentum builds here, because I feel strongly we need to return to a world where software runs for the benefit of the individual, according to their preferences and constraints, rather than according to the whims of a centralized third-party.

Fear regulatory capture

It will take more time to realize the benefits of the nascent market movements described above. And things may play out differently. So it's natural to want to correct the present.

Americans almost instinctively call for the government as an outside enforcer when they feel frustrated. But the risk should be familiar to anyone who observes regulation. Established companies secretly love the threat of regulation because they can tune it to their preferences, and make the resulting rules serve their interests. If Facebook helps write the regulations on social networks, while you will of course see some concessions made for appearances and appeasement, the bulk of the regulation will be written to essentially require all social networks do what Facebook already does and can afford to do. This will reduce innovation in social networks (sadly, further suffocating the innovation and exploration I described in the previous section) and entrench Facebook, ensuring its survival for longer than it would survive otherwise.

Alternatives

Rather than risk regulatory capture and adopt rules that will suppress small innovators, I would prefer to exert pressure on the gross data harvesters from inside and out, both as consumers and engineers.

Below are my (admittedly, half-baked) ideas for consideration.

Ostracize euphemisms

Privacy experts need to find terminology that strikes the right balance of invalidating the euphemisms used by data collectors without getting dismissed for using our own extreme terminology. For example, the word "surveillance" is often used by many (myself included) to describe data collection, and from a point of view, this word is accurate. But surveillance implies the purpose of data collection is only to monitor and control behavior, which is not wholly accurate, and maybe not accurate at all. We should find terminology that at least acknowledges the other incentives for collecting all of this user data, such as revenue, feature usage statistics, and bug detection.

The above paragraph is all a measure of caution to balance the following: We should reject and ostracize the euphemisms used by data collectors, such as the words telemetry and analytics.

Telemetry sounds unoffensive and you could be excused for thinking it's simple data such as logging an app's start, stop, and maybe stack traces in the event of an exception. But it often covers so much more, from collecting any/all key-presses (with only "best efforts" to remove sensitive content such as passwords) to collecting all mouse movement to create heatmaps. When you understand what is included under the "telemetry" umbrella in 2019, it is difficult to not immediately characterize it as surveillance. It certainly feels like you're being surveiled when you're the subject.

Once you know it's happening, you may also experience a chilling effect, which further emphasizes the emotional response.

So I routinely criticize the use of euphemisms and I recommend others do the same. I don't know the right alternatives, which is why I still often use "surveillance." It's emphatic and snarky, and makes my position known. But it probably doesn't really move the needle with the unfamiliar.

Be transparent about the value of aftermarket data

We need something like "fair trade" for data—a phrase which conveys an otherwise imbalanced trade that is being corrected.

Data privacy advocates find current practices offensive in part because users do not understand that paying for a product is no longer the full extent of the trade. Although the concept of a "loss leader" isn't new, it has been so widely adopted in modern tech that it seems the norm. The purchase price for most tech products is only the start of a long revenue curve.

A "fair trade" approach might mean we advocate for options to pay more for products and services to have the option that our devices are set to not exfiltrate data, and also to classify that which is collected as not to be monitored or aggregated. The delta in price would reveal how much the aftermarket data trade is worth to the vendor.

I think it would be tremendously valuable for people to be able to know the price/value of the aftermarket data trade. Either it's $1 and it seems an absolute no-brainer to turn it off. Or it's $100 and you begin to explore the situation further. If aftermarket data trade is worth so much to them, do I feel used? Especially if the software is jank and they don't listen to bug reports? Or if the device just tells me the weather and an occasional bad joke?

Control the data appetite

As application builders, engineers, designers, and product specialists, we should be applying back-pressure on the insatiable data appetite of the modern product-marketing apparatus.

Today, the market is highly saturated with analytics service vendors promising the tremendous value of analytics and telemetry. I'm not convinced. To be sure, there is value in some telemetry, but it's so over-hyped and over-emphasized that the rapidly diminishing marginal utility of more telemetry is not considered. As a thought experiment, if you've been using software for twenty or thirty years ask yourself: have you seen a notable uptick in software quality or features since the advent of widespread telemetry? I certainly haven't. Software today seems as buggy as ever, and feature changes hardly ever align with user requests, such as seen on forums or fan sites.

Recognize that telemetry is over-promising and under-delivering. Software was made for decades without the need for comprehensive telemetry.
Recognize the cost to user privacy of telemetry. Just because you, as the product maker, don't pay that price, does not make it zero.
Acknowledge the fact that your users suffer from information asymmetry and don't even know what's being taken from them.
Evaluate if comprehensive/invasive telemetry is actually necessary. It probably isn't. You can probably survive, and still make massive profits even, with less.
Allow users to opt-out of telemetry. Maybe even make it opt-in. Maybe even pay users who opt-in or give them other benefits.

Recognize the folly of analytics and metrics

Obviously analytics do have real value, but there is a clear case to be made that many applications and web sites are over-burdened with analytics, to the point of folding over, imploding on themselves.

If the purpose of analytics is to measure user happiness and increase engagement, it's ironic that the crushing burden of analytics is causing users to leave sites and close apps that are slow to the point of being unusable.

The situation became so bad with news and content sites that Google introduced AMP to counter the unrestrained growth of analytics, monitoring, advertising, and bloat. The backlash to AMP is rational: it's a big player (Google) oppressing small creators. And the backlash to that backlash is also rational: it should not be necessary for Google to come in and force restraint; restraint should have happened voluntarily. Content creators who complain about AMP have only themselves to blame if they've been part of the analytics, advertising, and bloat explosion.

Then when data collected is aggregated, it becomes useful as "metrics," and often serves as justification for a "metrics-driven" product design philosophy. You know the outcomes well. A feature is dropped because "no one" is using it. More emphasis is put on a social like button because it increases user engagement.

Metrics are often corrosive to long-term success. Metrics emphasize local maxima and can distract product from more important goals. Metrics are the "shareholder value" of product design.
Metrics crowd out genuine user input. Never allow metrics to blind you to what your users are directly telling you.
If you were to just eliminate data metrics entirely, would you really be left flailing? Of course not. You would engage your users and measure usage from volunteers, as was done for decades prior.

Increase transparency and encourage decentralization

Tell your users specifically what you collect and why. Show administrative and third-party interfaces to the data to show how it's consumed internally and externally. To-date, very few companies have done this very well.

The ultimate approach to transparency is to embrace self-hosting. Follow Mozilla's lead in providing open source implementations of your server components. Doing so reveals precisely what the server components gather and how it is used. A majority of users will continue to use the centralized option out of convenience, but the availability of a decentralized alternative and the transparency of the code's behavior will yield a high degree of trust.

About this blog