Data doesn’t just flow – it is processed, says digi.me’s CTO Gavin Ray
The world often appears to be based on simple facts, things that appear true and are never challenged. Sky is blue, grass is green and gravity points downwards, as some examples.
Another truism is that “data is a little packet of something and flows easily when needed”. We write code, throw data around at will and push it in and out of databases.
In the vast majority of global data management systems, there will be some kind of database that stores user data, allows you to query that data and fetch it on demand. Yet, we can easily forget that we have assumed it’s easy for data to flow, when actually you need data to request it. This is because there is a processing system tied to the storage system to do whatever you need, and since all users’ data for the average global web-service is centralised, coders write software that says “get me data ABC for user X”.
You often hear the more modern phrase “the graph” which is really a structured form of a highly distributed database, it links all the users’ identity records to all their different data. Whether Graph or Database, the system’s job is to handle all the requests to Create/Read/Update/Delete (CRUD) your data.
The software dance
These so-called CRUD functions are simply a suite of processing “tasks” that the database processing system must support, fetching and delivering data to and from the storage that it is networked to. To handle personal data you obviously need a “secure” database which means you must write code to do a little software “dance” with crypto keys to ensure that any code or app asking to touch data provides some security token and has permission to do what they ask. Once the security dance is over, the processing engine may safely open a connection to the data held on the storage domain it owns.
But why did I wonder at the beginning if we should double-check that sky is still blue, grass green and gravity points downwards? Its simple really, the digi.me mission wasn’t necessarily achievable. We were challenged to show that we could allow everyone to have all their online data owned, managed and controlled by them in storage they own, but with no processing attached to it.
Our aim was to “get back” everyone’s data and move it from their online services into their own storage. The reason this turns things up-side down is in two parts:
- Firstly, every user must own an independent personal cloud storage, but it doesn’t have any processing attached, so we can’t simply write code to say save/search/query user data in all its richness with different formats, properties and sizes.
- Second, this is complicated by the fact that our system also will not work like a database where all the data is in one logical place. In our case, the entire database is distributed in each user’s storage with no central model or traditional data structure that you get with databases.
- It also means we have no central indexes or caches or aggregation datasets, so none of the traditional processing models would work.
- To make matters worse – or better if you are a privacy engineer – we don’t even know who a user is or anything about them.
This means I tend to describe our cloud service that solves all these problems as “stateless” – or, at least, that’s the term I use in documentation. In reality, we describe our cloud services as dumb or stupid, they know nothing, decide nothing and from a privacy management perspective are only ever able to act on the user’s command. This is an important point in both regulated and security centric environments because it means our services are “a proxy” of the user.
Detangling data and processing
You could argue that storing user data in its own storage is simple. But when you need to run a global cloud service to connect users with other people who may wish to offer or share their data for some exchange of value – you find that there are all kinds of unusual system design challenges. The logical concept of giving users their data back works well, they own the storage, their apps control the data collection and sharing, and from various regulatory and privacy perspectives the user becomes the data custodian, the data processor and the data controller. Neat and empowering, right?
The only problem is, you need to find a way of joining up all these users to the ecosystem powered by supplying and sharing personal data for the benefit of the users.
To explain why this created an engineering and software challenge for digi.me, just think about how distributed user data works if it is stored in GoogleDrive, Microsoft OneDrive or DropBox – as we do today. If you have any kind of database and cloud system experience, you may have forgotten that accessing and handling user data is based on a close and entangled relationship between data storage and the processing. Most search, fetch and share services simply access a pile of software that holds records of what data is where. Usually a form of index that your software scans to answer the question “where are the things like XYZ that have properties ABC?”.
But imagine now you want to build a cloud service where all the users’ data is held in their OneDrive account and you want to share it with BigInterestingCo who is offering the user something irresistible in return. What kind of code would you write to do this? Perhaps you share with BigInterestingCo the access keys the user has to their OneDrive account, and tell them to only access data XYZ as agreed. Hmm, from now on BigInterestingCo has access to the user data – all of it.
But in BigInterestingCo’s defense they don’t want this either, they have a practical problem. They don’t want to write software to look at a directory of user data. It’s not that they wouldn’t find marketing people who say “hell yeah, why not”, it’s just that their developer team would never realistically want to access a user’s cloud filesystem and worry about what data is where, in what form and own the software validation problem, let alone privacy regulation sign-off.
Clearly, the digi.me solution solves this problem of removing processing and database style access. We untangled the traditional relationship but had to redesign the way data is stored in a filesystem and define how queries and searches can be done from remote, secure access.
We have created a cloud service that can securely connect to a user’s filesystem and selectively query or write data on demand, but only when instructed by a user device and the user’s defined permissions. We designed a filesystem and cloud service access model that supports multiple user-devices concurrently, as well as concurrent cloud services to fetch and save online data while also sharing it.
Proxy processing and dumb-stateless cloud services
After making this leap to say we will store all users’ online data for them, in their own storage, we faced (a great many) security issues and a significant privacy engineering problem. How to build a cloud service to act in some way, on demand, to get users’ data without seeing-storing-using it, and how to act as a broker between data consumers and the user as a data provider.
The answer was to define a “proxy” model, where the user is king, their on-device/desktop app is their personal manager and a cloud service is dumb-stateless and does nothing until the user’s app tells it.
In fact, the user’s app acts more like a crypto wallet, and passes keys and permissions to the cloud to do various specific things with various specific datasets. At no point does the cloud retain any user profiles or permissions longer than the time it takes to execute the required function.
We had to look into how banking and health industry regulations work and what defines a data processor or controller. In the end it comes down to the notion of decision-makers (users) and slaves (cloud services) enacting commands then evaporating without trace. At the instruction of a user, via their app, they enable their requested function and security credentials to be passed to the cloud service. After every action, the digi.me cloud services flush memory used and hold no personal identifiable information.
The Justin Bieber challenge
If we can safely and securely run user processing to access users’ data in their own storage, the final detanglement challenge was to allow data saves and searches in some kind of efficient way. The problem is this, digi.me handles any and all personal data, from millions of little Twitter tweets and Facebook posts, to health records and megabyte X-rays, multiple banking records to your Spotify playlists and Fitbit health stats.
We had to find a cloud storage model that dealt with this, and Justin Bieber came to our rescue. We looked at his stats and set out to show how we could handle the massive volume of his daily social media profile (this was three years ago, perhaps Kanye would be today’s benchmark) and ensure we could categorise, save and query any amount of his data in any format from any source.
The result was a data harmonisation system that ingests any data and reforms it in a global standard format (published on https://developers.digi.me) and a hierarchical categorisation model that allows us to handle arbitrary data types from any source and over any timeline, yet search and query by direct reference to a filesystem without needing a database class index.
We did also have to go play with building secure journals and file encryption and integrity managers, but that’s worth a blog in its own right.
So, in the end, we agree sky is blue, grass is green, but in our cloud computing world, it still feels like gravity points sideways.