Fancy to know what GDPR is short for? Google it! This article is an attempt to describe a neo-banks journey into the serverless realm of GDPR compliance. Of course, it is a wide realm, hence our focus on data anonymization. To consume our products, the customers need to share some information with us, such as personal number, address or phone number. As soon as we collect that kind of data GDPR becomes a vital part of our operations.
Bank equals trust. Trust equals obligations. We are obliged to collect, process and store data in a safe and secure manner. We collect the data by using secure transport mechanisms between the customers' devices running our app or website, and our own private little cloud in the big public 3-letter cloud. Processing this data is done in microservices, eliminating the risk of data loss or dissemination. Data storage is needed for a number of reasons. One of them being service delivery to our customers, others are regulatory requirements. Among the regulatory requirements on the opposite side of the scale is the GDPR “Data minimization” principle mentioned in Article 5, “Personal data shall be: (c) adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.”
We implemented an anonymization process in our databases. Here's a sample for you to enjoy:
In the first row, you can see that the customer data has been anonymized since we all know that “1” equals “True”. It is now impossible to identify the customer.
Not so complicated, is it? Not really, when you are running a distributed system, with more than 20 databases with different rules for scheduling the anonymization process, it becomes a really tough challenge.
Or more accurately – the determination and ability to use various AWS services - to the rescue. Among those services, we used mostly Lambda, DynamoDB and SNS. We combined these services to create a new microservice that handled the anonymization process. We’d like to focus on the features of mentioned AWS services and give you a walkthrough of how we’ve used them.
Let’s specify two requirements that we were faced with:
Knowing about these requirements we figured that some “events” might be suitable. We thought about designing an event for scheduling the anonymization, and next for postponing it. One of the team members came then up with this brilliant question:
“Okay let’s use events for that, but where are we going to store them?”
The obvious choice in this stage was to use DynamoDB.
DynamoDB is an excellent choice since:
Also, it scales efficiently, so we would be able to process more data initially when we needed to process all our customers. After choosing DynamoDB as our database solution, we were faced with the first challenge, the choice of capacity provisioning. We could either go with provisioned capacity or on-demand.
Provisioned capacity is more suitable when:
On-demand is more suitable when:
The next challenge was what will happen if the save to the DynamoDb fails? We tackled this problem using DLQs(Dead letter queue), so all events can be replayed and nothing is lost.
The structure of the DynamoDB instance we run is composed of ID, SubID, and properties which are specifying the state of the event. DynamoDB stores all the data in a single table that can be queried by using primary and secondary indexes (ID and SubID in our case).
It’s very important to correctly specify the ID and SubID. We needed to go through all the use cases and predict what we would need. If you specify it incorrectly and do not predict a certain case when querying the data, you might end up performing scans on the whole table, which is very inefficient.
We needed to save the information to DynamoDB about the products used by our customers. Therefore, we chose a lambda since it allows us to create lightweight functions that will be executed on AWS. Lambda is just like a function in code, responsible for one functionality. But in our solution, we didn’t want to slow down the process of applying for the product, or for example paying out the loan. We didn’t want to wait until this function would be executed. So we used ‘event’ based execution to invoke the lambda function without waiting for the response:
The only thing that we needed to do is to specify the ‘InvocationType’ to ‘event’. If the lambda invocation is not successful then the event is being moved to DLQ (after 3 retries). We can trace these events and retry them manually. With this functionality, we were able to decouple the anonymization stack operations from our product business logic.
The next thing that needed to be implemented was to send the message about anonymization to all of our microservices containing PII (Personally identifiable information). A perfect choice for that was Simple Notification Service (SNS). It’s a service providing the functionality of sending messages. We used it to fan-out the notification to all microservices. It is very useful in case of:
Let’s sum up how the architecture evolved and where we used SNS. In the beginning, we had a simple structure of CloudWatch event (trigger) calling lambda, which saves the event of anonymization to DynamoDB
Later, we added a check for active products as a security check, making sure to not anonymize a customer with an active product. Doing so would not have been good since we wouldn’t know anything about the customer afterwards. We used SNS to spread the anonymization event to specific microservices containing customer data:
Just to double-check, we added a verification within the anonymization lambdas to verify if the customer does not have an active product. Having this architecture in place allowed us to be compliant with the GDPR rules.
First, we needed to solve the S3 Bucket versioning problem. When your S3 bucket has versioning enabled, you have to delete all versions to get rid of the file. From the AWS console perspective, you can see all the versions when you switch the option:
If you delete a file with versions set to hide, you will in reality just add a delete marker to the file, but it will still be available:
Only if you delete the file with all its versions is it completely removed.
How did we handle it from the code perspective? We first listed the files with all versions before deleting them one by one.
The next challenge was related to the characteristics of databases. We had a unique key set-up on the customer’s personal number, but after the anonymization process, all the personal numbers were converted into a bunch of zeros. Then the unique constraint was throwing errors since the personal numbers were the same. How did we deal with that? We added another column and named it ‘AnonymizationToken’ and also updated the unique constraint to be based on these two columns, instead of only the personal number. The AnonymizationToken was a unique UUID, so even if the personal number would be anonymized, the unique constraint will be met.
Since we are running serverless services only when we need them, it makes them really inexpensive. We don’t need to have the infrastructure running 24/7, because it’s a process that is executed once a day. The yearly cost of running the microservice, which is anonymizing the data of all the customers in our bank, is almost the same as a pair of new AirPods.
Using the AWS serverless services enabled us to quickly implement a complicated infrastructure and meet the GDPR requirements. Besides that, we learned a lot during this process and had a lot of fun too.