Utilizing #Presto in projects involving billions of data points

3 min readApr 8, 2023

Today, I’d like to discuss my experience utilizing Presto in projects involving billions of data points. Many of the big data projects I’ve worked on have had computational issues, such as the need to shrink datasets or the challenge of making sense of terabytes of data.

Nonetheless, I’ve recently worked on a project that required me to clean up a large database. Sometimes we all need to clean up a database. It was not a straightforward one for many reasons. The following paragraphs below give you a glimpse of the difficulties we faced and the solution we ultimately chose.

Problem

We need to clean up a big database that contains billions of data points spanning across many years.

Why?

To reduce the storage cost.
To make the applications’ (running on the database) performance better.

Challenges:

The dependencies to form the logic to delete entries from the database do not reside in the same entity collection or not even in the same database.
Data formats of the different sources are not the same.
We cannot delete the documents in one pass. We want to test if a decision we will have made to delete entries will not break our system’s credibility.
The solution should be fast enough so that we can make our storage cost lower as soon as possible.
The solution should not impact on the live databases. So we will need to process the input data at offline.
The each step of this solution should be easily testable.
Finally when we will be sure of everything, only then, we will execute the deletion operation (green Line in the following figure) and make the big database a small one in size.

Solution

Among various solutions, we have found using Presto is the best-fitted one for this problem because of its architectural behavior and how it works with billions of data points within a very short amount of time as well as with a less amount of cost. More details about Presto can be found here: https://prestodb.io/

Unload data from the primary source and save it in the data lake. Preferable to save in some columnar format and by the date partitions.
Unload data from other sources and save in a similar format and partitions, if it is possible.
Create tables from the data saved in the data lake. Preferable to use the date partitions.
Run analytics query, and produce final results in a separate path in the data lake.
Read the result and add a mark in the database. It can be a separate column or field or even a mapping.
Run tests using the new field or mapping. We can do A/B tests with real users too. We can do multiple iterations of steps 4–6 if we are still not sure of our decision.
Finally we can delete the marked data safely from the big database and make it small. Don’t worry if we still find something we have deleted incorrectly we have still a backup in the data lake. We can always restore data from there.

Utilizing #Presto in projects involving billions of data points

Written by Biddut Sarker Bijoy

No responses yet