Saving $30000 a month by improving Garbage Collection

Harshal Chaudhari
Mixpanel Engineering
6 min readJul 8, 2021

--

Over the past few years, we’ve been focused on building a sustainable business at Mixpanel, and there’s a lot of emphasis on having sound unit economics. The engineering team plays a big part in this by periodically assessing our data center spend and removing inefficiencies. Earlier this year, we worked on one such project to improve our per-unit data storage costs.

First, some background. Mixpanel is an event-centric product analytics tool. Over the past 12 months, we’ve seen a substantial increase in the number of events ingested by our analytics database, “arb”. To put this in concrete numbers, we now process over a trillion updates every month. While each event is processed once, its storage costs compound over time.

Storing events

“arb”, at its core, is a distributed column-oriented storage and query engine. Incoming data is sharded horizontally across multiple storage machines. We use a combination of the user identifier and time associated with an event as the key. Kafka serves as a distributed log, and storage nodes consume the topics they’re responsible for from Kafka. Each node writes incoming events to a row-oriented write-ahead log on the attached SSD-based Persistent Disks (PDs). Periodically, this write-ahead log is “compacted” to a columnar representation and stored as an immutable file on disk. For these immutable files, our storage is tiered ; recent files live on PDs and as they age, get moved to Google Cloud Storage (GCS). In addition, each storage node maintains a special file called the “manifest”, which stores references and locations of all the currently live files. This manifest is encoded using protocol buffers.

Snapshots and time travel

At this point, it’s important to understand how our database can time travel. As an analytics company, it is unacceptable for us to lose any customer data. However, bugs are a part of life, and the worst kind of bugs are those that erroneously modify data at the storage layer. While we have multiple types of data backups, our process to recover from storage level bugs is fairly straightforward — we take snapshots of PDs every 12 hours and retain 7 days’ data in Kafka. To go back in time, all we need to do is mount an older snapshot on a storage node, and it will resume from the right Kafka offsets to replay traffic with the new code.

Orphaned Files

Over time, we’ve realized that having too many files is bad for performance. With a system that separates storage and compute, having multiple small files adds a lot of overhead for both the storage and the query layers. As a result, we merge these immutable files periodically. To do this, we download the source file set, generate a single combined file and atomically replace the old files with the new files in the storage node’s “manifest”. The old files are no longer queried and can be deleted to reclaim space.

Files on PDs can be safely removed because earlier snapshots will have a copy of that file. However, files that have graduated to GCS are a little harder to get rid of. Because restoring an earlier snapshot of a PD can resurrect a now-deleted file in GCS, it’s not safe to immediately remove an unreferenced file from cloud storage. The easiest way to get past this problem was never to delete files from GCS. For a long time, this worked fine. The number of orphaned files was a tiny percentage of our total data. However, as we started allowing conditional changes to historical data through features like Identity Management and adding support for GDPR/CCPA compliance, the number of unreferenced files started growing in number and size.

Solutions

As with most engineering problems, this one could be solved incrementally.

Stop the bleeding

Our priority was to collect as many orphaned files as possible to stop the storage costs from accumulating. We wrote a one-off tool to go through our GCS buckets and delete all the files that didn’t have an entry in the manifest. We waited 7 days between generating the list of files and checking the manifests to avoid problems with snapshot-based restores. While this works, it took over 48 hours to generate the list of files using GCS’s APIs. So, this wasn’t something we could run very often. Also, running data deletions using one-off scripts is extremely error-prone. We once had to restore multiple terabytes of data from our coldline backups because someone ran a deletion script that deleted more data than it should have. These kinds of restores are costly and cause unscheduled downtime for our customers.

Finding a more sustainable solution

The simplest solution to our problem was to store a list of unreferenced files temporarily for 7 days in the manifest. Then, periodically, the storage node would loop through the unreferenced files in its manifest and delete those older than 7 days from GCS. This worked fine in theory. However, this significantly increased the size of the manifest, increasing the latency of serializing and deserializing this file. To get past this increase in latency, and after a bit of napkin math, we figured that we could only store about 2000 of these unreferenced files per manifest. Any unreferenced files beyond that limit would still be orphaned. With this solution, we collected over 400TB of data in a month.

With the tracking we put in place while implementing the solution described above, we realized that we were storing only about 20% of the unreferenced files. An easy optimization that we could implement right away was to heap-ify the stored list to evict the smallest file from the manifest instead of a random one. This change almost doubled the amount of data we could delete every month to around 800TB. At this point, we were collecting about 60% of the unreferenced data by size.

Badger DB

With the above solution in place, we felt we were in a good spot to approach this problem from first principles and make more fundamental changes to our architecture. The first change we needed to make was to move the storage of unreferenced files to a separate file so that the manifest, which stores a list of all the live files, does not have a serialization and deserialization overhead. But that only moved the bottleneck from one part of the system to another. It also didn’t fundamentally improve the live serving system in any way. The main problem we needed to solve was not storing metadata in flat files but in a database optimized for these kinds of lookups. After a bunch of research and testing, we settled on using BadgerDB to store this information. BadgerDB is an embeddable key-value store written using Go. It works well SSD based PDs and has excellent performance for updates, which was a key requirement for us.

However, adding an external database into the mix made it harder to keep file entries consistent with the manifest. Specifically, we had to ensure two things — every file removed from the manifest has an entry in BadgerDB, and no file locator that resides in BadgerDB is present in the manifest. So, to make sure we never lose a file, it was added to BadgerDB before removing the manifest entry. However, systems can crash, and we might run into situations where an entry was added to BadgerDB but not removed from the manifest. To guard against this, whenever we picked a file for eventual deletion from BadgerDB, we checked if it had an entry in the manifest. If it did, we’d remove the BadgerDB entry, and because the process was self-healing, the next time it ran, the file would get moved correctly.

After making this change over a couple of months, we could now collect 100% of our unreferenced files , which was almost 1.5 Petabytes per month when this feature launched (and is increasing rapidly). Because these savings compound, we now save close to $30000 per month in GCS storage costs that we would otherwise have to incur.

Future work

Introducing BadgerDB into our stack has now opened up the opportunity to eliminate file-based manifests and move to a key-value store for all metadata. This would have a huge impact on simplifying our codebase because we’d get many features around crash recovery and fast updates out of the box. It would also reduce the high CPU costs we incur for serializing and deserializing the manifest file whenever we add or remove entries from it. This is something that our infrastructure team is likely to work on over the next few months. If these kinds of technical challenges sound interesting, please reach out to us!

--

--