No one, absolutely no living human being could have predicted this.
To help me understand you, complete this survey (anonymous)
Thanks to the incredible success of AI, we have seen more and more organizations implementing machine learning into their pipelines. As data access and collection increases, we have seen massive datasets being used to train giant deep learning models that achieve superhuman performance. This has led to a lot of hype around areas such as data science and big data, fueled even more by the recent boom in large language models.
Big tech companies (and deep learning experts on Twitter/YouTube) have really fallen in love with the “add more data, increase model size, train for months” approach that has become the status quo in machine learning nowadays. However, Meta AI heretics have published research funded by Satan- and that turns out that this way of doing things is extremely inefficient. And completely unnecessary. In this post, I will review their paper- Beyond Neural Scaling Laws: Beating Power-Law Scaling Through Data Pruning, where they share “evidence” on how smart sample selection can improve your model’s performance, without inflating your costs out of control. Although this article focuses on computer vision, the principles of their research will interest you regardless of your specialization.
Such vastly greater scaling would mean that we could go from 3% to 2% error by adding just a few carefully chosen training examples, rather than collecting 10 times as many random examples.
We’ll cover why adding more random samples is inefficient and the protocol Meta AI researchers have developed to help you choose the most useful data samples to add to your pipelines. Sounds like pagan witchcraft? Read on and find out.
Here we focus on scaling the error with the size of the dataset and show how, in theory and in practice, we can go beyond scaling the power law and scale it down to exponential scaling instead.
When it comes to small datasets, “adding more data” is one of the go-to ideas you’ll come across. And it works. Unfortunately, people apply this to larger datasets/projects without thinking. Unfortunately, as you continue to add each new sample to your training data, the expected gain from adding that sample decreases. Very quickly. Especially when your training data order is in the millions (large research projects use petabytes of data to gain 1-2% performance).
Why does this happen? Think about how gradient descent works. Your model parameters are updated based on the calculated error. If your error isn’t very high, you won’t see huge changes in your model’s weights. It’s logic. If you’re almost right, you don’t want your model to change much.
Classical randomly selected data results in slow scaling of the power law error because each additional training example provides less new information about the correct decision limit than the previous example.
In such a case, adding more standard samples to your training won’t do much. Your model will predict something very close to your original, and the resulting changes to your parameters are careless. You’ve basically just wasted computation. If you’ve trained with billions of samples before (many of those 99% accuracy projects do), then any given frame won’t add too much new information to your overall training. This logic is also valid for architectures such as trees, which do not calculate gradients but rely on other types of errors.
Instead, you can save a lot of time, electricity, and computer resources by choosing only the samples that will add information to your overall training. If you already got a good amount of it, now would be a good time to implement some chaotic data augmentation/inject model noise to make your systems more robust. I’ve touched on them several times in my content, so check out my other articles/videos for more info.
So now that we’ve covered the need to have a pruning/filtering protocol that can select the best samples to add more information to the predictions, let’s see how you can choose the best samples. The approach the researchers at Meta came up with is different from the system I’ve built and used a lot, but it still has great results. I will try to compare with what I did soon. For now, let’s review their system.
So what kind of procedure did Meta AI develop to select the best samples? Very sophisticated state-of-the-art deep learning calculations. No. It turns out that this research team felt rebellious. Their solution was simple, cost-effective, and applicable to projects of multiple sizes and capacities.
Their idea is relatively simple. Group the samples. For a given sample, it’s easy/difficult depending on its distance from the nearest centroid. Quite elegant. In their words-
To compute a self-supervised pruning metric for ImageNet, we perform k-means clustering in the integration space of a pre-trained ImageNet self-supervised model (here: SWaV ), and define the difficulty of each data point by the distance to its nearest cluster centroid, or prototype. Thus the easy (difficult) examples are the most (least) prototypical.
I really like this solution. It makes intuitive sense, is not very expensive to use, and does not explicitly need labeling (a huge plus). Despite this, it holds up very well, compared to other more expensive protocols used to eliminate uninformative samples. It’s exciting and certainly deserves further investigation. Incorporating self-monitoring into data quality checks is something I’ve talked about before, but it’s a clear indication that this approach has a lot of potential for the future.
However, that’s not what struck me the most. When it comes to clustering, having the right number of clusters is very important. However, this approach is very robust to the choice of k. Truly exceptional stuff.
This allows the researchers to not have to spend a lot of resources setting up the experiments, which I appreciate. Too much paperwork is impractical because of how much has gone into getting the perfect conditions. This allows them to follow through on the promise made at the beginning of the article and actually come up with a practical and innovative solution.
This document has a few other interesting aspects to explore. For example, they mentioned different pruning strategies for small and large datasets (keep the easy and hard samples respectively). Their analysis of the different classes and how they group together was also very interesting. Another quirk for me was their approach to forcing alignment between (unsupervised) clusters and (supervised) classes. They took all the samples in the class and averaged the embeddings. This approach beats all the rest and has interesting implications for latent space coding that I’ll cover another time. If you’re interested, be sure to contact me so you don’t miss a thing. All my relevant links are at the end of this article.
If you’re looking to get started in ML, this article offers a step-by-step plan to build your machine learning skills. It uses FREE resources. Unlike other boot camps/courses, this plan will help you develop your core skills and get ready for long-term success in the field.
For machine learning, a foundation in software engineering, mathematics, and computer science is crucial. This will help you conceptualize, build and optimize your ML. My daily newsletter, Technology Interviews Made Simple, covers topics related to algorithm design, math, recent tech happenings, software engineering, and more to make you a better developer. I’m currently offering a 20% discount for ONE ENTIRE YEAR, so be sure to check it out.
I created Technology Interviews Made Simple using new techniques discovered through tutoring several people at high tech companies. The newsletter is designed to help you succeed, saving you hours wasted on the Leetcode grind. I have a 100% satisfaction policy, so you can try it risk-free. You can read the FAQ and learn more here
Do not hesitate to contact me if you also have interesting jobs/projects/ideas for me. Always happy to hear from you.
Use the links below to check out my other content, learn more about tutoring, or just say hello. Check out the free Robinhood referral link. We both get free stock (you don’t have to put any money down), and there’s no risk to you. So not using it is just wasting free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Contact me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
If you are preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/
Get free stock on Robinhood: https://join.robinhood.com/fnud75