Dr. Jennifer Prendki is the founder and CEO of Alectio, the first startup focused on DataPrepOps, a portmanteau term that she coined to refer to the nascent field focused on automating the optimization of a training dataset. She and her team are on a fundamental mission to help ML teams build models with less data (leading to both the reduction of ML operations costs and CO2 emissions) and have developed technology that dynamically selects and tunes a dataset that facilitates the training process of a specific ML model. Prior to Alectio, Jennifer was the VP of Machine Learning at Figure Eight; she also built an entire ML function from scratch at Atlassian, and led multiple Data Science projects on the Search team at Walmart Labs. She is recognized as one of the top industry experts on Data Preparation, Active Learning and ML lifecycle management, and is an accomplished speaker who enjoys addressing both technical and non-technical audiences.
Q1:Please tell us about your journey in AI so far.
Dr. Jennifer: The story of how I got into AI is a bit unusual: I actually started my career as a physicist. Understanding how the universe works was my childhood dream. But while I fulfilled that dream, the Great Recession, which started when I graduated with a Phd in Particle Physics, left few opportunities for fundamental researchers like myself to find funding, and that’s how Fate led me to pivot into a career in the industry as a data scientist instead. Still, people are often intrigued by that career shift (which to me isn’t really one since the skills I use working with data are actually so similar to the ones I used in the context of Physics research), and wonder how I ended up in the ML space from there. As far as I am concerned, there are lots of similarities between the motivations that brought me to Physics, and the reasons why I enjoy ML research. Physicists are usually motivated by a deep interest in understanding how things work – how the universe came to exist, how planets revolve around the sun or how electricity is generated -, and data scientists share this fascination for modeling the systems around them, though it is often at a more modest scale. Granted, identifying emotions from a human face on a picture isn’t as grandiose as explaining the Big Bang, but it is still extremely satisfying to figure out which exact facial features are responsible for making someone look happy, sad or angry.
Q2: Tell us about your venture Alectio and the ML technology behind it.
Dr. Jennifer: The story behind Alectio starts long before the day I incorporated the company in January 2019, and it trickles down from my frustration as a Data Science manager, as well as my legendary aversion to inefficiency in general. Back to my days at Walmart Labs, when I first led a group of ambitious data scientists, I quickly realized that my job wasn’t as much about guiding my team into choosing the right models or helping them develop new algorithms as I had originally imagined: instead, most of my time was spent begging management to grant us additional resources to get our models trained or our data annotated. And back then, every victory was short-lived, as it felt like the very moment I would receive that long anticipated approval email informing me we’d been granted a 10% increase in budget, I would also almost instantly receive another email from a team member stating that that additional budget wouldn’t cut the mustard if we wanted to meet the Black Friday deadline. I started cursing Big Data for making my life miserable, until one day, it hit me: maybe we should stop thinking of Big Data as a sine qua non condition to the success of ML projects. Even if the entire industry seemed to believe that Big Data was key to building better models, could it be that the need for large volumes of data was just a myth after all? I started researching techniques capable of reducing the amount of necessary training data and experimenting with them, and quickly concluded that working with large volumes of data was what data scientists were defaulting to because they really had no idea how to strategically sample their dataset and identify which records benefited the training process of their model. In other terms, relying on big Data was the easy thing to do back when we didn’t know any better. That’s not to say that large datasets don’t benefit ML (they absolutely do!), but they do so only because collecting more data is the path of least resistance to obtain a large enough variance, and to cover most corner cases. But in a world where Big Data was becoming a challenge (for cost or time reasons) rather than an ally, it was time to become smarter about the data we were using.
This is how I started evangelizing the concept of data curation and to promote the usage of Active Learning (a training paradigm where a model is trained incrementally with data that’s strategically selected from a raw dataset) in the industry, and promoting the idea of “Machine Teaching”, where Machine Learning is adopted to support and improve the training process of another ML model. Alectio’s technology is founded on the idea that most datasets contain a large fraction of redundant or irrelevant information which could be flushed out without impacting negatively the performance of the model, and leverages semi-supervised Machine Learning and Reinforcement Learning to establish a framework to separate useful from useless data. Today, companies rely on this technology not only to reduce the operational costs of ML development, but also to tune their data collection process and even for data explainability.
Q3: What is DataPrepOps and how is Alectio helping companies approach data collection and data preparation?
Dr. Jennifer: To better understand what DataPrepOps is, it might be a good idea to analyze how MLOps came to maturation over the past few years. Practical ML applications have flourished in the past 10 years, mostly because the hardware required to train ML models and collect sufficiently enough training data had finally caught up with the advances made in ML during the previous couple of decades. And yet, until recently, in spite of the large amount of money deployed on ML initiatives, most projects failed to launch because the people capable of building ML models were not trained to deploy them to production. Instead, it was up to DevOps teams to take on the challenge of scaling, monitoring and managing these models, so that they could actually benefit the end user and lead to ROI. And soon, as the industry started establishing expertise and best practices for model deployment and management, entire companies dedicated themselves to building tools to make that process easier and even attempt to fully automate it: MLOps was born.
Yet, MLOps fails to assist ML experts in what should be the very first step when building a Machine Learning model: constructing and optimizing a training dataset. Even in 2022, data preparation is still almost completely done by hand, which causes “data prep” to be perceived as a boring, low-tech activity, giving it an incredibly bad press among data scientists. A shocking thought when you think that preparing the data should in fact be the main center of interest of anyone working with Machine Learning. Making data the first citizen of the ML process is the very mission of the Data-Centric AI movement. DataPrepOps takes the idea one step further and addresses the reason why data prep is so unpopular with data scientists by converting it into a high-tech discipline requiring mathematical models and engineering expertise. Just like MLOps changed the game by enabling ML scientists to deploy models with no preliminary DevOps expertise, DataPrepOps leverages the most recent advances in ML to make data prep less frustrating, less expensive and overall simpler. It essentially attempts to change the perception of a field traditionally viewed as tedious and ‘uncool’, make it into a technical field in its own right, and to encourage more researchers to concentrate their efforts on technology-driven data preparation.
Q4: What are some of the biggest challenges in big data and how to solve them?
Dr. Jennifer: I believe it would be more accurate to say that it is high time we recognized that Big Data is, in itself, the challenge! The concept of Big Data was born when companies went into a frenzy to collect every single bit of data, at any cost, enabled by the fact that technology had finally made that possible for them after decades of frustration. For the longest time in the history of Machine Learning, researchers were struggling to make concrete advances precisely because the hardware available to them at the time did not allow them to collect large enough datasets to train their model on, and Big Data became the welcome antidote to the problem. What this has led to, is an exploding number of data warehouses which today are mostly filled with “ROT” (the acronym for Redundant, Outdated and Trivial) data, benefiting no one but those making money on data storage. Fundamentally, Big Data should be nothing more than a tool for data analysts and ML experts to drive decisions, build models and solve business problems. But the day the industry started treating Big Data as a field in its own right marked the beginning of what could be called the “Data Hoarding Era”, and the time when the Machine Learning field went from data-starved to data-drowned, and developing ML models suddenly became cost-prohibitive. People often don’t realize it, but Big Data actually caused ML to become more exclusive, reserved only to the few companies able to afford the preparation and storage of data at scale, as well as the associated compute costs. And that still doesn’t account for the environmental challenges posed by Big Data: with ever more data generated and collected, comes a growing need for data warehouses, and for power to supply energy to power up server farms. Without a conscious community effort to fight the “Big Data Lobby”, it is only a matter of time before only the largest corporations can even afford to train a Machine Learning model, and the advent of the super-large, extremely data-greedy models (such as GPT-4 which is coming out soon) is definitely not helping. This is why smart data curation is such an important part of the future of Machine Learning.
Q5: Multiple industries are seeing a rising importance of Big Data and AI. How do you see these emerging technologies impacting Data Privacy?
Dr. Jennifer: Anytime a company decides it’s okay for them to collect and store their user’s data (especially without being fully transparent about it) is an issue in terms of Data Privacy, regardless of the volume collected. Sadly, both because users were only made aware of the poor practices adopted by companies worldwide years after arbitrary data collection became the norm, and because the amount of data collected was originally too low to raise the alarm, it is only recently that the world started demanding action and transparency. Too late to fully address the problem in a satisfactory way and defend the rights of consumers? Only time will tell. I am still hopeful though that as the Machine Learning field keeps maturing, new techniques based on technologies like Federated Learning (a process that allows to train on distributed, decentralized data stored at the point of collection) will allow users to maintain ownership of their data while allowing ML scientists to leverage that same data for training. At Alectio, we actually identified the protection of data ownership as a key value proposition and have successfully built on top of standard Active Learning in order to select useful data for our users while allowing them to retain their data on their systems without the need for them to export their data to the Cloud. So building Privacy-By-Design systems is certainly achievable, including in the MLOps space which usually raises concerns in that regard. And I certainly hope to see more companies follow our lead and create solutions to enable a more ethical use of users’ data in the next couple of years.
Q6: What would your advice be to budding machine learning and data science candidates? Is your company currently hiring?
Dr. Jennifer: Machine Learning is undoubtedly one of the most exciting fields in Technology nowadays, and it is only natural that so many young people are interested in building a career as data scientists. Unfortunately, after having been coined the Sexiest Job of the Century, Data Science is also attracting many people for the wrong reasons. The reality is that, just like any other career, there are some deeply frustrating aspects to a career in Data Science. People routinely believe that all data scientists are ridiculously well paid, enjoy a great work-life balance, and work on the coolest problems, but the reality is somewhat different. For example, as relatively new fields, Machine Learning and Data Science often suffer from a general misunderstanding from the people in decision making positions, which leads to many bosses having unrealistic expectations, and almost anyone in the field has a story or two to tell about how they were asked to build models with virtually no data to work with, or with no data pipelines to collect training data with. Also, not every problem out there is necessarily glamorous, and many data scientists spend their time working on detecting credit fraud or predicting when inventory is going to run out as opposed to developing DeepFake technology. My advice to aspiring ML scientists is to give themselves time to figure out if this is really the right career for them by taking internships and working on real-life ML problems. It’s also important for them to understand that “toy problems”, like the ones people work on in school or during Kaggle competitions, don’t give a reliable idea of what the life of a data scientist might be like. Being a data scientist can be cool but it’s just not for everyone, and there are many alternative careers in the data space which can be just as rewarding.
As for Alectio, we’re almost continuously hiring for ML scientists interested in a different challenge. As I often like to say, we technically are the only “true” Data Science company since our focus is on building a general framework to understand how data affects the learning process of a Machine Learning model, and how models actually learn. Besides, we’re one of very few companies using Machine Learning to facilitate Machine Learning (our ML-driven data prep algorithms are effectively ML models controlling other ML models!); in a sense, we’re in fact a Meta Machine Learning company. So if you’re really interested in fundamental ML research and in the next generation of ML algorithms, reach out to us and let’s talk!
Q7: Can you name some AI / data science resources that have influenced your thoughts the most?
Dr. Jennifer: Data scientists are spoiled with an incredible diversity of amazing content out there for them to enjoy and learn from. Regardless of your level of expertise and the specific skills you’re looking to improve, there definitely is a great Data Science blog somewhere meant just for people like you. And that is just perfect because as a data scientist, you will constantly need to stay up-to-date with the newest techniques and technology, and cannot afford to overlook the importance of continuous learning to your career. That being said, I personally have a sweet spot for the Towards Data Science and the BLAIR blogs. I also strongly recommend new-comers to the field to read the ML / AI blogs from their favorite companies, which typically provide a more specialized view on the work of ML experts within those industries and can offer more tactical / less theoretical tips for the working ML scientist.