The Proactive Mindset for AI Development, or: How I Learned to Start Worrying So My AI Model Doesn’t Bomb

The first 80% is easy, the next 10% is hard, and then the final 10% is almost impossible. That sentiment frequently pops up in AI engineering circles—and while it’s become a bit of a cliché, I’ve found it to hold true more often than not. Having spent my formative years researching AI safety and reliability at Cruise and Waymo, I can say with confidence: In many AI applications, that “impossible” last 10% of model accuracy is everything. In the autonomous vehicles space, it’s a matter of life or death. In the GenAI/LLM space, things are usually less dire on the surface, but when deployed at scale an unsafe or unreliable model can still cause widespread disruption to potentially critical systems. At Upwork, reliability is key since we deal with real professionals who want to get the best work done across many categories. To service those professionals, our AI systems like Uma, Upwork’s mindful AI, need to be flexible and robust across many different domains. So let’s get into the weeds and talk about how good AI technicians – and how we at Upwork – think about AI reliability.

Think proactively, not reactively
One of the main reasons the last 10% of a model’s performance can seem nearly impossible to solve lies in the fundamentals of AI design and development. Most AI models have gaps in their knowledge, as well as failure modes that make them vulnerable to weird or unexpected scenarios that abound in the real world. Maybe a 7-foot person dressed in a Santa suit runs across the front of your self-driving car, or someone starts talking to your customer service LLM in broken pig Latin. As a result, practitioners often find themselves reacting to problems as they arise, leading to a frustrating, ad hoc whack-a-mole process of patching real-world failures.
But what if we could proactively predict these issues before they occur?
To get to the crux of how that can be possible, it’s illustrative to ask how these knowledge gaps emerge in the first place. Most AI models follow the following execution flow:

Practitioners collect a dataset, adjust the model until it fits the data (according to evaluation metrics), and then deploy. In reality, the process isn’t much different from the one described in the (in)famous XKCD comic above.
Unfortunately, simply training the model with the standard methods until it won’t train anymore leaves the model vulnerable to a key issue: It has likely overindexed on easy or straightforward examples. As many practitioners know, most standard training algorithms reward models equally for learning any particular example in the training data. As a result, a model trained naïvely will prioritize learning easy examples first, while harder examples are ignored. Some modifications to standard training, such as Focal Loss [1] and GradNorm [2], can help mitigate these issues, but they are far from a complete solution.
So, what happens next? The practitioner pushes the model to production, and suddenly, the hardest 10% of examples start appearing in the wild. Maybe a customer speaks an unusual dialect, or someone attempts to jailbreak the system in a creative way. Either way, the model is brittle in these scenarios and breaks. The practitioner then scrambles to patch these performance issues. These reactive fixes are often ad hoc, which means they fail to provide meaningful protection against the next set of unexpected failures. Great, the vision system correctly recognizes flamingos now, but 10 days later it starts misclassifying parakeets.
The better approach to handling these challenges is to proactively anchor the model development process around solving hard problems. Training should focus on identifying the most difficult challenges in real-world use cases and designing methods to safeguard model performance specifically in those scenarios. This begins with analyzing the model’s weaknesses, followed by using targeted data and training techniques to address them directly. Instead of the simple workflow described above, a more proactive approach modifies the data and training phases like this:

Each of those new transition arrows represents a large category of AI/ML techniques centered around quantifying and mitigating uncertainty and underperformance in the face of challenging data. But the fact that it takes transitioning through three of four categories to even get to a dataset should expose how important that dataset is within this worldview.
Note that this way of thinking is not new; ideas around mining specifically for hard examples predate the deep learning era, back to when SVMs and much simpler models were still the preferred approach [9].
Proactivity starts at the data layer
Data is the lifeblood of AI, and a bad dataset all but guarantees that a model will also be bad. What many people may not realize is that the best dataset is not just a random sampling of data from the real world, but intentionally oversamples on strange, unexpected, or challenging examples. In the previous section, we described how a naïve training regiment will cause a model to overfit to easy examples. The first and most effective defense against this is crafting datasets that make it much harder for the model to overfit to those simpler samples.
But first, how do we determine what counts as a hard example? This task is difficult because what an AI model finds challenging may not align with what a human finds challenging. In other words, understanding a model’s weaknesses requires looking directly at the model itself. While that may seem obvious, most publicly available “hard” datasets are curated based on human intuition, which limits their effectiveness. Instead, we need to identify a model’s “unknown unknowns” or the gaps in its knowledge that we can address.
In the age of GenAI, some might assume it’s enough to ask an LLM to explain what it doesn’t know. Unfortunately, we’re not yet at that Westworld-level of AI reasoning yet. LLMs will almost certainly generate an answer that sounds convincing but is inaccurate. Instead, we need to look for mathematically grounded signals within the inner workings of these models.
One potentially powerful signal of unknown unknowns is the model gradient, a measure of how each training data point influences the model’s adjustments. In the language of the XKCD comic we’ve perhaps already referenced way too much in this post, the gradient represents the exact direction in which the pile is stirred. In previous work at Waymo Research [3], we demonstrated that tracking gradient signals during training can help identify examples that produce unusual gradient vectors that seem at odds with most other observed gradients, which then reveals areas where the model struggles.

When using gradient signals to separate hard and easy examples during training, you get the results above for an image classifier. The top half displays easy exemplars for the classes “black swan,” “volcano,” and “refrigerator,” while the bottom half shows hard exemplars for the same classes.
One interesting takeaway from this visualization is that, for a neural network identifying volcanoes, erupting volcanoes are actually more difficult to recognize than dormant ones against clear weather. If you asked a human to provide an ideal example of a volcano, many would likely picture a fiery red eruption. However, because the dataset contains many dormant volcanoes, the model learns the opposite. This is a clear example of when human intuition fails to align with what a machine learning system finds difficult. To properly improve such a system, it is essential to have robust methods for probing and understanding its deficiencies.
At Upwork, we primarily focus on text-based models like LLMs, as our business relies on verbal and textual interactions with customers. Unlike image-based models, text-based models are notoriously challenging to analyze for uncertainty or task difficulty because LLMs conventionally output only the next token in a sentence. When the next token is a common word like “the” or “an,” interpreting conventional uncertainty metrics becomes particularly difficult.
Fortunately, we are busy at work developing new algorithms and showing that they can work on uncertainty quantification in the LLM setting while avoiding the weird artifacts that can result from overindexing on the next token prediction. Here is some sample output from one of our preliminary candidate methods:

High uncertainty words or tokens are highlighted in yellow, orange, and red, in order of increasing uncertainty. There is a noticeably strong correlation between complex concepts and higher uncertainty, and we are excited to continue refining this method to use as a quick, real-time feedback tool for identifying when our production LLMs may struggle or encounter unfamiliar situations. Analyzing these instances helps us better understand our models’ blind spots, and this discovery is the first fundamental step toward addressing these gaps.
(Once we are done with our explorations with these methods, you can expect a more detailed paper/technical report.)
Turning insights into datasets
Now, we have a robust set of tools to understand the “unknown unknowns” of our models. For instance, one LLM may struggle with precise budget calculations, while another may generate overly long responses. The next step is to refine those insights into a dataset that can be fed into a model training loop to improve performance.
There are several ways to approach this, such as using the uncertainty analysis above to flag certain live conversations as valuable and adding them to a new training set. Or maybe we can search public datasets for textual data that exhibit specific behavioral patterns. At Upwork, we are exploring both of those methods with great interest. However, there is also an oft-overlooked but still gold-standard approach: just label more data.
As the world’s largest work marketplace, Upwork has a clear advantage in defining dataset parameters and obtaining a high-quality dataset written by domain experts on our platform. We’ve used independent talent from many work categories across Upwork to label novel datasets that help evaluate complex models and build out reinforcement learning from human feedback (RLHF) pipelines. One of the more interesting use cases has been collecting longer conversational interactions to enhance customer engagement.
Early in Upwork’s model development cycle, we identified goal-oriented long conversations as a key weakness in off-the-shelf models. In the language of this post, long conversations were a “known unknown,” and there is a shortage of datasets with the right level of complexity and articulation to infuse that extra layer of humanity into our flagship LLM Uma. One of our first tasks was to leverage our platform to collect a large volume of data, like the example below.

This data gave us a key advantage in building our models, allowing us to quickly train systems that surpassed the state-of-the-art in the use cases that mattered most to our business. It also enabled us to make Uma more conversational, interactive, and personable: essential factors in building trust in customer interactions with AI. Ultimately, Upwork’s platform for data collection remains one of our biggest advantages, helping us push our AI beyond performance barriers to benefit our end customers.
(You can read more about those results in our Scaling AI Models blog post.)
Training the model
The final step, once we have a dataset full of valuable but difficult examples, is to train the model that appropriately uses that data. Training on these unusually distributed datasets can seem daunting at first, but there are many algorithms and approaches to tackle this challenge. We could easily write 10 more blog posts covering the many strategies available to experienced practitioners. At Upwork, we are actively developing novel approaches that fit this category as well, and we expect to share more about those approaches in future posts. But there are plenty of examples of such approaches in the pre-existing literature. Dynamic weighting techniques such as Focal Loss [1] and its variants [4] allow models to focus more on harder examples. Multitask learning methods, like GradNorm [2] and PCGrad [5], enable isolating the difficult parts of a dataset and assigning different weights to them. Algorithms like LWF [6] and its improvements (such as PROFIT [7]) provide some relief from catastrophic forgetting (where neural networks forget old data rapidly when trained on new data), which enables more flexible and iterative training schedules. In general, continual learning and lifelong learning methods [8] provide flexible frameworks to train a model on an iterative sequence of datasets, which allows for more complex strategies to properly introduce challenging data into an AI model. These are just a few starting points.
But instead of waxing poetic about the latest research on model building with hard datasets, I’ll put it simply: Get your dataset right, and the rest will usually fall into place. Among AI researchers, dataset design and collection are often dismissed as tedious or unexciting tasks. But for AI practitioners, they make up eight of the nine innings of the game. A great dataset doesn’t guarantee success, but a poor one almost always guarantees failure. Spend time identifying dataset gaps (unknown unknowns → known unknowns) and carefully executing a dataset collection strategy (known unknowns → hard dataset). That’s the path to building a genuinely robust AI model.
Conclusions
The moral of the story is that if you want to truly evaluate the performance of AI models, don’t focus on where they succeed but focus rather on where they fail. To build AI models that work reliably at scale, the “AI Model Building 101” version of the pipeline is unfortunately far from sufficient. Many of the methods newer practitioners learn for training AI models generally produce models that fall short of the quality bar because they take greedy, naïve approaches to handling edge cases and challenging scenarios.
So, since addressing edge cases is a fundamental challenge often overlooked by traditional training methods, the only way to tackle them properly is to rethink model training and anchor one’s training efforts on edge case mitigation. A large part of this involves crafting the right dataset. From the start, we must identify gaps in our models and move to strategically close them.
At Upwork, this robustness mindset shapes our AI development strategy. We know trust in AI is fragile and must be earned, and that our customers deserve an AI experience that is not only useful but also built on a technology stack designed for safety and reliability. As long as we stay true to those principles, Uma’s potential is limitless, and should continuously deliver better work outcomes for the businesses and freelancers that rely on Upwork every day.
References
[1] Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international conference on computer vision. 2017.
[2] Chen, Zhao, et al. "Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks." International conference on machine learning. PMLR, 2018.
[3] Chen, Zhao, et al. "Gradtail: Learning long-tailed data using gradient-based sample weighting." arXiv preprint arXiv:2201.05938 (2022).
[4] Li, Xiang, et al. "Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection." Advances in neural information processing systems 33 (2020): 21002-21012.
[5] Yu, Tianhe, et al. "Gradient surgery for multi-task learning." Advances in neural information processing systems 33 (2020): 5824-5836.
[6] Li, Zhizhong, and Derek Hoiem. "Learning without forgetting." IEEE transactions on pattern analysis and machine intelligence 40.12 (2017): 2935-2947.
[7] Chakravarthy, Anirudh S., et al. "PROFIT: A PROximal FIne Tuning Optimizer for Multi-Task Learning." arXiv preprint arXiv:2412.01930 (2024).
[8] Wang, Liyuan, et al. "A comprehensive survey of continual learning: Theory, method and application." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[9] Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05). Vol. 1. Ieee, 2005.










.png)


