As data scientists, we have all heard (if not experienced) documentation horror stories - what can go wrong when we don’t take the time to thoroughly document our models. As part of my work on improving documentation processes at my company, Verta, I’ve collected (and anonymized) a “hall of shame” for stories that highlight solutions for avoiding the pitfalls of poor documentation.
Horror Story #1
The DS team at an online network needed to fix a model used to identify inappropriate content. The model had degraded over time, but when the responsible team looked into it, they realized that they had no documentation, no source code, and no idea how the model was created or what dataset was used to produce it.
The company had to assign a team member to reverse engineer the model that was running in production, then go on a fishing expedition through the data lake to identify which dataset was used to create this model. The team eventually replicated the model to the best of their ability, after which they could work on improving the model. Time lost: Six months.
Lesson learned: Prioritize documentation.
Hey, we get it, no one likes documentation. It’s overhead that comes with the job, like meetings. But documentation should be viewed as an integral part of the job, not as something separate. As data scientists, we need to embrace a culture of documentation within the organization, where it is consistently created, updated and maintained throughout the model's lifecycle.
Horror Story #2
Facing a crisis, a government agency contracted with a private firm for a tool to sift through social media posts to identify threats based on keywords, with a turnaround time of 72 hours. The firm didn't have time to train a new model, but they had an existing model that would work. However, they needed to understand whether they could legally reuse the model and whether it would perform as required.
The data scientist who made the model 14 months prior was still at the company, but they had recorded all their “documentation” by hand in a notebook that got filed away and forgotten after that project was completed. After a frantic day-long search through old banker’s boxes stored in a garage, eventually the notebook turned up. The firm met the deadline – but only because the data scientist hadn't left the company and was able to find the long-lost notebook with the critical information.
Lesson learned: Centralize your documentation.
As a group, data science needs to agree on some one tool or platform as a system of record for documentation. It should be easy to access and search, a place where we can create, maintain, tag and especially share our documentation both within our own group and with other functions in the organization. Opening up access to our documentation to other groups will ensure that we’re preserving the right set of information to meet requirements like regulatory compliance. It also will minimize the disruption of shoulder taps or Slack messages asking about a project that we worked on a year ago.
Horror Story #3
This is really my own horror story that I’ve seen play out again and again. In general, I find that data scientists have a good understanding of what solid documentation looks like. But, in a sense, that discourages us from providing full documentation of our work, because we are under so much pressure to produce output quickly. Documentation slows us down and, in the short term at least, doesn't seem to provide added value.
To be blunt, documentation doesn't make people happy, and if you make people unhappy enough, they'll leave. So organizations wind up trying to find a middle ground where they are capturing the right level of documentation they need to de-risk the company, but doing it in a way that doesn't make people so unhappy that they leave. Unfortunately, in the absence of clear guidelines, the “right level of documentation” often winds up being no documentation at all.
Lesson learned: Make good documentation easy.
We can make it easier to produce good documentation by agreeing on clear guidelines and standards for what constitutes good documentation. This means capturing information that is essential for our own team but also working with other teams - like IT, legal, governance, and risk management - to ensure we’re capturing the information they need and in an appropriate format.
We need to make sure that the “right level” of documentation is not nothing, but also isn’t overkill. Optimally, whatever tool we’re using would have the documentation baked in or, at the least, makes it easy to capture this right set of information. At my company, Verta, we’re even experimenting with using generative AI to create a first draft of documentation that can then be reviewed before publishing to ensure accuracy and completeness.
With a documentation culture, a centralized system of record for documentation, and standardized processes, we can take much of the pain out of documentation, save ourselves time and hassle, and - hopefully - avoid any documentation horror stories of our own.
In the meantime, I’d welcome hearing about your own documentation horror stories or how your organization is using tools like generative AI to prevent documentation nightmares - email me at baasit@verta.ai, or connect with me on LinkedIn at https://www.linkedin.com/in/baasit-sharief/.
Related: Learn how data scientists are using AI to document AI. See the article Everyone Hates Model Documentation. Verta Is Changing That.
Subscribe To Our Blog
Get the latest from Verta delivered directly to you email.