My Models Failed. That's How I Became a Better Data Scientist.

The first predictive model in healthcare looked like a home game.
Answer the business question. Performance metrics were strong. The idea was pure.
And it would have failed spectacularly in production.
That course changed the way I think about data science and what it takes to succeed in healthcare in the age of AI.
Looking back, that failure would repeat itself throughout my career, but it was crucial to my growth and success as a data scientist: a complex model in a notebook is useless if you don't understand the environment your model is intended for.
Data analyst
After three grueling months hunting for my first job in the real world, in a market with a new appetite for data but also one that was full of talent, I finally got my first big break. I got an entry-level data analyst position on the Business Intelligence team at a large hospital system. There was a lot to learn. A major hurdle, and one that many people who want to enter the healthcare data space will also have to jump, is getting used to the exit of Epic, the largest EHR (electronic health record) vendor by market share. Stretching my legs in SQL with more complex data in the EHR was not easy. For the first few months, I depended on my senior colleagues to write the SQL I would need for analysis. This frustrated me; how could I finish a master's degree in mathematics and still struggle to get the hang of SQL?
However, with practice (lots of practice) and patience from my partner (lots of patience) eventually it all started to make sense in my head. As my comfort increased, I entered the world of Tableau and dashboards. I was fascinated by the process of creating exciting dashboards that told data stories that needed to be told.
Throughout my first year, my manager was very supportive, constantly checking in and asking me what my career goals were and how he could help me achieve them. He knew that my background in school was more technical than the ad-hoc analysis I was doing as an entry-level data analyst, and that I wanted to build predictive models. At the sad end of my first chapter, he asked to transfer me to another group to get this experience for me. That group was the Advanced Analytics group. And I was going to be a Data Scientist.
Data scientist I
From day one, I worked closely with a data science instructor who had deep knowledge of healthcare and technical capabilities to match, giving him the ability to deliver amazing products and lead the way for our small team. He was the first in our program to develop a custom predictive model and bring it to life in a production environment, generating scores for patients in real time. These scores were used in clinical practice. When my boss asked me what my professional goals were for the coming year, I had a quick and specific answer: I wanted to get a custom forecasting model into production.
I started with a few POCs (Proofs of Concepts). My first model was a linear logistic regression model that tried to predict the likelihood of complications from diabetes. Although a good first attempt, my method of sampling the data was wrong, and in the peer review, my colleague pointed it out. One of the key lessons I learned from my first attempt at predictive modeling in healthcare was
When collecting data to train a predictive model, it is important to simulate the conditions, patient context, and workflow in which the model will be used within a production environment.
For example: You can't just collect a patient's current lab values and use those as factors in your model. If you expect the model to make predictions, say 15 minutes after arriving at the ED, you need to account for that. So, when you collect two years of historical data to train a model, you need to collect each patient's lab values as they were 15 minutes after arrival, i.e. during their simulated day and time, not what those lab values are today/currently. Failure to do so creates a model that may perform better at POC than it does in real-time production environments, because you give the model access to data that would otherwise not be available during forecasting, a concept known as data leakage.
Lesson learned, I was ready to try again. I spent the next few weeks developing a model to predict no-show nominations. I was very serious about how to collect data, I used a strong and powerful algorithm, XGBoost, and I got to the peer review stage. The model's AUC (Area Under the Detection Performance Curve) was amazing, sitting in the low 0.9s and blowing everyone's expectations for a model that doesn't show up out of the water. I felt unstoppable. Then, it all came crashing down. During a deep dive into an incredibly intense practice, I realized the most important aspect was the scheduled appointment time. Remove that feature, and the AUC drops to mid-0.5s, meaning the model's predictions were no better than random guesses. To investigate this unusual behavior, I jumped into SQL. There was. Within the database, every patient who missed their appointment had a scheduled midnight appointment. Another data procedure changed by appointment time for all patients who did not complete their appointment. This gives the model an almost perfect feature of predicting no-shows. Every time a patient had an appointment at midnight, the model knew the patient was a no-show. If this model reached production, it would be making predictions weeks before the next appointment, and it would not have this magic feature to improve its performance. Data leakage, my biggest enemy, was back to haunt me. We tried for weeks to save performance using state-of-the-art feature engineering, large training data sets, intensive training procedures, nothing helped. This model was not going to work, and my heart was breaking.
I finally made a move. My first successful prediction also has a funny name: the DIVA model. DIVA stands for Difficult Intravenous Access. The model was designed to inform nurses when they may have difficulty placing IVs in certain patients and should contact the IV team for placement. The goal was to reduce failed IV attempts, with the hope of increasing patient satisfaction and reducing complications that may arise from such failures. The model performed well, but not suspiciously. It passed peer review, and I developed the script for use in production, a much more difficult process than I imagined. The IV team loved their new tool, and the model was being used within clinical workflows across the organization. I accomplished my goal of getting a model into production and I was happy.

Data Scientist II
After successful implementation on several other models, I was promoted to Data Scientist II. I continued to develop predictive models, but also made time to learn about the ever-expanding world of AI. Soon, the demand for AI solutions increased. Our first formal AI project was an internal department challenge where we would use language models to automate the capitalization of publicly traded companies. This project, like many other AI-related projects, was very different than the ML model development I was used to, but variations were welcome. I am very excited to enter the world of ETL processes, active reporting, and automation. While we're still getting our feet wet with AI systems, I'm excited about the new types of business problems we can now create solutions for.
My role as a data scientist has changed as AI systems have evolved. Developing DS/ML and AI solutions requires a lot less technical work effort now, and I almost think of myself as part data scientist, part AI project manager in the process. The AI systems we have access to now can write code, bug test, and do more efficient planning with strategic information on our end. That said, there is growing concern about the impact and feasibility of AI systems, with various reports suggesting that many AI projects fail before seeing production. I believe
A Data Scientist with a strong technical foundation and subject matter knowledge can be a tremendous asset to combat the high failure rate of AI projects.
Our basic understanding of predictive models coupled with domain knowledge from within our industries (healthcare, in my case), is still very much needed to create solutions that work and can deliver value. Gone are the days when we could only rely on our technical knowledge to provide value. Coding is now handled by LLMs. Automation is more accessible through cloud providers. An expert who can translate business needs into a strategic plan that guides AI into an effective solution is what is needed now. The modern data scientist is an ideal candidate to be that translator.

Wrapping up
Data science, like any career path in technology, is constantly changing and evolving. As you can see above, my role has changed a lot over the years since college. I've climbed a few corporate ladders, from entry-level data analyst to Data Scientist II, and I can confidently say that the skills needed to succeed have changed as the years have passed and advances in technology have been made, but it's important to remember the lessons learned along the way.
My models failed.
That failure shaped my career.
In healthcare, especially with the magic of AI at our fingertips, a successful data scientist is not one who can build the most complex models.
A successful data scientist is one who understands the environment for which the model is intended.



