Care Redesign
Relentless Reinvention

Why Every Health Care Organization Needs a Data Science Strategy

Article · March 22, 2017

Harnessing the full potential of data requires developing an organization-wide data science strategy. Such strategies are now commonplace in most industries such as banking and retail. Banks can offer their customers targeted needs-based services and improved fraud protection because they collect and analyze transactional data. Retailers such as Amazon routinely collect data on shopping habits and preferences to profile their customers and use sophisticated predictive algorithms to tailor marketing strategies to customer demand.

Health care is a glaring exception. Individual pieces of data can have life-or-death importance, but many organizations fail to aggregate big data effectively to gain insights into wider care processes. Without a data science strategy, health care organizations can’t draw on increasing volumes of data and medical knowledge in an organized, strategic way, and individual clinicians can’t use that knowledge to improve the safety, quality, and efficiency of the care they provide.

A comprehensive data science strategy needs to address the quality of the underlying data, effective ways to analyze the data, and a framework for keeping it secure. If an organization tries to aggregate and analyze poor-quality data, it may derive useless or even dangerous conclusions. An inadequate security framework may lead to unauthorized access and undermine trust of patients and providers.

A carefully developed data science strategy will help achieve both precision medicine (helping to tailor treatments to patients) and the creation of learning health systems (helping to predict outcomes and identifying specific areas for improvement). Ideally, every decision a provider makes about a patient should be informed by the data of both that specific patient and other similar patients. In a learning health system, prior experiences improve future choices.

Organizations without an effective data science strategy may never realize returns on their investment in electronic health records (EHRs), may have disillusioned physicians, and may face potentially catastrophic security risks resulting from inadequate data protection. The stakes are high.

We believe that an effective data science strategy for health care organizations has five key components:

Key Components of Data Science Strategy in Healthcare Organizations

  Click To Enlarge.

1. Repository.  A secure organization-wide data repository allows organizations to keep a complete inventory of their data assets. Planning a repository presents multiple challenges. Substantial groundwork is required to scope existing data, create metadata (that is, detailed descriptions of each data source), explore ways to combine data sources, and develop strategies to keep track of what data is produced, stored, used, and reused, and how, and by whom.

This process may require the formation of new organizational structures, such as designated centers for data science. Organizations have begun to establish such centers, including the Beth Israel Deaconess Medical Center (BIDMC) for Healthcare Delivery Science and the Stanford Biomedical Data Science Initiative. The Center for Healthcare Delivery Science at the Beth Israel Deaconess Medical Center brings together expertise in health care delivery, analytics, management, epidemiology, biostatistics, and information technology. These experts can draw on a large repository of locally collected electronic health care data for quality improvement purposes.

2. Integration.  Bringing different data sets together involves many challenges in reconciling formats and breaking down siloes. Development and consistent use of an enterprise master patient index (EMPI) allows linkage of disparate data sources on individual patients but requires significant organizational and process changes to be achieved across information systems. These include eliminating duplicate records and establishing new procedures surrounding the addition of new patients.

If creating an EMPI for initial data collection poses too many challenges, either administrative or technical, an organization may achieve a reasonable equivalent by using a “data lake,” a technology platform that allows linking of highly disparate data. Data lakes keep source data in its original state for analysis if needed but also allow organizations to navigate across different sources and explore new relationships among them. Mercy Health in St. Louis uses a data lake to integrate data across its locations. This is fed with real-time data from its electronic health record as well as an enterprise resource management system and several other sources. This combination of data from disparate sources pulls together patient-specific information across a range of operational and clinical issues. This information can be fed back in real-time to clinicians at the bedside and can also be used for operational and strategic planning and overall quality analysis.

3. Security.  Protecting privacy and anonymity are always paramount, and that task becomes more complex when an organization uses a patient’s data for purposes that go beyond immediate patient care. This is particularly important given that some health systems, including BIDMC, are moving to the use of private space on public clouds. Organizations need to create data governance frameworks to ensure those protections, and commit money to cybersecurity measures.

For example, organizations need to address issues of staff training, how to handle access to data by visiting workers, how to guard against data breaches, and how to mitigate the damage from any breaches that occur. Existing technologies should meet International Organization for Standardization (ISO) data security standards, and organizations should schedule periodic risk assessment and mitigation of technical, administrative, and physical vulnerabilities.

Large digital data repositories may increase concerns about the security of cloud storage systems and data lakes and may undermine patients’ and clinicians’ trust. Breaches can be costly in both money and institutional reputation. New and potentially expensive approaches may be needed to prevent them, including the development of anonymization algorithms and machine-learning-based security models that can adapt to changing threats and/or circumstances.

4. Support.  Organizations need teams with a wide range of skills in data processing and cleaning, statistics, computer science, visualization, operational research and change management, artificial intelligence, and archiving/curation. Of particular importance are “boundary spanners” who can establish links among data science staff, the organization’s management, and its clinicians. They can identify data query priorities that are both organizationally and clinically relevant and can help users of data understand the full range of analysis that is available to them (such as near real-time queries regarding particular patient populations, medications, or treatment outcomes).

5. Feedback.  An effective data science strategy relies not only on well-structured databases and advanced analytics but also on having solid underlying data. Predictive analytics can be extremely valuable but require high-quality data for reliable insights. Strategic approaches to analysis should create a virtuous cycle in which data are repeatedly scrutinized as they are reused for different purposes, driving improvements in data quality. Such work should harness innovative analytical tools that employ artificial intelligence approaches such as machine and deep learning, and a complete service redesign may be required in which insights from data can inform important organizational and service delivery decisions in real time. To achieve this level of effectiveness, frontline staff may need to change how they work in order to incorporate these insights and act on them at the point of care.

Implementation of a data science strategy represents one of the cornerstones of better care, as well as greater operational efficiency and, eventually, more effective approaches to population health. Our health care system will increasingly depend on data to improve care, reduce costs, and expand access.

Call for submissions:

Now inviting expert articles, longform articles, and case studies for peer review


A weekly email newsletter featuring the latest actionable ideas and practical innovations from NEJM Catalyst.

Learn More »

More From Care Redesign

What Is Lean Healthcare?

Lean Healthcare is the application of “lean” ideas in healthcare to minimize waste with ongoing process improvement. Learn how to use Lean to improve patient satisfaction and care outcomes while reducing costs.

Zane05_pullquote3 hospital decentralization

Do Hospitals Still Make Sense? The Case for Decentralization of Health Care

The future is here: moving care out of the hospital and into the home and community.

Sociotechnical dimensions used to analyze EHR-related health IT safety concerns

Frustrated with Your EHR? Don’t Blame Your Vendor — Safety Is a Shared Responsibility

Two informatics experts urge individuals and organizations to work together toward safer EHR-enabled patient care.

Relative Health Care System Performance and Spending in 11 High-Income Countries

From Last to First — Could the U.S. Health Care System Become the Best in the World?

The United States could achieve the best-performing health care system in the world by undertaking coordinated efforts that address four challenges.

Addressing the Prescription Opioid Crisis: Advancing Provider Education and Collaborating with All Stakeholders

Providers have a large role to play in tackling the opioid overdose epidemic, but they can’t go it alone.

Collaboration Between Doctors and Computers - Machine Learning

Lost in Thought — The Limits of the Human Mind and the Future of Medicine

It’s ironic that just when clinicians feel that there’s no time in their daily routines for thinking, the need for deep thinking is more urgent than ever.

Root Cause Analysis: Typical Domains of Root Cause: Medical Errors

Does Every Hospital Admission Deserve a Root Cause Analysis?

Will applying the RCA rubric to hospital admissions better help define and manage care?

Uniquely Identified: The Impact of a National Health Index

What does the NHI mean to a New Zealand clinician, researcher, and health care consumer?

The Intersection of Home-Based Primary Care and Home-Based Palliative Care

My Favorite Slide: The Intersection of Home-Based Primary Care and Home-Based Palliative Care

What are the overlapping provider skill sets needed to care for homebound patients?

Simplifying Person-Centered Care with Use of the Personalized Perfect Care (PPC) Bundle

Personalized Perfect Care

The Personalized Perfect Care Bundle: Making quality metrics easier to understand and more patient-centered.


A weekly email newsletter featuring the latest actionable ideas and practical innovations from NEJM Catalyst.

Learn More »


Proactively Catching the Declining Patient

A coordinated effort by UCLA leaders to identify a high-cost population with chronic kidney disease…

Coordinated Care

145 Articles

Proactively Catching the Declining Patient

A coordinated effort by UCLA leaders to identify a high-cost population with chronic kidney disease…

Proactively Catching the Declining Patient

A coordinated effort by UCLA leaders to identify a high-cost population with chronic kidney disease…

Insights Council

Have a voice. Join other health care leaders effecting change, shaping tomorrow.

Apply Now