Harnessing the full potential of data requires developing an organization-wide data science strategy. Such strategies are now commonplace in most industries such as banking and retail. Banks can offer their customers targeted needs-based services and improved fraud protection because they collect and analyze transactional data. Retailers such as Amazon routinely collect data on shopping habits and preferences to profile their customers and use sophisticated predictive algorithms to tailor marketing strategies to customer demand.
Health care is a glaring exception. Individual pieces of data can have life-or-death importance, but many organizations fail to aggregate big data effectively to gain insights into wider care processes. Without a data science strategy, health care organizations can’t draw on increasing volumes of data and medical knowledge in an organized, strategic way, and individual clinicians can’t use that knowledge to improve the safety, quality, and efficiency of the care they provide.
A comprehensive data science strategy needs to address the quality of the underlying data, effective ways to analyze the data, and a framework for keeping it secure. If an organization tries to aggregate and analyze poor-quality data, it may derive useless or even dangerous conclusions. An inadequate security framework may lead to unauthorized access and undermine trust of patients and providers.
A carefully developed data science strategy will help achieve both precision medicine (helping to tailor treatments to patients) and the creation of learning health systems (helping to predict outcomes and identifying specific areas for improvement). Ideally, every decision a provider makes about a patient should be informed by the data of both that specific patient and other similar patients. In a learning health system, prior experiences improve future choices.
Organizations without an effective data science strategy may never realize returns on their investment in electronic health records (EHRs), may have disillusioned physicians, and may face potentially catastrophic security risks resulting from inadequate data protection. The stakes are high.
We believe that an effective data science strategy for health care organizations has five key components:
1. Repository. A secure organization-wide data repository allows organizations to keep a complete inventory of their data assets. Planning a repository presents multiple challenges. Substantial groundwork is required to scope existing data, create metadata (that is, detailed descriptions of each data source), explore ways to combine data sources, and develop strategies to keep track of what data is produced, stored, used, and reused, and how, and by whom.
This process may require the formation of new organizational structures, such as designated centers for data science. Organizations have begun to establish such centers, including the Beth Israel Deaconess Medical Center (BIDMC) for Healthcare Delivery Science and the Stanford Biomedical Data Science Initiative. The Center for Healthcare Delivery Science at the Beth Israel Deaconess Medical Center brings together expertise in health care delivery, analytics, management, epidemiology, biostatistics, and information technology. These experts can draw on a large repository of locally collected electronic health care data for quality improvement purposes.
2. Integration. Bringing different data sets together involves many challenges in reconciling formats and breaking down siloes. Development and consistent use of an enterprise master patient index (EMPI) allows linkage of disparate data sources on individual patients but requires significant organizational and process changes to be achieved across information systems. These include eliminating duplicate records and establishing new procedures surrounding the addition of new patients.
If creating an EMPI for initial data collection poses too many challenges, either administrative or technical, an organization may achieve a reasonable equivalent by using a “data lake,” a technology platform that allows linking of highly disparate data. Data lakes keep source data in its original state for analysis if needed but also allow organizations to navigate across different sources and explore new relationships among them. Mercy Health in St. Louis uses a data lake to integrate data across its locations. This is fed with real-time data from its electronic health record as well as an enterprise resource management system and several other sources. This combination of data from disparate sources pulls together patient-specific information across a range of operational and clinical issues. This information can be fed back in real-time to clinicians at the bedside and can also be used for operational and strategic planning and overall quality analysis.
3. Security. Protecting privacy and anonymity are always paramount, and that task becomes more complex when an organization uses a patient’s data for purposes that go beyond immediate patient care. This is particularly important given that some health systems, including BIDMC, are moving to the use of private space on public clouds. Organizations need to create data governance frameworks to ensure those protections, and commit money to cybersecurity measures.
For example, organizations need to address issues of staff training, how to handle access to data by visiting workers, how to guard against data breaches, and how to mitigate the damage from any breaches that occur. Existing technologies should meet International Organization for Standardization (ISO) data security standards, and organizations should schedule periodic risk assessment and mitigation of technical, administrative, and physical vulnerabilities.
Large digital data repositories may increase concerns about the security of cloud storage systems and data lakes and may undermine patients’ and clinicians’ trust. Breaches can be costly in both money and institutional reputation. New and potentially expensive approaches may be needed to prevent them, including the development of anonymization algorithms and machine-learning-based security models that can adapt to changing threats and/or circumstances.
4. Support. Organizations need teams with a wide range of skills in data processing and cleaning, statistics, computer science, visualization, operational research and change management, artificial intelligence, and archiving/curation. Of particular importance are “boundary spanners” who can establish links among data science staff, the organization’s management, and its clinicians. They can identify data query priorities that are both organizationally and clinically relevant and can help users of data understand the full range of analysis that is available to them (such as near real-time queries regarding particular patient populations, medications, or treatment outcomes).
5. Feedback. An effective data science strategy relies not only on well-structured databases and advanced analytics but also on having solid underlying data. Predictive analytics can be extremely valuable but require high-quality data for reliable insights. Strategic approaches to analysis should create a virtuous cycle in which data are repeatedly scrutinized as they are reused for different purposes, driving improvements in data quality. Such work should harness innovative analytical tools that employ artificial intelligence approaches such as machine and deep learning, and a complete service redesign may be required in which insights from data can inform important organizational and service delivery decisions in real time. To achieve this level of effectiveness, frontline staff may need to change how they work in order to incorporate these insights and act on them at the point of care.
Implementation of a data science strategy represents one of the cornerstones of better care, as well as greater operational efficiency and, eventually, more effective approaches to population health. Our health care system will increasingly depend on data to improve care, reduce costs, and expand access.