CCBYNC Open access
Analysis Medical Research in China

Big data and medical research in China

BMJ 2018; 360 doi: (Published 05 February 2018) Cite this as: BMJ 2018;360:j5910
  1. Luxia Zhang, professor1 2,
  2. Haibo Wang, researcher3 4 ,
  3. Quanzheng Li, associate professor5,
  4. Ming-Hui Zhao, professor1 6,
  5. Qi-Min Zhan, professor7
  1. 1Renal Division, Department of Medicine, Peking University First Hospital, Peking University Institute of Nephrology, Beijing, China
  2. 2Peking University, Center for Data Science in Health and Medicine, Beijing, China
  3. 3Clinical Trial Unit, First Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China
  4. 4China Standard Medical Information Research Center, Shenzhen, China
  5. 5MGH & BWH Center for Clinical Data Science, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
  6. 6Peking-Tsinghua Center for Life Sciences, Beijing, China
  7. 7Peking University, Health Science Center, Beijing, China
  1. Correspondence to: L Zhang: zhanglx{at}

Luxia Zhang and colleagues discuss the development of big data in Chinese healthcare and the opportunities for its use in medical research

The quantity of data that is routinely generated and collected have increased greatly in the past decade, as has our ability to analyse and interpret these data, particularly in medicine. China’s large population and universal healthcare system provide rich sources of data, and interest in the application of big data to medicine has grown in the past few years. It is hoped that the combined use of large data resources and new technologies will solve many existing medical problems and provide better evidence for decision making.1

What do we mean by big data?

Big data has been defined as “high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”2

Digital healthcare data are now common. Large numbers of medical data are generated through medical records, regulatory requirements, and medical research.3 Worldwide, the number of data are projected to double every two years, which will result in 50 times more data in 2020 than in 2011.4

In addition to data volume,5 variety and velocity are also important for usability—comprising the 3Vs of big data. The variety comes from the multiple sources of data (box 1), both structured and unstructured, which reflect the whole health and disease process.

Box 1

Sources of medical big data

  • •Administrative and claims data

  • •Routine population statistics and major disease surveillance data

  • •Real world data, such as electronic medical records, medical imaging, and data from health examinations

  • •Research data, including biomarkers, and multiomic information from clinical trials or cohort studies

  • •Registries (eg, of devices, procedures, and diseases)

  • •Data from mobile medical devices

  • •Data reported by patients


Medical data are also being combined with information from social media, occupational information, geographical location, and economic and environmental data.6 Integrating all these information sources into datasets that can be analysed is key to utilising big data. In addition, the speed at which big data are generated and processed should meet the real time demands of preventing and managing disease.

Recently, veracity has been added as a goal of big data,7 although some argue that big data are difficult to validate and can never be completely accurate.58 Nonetheless, to make the best use of big data, quality is important.

An important concept of big data is that assembly of the data is not the purpose. Instead, data must be analysed, interpreted, and acted on. Therefore, to get the best value from big data, new technologies and analytical methods (eg, machine learning) are needed and the information generated must be evaluated for clinical effectiveness and translated into tools for use in clinical practice.9

What data are gathered in China and how?

Promoting the use of big data in medicine is a national priority in China. In June 2016, the State Council of China issued an official notice on the development, and use of big data in the healthcare sector.10 The council acknowledged that big data in health and medicine were a strategic national resource and their development could improve healthcare in China, and it set out programmatic development goals, key tasks, and an organisational framework.

After regional health data centres were established in Shanghai and Ningbo, the National Health and Family Planning Commission announced in 2016 that China would establish more regional and national centres and industrial parks that focused on big data in health and medicine as part of a national pilot programme to make more meaningful use of these data.11 Four cities in Fujian and Jiangsu provinces in eastern China were chosen as the pilot sites, and the centres are now in construction. The goal is to integrate the following datasets:

  • • Regional health data, including claims data from nationally funded basic health insurance that covers over 95% of the Chinese population12

  • • Administrative data from local health offices

  • • Data from public health services of the Chinese Center for Disease Control and Prevention, especially for women and children, and for surveillance networks of the main non-communicable diseases

  • • Birth and death registries

  • • Electronic medical records from hospitals, including primary, secondary, and tertiary hospitals.

China is already making use of big data. The country’s personal identification system could be used to link data from various sources. Medical claims data from the national social insurance system have been used to generate a 5% sampling database and an overall database covering over 0.6 billion beneficiaries in the past five years, which are available to scientific researchers. Applications to use these data are managed by organisations such as the Chinese Health Insurance Research Association; there is no public access.

Since 2016, many academic research projects using these national datasets have been approved to evaluate the current and future clinical and economic burden of chronic diseases such as cardiovascular disease, diabetes, kidney disease, and chronic obstructive pulmonary disease. Furthermore, other national administrative databases, including the national standardised discharge summary of inpatients and the national death registry, with hundreds of millions of patient records, have been used by medical and public health researchers.1314

China is also focusing on personalising medicine. Since 2016, the Ministry of Science and Technology has initiated and funded many “precision medicine” projects under the national key research and development programme. A centralised and integrated data platform for precision medicine is being developed, which will store all patient/population data as well as biosamples collected from a series of large cohort studies and from biobanks. The platform is expected to include at least 0.7 million participants, 0.4 million from the general population and 0.3 million from patients with major non-communicable diseases. China’s large population base and centralised governance mean that very large sample sizes can be reached, which is of great value to personalised medicine initiatives.

As well as the government-led projects, Chinese academic medical societies are leading data-sharing initiatives (box 2). In October 2017, the School of Public Health at Peking University announced the launch of the China Cohort Consortium ( Currently 20 cohorts with more than 2 million participants are included. The activities of the consortium include using common data models for data harmonisation, performing individual participant data meta-analyses, and generating new cohorts. Furthermore, disease based data sharing platforms, including for cardiovascular disease, stroke, cancer, and kidney disease, have been established by medical specialists with the support of the government. For example, the China Kidney Disease Network (, which launched in 2015, integrates various sources of data on kidney disease and uses new analytic techniques to provide evidence for healthcare policy, strengthen academic research, and promote effective disease management.15

Box 2

Current projects applying big data to medicine in China

Government led

Researcher initiated

  • China Cohort Consortium (

  • China Kidney Disease Network (

  • Others funded by the government include cardiovascular disease (eg, China Cardiovascular Surgery Registry), stroke (eg, Chinese National Stroke Registry), and cancer (eg, National Central Cancer Registry of China)


What are the challenges and what needs to be done?

Electronic record systems

Electronic medical records, whether collected by one organisation or for individual patients across organisations, are not commonly used for research in China. They are primarily used for clinical practice and largely contain unstructured data. Although over 90% of hospitals in China use electronic records, accessibility to and quality of the data are not optimal.

Adoption of individual electronic health records has been impeded by incompatibility between different hospital systems. China has over 300 commercial providers of hospital information systems with various technical structures and data standards. Furthermore, healthcare systems are not required to exchange data with each other. Some regions are planning to establish regional electronic health records but most are in preliminary stages. To overcome these problems, the interoperability of electronic records needs to be improved, especially for data structures, data standards, and data transfer agreements. Health authorities, hospitals, and electronic record companies must agree on how to improve hospital information systems. Technologies that can integrate data from different sources are also needed. In addition, the government should introduce policies to strengthen data exchange and integration across organisations.

Lack of medical terminology system

The lack of a widely adopted and consistently implemented medical terminology system is another problem for using big data in medical research. For example, since 2002, the use of the International Classification of Diseases (ICD-9, and more recently ICD-10) was mandated by the National Health and Family Planning Commission for all hospital patients. However, the growth of hospital information systems has resulted in many variations in the coding of other clinical terms beyond diagnosis, making data exchange difficult. Widely accepted terminology systems, such as the Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT), the Unified Medical Language System (UMLS), or the General Architecture for Languages, Encyclopaedias and Nomenclatures in Medicine (GALEN), are not available in China. By integrating and distributing key terminology, classification, and coding standards in medicine, these systems promote more effective and interoperable biomedical information systems and services, including electronic health records. More effort is needed to resolve linguistic differences between Chinese and English beyond the existing translation of terms.

Current medical practice patterns

Medical practice patterns and the infrastructure of health systems in China also impede the meaningful use of big data. The lack of an established referral system and the heterogeneity in the quality of healthcare contribute to “medical migration,” when patients travel to different provinces and cities to seek medical care. In the current Chinese medical system, it is almost impossible to track a patient through electronic record systems for clinical purposes as there is no unified national platform that can consolidate all the data from all healthcare institutions in China. The main barrier to conducting a “deep patient” study,16 where machine learning is used to predict future adverse events using medical data, is obtaining the longitudinal data and outcomes of each patient from electronic records. Furthermore, the wide differences in medical practice raise concerns about the veracity of data.

Data quality

The problems described above affect the quality of big data. It has been shown that, when the quality of clinical data is higher, big data analytics produce more valid, stable, and clinically useful results.17 However, it is difficult to validate high volume datasets. One way of dealing with the data quality problem is to examine the characteristics of the database and judge which variables are likely to be relatively accurate—for example, expenditure from claims data—and to answer questions based on those variables. Improving the veracity of data requires an ongoing and joint effort by multiple sectors to rigorously examine the validity, representativeness, and completeness of data.

Privacy concerns

Although privacy is an extremely important topic for big data in health and medicine, there is no specific law or guidance on this in China. Regulation from authorities and research standards about privacy protection are needed that do not jeopardise the completeness of data that can be used.

Opportunities to improve health

The use of big data in medicine includes public health promotion (disease monitoring and population management), healthcare management (quality control and performance measurement), drug and medical device surveillance, routine clinical practice (risk prediction, diagnosis accuracy, and decision support), and research.19

The existing mandatory national administrative databases in China produce big data that can easily be used to monitor trends in major diseases and provide evidence for policy making in healthcare. New data analytics, such as machine learning, to replace much of the work of radiologists and anatomical pathologists, can also be used and is an active area of research in China.18 However, for applications that need detailed and high quality clinical information and long term follow-up, such as predicting long term outcomes and providing support for clinical decisions, the data systems in China need to be developed further.

In China, discussion on big data in medicine has focused on how to collect, store, integrate, and manage data and has been led by computer scientists, and the health information industry. However, the future of big data in medicine is in using new analytic techniques such as machine learning to answer clinical questions, educating doctors and policy makers to understand big data, and promoting the use of tools generated by big data and big data technologies that support clinical decision making.


China’s national campaign to promote the application of big data in health and medicine is likely to change medical research, medical practice, and the development of the healthcare industry in the near future. Despite the great interest in big data, we advocate following Confucian doctrine to ensure that we obtain true value for medicine—that is, to learn extensively, inquire carefully, think deeply, discriminate clearly, and practise faithfully.

Key messages

The application of big data to health and medicine is a national priority for China

Several initiatives to promote big data have been started by the government and researchers

The use of big data and new data technologies has the potential to improve medical research and the understanding of health, and disease


We thank Alan Leichtman (Arbor Research Collaborative for Health, and University of Michigan) and Roseanne Yeung (University of Alberta) for their constructive suggestions and editing. We also thank Fan Liu (former chief information officer of Peking University People’s Hospital and Peking University International Hospital) for comments on electronic record systems.


  • Contributors and sources: LZ is a renal epidemiologist and the executive deputy director of Peking University, Center for Data Science in Health and Medicine. HW is the founding director and key architect of several national medical databases in China. QL is a principal investigator focusing on artificial intelligence in health and medicine. MHZ is a nephrologist with substantial experience in experimental research and population based studies in China. QMZ is the Academician of the Chinese Academy of Engineering and the Chief Scientist of the 973 National Fundamental Program in China. His main interest is the translational study of cancer. LZ and HW contributed equally to this work and are the guarantors. This article arose from discussions about the status and future directions of big data in health and medicine in China, and the relationship with traditional medical studies.

  • Competing interests: We have read and understood BMJ policy on declaration of interests and declare that the article was funded by the World Health Organization (WHO Reference 2014/435380-0), the National Key Technology R&D Program of the Ministry of Science and Technology (2016YFC1305400), and the University of Michigan Health System-Peking University Health Science Center Joint Institute for Translational and Clinical Research (BMU20140479).

  • Provenance and peer review: Commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:


View Abstract