Data Sharing

Data is the most important commodity in science and its management is of critical importance to knowledge discovery and patient care. Publicly available databases are integral to life sciences research. In certain fields there is a strong culture of data sharing. For example, genomics researchers traditionally submit the raw data described in their studies to public databases at the time their studies are published. However, data sharing is uncommon in medical fields where researchers often treat data as a private preserve. The dearth of databases linking biological and clinical data is a major obstacle to medical progress. Many expert panels and funding agencies have published principles and guidelines underscoring that the raw data described in a published paper should be made publicly available so that other researchers can validate the published findings and re-use the data to promote discovery. However, there is much uncertainty on how to promote adherence to these data sharing principles and guidelines.

Most published HIV drug resistance studies are not initially amenable for re-use or individual patient-level meta-analyses. Although the genetic sequences described in HIV drug resistance studies may occasionally be submitted to GenBank, these sequences must also be linked to patient data, such as the treatment history of the patient from whom the sequenced virus was obtained, or to laboratory data, such as the results of drug susceptibility testing of the sequenced virus. We systematically review published studies of HIV drug resistance and GenBank submissions to identify the datasets to add to HIVDB. Following this review, we contact the studies' authors and request that they contribute the relevant raw data to HIVDB. We often must also provide authors incentives to make their data available, including co-authorship on meta-analyses that use the data from their original publication. The experience of recruiting data for HIVDB has provided us with an unparalleled understanding of the obstacles to data sharing and the means to overcome them.

HIVDB is a high-profile example of how data sharing can accelerate research and improve patient care. It has shown that aggregating raw data from many studies generates new knowledge that cannot be obtained from individual studies. As the use of genetic sequencing tests in clinical medicine increases, expert systems for interpreting these sequences have become increasingly necessary. The public accessibility and transparency of HIVDB and its genotypic drug-resistance interpretation programs have improved patient management and streamlined HIV treatment research globally. Ensuring the long-term sustainability of HIVDB and its data sharing tradition is critical to maintaining this resource for future decades and for demonstrating the viability of data sharing to researchers in other fields.