Figure legend

Figure 1: Least-squares regression coefficients for protease inhibitor (A), nucleoside RT inhibitor (B), and non-nucleoside RT inhibitor (C) treatment-selected mutations

Regression coefficients of the least-squares regression models for A. protease inhibitor (PI) susceptibility using nonpolymorphic PI treatment-selected mutations (TSMs), B. nucleoside RT inhibitor (NRTI) using nonpolymorphic NRTI TSMs, and C. nonnucleoside RT inhibitor (NNRTI) using nonpolymorphic NNRTI TSMs. The Y-axis indicates the magnitude of the coefficient. Positive coefficients (yellow histograms) indicate mutations that decrease drug susceptibility; negative coefficients (blue histograms) indicate mutations that increase drug susceptibility. The Y-axis has no units because the log-fold susceptibility changes were normalized prior to regression analysis. The error bars indicate the standard deviation of the mean generalized error determined 50 times (10 repetitions of five-fold cross validation). For the PIs (n=35) and the NRTIs (n=23), the only mutations shown are those that occurred >= 10 times in the dataset and for which the absolute value of the coefficient was >= 3.0 times the standard deviation for one or more drugs. For the NNRTIs (n=24), the only mutations shown are those that occurred >= 2 times in the dataset and for which the absolute value of the coefficient was >= 3.0 times the standard deviation for one or more drugs are shown.

Figure 2: Distribution in the number of nonpolymorphic TSMs and expert panel mutations per isolate

Figure 3: Rate of learning using regression and nonpolymorphic treatment-selected mutations (TSMs)

Mean-squared errors (MSEs) of least squares regression (grey dotted lines), support vector regression (red dotted lines), and LARS (blue dotted lines) using the TSMs according to the number of samples used for testing and training. Each point represents the median MSE based on 10 randomly created sets ranging in size from 50 to 600 (or the highest multiple of 50 available for each drug) using five-fold cross validation. Each set of training examples was stratified according to the proportions of isolates in the complete dataset that were susceptible, low/intermediate, and high-level resistant.

Figure 4: Regression coefficients for each of the PI, NRTI and NNRTI TSMs determined by different regression methods