Stanford University HIV Drug Resistance Database - A curated public database designed to represent, store, and analyze the divergent forms of data underlying HIV drug resistance.

Release Notes for HIVdb, HIVseq, HIValg

Last updated on July 8, 2015

Table of Contents

  1. Introduction
  2. User Interfaces
  3. HIVdb
  4. Listing of Program Updates
  5. HIVseq
  6. HIValg
  7. User-Submitted Algorithms
  8. Program Code Download
  9. References
  10. Appendices
 
1. Introduction

The presence of HIV-1 drug resistance before starting a new antiretroviral (ARV) drug treatment regimen is an independent predictor of the virological response to that regimen. Several studies have shown that the use of genotypic resistance testing prior to the start of new treatment regimen increases the likelihood of virological response to that regimen. However, interpreting the results of HIV-1 drug resistance tests is one of the most difficult tasks facing health care providers. First, there are many different drug resistance mutations. Second, these mutations cause varying levels of decreased susceptibility to different ARVs. Third, standard genotypic resistance tests fail to detect drug-resistance mutations that are present at low levels within a patient's virus quasispecies.

The HIVdb program assesses how active an ARV is likely to be against a particular mutant virus compared with the drug's activity against a wildtype virus. When combined with a sound understanding of the principles of antiretroviral therapy, the interpretations and associated comments help health care providers better understand the results of HIV-1 genotypic resistance tests. However, because these systems do not explicitly consider the relative potencies of different ARV drugs and drug combinations or the results of other relevant clinical data such as previous drug-resistance test results, ARV treatment history, plasma HIV-1 RNA levels, CD4 counts, and drug toxicity, they do not have the logical power to instruct clinicians on which ARV drugs should be used when constructing a salvage therapy regimen.

HIV-1 drug resistance is rarely an all-or-none phenomenon. Clinicians treating infected patients usually need the answers to the following two questions: (i) Does the genotype suggest that the patient will respond to a drug in a manner comparable to a patient with a wild-type isolate? (ii) Does the genotype suggest that the patient will obtain any antiviral benefit from the drug? To answer these questions it is necessary to grade the extent of inferred resistance relative to the wild type and to the most resistant isolates (e.g., low-level, intermediate, and high-level).

There are three programs in the HIV Drug Resistance Database which share a common code base: HIVseq, HIVdb, and HIValg. HIVseq accepts user-submitted protease, RT, and integrase sequences, compares them to the consensus subtype B reference sequence, and uses the differences as query parameters for interrogating the HIV Drug Resistance database (Shafer, D Jung, & B Betts, Nat Med 2000; Rhee SY et al. AIDS 2006). The query result provides users with the prevalence of protease, RT and integrase mutations according to subtype and PI, nucleoside RT inhibitor (NRTI), non-nucleoside RT inhibitor (NNRTI), and integrase inhibitor (INI) exposure. This allows users to detect unusual sequence results immediately so that the person doing the sequencing can check the primary sequence output while it is still on the desktop. In addition, unexpected associations between sequences or isolates can be discovered by immediately retrieving data on isolates sharing one or more mutations with the sequence.

HIVdb is an expert system that accepts user-submitted HIV-1 pol sequences and returns inferred levels of resistance to 22 FDA-approved ARV drugs including 8 PIs, 7 NRTIs, 4 NNRTIs, and 3 INSTIs. In the HIVdb system, each HIV-1 drug resistance mutation is assigned a drug penalty score and a comment; the total score for a drug is derived by adding the scores of each mutation associated with resistance to that drug. Using the total drug score, the program reports one of the following levels of inferred drug resistance: susceptible, potential low-level resistance, low-level resistance, intermediate resistance, and high-level resistance.

HIValg is designed for users interested in comparing the results of different algorithms or who are interested in comparing and evaluating existing and newly developed algorithms. The ability to develop new algorithms that can be run on the HIV Drug Resistance Database depends on the Algorithm Specific Interface (ASI) compiler (Shafer & Betts JCM 2003).

 
2. User Interfaces
Input Output No. Samples Input Format Output Format
Mutation List Mutation classification
Predicted ARV activity
Mutation comments
Mutation penalty scores
1 Text box
Drop-down menu
HTML
DNA Sequence Mutation classification
Predicted ARV activity
Mutation comments
Mutation penalty scores
Quality control
1 to 500 Text box
File upload
HTML
Spreadsheet
XML
Sierra Webservice Mutation classification
Predicted ARV activity
Mutation comments
1 to 500 User script XML

Mutation List Interface

The mutation list interface was developed to help HIV care providers who typically do not have the complete DNA sequence of a patient’s virus sample but who instead have an external genotypic resistance report generated by the laboratory used by their clinic. This external report will usually have a list of drug-resistance mutations. The external report may also have a list of mutations defined as differences from the laboratory reference sequence whether or not they have been associated with drug resistance.

Although most external reports contain predictions of drug resistance, many care providers are also interested in the HIVdb predictions and comments. In addition, this interface allows care providers to type in various mutation permutations such as a composite list of mutations present on more than one genotypic test.

To use the Mutation List Form, select mutations using the drop down boxes or by entering the mutations into the textboxes. When using the textboxes, it is essential that amino acid mutations be entered in uppercase whereas insertions and deletions should be entered using lowercase "ins" or "del". If there is a mixture of more than one amino acid at a position, write both amino acids (intervening slashes are optional). The consensus mutations must be separated either by spaces or commas; preceding the amino acid position by the consensus amino acid residue is optional.

When using the drop down menu, choose the amino acid present in the sequence. If the amino acid is not present, then select the asterisk, which will open a text box allowing you to enter an amino acid that is not on the drop-down list.

Sequence Interface

To use the Sequence Analysis Form, paste one or more non-interleaved sequences in fasta format into the textbox or upload a file containing up to 500 non-interleaved fasta sequences. In accordance with the fasta format each sequence should be preceded by a line containing ">" followed by a sequence name and optionally followed by additional descriptors separated by pipes ("|").

Web Service

In addition to its html interface, HIVdb can be accessed via a Web service called Sierra. Sierra is a computer-to-computer programmatic interface designed for research and clinical labs that typically upload large numbers of sequences and wish to automate and individualize the manner in which data are extracted from HIVdb’s output. Sequences submitted to HIVdb either via the Web interface or Sierra are not stored.

 
3. Output Options

There are three types of output: HTML, spreadsheet, and XML. As indicated in the table above, HTML is the only option for those using the Mutation List form and XML is the only option for those using the Sierra web service. However, those using the Sequence Analysis form can specify HTML, spreadsheet, or XML.

HTML Output

HTML output contains the output for either one sequence or for many sequences printed in the order in which the sequences were submitted. The HTML output for submitted sequences includes the following types of information:

  1. Header: This contains the SeqID, which is the fasta header when one or more sequences are submitted and a Date field containing the date the program was run

  2. Summary Data: This lists which residues in PR, RT, and/or IN were present in the submitted sequence; whether the sequence contained amino acid insertions, or deletions, and the closest matching subtype of each submitted gene.

  3. Sequence Quality Assessment: This identifies sections of poor sequence quality as indicated by the presence of stop codons, frame shifts, highly ambiguous nucleotides (B, D, H, V, N), or unusual residues.

  4. Mutation Classification: PR mutations are classified into Major DRMs, Accessory DRMs, and mutations that do not receive mutation penalty scores (Other). RT mutations are classified into NRTI DRMs, NNRTI DRMs, and Other. IN mutations are classified into Major DRMs, Accessory DRMs, and Other

  5. Drug Resistance Interpretation: For PR, drug-resistance interpretations are provided for each of the ritonavir-boosted PIs. For RT, interpretations are provided for five NRTIs and four NNRTIs. For IN, interpretations are provided for three INSTIs.

  6. Comments: Comments are provided for (i) All DRMs with a mutation penalty score, (ii) Unscored mutations that have been associated with drug resistance but are considered to have minimal or no impact on currently used ARVs, and (iii) Highly unusual mutations at known drug-resistance positions that are not established DRMs.

  7. Scoring Table: One table for each ARV class. The columns contain ARVs. One set of rows lists the scored DRMs and another set of rows lists those combinations of DRMs that cause a sequence to receive additional penalty scores. The last row of each table contains the total penalty score for each ARV. Each score is hyperlinked to data supporting the association between the mutation and each ARV through the Mutation ARV Evidence Listing (MARVEL) program.

Spreadsheet output

This output is useful for primarily for researchers submitting a set of sequences – ideally encompassing the same regions in PR, RT, and/or IN. The output contains tab-delimited text. The first row contains the header fields. Each subsequent row contains one sequence. The fields can be organized into the following types of information:

  1. SeqID: The fasta headers of the submitted sequences

  2. HIVdb version:

  3. Gene coverage: The first and last residue of PR, RT, and/or IN

  4. Sequence Quality Assessment: For each gene, the positions of stop codons, highly ambiguous nucleotides (B, D, H, V, N), highly unusual amino acids, and evidence for G-to-A hypermutation are collected in separate columns and within these columns are separated by commas.

  5. Subtype:

  6. Mutation Classification: PR mutations are classified into Major DRMs, Accessory DRMs, and mutations that do not receive mutation penalty scores (Other). RT mutations are classified into NRTI DRMs, NNRTI DRMs, and Other. IN mutations are classified into Major DRMs, Accessory DRMs, and Other. For each gene in a sequence, there are three comma-separated lists of mutations. Columns contain ‘None’ when there are no mutations belonging to the relevant classification.

  7. Drug Resistance Interpretation: For PR, drug-resistance interpretations are provided for each of the ritonavir-boosted PIs. For RT, interpretations are provided for five NRTIs and four NNRTIs. For IN, interpretations are provided for three INSTIs.

  8. Scored DRMs: One column for each scored mutation. 0 indicates no change from consensus, 1 indicates the presence of the DRM whether in pure form or part of a mixture, and a period ‘.’ indicates that the position was not included in the sequence.

  9. Amino Acid Alignment: One column for each position included in at least one sequence. A dash ‘-‘ indicates the presence of the consensus amino acid. A single amino acid indicates a pure mutation or difference from consensus B. Two or more amino acids indicate the presence of a mixture. Amino acid insertions are indicated by … Amino acid deletions are indicated by … * indicates stop. An X indicates.

XML OUTPUT

This output contains the same information provided by the HTML output. The output and its structure are described on the following page:

 
4. Drug Resistance Mutations (DRMs) and Sequence Interpretation

DRM classification

A DRM can be characterized according to the following criteria:

  1. Polymorphism frequency: its prevalence in virus isolates from ART-naïve patients in regions with low-levels of TDR. Polymorphic DRMs may occur in the absence of selective drug pressure. Polymorphic DRMs usually have little effect on ARV susceptibility when they occur without other DRMs.

  2. Treatment prevalence: its prevalence in virus isolates from patients receiving ART compared with its prevalence in virus isolates from ART-naïve patients. Nonpolymorphic DRMs that occur frequently in patients receiving an ARV are usually associated with clinically significant resistance to that ARV.

  3. In vitro phenotype: its contribution to reduced in vitro susceptibility either alone or in combination with other DRMs.

  4. Association with VF: its association with a reduced virological response to an ARV in a new treatment regimen

The HIVDB report groups mutations within each gene into 3 lists:

  • RT
      NRTI: Mutations in this list nearly always have a penalty score for one or more NRTIs. Rarely, this list may contain an unusual amino acid at a position that is associated with NRTI resistance.

      NNRTI: Mutations in this list nearly always have a penalty score for one or more NNRTIs. Rarely, this list may contain an unusual amino acid at a position that is associated with NNRTI resistance.

      Other: Mutations that do not have a penalty score

  • PR
      Major: Nonpolymorphic DRMs that make a major contribution to reduced susceptibility to one or more PIs. These usually have a penalty score of 30 to 60.

      Accessory (formerly Minor): Nonpolymorphic or minimally polymorphic mutations that contribute to reduced susceptibility in combination with major DRMs. Highly unusual and poorly characterized mutations at major drug-resistance positions are also included in this list.

      Other: Mutations that are not associated with drug resistance and do not receive penalty scores. These may include highly polymorphic mutations that may be weakly associated with drug resistance, but that are primarily accessory, are also placed in this category. It may also include rare nonpolymorphic PI- or INI-selected mutations that have not been studied for their effects on drug susceptibility

  • IN
      Major: Nonpolymorphic DRMs that make a major contribution to reduced susceptibility to one or more PIs. These usually have a penalty score of 30 to 60.

      Accessory (formerly Minor): Nonpolymorphic or minimally polymorphic mutations that contribute to reduced susceptibility in combination with major DRMs. Highly unusual and poorly characterized mutations at major drug-resistance positions are also included in this list.

      Other: Mutations that are not associated with drug resistance and do not receive penalty scores. These may include highly polymorphic mutations that may be weakly associated with drug resistance, but that are primarily accessory, are also placed in this category. It may also include rare nonpolymorphic PI- or INI-selected mutations that have not been studied for their effects on drug susceptibility.

DRM penalty scores and resistance interpretation

  • The estimated level of resistance to a drug is determined by adding up the penalty scores associated with each of the DRMs present in a submitted sequence.

  • Some combinations of DRMs receive penalty scores that are added to the total score for a drug:
      Example:

  • Some DRMs have negative penalty scores for certain drugs

  • Once the total score is calculated the estimated level of resistance can be calculated as follows:
      Susceptible: Total score 0 to 9

      Potential low-level resistance: Total score 10 to 14

      Low-level resistance: Total score 15 to 29

      Intermediate resistance: Total score 30 to 59

      High-level resistance: Total score >= 60

  • The HIVdb output contains a table in which each of the individual and combination scores associated with a drug is listed. Each scored DRM is hyperlinked to a set of entries in HIVDB that the support the DRM’s associated with reduced susceptibility. This table is important to examine because it contains more information than the five categories listed at the top of the report. It is not uncommon for an isolate to have intermediate resistance to two PIs with one PI having a score of 31 (close to low-level resistance) and another having a score of 59 (close to high-level resistance). The scores themselves are also links to information in the database supporting the level of the mutation penalty.

  • The most recent scores are available as tab-delimited files or tables sortable by position or drug:

    Tab-Delimited FilesSortable Tables
    scores for PIs
    scores for NRTIs
    scores for NNRTIs
    scores for INIs
    scores for a combination of PI mutations
    scores for a combination of NRTI mutations
    scores for a combination of NNRTI mutations
    scores for a combination of INI mutations
    scores for PIs
    scores for NRTIs
    scores for NNRTIs
    scores for INIs
  • The complete list of DRMs can be sorted by position and drug penalty score at this page:
  • To display the effect of our scoring in a concrete manner, we run our algorithm through a set of unique mutation patterns derived from HIV-1 RT sequences from >40,000 persons. The NRTI, NNRTI, and PI spreadsheets show the drug resistance levels for 2,081, 1,104, and 2,556 unique patterns of scored mutations using the latest version of our algorithm.

Comments

HIVdb output contains 3 types of comments:

  • Comments on ARV resistance mutations that receive mutation penalty scores. These comments are designed to justify the score and to provide additional information about a mutation that may be clinically relevant, depending on the clinical scenario.

  • Comments on mutations that have been potentially associated with reduced ARV susceptibility but which do not have mutation penalty scores because they are either highly polymorphic or have a minimal, if any, effect on drug susceptibility. Comments on these mutations are designed to alert the user to the presence of these mutations while at the same time justifying the absence of mutation penalty scores.

  • Comments on highly unusual mutations at known drug resistance positions.
  • The most recent comments are available as tab-delimited files or web pages:

    Tab-Delimited FilesWeb Pages
    comments for protease inhibitors
    comments for RT inhibitors
    comments for integrase inhibitors
    comments for PIs
    comments for NRTIs
    comments for NNRTIs
    comments for INIs

Clinical Context

  • Susceptible: no evidence of reduced ARV susceptibility compared with a wild-type virus (total score of 0–9).

  • Potential low-level resistance: the virus encoded by the submitted sequence is likely to be fully susceptible; however, the sequence contains mutations that may indicate previous ARV exposure (total score of 10–14).

  • Low-level resistance: the virus encoded by the submitted sequence may have reduced in vitro ARV susceptibility and/or patients harboring viruses with the submitted mutations may have a suboptimal virological response to treatment with the ARV (total score of 15–30).

  • Inter mediate resistance: a level of ARV resistance greater than low-level resistance but lower than high-level resistance (total score of 30–59). An ARV to which the virus has in termediate resistance should generally be used only if the ARV has a high genetic barrier to resistance (e.g. some ritonavir-boosted inhibitors) or if few other active drugs are available.

  • High-level resistance: the mutations present in the submitted sequence of the virus are similar to those observed in viruses with the highest levels of in vitro drug resistance (total score >60). Alternatively, clinical data exist demonstrating that patients infected with viruses having such mutations usually have little or no virological response to treatment with the ARV.

  • noted in the Introduction, the purpose of this program is to assess how active an ARV is likely to be against a particular mutant virus compared with its activity against wildtype virus. The program does little else to help a health care provider choose therapy. For example, it is often wiser to use a highly potent drug assigned intermediate resistance than to use a less potent drug assigned low-level resistance. Second, some drugs such as 3TC and FTC continue to provide some degree of virological benefit even in the presence of high-level resistance possibly because the mutations usually responsible for resistance M184V/I, increase HIV-1 susceptibility to other NRTIs and because M184V/I are associated with decreased virus replication. Although a program that could select the appropriate treatment regimen for a patient would be desirable, no such program exists making it necessary for all health-care providers to have a sound understanding of the principles of antiretroviral therapy (http://aidsinfo.nih.gov/Guidelines/Default.aspx?MenuItem=Guidelines).
 
5. Program Updates

The scoring tables, comments, and programs are frequently updated; these updates are tracked in the Updates page. Below is a listing of our current and previous versions linking to the specific improvements since January 2003.

 
6. HIVseq

HIVseq allows users to examine new sequences in the context of previously published sequence data on RT, protease, and integrase (Shafer R, Jung D, and Betts B, Nature Med 2000; Rhee et al AIDS 2006). Like HIVDB, HIVseq can accept either mutations or complete sequences and produces an assessment of quality control.

HIVSeq overview:
HIVseq accepts user-submitted RT, protease, and integrase sequences, compares them to a reference sequence (subtype B consensus) and uses the difference to query the database. The program output includes (i) a list of mutations defined as differences from the consensus B amino acid sequence, (ii) the frequency with which each mutation occurs in treated and untreated person infected with viruses belonging to the eight most common subtypes (A, B, C, D, F, G, CRF01_AE, CRF02_AG), (iii) hyper links to a table containing each report of those mutations associated with a particular treatment status and subtype. (the mutation itself is a hyperlink). (Can we provide a link to Figure 1 from Soo's 2006 paper in AIDS)

Detailed description of the tabular output of HIVseq:
For RT sequences, the program provides a NRTI table containing mutation frequency for isolates from RTI-naive and NRTI-treated (NNRTI) persons, and a NNRTI table containing mutation frequency for isolates from NNRTI-naive and NNRTI-treated persons. For protease sequences, the program provides a protease table containing mutation frequency for HIV-1 isolates from PI-naive and PI-experienced persons. For integrase sequences, the program provides a protease table containing mutation frequency for HIV-1 isolates from INI-naive and INI-experienced persons.

Each table contains one row for each mutation and 20 columns. Columns 1 to 4 list the position, the position's consensus amino acid, the submitted nucleotide triplet and the submitted mutation. Columns 5 to 12 list the frequency of each mutation in subtypes A, B, C, D, F, G, CRF01_AE and CRF02_AG in drug class naive persons. Columns 13 to 20 list the frequency of each mutation in subtypes A, B, C, D, F, G, CRF01_AE and CRF02_AG in drug class experiences persons. Each mutation is also a hyper-link to a separate web page with detailed information on each isolate, including literature references with Medline abstracts, the GenBank accession number, and complete sequence and treatment records.

Note: To minimize reporting bias, the mutation frequency tables contain one sequence per individual. For individuals in whom sequences from multiple isolates were published, the mutation tables include the earliest sequence from untreated persons and the latest sequence (while on therapy) from persons receiving antiretroviral therapy. To exclude technical sequencing errors and cases of circulating virus containing unusual variants, the mutation tables include only mutations present as the predominant form whenever multiple clones from the same isolate were sequenced. Sequences of poor quality and those considered to be possible laboratory contaminants are excluded from the data sets.

The following table provides a summary of number of persons used for the HIVseq output.

 
7. HIValg

7.1 Objectives

The objectives of this program are to 1) identify the extent of agreement between three commonly used genotypic drug resistance interpretation systems; and 2) to identify sequences responsible for disagreements between these systems. It is important to note that two of the three algorithms have been simplified from a five-to-six level output (Rega) or a five level output (HIVdb) to a three level output so that all three algorithms can be roughly compared. It is also important to note that discrepancies of one level (e.g. susceptible vs low/intermediate resistance or low/intermediate resistance vs high-level resistance) can frequently occur by chance if the level of resistance is on the borderline between two levels. Only discrepancies between fully susceptible and high-level resistance should be examined closely.

7.2 Algorithms

The following algorithms are available online in their XML form in the "Algorithm Specification Interface page". They are all encoded using the ASI format, which is also described in the same page.

  • ANRS: Agence Nationale de Recherches sur le SIDA 4,5.
  • HIVDB: The current version of the drug-resistance interpretation program on this site is referred to as the "HIVdb" algorithm.
  • Rega Institute: Courtesy of Professor Anne-Mieke Vandamme 7.

Each of the algorithms reports their results differently. The table below shows how the results of the algorithm are normalized for comparison by the program. Users of HIValg can select whether they prefer to receive output with the original interpretation or with the normalized interpretation ('SIR' option).

AlgorithmSIR
ANRSSusceptiblePossible resistanceResistance
HIVDBSusceptible
Potential low-level resistance
Low-level resistance
Intermediate resistance
High-level resistance
Rega InstituteSusceptible GSS 1
Susceptible GSS 1.5
Intermediate Resistant GSS 0.75
Intermediate Resistant GSS 0.5
Intermediate Resistant GSS 0.25
Resistant GSS 0

 
8. User-Submitted Algorithms / ASI

Selecting which algorithms appear in the output report can be done in two different ways. The first technique is to select from the list of algorithms made available on our servers. The second technique allows you to upload an algorithm from your machine, assuming that the algorithm is in proper ASI format as described in the Algorithm Specification Interface page (Betts BJ & Shafer RW J Clin Microbiol 2003). These techniques can be used in combination.

 
9. Program Code Downloads
 
10. Appendices

Appendix 1. Consensus B Sequences

The subtype B consensus sequence is derived from an alignment of subtype B sequences maintained at the Los Alamos HIV Sequence Database (hiv-web.lanl.gov). The consensus B sequence is therefore a commonly used reference sequence to which new sequences are compared. Files containing the consensus PR, consensus RT, and consensus IN are also available.

Consensus B SequencesAmino Acids
Protease PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
RT PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
GPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGL
KKKKSVTVLDVGDAYFSVPLDKDFRKYTAFTIPSINNETPGIRYQYNVLP
QGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRT
KIEELRQHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKD
SWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRGTKALTEVIPLTEEAE
LELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLK
TGKYARMRGAHTNDVKQLTEAVQKIATESIVIWGKTPKFKLPIQKETWEA
WWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRET
KLGKAGYVTDRGRQKVVSLTDTTNQKTELQAIHLALQDSGLEVNIVTDSQ
YALGIIQAQPDKSESELVSQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDK
LVSAGIRKVL
integrase FLDGIDKAQEEHEKYHSNWRAMASDFNLPPVVAKEIVASCDKCQLKGEAM
HGQVDCSPGIWQLDCTHLEGKIILVAVHVASGYIEAEVIPAETGQETAYF
LLKLAGRWPVKTIHTDNGSNFTSTTVKAACWWAGIKQEFGIPYNPQSQGV
VESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERI
VDIIATDIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVV
IQDNSDIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED


Appendix 2. Sample Data Sets

A small data set (N=10) has been compiled to provide users with a sample input for running our programs. To view the results for these sequences, copy and paste them into the input form.

A large data set (N=2055) is also available. We ask users to restrict the number of sequences they process at a time using our programs to 100, so this data set cannot be directly submitted to our programs.

A very large data set (N=5838) is available. Again, we ask users to restrict the number of sequences they process at a time using our programs to 100, so this data set cannot be directly submitted to our programs.