Closely related Drosophila September 2016

BP2016000001

We have used data on the D. melanogaster genome annotation (release 6.12) and data on the genome assemblies of 11 closely related Drosophila species, namely: D. simulans (release 1.4), D. sechellia (release 1.3), D. erecta (release 1.3), D. yakuba (release 1.3), D. suzukii (release 1), D. biarmipes (release 1), D. takahashii (release 1), D. eugracilis (release 1), D. ficusphila (release 1), D. elegans (release 1) and D. rhopaloa(release 1). All files were downloaded from GenBank. First, we transferred the annotation of all D. melanogaster coding sequences (CDS) to the other 11 species, using Compart and Splign, as implemented in BDBM (submitted, link for download). Only fully annotated CDS, with no in-frame stop codons and with a maximum of 10% length variation relative to the D. melanogaster reference sequence were kept. It should be noted that rapidly evolving genes as well as genes encoding proteins with a variable number of repeats are likely not annotated this way. It is well known, however, that it is difficult to obtain a reliable alignment for genes with such features. Moreover, it is well established that the use of highly divergent sequences can lead to an underestimation of the rate of synonymous divergence, which in turn leads to an inflated value for the ratio of the rate of non-synonymous divergence/rate of synonymous divergence, leading to the identification of false positively selected amino acid sites. This is also one of the reasons why in this project only closely related Drosophila species were used.

Using ClustalW2, MrBayes and codeml, as implemented in ADOPS (Reboiro-Jato et. al., 2012; link for download), we then attempted to identify positively selected amino acid positions in all datasets with four or more sequences. The resulting ADOPS project folders are here provided, meaning that all details of the performed analyses can be inspected using ADOPS. Moreover, using ADOPS and the provided project folders, researchers can do additional analyses on the data, such as assessing the impact of the inclusion of a given sequence, or the impact of using a different alignment algorithm, for instance.

HIV1 - ASP dataset

BP2017000005

The existence of an HIV-1 protein translated from an antisense transcript (ASP) has not been completely accepted by the HIV-1 research community, although recent findings support its existence. Therefore we looked for evidence of positively selected amino acid sites at ASP that could represent new potential targets for anti-retroviral therapies and vaccine strategies. 100 sequences were retrieved from https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html and the alignment undone. Then, sequences with ambiguity codes were removed as well as stop codons. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination that could affect the results by creating false positively selected amino acid sites. The phi test did not find statistically significant evidence for recombination (P=0.999). Sequences were aligned using ClustalW2. Eight positively selected amino acid sites were found.

Petunia SLF intra-haplotype positive selection

BP2017000006

S-RNase based gametophytic self-incompatibility (GSI) is a genetic mechanism common in flowering plants that prevents self-fertilization, and thus inbreeding depression. A single S-locus, containing separate pistil (the S-RNase gene) and pollen (F-box gene(s)) components, determines specificities differences. Multiple S-pollen genes, called SLFs are determining S-pollen specificity in Solanaceae. In this system each S-protein recognizes and interacts with a sub-set of non-self S-RNases, to mediate their degradation. Positively selected amino acid sites are expected to be observed at those amino acid positions that are involved in specificity determination. Petunia SLF gene sequences for 13 S-haplotypes were downloaded from NCBI (see the "View Names File" tab for each project to see the accession numbers). We used Muscle, MrBayes and codeml, as implemented in ADOPS (Reboiro-Jato et. al., 2012) to run the analyses. The parameters used are shown in the "View Properties" tab while all details can be viewed in the log files.

Zika Virus

BP2017000007

The Zika virus belongs to the Flaviviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 11 mature proteins, for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 485, for all of the genes. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 53, 82, 89, 214, 179, 156, 94, 191, 81, 150, and 247 sequences for genes C, pr, M, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 15, 22, 19, 57, 69, 48, 14, 52, 15, 45, and 89 sequences for genes C, pr, M, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. Using the first protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes pr, M, NS1, NS4A, NS4B and NS5, and did find evidence for recombination (P<0.05) for genes C, E, NS2A, NS2B and NS3. Using the second protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes pr, M, NS2B, NS4A, NS4B and NS5, and did find evidence for recombination (P<0.05) for genes C, E, NS1, NS2A and NS3. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 100 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. When more than 100 sequences are available for a given gene, and evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed using OmegaMap, after aligning the sequences with Muscle, and a phylogeny inferred with MrBayes, as implemented in ADOPS. In this case, the details of the OmegaMap run are shown in the Notes tab of the corresponding ADOPS project, but positively selected amino acid sites can still be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Ebolavirus

BP2017000008

The Ebolavirus belongs to the Filoviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 7 mature proteins and one small secreted glycoprotein (sGP), for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 434, 434, 434, 186, 610, 434, 436 and 436 for genes N, vp35, vp40, GP, sGP, vp30, vp24 and L respectively. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 95, 61, 51, 37, 76, 52, 53 and 169 sequences for genes N, vp35, vp40, GP, sGP, vp30, vp24 and L, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 51, 32, 25, 23, 37, 25, 24 and 86 sequences for genes N, vp35, vp40, GP, sGP, vp30, vp24 and L, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. The phi test did not find statistically significant evidence for recombination (P>0.05) for all genes in both protocols. When more than 100 sequences are available for a given gene, five datasets with 100 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. Positively selected amino acid sites can be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Enterovirus

BP2017000009

The Enterovirus belongs to the Picornaviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 11 mature proteins, for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 1337, 1337, 1337, 1337, 2587, 2587, 2584, 2587, 1102, 2588 and 8 for genes VP4, VP2, VP3, VP1, 2A, 2B, 2C, 3A, VPg, 3Cpro and RdRp, respectively. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 686, 941, 912, 1004, 1895, 1708, 2064, 1639, 581, 1913 and 8 sequences for genes VP4, VP2, VP3, VP1, 2A, 2B, 2C, 3A, VPg, 3Cpro and RdRp, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 194, 471, 489, 654, 1028, 705, 1074, 730, 165, 929 and 8 sequences for genes VP4, VP2, VP3, VP1, 2A, 2B, 2C, 3A, VPg, 3Cpro and RdRp, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. In both protocols, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes VP2, VP3, VP1, 2A, 2B, 2C, 3A, 3Cpro and RdRp, and did find evidence for recombination (P<0.05) for gene VP4. Also in both cases, the Vpg gene sequence had too few informative characters and the Phi Test could not be used. Even so, the evidence for recombination (P<0.05) aproach was chosen for this gene. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. When more than 100 sequences are available for a given gene, and evidence for recombination or a few informative character case were found, five datasets with 50 randomly selected sequences were analyzed using OmegaMap, after aligning the sequences with Muscle, and a phylogeny inferred with MrBayes, as implemented in ADOPS. In this case, the details of the OmegaMap run are shown in the Notes tab of the corresponding ADOPS project, but positively selected amino acid sites can still be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Hepatitis C virus

BP2017000010

The Hepatitis C virus belongs to the Flaviviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 10 mature proteins, for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 2182, 2182, 2182, 2181, 2207, 2220, 2220, 2220, 2219 and 2220 for genes C, E1, E2, p7, NS2, NS3, NS4A, NS4B, NS5A and NS5B, respectively. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 1760, 1778, 1738, 1840, 1807, 1714, 1814, 1785, 1748 and 1683 sequences for genes C, E1, E2, p7, NS2, NS3, NS4A, NS4B, NS5A and NS5B, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 974, 1732, 1713, 1246, 1754, 1676, 559, 1589, 1703 and 1643 sequences for genes C, E1, E2, p7, NS2, NS3, NS4A, NS4B, NS5A and NS5B, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. Using the first protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes C, E1, E2, p7, NS3, NS4B, NS5A and NS5B, and did find evidence for recombination (P<0.05) for gene NS2. Using the second protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes C, E1, E2, NS3, NS4B, NS5A and NS5B, and did find evidence for recombination (P<0.05) for genes NS2. In both protocols, the NS4A gene sequence had too few informative characters and the Phi Test could not be used. Also, in the second protocol only, the pr gene sequence had too few informative characters to be used for the Phi test. Even so, the evidence for recombination (P<0.05) aproach was chosen for these genes. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. When more than 100 sequences are available for a given gene, and evidence for recombination or a few informative character case were found, five datasets with 50 randomly selected sequences were analyzed using OmegaMap, after aligning the sequences with Muscle, and a phylogeny inferred with MrBayes, as implemented in ADOPS. In this case, the details of the OmegaMap run are shown in the Notes tab of the corresponding ADOPS project, but positively selected amino acid sites can still be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Malus SFBB

BP2017000011

Self-incompatibility (SI) is a major genetic barrier to self-fertilisation in which the female reproductive cells discriminate between genetically related and non-related pollen, and reject the former. In the gametophytic SI (GSI) system, the pollen is rejected when it expresses a specificity that matches either of those expressed in the style. Because of frequency dependent selection, many specificities are maintained in natural populations. In Pyrinae (Malus, Pyrus, and Sorbus) the S-pistil gene product is an extracellular ribonuclease, called S-RNase, and the S-pollen specificity is determined by multiple F-box genes, called SFBBs (S-locus F-box brothers). In this system the S-pollen proteins have higher affinity for non-self S-RNases, and the binding of these proteins implies that the S-RNase is ubiquitylated, and marked for degradation by the proteasome . The collaborative non-self recognition model is also the mechanism proposed for Solanaceae. In the non-self recognition systems, levels of polymorphism at the S-pollen genes are low, but levels of intra-haplotype divergence are similar to the S-RNase diversity. The amino acids under positive selection at the S-pollen genes, those involved in S-pollen specificity determination, are here identified in Malus x domestica by performing intra haplotypic analyses in 10 S-haplotypes, as described in Pratas et al. 2017. Briefly, using ADOPS pipeline, the sequences were first aligned with the ClustalW2, and Muscle alignment algorithms, Bayesian trees were obtained using MrBayes, and for the identification of sites under positive selection we used codeML.

Dengue virus

BP2017000012

The Dengue virus belongs to the Flaviviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 11 mature proteins, for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 4865, 4868, 4864, 4864, 4864, 4864, 4864, 4864, 4864 and 4863 for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. Note that the coding nucleotide sequences for the pr and M individual proteins were not available. To overcome this problem, the coding nucleotide sequences for the uncleaved prM protein were downloaded for further analysis. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 1262, 2111, 3482, 3058, 2663, 1887, 3564, 1984, 2565 and 3820 sequences for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 304, 506, 1289, 1103, 973, 339, 1050, 377, 535 and 1803 sequences for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. Using the first protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A and NS4B, and did find evidence for recombination (P<0.05) for gene NS5. Using the second protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes C, prM, NS1, NS2A, NS2B, NS3, NS4A and NS4B, and did find evidence for recombination (P<0.05) for genes E and NS5. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. When more than 100 sequences are available for a given gene, and evidence for recombination or a few informative character case were found, five datasets with 50 randomly selected sequences were analyzed using OmegaMap, after aligning the sequences with Muscle, and a phylogeny inferred with MrBayes, as implemented in ADOPS. In this case, the details of the OmegaMap run are shown in the Notes tab of the corresponding ADOPS project, but positively selected amino acid sites can still be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.