Closely related Drosophila September 2016

BP2016000001

We have used data on the D. melanogaster genome annotation (release 6.12) and data on the genome assemblies of 11 closely related Drosophila species, namely: D. simulans (release 1.4), D. sechellia (release 1.3), D. erecta (release 1.3), D. yakuba (release 1.3), D. suzukii (release 1), D. biarmipes (release 1), D. takahashii (release 1), D. eugracilis (release 1), D. ficusphila (release 1), D. elegans (release 1) and D. rhopaloa(release 1). All files were downloaded from GenBank. First, we transferred the annotation of all D. melanogaster coding sequences (CDS) to the other 11 species, using Compart and Splign, as implemented in BDBM (submitted, link for download). Only fully annotated CDS, with no in-frame stop codons and with a maximum of 10% length variation relative to the D. melanogaster reference sequence were kept. It should be noted that rapidly evolving genes as well as genes encoding proteins with a variable number of repeats are likely not annotated this way. It is well known, however, that it is difficult to obtain a reliable alignment for genes with such features. Moreover, it is well established that the use of highly divergent sequences can lead to an underestimation of the rate of synonymous divergence, which in turn leads to an inflated value for the ratio of the rate of non-synonymous divergence/rate of synonymous divergence, leading to the identification of false positively selected amino acid sites. This is also one of the reasons why in this project only closely related Drosophila species were used.

Using ClustalW2, MrBayes and codeml, as implemented in ADOPS (Reboiro-Jato et. al., 2012; link for download), we then attempted to identify positively selected amino acid positions in all datasets with four or more sequences. The resulting ADOPS project folders are here provided, meaning that all details of the performed analyses can be inspected using ADOPS. Moreover, using ADOPS and the provided project folders, researchers can do additional analyses on the data, such as assessing the impact of the inclusion of a given sequence, or the impact of using a different alignment algorithm, for instance.

HIV1 - ASP dataset

BP2017000005

The existence of an HIV-1 protein translated from an antisense transcript (ASP) has not been completely accepted by the HIV-1 research community, although recent findings support its existence. Therefore we looked for evidence of positively selected amino acid sites at ASP that could represent new potential targets for anti-retroviral therapies and vaccine strategies. 100 sequences were retrieved from https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html and the alignment undone. Then, sequences with ambiguity codes were removed as well as stop codons. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination that could affect the results by creating false positively selected amino acid sites. The phi test did not find statistically significant evidence for recombination (P=0.999). Sequences were aligned using ClustalW2. Eight positively selected amino acid sites were found.

Petunia SLF intra-haplotype positive selection

BP2017000006

S-RNase based gametophytic self-incompatibility (GSI) is a genetic mechanism common in flowering plants that prevents self-fertilization, and thus inbreeding depression. A single S-locus, containing separate pistil (the S-RNase gene) and pollen (F-box gene(s)) components, determines specificities differences. Multiple S-pollen genes, called SLFs are determining S-pollen specificity in Solanaceae. In this system each S-protein recognizes and interacts with a sub-set of non-self S-RNases, to mediate their degradation. Positively selected amino acid sites are expected to be observed at those amino acid positions that are involved in specificity determination. Petunia SLF gene sequences for 13 S-haplotypes were downloaded from NCBI (see the "View Names File" tab for each project to see the accession numbers). We used Muscle, MrBayes and codeml, as implemented in ADOPS (Reboiro-Jato et. al., 2012) to run the analyses. The parameters used are shown in the "View Properties" tab while all details can be viewed in the log files.

Zika Virus

BP2017000007

The Zika virus belongs to the Flaviviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 11 mature proteins, for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 485, for all of the genes. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 53, 82, 89, 214, 179, 156, 94, 191, 81, 150, and 247 sequences for genes C, pr, M, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 15, 22, 19, 57, 69, 48, 14, 52, 15, 45, and 89 sequences for genes C, pr, M, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. Using the first protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes pr, M, NS1, NS4A, NS4B and NS5, and did find evidence for recombination (P<0.05) for genes C, E, NS2A, NS2B and NS3. Using the second protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes pr, M, NS2B, NS4A, NS4B and NS5, and did find evidence for recombination (P<0.05) for genes C, E, NS1, NS2A and NS3. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 100 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. When more than 100 sequences are available for a given gene, and evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed using OmegaMap, after aligning the sequences with Muscle, and a phylogeny inferred with MrBayes, as implemented in ADOPS. In this case, the details of the OmegaMap run are shown in the Notes tab of the corresponding ADOPS project, but positively selected amino acid sites can still be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Ebolavirus

BP2017000008

The Ebolavirus belongs to the Filoviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 7 mature proteins and one small secreted glycoprotein (sGP), for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 434, 434, 434, 186, 610, 434, 436 and 436 for genes NP, vp35, vp40, GP, sGP, vp30, vp24 and L respectively. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 95, 61, 51, 37, 76, 52, 53 and 169 sequences for genes NP, vp35, vp40, GP, sGP, vp30, vp24 and L, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 51, 32, 25, 23, 37, 25, 24 and 86 sequences for genes NP, vp35, vp40, GP, sGP, vp30, vp24 and L, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. The phi test did not find statistically significant evidence for recombination (P>0.05) for all genes in both protocols. When more than 100 sequences are available for a given gene, five datasets with 100 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. Positively selected amino acid sites can be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Malus SFBB

BP2017000011

Self-incompatibility (SI) is a major genetic barrier to self-fertilisation in which the female reproductive cells discriminate between genetically related and non-related pollen, and reject the former. In the gametophytic SI (GSI) system, the pollen is rejected when it expresses a specificity that matches either of those expressed in the style. Because of frequency dependent selection, many specificities are maintained in natural populations. In Pyrinae (Malus, Pyrus, and Sorbus) the S-pistil gene product is an extracellular ribonuclease, called S-RNase, and the S-pollen specificity is determined by multiple F-box genes, called SFBBs (S-locus F-box brothers). In this system the S-pollen proteins have higher affinity for non-self S-RNases, and the binding of these proteins implies that the S-RNase is ubiquitylated, and marked for degradation by the proteasome . The collaborative non-self recognition model is also the mechanism proposed for Solanaceae. In the non-self recognition systems, levels of polymorphism at the S-pollen genes are low, but levels of intra-haplotype divergence are similar to the S-RNase diversity. The amino acids under positive selection at the S-pollen genes, those involved in S-pollen specificity determination, are here identified in Malus x domestica by performing intra haplotypic analyses in 10 S-haplotypes, as described in Pratas et al. 2017. Briefly, using ADOPS pipeline, the sequences were first aligned with the ClustalW2, and Muscle alignment algorithms, Bayesian trees were obtained using MrBayes, and for the identification of sites under positive selection we used codeML.

Dengue virus

BP2017000012

The Dengue virus belongs to the Flaviviridae family, and is one of the five featured viruses in the VIPR database (www.viprbrc.org). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 11 mature proteins, for each gene, all available nucleotide sequences were downloaded from the VIPR database, namely 4865, 4868, 4864, 4864, 4864, 4864, 4864, 4864, 4864 and 4863 for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. Note that the coding nucleotide sequences for the pr and M individual proteins were not available. To overcome this problem, the coding nucleotide sequences for the uncleaved prM protein were downloaded for further analysis. Two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 1262, 2111, 3482, 3058, 2663, 1887, 3564, 1984, 2565 and 3820 sequences for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 304, 506, 1289, 1103, 973, 339, 1050, 377, 535 and 1803 sequences for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. Using the first protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes C, prM, E, NS1, NS2A, NS2B, NS3, NS4A and NS4B, and did find evidence for recombination (P<0.05) for gene NS5. Using the second protocol final dataset, the phi test did not find statistically significant evidence for recombination (P>0.05) for genes C, prM, NS1, NS2A, NS2B, NS3, NS4A and NS4B, and did find evidence for recombination (P<0.05) for genes E and NS5. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. When more than 100 sequences are available for a given gene, and evidence for recombination or a few informative character case were found, five datasets with 50 randomly selected sequences were analyzed using OmegaMap, after aligning the sequences with Muscle, and a phylogeny inferred with MrBayes, as implemented in ADOPS. In this case, the details of the OmegaMap run are shown in the Notes tab of the corresponding ADOPS project, but positively selected amino acid sites can still be viewed in the PSS tab. As usual, the details of every project can be checked by opening the other tabs.

Human Immunodeficiency Virus 1 (HIV 1)

BP2018000001

The Human Immunodeficiency Virus 1 belongs to the Retroviridae family, and is one of the two featured viruses in the HIV database (https://www.hiv.lanl.gov). In order to try to detect positively selected amino acid sites (those sites visible to the immune system, for instance) at the 9 mature proteins, for each gene, all available nucleotide sequences were downloaded from the HIV database, namely 6223, 8534, 6788, 4416, 3529, 3366, 4465, 4251 and 5111 for genes ENV, GAG, NEF, POL, REV, TAT, VIF, VPR and VPU, respectively. These sequences were only available as aligned files for download, and were initially processed to produce FASTA format files for further analysis. Due to the sheer number of sequences and to allow efficient time management during the future protocol procedures, datasets with 1000 sequences each were obtained using independent and random sequence extractions from these FASTA files. After, two protocols for sequence filtering, namely the removal of identical nucleotide sequences and the removal of identical amino acid sequences, were seperately implemented. In the first protocol (N prefix), with the removal of identical nucleotide sequences, as well as those with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 913, 815, 924, 929, 859, 845, 989, 975 and 965 sequences for genes ENV, GAG, NEF, POL, REV, TAT, VIF, VPR and VPU, respectively. In the second protocol (A prefix), with the removal of identical amino acid sequences, as well as the untranslated sequences with ambiguous nucleotides and those presenting in-frame stop codons, we ended up with 913, 815, 924, 929, 859, 844, 989, 974 and 964 sequences for genes ENV, GAG, NEF, POL, REV, TAT, VIF, VPR and VPU, respectively. The phi test for recombination, as implemented in SplitsTree was used to try to find evidence for recombination in these datasets, that could affect the results by creating false positively selected amino acid sites, when using codeML. In both protocols, the phi test did not find statistically significant evidence for recombination (P>0.05) for all of the genes. When more than 100 sequences are available for a given gene, and no evidence for recombination was found, five datasets with 50 randomly selected sequences were analyzed. In this case, sequences were aligned using Muscle, phylogenies inferred using MrBayes, and positively selected amino acid sites inferred using codeML as implemented in ADOPS. As usual, the details of every project can be checked by opening the various available tabs.

Identification of positively selected amino acid sites in 19 primate HLA genes

BP2018000003

Sequences were retrieved from the human genome and the genomes of 24 other primates using SEDA, and the new Blast: two-way ortholog identification tool. For two (HLA-G and HLA-DRB4) of the original 21 genes set, less than four useful sequences could be retrieved and thus these genes were not further analysed.

Rosa S-locus genes

BP2018000004

Gametophytic self-incompatibility (GSI) is a genetic mechanism that determines the rejection of genetically related pollen landing on the stigma of a particular flower. Such system has been described in diploid Rosa species. The Rosa S-locus is located on chromosome 3, and there is evidence for linkage with economically important phenotypes. In other Rosacea species the S-pistil gene encodes a glycoprotein with ribonuclease activity, called the S-RNase. Based on phylogenetic inferences, expression patterns and segregation analyses the putative S-RNase gene has been identified. This gene belongs to the Prunus S-RNase lineage. For this gene, here we identify amino acid sites under positive selection, those, in principle, responsible for the specificity of the reaction (see S-RNase_Rosa). Similar inferences have been made for the S-pollen gene(s). We have shown that none of the Rosa SFB lineage genes is determining S-pollen specificity, as described in Prunus species. For the 14 putative R. chinensis S-pollen genes, those in the vicinity of the S-RNase that are expressed in stamen, we also show evidence for positive selection (see 14_F-box_R_chinensis). Since Rchinensis_NC_037090F_box_minus3 codes for a putative protein smaller than the other S-locus F-box genes, in order to analyze most of the coding region, we performed the analyses excluding this sequence (see 13_F-box_R_chinensis). Evidence for positive selection is also shown for the five F-box genes in the vicinity of the R. multiflora S-RNase located in scaffold sca0006888 (see 5_R_multiflora_sca0006888).

The evolution of vitamin C biosynthesis and transport in animals

BP2018000005

In animals, vitamin C (VC) is an indispensable antioxidant and co-factor for optimal function of cells. In the single VC synthesis pathway described in this taxonomic group, the penultimate step is catalysed by Regucalcin. The Regucalcin datasets here used have been compiled as described in the article "The evolution of vitamin C biosynthesis and transport in animals" with the aim of identifying positively selected amino acid sites that could give an indication on subfunctionalization and protein docking surfaces.

Mycobacterium leprae

BP2020000001

Using the GenomeFastScreen pipeline (available at https://pegi3s.github.io/dockerfiles/), and the available annotation for 6 Mycobacterium leprae genomes (GCF_000026685.1,, GCF_000195855.1, GCF_001648835.1, GCF_001653495.1, GCF_003253775.1, and GCF_003584725.1), 531 genes out of 1601 were identified as being the most likely to show positively selected amino acid sites (PSS), and thus were analyzed using ADOPS. Of these, 31 are identified by ADOPS as showing PSS.

Alphacoronavirus_1 codeML

BP2022000005

Detection of positively selected amino acid sites using codeML in Alphacoronavirus_1 genes

Alphacoronavirus_1 Fubar

BP2022000006

Detection of positively selected amino acid sites using Fubar in Alphacoronavirus_1 genes

Bat coronavirus codeML

BP2022000007

Detection of positively selected amino acid sites, using codeML in Bat coronavirus genes

Bat coronavirus Fubar

BP2022000008

Detection of positively selected amino acid sites, using Fubar, in Bat coronavirus genes

Bat coronavirus HKU10 CodeML

BP2022000009

Detection of positively selected amino acid sites, using codeML, in Bat coronavirus HKU10 genes

Bat coronavirus HKU10 Fubar

BP2022000010

Detection of positively selected amino acid sites, using Fubar, in Bat coronavirus HKU10 genes

Betacoronavirus_1 codeML

BP2022000011

Detection of positively selected amino acid sites using codeML in Betacoronavirus_1 genes

Betacoronavirus_1 Fubar

BP2022000012

Detection of positively selected amino acid sites using Fubar in Betacoronavirus_1 genes

Human coronavirus 229E codeML

BP2022000013

Detection of positively selected amino acid sites, using codeML, in Human coronavirus 229E genes

Human coronavirus 229E Fubar

BP2022000014

Detection of positively selected amino acid sites using Fubar in Human coronavirus 229E genes

Human coronavirus HKU1 codeML

BP2022000015

Detection of positively selected amino acid sites, using codeML, in Human coronavirus HKU1 genes

Human coronavirus HKU1 Fubar

BP2022000016

Detection of positively selected amino acid sites, using Fubar, in Human coronavirus HKU1 genes

Human coronavirus NL63 codeML

BP2022000017

Detection of positively selected amino acid sites, using codeML, in Human coronavirus NL63 genes

Human coronavirus NL63 Fubar

BP2022000018

Detection of positively selected amino acid sites, using Fubar, in Human coronavirus NL63 genes

Murine coronavirus codeML

BP2022000019

Detection of positively selected amino acid sites, using codeML, in Murine coronavirus genes

Murine coronavirus Fubar

BP2022000020

Detection of positively selected amino acid sites, using Fubar, in Murine coronavirus genes

Pipistrellus bat coronavirus HKU5 codeML

BP2022000021

Detection of positively selected amino acid sites, using codeML, in Pipistrellus bat coronavirus HKU5 genes

Pipistrellus bat coronavirus HKU5 Fubar

BP2022000022

Detection of positively selected amino acid sites, using Fubar, in Pipistrellus bat coronavirus HKU5 genes

Porcine coronavirus HKU15 codeML

BP2022000023

Detection of positively selected amino acid sites, using codeML, in Porcine coronavirus HKU15 genes

Porcine coronavirus HKU15 Fubar

BP2022000024

Detection of positively selected amino acid sites, using Fubar, in Porcine coronavirus HKU15 genes

Rhinolophus bat coronavirus HKU2 codeML

BP2022000025

Detection of positively selected amino acid sites, using codeML, in Rhinolophus bat coronavirus HKU2 genes

Rhinolophus bat coronavirus HKU2 Fubar

BP2022000026

Detection of positively selected amino acid sites, using Fubar, in Rhinolopus bat coronavirus HKU2 genes

Rousettus bat coronavirus HKU9 codeML

BP2022000027

Detection of positively selected amino acid sites, using codeML, in Rousettus bat coronavirus HKU9 genes

Rousettus bat coronavirus HKU9 Fubar

BP2022000028

Detection of positively selected amino acid sites, using Fubar, in Rousettus bat coronavirus HKU9 genes

Tylonycteris bat coronavirus HKU4 codeML

BP2022000029

Detection of positively selected amino acid sites, using codeML, in Tylonycteris bat coronavirus HKU4 genes

Tylonycteris bat coronavirus HKU4 Fubar

BP2022000030

Detection of positively selected amino acid sites, using Fubar, in Tylonycteris bat coronavirus HKU4 genes

MERS codeML

BP2024000001

Detection of positively selected amino acid sites, using codeML, in MERS genes

MERS Fubar

BP2024000002

Detection of positively selected amino acid sites, using Fubar, in MERS genes

PEDV codeML

BP2024000003

Detection of positively selected amino acid sites, using codeML, in PEDV genes

PEDV Fubar

BP2024000004

Detection of positively selected amino acid sites, using Fubar, in PEDV genes