Talk:Branching order of bacterial phyla (Genome Taxonomy Database, 2018)

Latest comment: 1 year ago by Artoria2e5 in topic GTDB version

GTDB version

edit

Maybe we should update this page to use a newer GTDB release whenever one is available? Their releases aren't hard to parse: just go to https://data.gtdb.ecogenomic.org/releases/latest/ and get the two .tree files for bacteria and archaea. Trim it down to the phylum p__ level by discarding other nodes and that's the tree you want. Maybe add some names to the beyond-phylum-level clades we have here. Artoria2e5 🌉 02:43, 11 May 2022 (UTC)Reply

Alright, here's a script to do that.

#!/usr/bin/env python3
import newick
import re

RANK_REGEX = re.compile(r'(^|:)[dp]__.+')

def do_prune(inname, outname):
    tree = newick.read(inname)
    # Inefficient, so what?
    accept = [n for n in tree[0].walk() if RANK_REGEX.search(n.name)]
    print(accept)
    tree[0].prune(accept, True)
    newick.write(tree, outname)

if __name__ == '__main__':
    import sys
    do_prune(*sys.argv[1:])

But! Before you think pip3 install newick is enough, it isn't.

  • The GTDB files has a lot of semicolons that the original library doesn't like. Replace all occurrences of ; with / before you start. With sed, maybe, but I was experimenting so I used a graphical text editor. Try sed -i -e 's@;@/@g' bac120.tree.
  • The GTDB files use colons in a way that the library cannot understand at all. To overcome that I "dumbified" the newick.py library by turning off colon handling in the following places:
    • Remove the colon from RESERVED_PUNCTUATION at the top.
    • In _parse_name_and_length, remove the if ':' in s: part completely.

Yeah. I was too tired to find another library. Anyways I have a tree now with python3 bac120{.tree,.phy}, so that's a start... --Artoria2e5 🌉 03:23, 11 May 2022 (UTC)Reply

Alright, the trees may be still a bit big. What to do?

  • Kill em numbers: remove [:\)][0-9]+\.[0-9]+ (keep the braces!) and [0-9]+\.[0-9]+:, sed is unhappy with me sorry
  • Kill extra monotypic names: remove /[^']+, although this can cause some confusion with GTDB auto-generated names based on lower taxa

The end result is something like this for v207:

d  Archaea;

'p Halobacteriota'

'p Thermoplasmatota'

'p Thermoproteota'

'p Asgardarchaeota'

'p Methanobacteriota'

'p Hydrothermarchaeota'

'p Methanobacteriota A'

'p Methanobacteriota B'

'p Hadarchaeota'

'p Micrarchaeota'

'p B1Sed10-29'

'p Iainarchaeota'

'p Altiarchaeota'

'p Nanoarchaeota'

'p Aenigmatarchaeota'

'p Nanohaloarchaeota'

'p EX4484-52'

'p SpSt-1190'

'p Undinarchaeota'

d  Bacteria

'p Spirochaetota'

'p UBA8481'

'p Lindowbacteria'

'p T1Sed10-126'

'p Wallbacteria'

'p Riflebacteria'

'p UBP17'

'p Mcinerneyibacteriota'

'p Fusobacteriota'

'p Calescibacterota'

'p DRYD01'

'p UBP7'

'p Dependentiae'

'p Campylobacterota'

'p Synergistota'

'p Bipolaricaulota'

'p Thermotogota'

'p DUMJ01'

'p Dictyoglomota'

'p JACIXR01'

'p Caldisericota'

'p Thermodesulfobiota'

'p HKB111'

'p Coprothermobacterota'

'p Patescibacteria'

'p 4484-113'

'p UBP15'

'p Cyanobacteria'

'p Margulisbacteria'

'p Firmicutes G'

'p Firmicutes H'

'p Firmicutes'

'p Firmicutes F'

'p Firmicutes A'

'p Firmicutes D'

'p DTU030'

'p Firmicutes B'

'p Firmicutes C'

'p Firmicutes E'

'p Dormibacterota'

'p Chloroflexota'

'p Armatimonadota'

'p CSP1-3'

'p Eremiobacterota'

'p Deinococcota'

'p Actinobacteriota'

'p Atribacterota'

'p Aerophobota'

'p CG03'

'p Elusimicrobiota'

'p UBA6262'

'p FCPU426'

'p Firestonebacteria'

'p Goldbacteria'

'p UBA9089'

'p UBP18'

'p Desantisbacteria'

'p JACRDZ01'

'p PUNC01'

'p UBA6266'

'p JABMQX01'

'p Aureabacteria'

'p JACPWU01'

'p JAFGBW01'

'p Verrucomicrobiota'

'p UBA3054'

'p Chlamydiota'

'p Ratteibacteria'

'p NPL-UPA2'

'p CAIJMQ01'

'p Omnitrophota'

'p Poribacteria'

'p Abyssubacteria'

'p Hydrogenedentota'

'p OLB16'

'p Sumerlaeota'

'p RBG-13-66-14'

'p Planctomycetota'

'p Fibrobacterota'

'p Cloacimonadota'

'p AABM5-125-24'

'p Delongbacteria'

'p Calditrichota'

'p SM23-31'

'p KSB1'

'p JdFR-76'

'p QNDG01'

'p Marinisomatota'

'p CLD3'

'p Bacteroidota'

'p UBP14'

'p WOR-3'

'p Latescibacterota'

'p JAAXHH01'

'p TA06'

'p Edwardsbacteria'

'p Zixibacteria'

'p 4572-55'

'p Gemmatimonadota'

'p Fermentibacterota'

'p Krumholzibacteriota'

'p Eisenbacteria'

'p VGIX01'

'p JABDJQ01'

'p Acidobacteriota'

'p Chrysiogenota'

'p Deferribacterota'

'p Thermosulfidibacterota'

'p Aquificota'

'p Methylomirabilota'

'p Moduliflexota'

'p SZUA-182'

'p Schekmanbacteria'

'p Nitrospinota B'

'p Tectomicrobia'

'p UBA8248'

'p Nitrospinota'

'p Nitrospirota'

'p CG2-30-53-67'

'p B130-G9'

'p CSSED10-310'

'p Desulfobacterota'

'p UBA10199'

'p Bdellovibrionota E'

'p Desulfobacterota G'

'p SZUA-79'

'p Desulfobacterota D'

'p Desulfobacterota C'

'p Desulfobacterota E'

'p BMS3Abin14'

'p FEN-1099'

'p Desulfobacterota B'

'p JACPQY01'

'p Myxococcota'

'p RBG-13-61-14'

'p Desulfobacterota I'

'p UBA2233'

'p Nitrospirota A'

'p Proteobacteria'

'p UBP6'

'p Bdellovibrionota C'

'p SAR324'

'p Bdellovibrionota'

'p Myxococcota A'

Okay start I'd say. --Artoria2e5 🌉 03:42, 11 May 2022 (UTC)Reply

Further processing

edit

HUGE thanks to Videsh Ramsahai for getting the job done! It's beautiful. I can't even imagine how much work that takes.

Still, we should maybe... get a way to automate the updating and specifically the relabelling from all the p__Blah names to proper stuff with quotation marks, links, and explanation about what it includes if the grouping is novel. Ideally we get:

  • a script that takes an old article cladogram and the corresponding "crude" newick (as above) and generates a look-up table for how to rename all the nodes
  • another script that takes a "crude" newick from a newer version and applies the table to it

--Artoria2e5 🌉 15:07, 14 September 2022 (UTC)Reply

Hi Artoria2e5,
I do not posses the skill to do create a bot that updates the GTDB tree whenever there is a newer version. If you can do this it is highly appreciated and commendable. If and when ne is created I can manually curate the new tree to remove errors if you would like.
Sincerely, Videsh Ramsahai (talk) 16:50, 14 September 2022 (UTC)Reply
I guess I will do that sometime then. I think the first script can be skipped if we just kept comments describing what each node is in the article source. That would provide traceability even for human editors, although the source code size is... gonna get bigger and arguably the parenthesized stuff (e.g. "(Bacillota C)") will be duplicated unnecessarily. Artoria2e5 🌉 11:48, 15 September 2022 (UTC)Reply

Oh no, Spirochaetota and Lindowbacteria are clearly not in the right place. And Undinarchaeota isn't branching early enough either. I don't want to check everything... --Artoria2e5 🌉 04:01, 27 January 2023 (UTC)Reply