Why are we doing this?

While mammalian genomes maintain strict dosage constraints—where deviations in copy number (CNV) are often synonymous with disease—plant genomes exhibit an extraordinary plasticity, harbouring gene families with copy numbers exceeding 300. The mechanisms allowing plants to bypass the "dosage-balance constraint" remain a fundamental mystery in evolutionary biology. This project will perform the first unbiased, long-read-based meta-analysis of all available plant genomes (> 1,000) to quantify the global landscape of gene dosage. By mapping raw Nanopore and PacBio signal coverage back to chromosome-level assemblies, we will bypass the "assembly collapse" bias that has historically hidden high-copy sequences, revealing the true scale of genomic expansion and its correlation with environmental adaptation.

Stage 1: Curating the database and meta data

<aside> 🏁 Goal 1: Obtain >1,000 genomes, with their associated sequences and meta data, from NCBI

</aside>

The essence of a meta-analysis is unbiased, standardised sampling. We should aim at using NCBI Datasets (which is quite nasty right now because of all the recent changes in the system…) See this, this, to obtain genomes that match these criteria:

[ ] Reference genome (in FASTA format)
[ ] Chromosome-level (vs contig, scaffolds)
[ ] Assembled with a long-read technology (i.e. Nanopore or PacBio)
[ ] Gene annotation (in GFF format)
[ ] Protein models (in FASTA format)
[ ] (???) Raw sequences used to assemble the genome (stored in SRA database)

We should download all these data to our ARC project space with a standard structure. We should curate an accompanying meta data.