####################################################### # BIOSCI210, Lab 4: Exploring Genome Size Variation # # In this lab, we will explore genome size, and how it # varies across vertebrates. # # Rather than give you a large written introduction to # the material, please be sure to watch these lectures before # finishing the lab: # # Lecture 25: Quantitative methods in macroevolution # Lecture 28: Nonadaptive evolution # # The purpose of this lab is to *explore* the data, # basically by running the given code, looking at # the plots, and *thinking* about the plots # and what they tell you. # # The purpose is not to teach you how to code in R, # although it will give you a little more experience # in R (perhaps building on other courses). All code should # run "out of the box" by copy-paste into the R # window (or highlighting code in RStudio and # hitting CTRL-RETURN), as long as: # # * you have R packages "ape" and "phytools" installed. # # * you run the commands in order -- if you accidentally # skip some commands, later commands may give errors. # # Also: You might find this script useful as a starting # point, if you ever have to do a phylogenetically-aware analysis of # multispecies data in future work. # # Answer the questions in the Canvas quiz. They are # marked in this script with "Q:". Choose the best # answer! ####################################################### ####################################################### # Genome Size dataset # # Our genome size data are taken (with permission) from: # # Gregory, T. Ryan (2020). Animal Genome Size Database. http://www.genomesize.com . # # C-value is the haploid genome size in picograms. 1 picogram of DNA is roughly # 1 billion bases, i.e. 1 Gigabase or Gb. The human haploid genome size is # about 3.3 Gb. # # # (Actually 1 pg DNA is about 0.978 Gb.) ####################################################### ####################################################### # # LAB 4 SETUP INSTRUCTIONS # # For lab 4, you will need access to these: # # (Note: all of these are available in the computer lab): # # 1. Web browser, and access to CANVAS (for quizzes, # lectures, online disucssions etc.). # # 2. Ability to copy/paste text. This is possible even # on a smartphone, but probably is much easier on # a desktop/laptop/tablet. # # 3. You need to have access to R, and most students # like to use RStudio. These are installed on the # lab computers. These programs are free, so you can # also install these programs on your own computer, # or use them through a browser with the # "RStudio Cloud" website. # # Download R: # http://www.r-project.org/ # # Download RStudio: # https://rstudio.com/products/rstudio/ # # RStudio Cloud (running RStudio through a website) # https://rstudio.cloud/ # (you will want to register a free account) # # 4. You will need 2 R packages installed. These # come installed on the lab computers, but otherwise # you may have to install them. Un-comment these # lines and run these commands in R: # # install.packages("ape") # install.packages("phytools") # library(ape) # library(phytools) # # 5. The TAs and I will take questions on Piazza all # week. I also plan to do live Zoom help sessions, # so if you want to access these, please install # Zoom. # ####################################################### ####################################################### # NOTE: Phylogenetics, and R, are two of the areas # for which New Zealand and the University of # Auckland are well-known in science. For example, # the original publication of R, Ihaka & # Gentleman (1996), now has over 10,000 citations # in the scientific literature. The authors, Ross # Ihaka and Robert Gentleman, are from the # Department of Statistics at the # University of Auckland. # # Ihaka, Ross; Gentleman, Robert (1996). "R: A Language # for Data Analysis and Graphics." Journal of # Computational and Graphical Statistics. Volume 5, # Issue 3, pages 299-314. # https://amstat.tandfonline.com/doi/abs/10.1080/10618600.1996.10474713 # # More on Ross Ihaka: # # Middleton, Juliet (2009). "Academic unfazed by rock star # status." New Zealand Herald, Jan. 10, 2009. # https://www.nzherald.co.nz/nz/academic-unfazed-by-rock-star-status/LMU5EIMP7QCMZFCBM3UNW4C7CI/ # # Ihaka Lecture Series at the University of Auckland # https://www.auckland.ac.nz/en/science/about-the-faculty/department-of-statistics/ihaka-lecture-series.html # ####################################################### ####################################################### # Background on exercises using R # # Real-life computational biologists tend to be # pluralists -- they will use whatever combination of # programs, websites, and programming languages # "works" for their current tasks. # # Due to the Covid situation, we may not have access # to the computer lab. In some ways, this is an # advantage! By being forced to install some programs # on your own computers (with a variety of different # operating systems, etc.), and figuring out how to # make them work on your computer, you are doing # something much closer to the day-to-day work # of a computational biologist. # # The main thing you need to make this work is: # patience. When you get stuck, # # 1. First, take a deep breath, and see if you can # identify the problem. # # 2. Second, google the problem (e.g., google the # error message). If it's an installation # problem, google your question along with the # name of your operating system. (E.g., "problem # installing AliView on Windows" is better than # just "problem installing AliView) # # 3. If you can't figure out the issue after a few # minutes, POST YOUR QUESTION TO PIAZZA. # # a. Note: Your questions will get better answers # if you include sufficient information in your # question. "It's not working!" does not give # us any ability to help. We instructors can't # see what you are seeing, unless you show us! # # b. Therefore, include screenshots of error messages, # or (better) copy-paste the commands you used, AND # the compete error messages that resulted. This # way, other people can search and find the # questions/answers for their own questions. # # 4. If you find a solution, please post that! This # will also help others. # ####################################################### ######################################################## # Advice on navigating the file system # # Every operating system (OS) has a different file # viewer. This is the best place to see where your # folders and files are, and to arrange files/folders # so that you can find them. # # On Windows, the file viewer is Windows Explorer, and # can usually be opened with WindowsKey+E # # On Macs, the file viewer is Finder, and can be opened # by clicking the little blue faces icon: # https://www.lifewire.com/finders-icon-view-options-2260725 # # On Linux -- if you have Linux, you probably know # all about file viewing already. # ######################################################## ######################################################## # Advice on filenames / directory names # # Short version: STICK TO PLAIN-TEXT, NO SPACES # # While modern operating systems often allow filenames # and file paths (directory names) to have spaces in # them, as well as apostrophes, inverted commas, # regular commas, etc., you should avoid these. # # This is because R, and many other pieces of scientific # software, react badly to filenames with spaces, # special characters, etc. This creates endless # difficulties. # # Therefore, keep it simple: # # BAD PATH AND FILENAME: # /Nick's cool documents/Nick's lab 3.txt # # GOOD PATH AND FILENAME # /Documents/Nicks_lab3.txt # ######################################################## ######################################################## # Advice on plain-text editors: ######################################################## # # In computational biology, data, inputs, code, etc., are typically # contained in PLAIN-TEXT (ASCII) files. These files have *no* fonts, # text-sizes, invisible formatting, weird character codes, etc. # # Plain-text files are best viewed in an ASCII text viewer, so that you can # see the difference between spaces and tabs, you can see blank lines at # the end of some files, etc. # # Note also that THE DIFFERENCE BETWEEN TABS AND SPACES IS IMPORTANT IN # THESE TEXT FILES. COMMON PROBLEMS INCLUDE: # # Tabs / spaces at the end of a line, after any text. Delete these in the # text-editor. # A line in the text file that looks blank, but is actually 8 tabs (or # whatever). This can be produced by e.g. copying/pasting a series of # distance matrices from Excel to text. When you do this, just check the # blank lines for tabs, and delete these tabs in the text-editor so the # line is actually blank. # # Good programs: # # TextWranger (mac), BBedit (mac), Notetab (windows), Rstudio or R.app # editors, any code editor # # Bad (bad Bad BAD!!) programs: # Word/Office (mac/windows), WordPad (windows), TextEdit (Mac) # # Marginal: # Notepad (windows) # ######################################################## ####################################################### ####################################################### # R script for analyzing genome size data for vertebrates. # Should run "out of the box", if you are online and # have the R packages "ape" and "phytools" installed. ####################################################### ####################################################### # ABBREVIATIONS # NOTES ON READING NICK'S CODE: # "wd" means "working directory" # "fn" means "filename" # "trfn" means "tree filename" # "data_fn" means "data filename" # "TF" stores a series of TRUE/FALSE results # "pch" means "point character", e.g. for plotting points # "cex" means "character expansion", e.g. char. size # # Scientific notation in R: # 3.65e2 means 3.65 x 10^2 = 365 # 3.65e-2 means 3.65 x 10^-2 = 0.0365 ####################################################### # Get the working directory (set it to a local directory, if you desire) wd = getwd() # Specifying the working directory on Nick's machine # wd = "/drives/GDrive/__classes/BIOSCI210/lab4_genome_size/" # See the working directory wd # Tell R what the working directory is, with the setwd command setwd(wd) # Nick's machine - make PDFs of all plots # pdffn = "Lab4_plots_v1.pdf" # pdf(file=pdffn, width=6, height=6) # Double-check the working directory getwd() ############################################################### # Setup ############################################################### # Load the APE and BioGeoBEARS packages # (APE = Analysis of Phylogenetics and Evolution) # If the library command doesn't work, first type # install.packages("ape") # install.packages("phytools") library(ape) library(phytools) # Get a linear regression function from Nick Matzke's "gist" archive source("https://gist.githubusercontent.com/nmatzke/b21fcccb09b42959897d/raw/5f461b8eb1032fa7aad2e0ae5c13e9f9a1783cdb/_genericR_v1.R") # Get a tree-table function from Nick Matzke's "gist" archive source("https://gist.githubusercontent.com/nmatzke/a40dc1291115072889c81e416953edad/raw/bc3f657bc80aafa885bd39c3fdcd41b5d52d9816/prt_alone_v1.R") ############################################################### # Files -- these are online links; if you like, # download to your working directory (you will then have to # change trfn and data_fn to refer to that local directory) ############################################################### trfn = "http://phylo.wikidot.com/local--files/explore-genome-size-data/birds_mammals_subset_tree.newick" data_fn = "http://phylo.wikidot.com/local--files/explore-genome-size-data/birds_mammals_subset_Cvalue_bodysize.txt" ############################################################### # Read in the tree file ############################################################### # Read a Newick-formatted phylogeny file (which has been subset to birds and mammals # found in the C-value data file) to an APE tree object tr = read.tree(trfn) # Look at the first few tip labels. tr$tip.label[1:10] # Plot the phylogeny (skipping the tip names, as we have so many) plot(tr, show.tip.label=FALSE) axisPhylo() mtext(text="Millions of years ago (Ma)", side=1, line=2.5) title("Phylogeny of (selected) birds and mammals") # Put a blue asterisk where humans are: # Get the tip number for humans: speciesname = "Homo_sapiens" tipnum_of_Homo_sapiens = (1:length(tr$tip.label))[tr$tip.label == speciesname] # Get the height of the tree tr_height = get_max_height_tree(tr) # Plot the blue asterisk points(x=tr_height+4, y=tipnum_of_Homo_sapiens, pch="*", col="blue", cex=1.5) ############################################################### # Read in the data file ############################################################### # Read a table-delimited data file (bird and mammal C-values and body sizes) # to an R data.frame named "datadf" datadf = read.table(file=data_fn, header=TRUE, sep="\t", stringsAsFactors=FALSE) # Order the table to match the phylogeny tip labels datadf = datadf[match(tr$tip.label, table=datadf$name),] # Look at the first few species in the table datadf$name[1:10] head(datadf) # Q1: Approximately what age is the common ancestor of birds and mammals, according # to this phylogeny? # Q2: How many species are in this phylogeny? (Hint: each species is # a "tip" in the tree, you can get the species names with # tr$tip.label, and the length() function will count # how many items an input has.) # Q3: Is this a mostly-complete phylogeny? (i.e. does it contain most of # the known species of birds and mammals?) ####################################################### # Plot body size versus C-value ####################################################### # Raw data plot(x=datadf$mass_g, y=datadf$Cvalue, xlab="Body mass (grams)", ylab="C-value (picograms)") title("Birds+mammals: Body mass versus genome size") # Q4: Is it safe to conclude there is no relationship between genome size # and body mass based on this plot? # Q5: What is the mass of that huge animal, in kilograms? Hint: the "mass_g" column of datadf contains the mass in grams, and the "max" command gives you the maximum value in the input. Try: max(datadf$mass_g) # Q6: Name the species with the largest body size in the data table. # Which row matches the maximum mass? TF = datadf$mass_g == max(datadf$mass_g) # View the matching row: datadf[TF,] # Plot with log(body mass) on x-axis plot(x=datadf$mass_g, y=datadf$Cvalue, xlab="Body mass (grams)", ylab="C-value (picograms)", log="x") title("Birds+mammals: Body mass (log-scaled) versus genome size") # Put a blue asterisk where humans are: rownum = (1:nrow(datadf))[datadf$name == "Homo_sapiens"] points(x=datadf$mass_g[rownum], y=datadf$Cvalue[rownum], pch="*", col="blue", cex=5) # Q7: Do humans have a particularly large genome size? # Q8: Does this plot suggest a relationship between body mass and genome size? # The linear_regression_plot() function (which you loaded from Matzke's # code archive with the source() function, above) plots your "x" and "y" data, # and also calculates & plots a standard linear regression, # returning a slope, intercept, and p-value. # # IMPORTANT CRASH COURSE ON P-VALUES: # # P-values are measures of *inconsistency* with the null # hypothesis. They measure the probability that some # statistic (or a more extreme value of that statistic) # would be produced if the null model is true. # # For these regressions, the null model is that the # slope, m, equals 0.0 (a flat line). # # By common convention, p-values less than 0.05 are taken # to be "statistically significant" evidence against the # null hypothesis, because they indicate that the observed # data would be relatively improbable under the null hypothesis. # # Under this cutoff, P-values > 0.05 mean that we "fail to reject the null # model", which is different than "we accept the null model." # # P-values are commonly misunderstood and mis-used, and they # don't tell you what you intuitively want to know: "is my # model true?". So, there is large debate in statistics and # science over when/whether P-values should be replaced # with alternatives. Google will bring up many discussions. # # Here, we regress log10(body mass) against genome size # (C-value = haploid genome size in picograms) linear_regression_plot(x=log10(datadf$mass_g), y=datadf$Cvalue, xlabel="Birds+mammals: Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(datadf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE) points(x=log10(datadf$mass_g[rownum]), y=datadf$Cvalue[rownum], pch="*", col="blue", cex=5) title("Birds+mammals: Body mass (log-scaled) versus genome size") # Q9: Is the inferred relationship between body mass and genome size positive or negative? # Q10: What is the estimated slope of this relationship? # Q11: The null model is that the slope m = 0.0, i.e. no relationship between body mass and genome size. Is the estimated slope statistically significantly different from 0.0? # Q12: What is the reported p-value? Does does this p-value indicate "barely significant", e.g. near the p=0.05 cutoff, or does it indicate "highly significant," i.e. a p-value very far below the 0.05 cutoff? ####################################################### # Phylogenetic independent contracts ####################################################### # Let's take the same data, but this time plot the # "Phylogenetically Independent Contrasts" (PICs) of # log(body mass) versus C-value. # # Contrasts are just *differences* between pairs of species # (or between pairs of clades). # # See the lectures for more on Phylogenetic Independent Contrasts. # # We calculate the contrasts with APE's "pic" function. # # Then we plot the contrasts, and do the linear regression # on the contrasts, in the usual way. # bodysize_PICs = pic(x=log10(datadf$mass_g), phy=tr) Cvalue_PICs = pic(x=datadf$Cvalue, phy=tr) linear_regression_plot(x=bodysize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of body mass (log10 of grams)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(bodysize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE) title("Birds+mammals: PICs of log(body mass) vs. genome size") # Q13: What is the estimated slope of this relationship? # Q14: The null model is that the slope m = 0.0, i.e. no relationship between # contrast in body mass and contrast in genome size. Is the estimated slope # statistically significantly different from 0.0? # Q15: What is the reported p-value? Does does this p-value indicate a # statistically significant relationship? # Q16: How is it possible that plotting body mass versus C-value could # produce an extremely statistically significant relationship, but plotting body-size # contrasts versus C-value contrasts could produce no statistically significant relationship? # Color dots by Linnaean class cols = rep("black", times=nrow(datadf)) TF = datadf$Class == "Mammalia" cols[TF] = "blue" TF = datadf$Class == "Aves" cols[TF] = "red" linear_regression_plot(x=log10(datadf$mass_g), y=datadf$Cvalue, xlabel="Birds+mammals: Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(datadf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Birds+Mammals: body mass vs. C-value") # Q17: How many species are included in datadf? nrow(datadf) # Q18: What may caused the apparently highly significant relationship between # body mass and C-value in the raw data? ##################################################################### # PART II - DO YOUR OWN INTERPRETATION # # # Below, we do a similar analysis within mammals, and # within birds. # # Then we do an analysis comparing red-blood-cell size # with C-value, focusing on amphibians. # # Examine the results, and in the short answer box, answer these questions: # # Q19: Answer with 1-3 sentences each. (Paste the questions in the # short answer box, then insert your answer after each one.) # # 19A. What do the raw-data regressions in within mammals # and within birds seem to suggest about the relationship between # body mass and C-value? # # 19B. What do the PIC analyses suggest about these relationships? # # 19C. Do the PIC analyses falsify the idea that body mass might have # some relationship to C-value? Why or why not? # # 19D. What do the raw-data and PIC regressions suggest about the # relationship between cell volume and genome size in amphibians? # # 19E. Do you think cell volume and genome size have about the same # relationship across vertebrate classes, or do different rules # seem to apply in different groups? What might explain the # differences? # # 19F. How would you reply to the assertion "humans have very big genomes, # this must have something to do with why they are so advanced and complex"? # ##################################################################### ####################################################### # Subset for just mammals or just birds ####################################################### # Load mammal-specific and bird-specific trees mammstr = read.tree(file="http://phylo.wikidot.com/local--files/explore-genome-size-data/mammal_tree.newick") birdstr = read.tree(file="http://phylo.wikidot.com/local--files/explore-genome-size-data/bird_tree.newick") # Subset and order tables TF = datadf$Class == "Mammalia" mammsdf = datadf[TF,] mammsdf = mammsdf[match(mammstr$tip.label, table=mammsdf$name),] mammsdf$name[1:10] mammstr$tip.label[1:10] TF = datadf$Class == "Aves" birdsdf = datadf[TF,] birdsdf = birdsdf[match(birdstr$tip.label, table=birdsdf$name),] birdsdf$name[1:10] birdstr$tip.label[1:10] ####################################################### # Mammals-only analysis ####################################################### cols = rep("black", times=nrow(mammsdf)) linear_regression_plot(x=log10(mammsdf$mass_g), y=mammsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(mammsdf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Mammals: body mass vs. C-value") ####################################################### # Phylogenetic independent contracts - within mammals ####################################################### bodysize_PICs = pic(x=log10(mammsdf$mass_g), phy=mammstr) Cvalue_PICs = pic(x=mammsdf$Cvalue, phy=mammstr) linear_regression_plot(x=bodysize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of body mass (log10 of grams)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(bodysize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE) title("Mammals: body mass contrasts vs. C-value contrasts") # Color in a particular clade cols = rep("black", times=nrow(mammsdf)) cols[mammsdf$Order == "Chiroptera"] = "blue" linear_regression_plot(x=log10(mammsdf$mass_g), y=mammsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(mammsdf$mass_g)), ylim=c(0,10), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Mammals: body mass vs. C-value") # Add a legend xval = 6 yval = 9 points(x=xval, y=yval, col="blue") text(x=xval, y=yval, col="blue", label="Chiroptera", pos=4) points(x=xval, y=yval+0.4, col="black") text(x=xval, y=yval+0.4, col="black", label="other mammals", pos=4) points(x=log10(mammsdf$mass_g)[rownum], y=mammsdf$Cvalue[rownum], pch="*", col="green3", cex=4) points(x=xval, y=yval-0.4, pch="*", col="green3", cex=4) text(x=xval, y=yval-0.4, col="green3", label="Homo sapiens", pos=4) ####################################################### # Birds-only analysis ####################################################### cols = rep("black", times=nrow(birdsdf)) linear_regression_plot(x=log10(birdsdf$mass_g), y=birdsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(birdsdf$mass_g)), ylim=minmax_pretty(birdsdf$Cvalue), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Birds: body mass vs. C-value") ####################################################### # Phylogenetic independent contracts ####################################################### bodysize_PICs = pic(x=log10(birdsdf$mass_g), phy=birdstr) Cvalue_PICs = pic(x=birdsdf$Cvalue, phy=birdstr) linear_regression_plot(x=bodysize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of body mass (log10 of grams)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(bodysize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE) title("Birds: body mass contrasts vs. C-value contrasts") ####################################################### # Birds-only analysis - coloring in a particular clade ####################################################### # Color in a particular clade cols = rep("black", times=nrow(birdsdf)) cols[birdsdf$Family == "Trochilidae"] = "blue" linear_regression_plot(x=log10(birdsdf$mass_g), y=birdsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(birdsdf$mass_g)), ylim=minmax_pretty(birdsdf$Cvalue), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Birds: body mass vs. C-value") # Add a legend xval = 4.5 yval = 2.0 points(x=xval, y=yval, col="blue") text(x=xval, y=yval, col="blue", label="Trochilidae", pos=4) points(x=xval, y=yval-0.05, col="black") text(x=xval, y=yval-0.05, col="black", label="other birds", pos=4) ####################################################### # Birds-only analysis - coloring in a few more clades ####################################################### # Color by group cols = rep("black", times=nrow(birdsdf)) cols[birdsdf$Family == "Trochilidae"] = "blue" # point characters (pch) pchs = rep("o", times=nrow(birdsdf)) # Highlight a few of the big birds cols[birdsdf$Family == "Struthionidae"] = "green4" pchs[birdsdf$Family == "Struthionidae"] = "S" cols[birdsdf$Family == "Dromaiidae"] = "green2" pchs[birdsdf$Family == "Dromaiidae"] = "D" linear_regression_plot(x=log10(birdsdf$mass_g), y=birdsdf$Cvalue, xlabel="Body mass (log10 of grams)", ylabel="C-value (picograms)", tmppch=pchs, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(log10(birdsdf$mass_g)), ylim=minmax_pretty(birdsdf$Cvalue), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Birds: body mass vs. C-value") # Add a legend xval = 4.5 yval = 2.0 points(x=xval, y=yval, col="blue") text(x=xval, y=yval, col="blue", label="Trochilidae", pos=4) points(x=xval, y=yval-0.05, col="green4") text(x=xval, y=yval-0.05, col="green4", label="S=Struthionidae", pos=4) points(x=xval, y=yval-0.10, col="green2") text(x=xval, y=yval-0.10, col="green2", label="D=Dromaiidae", pos=4) points(x=xval, y=yval-0.15, col="black") text(x=xval, y=yval-0.15, col="black", label="other birds", pos=4) ####################################################### # Does C-value correlate with Cell Size? ####################################################### # # Here is a dataset of species for which C-value Cell-size were # both available. This data is derived from T. Ryan Gregory's # Genome Size Database, used with permission: # # Gregory, T. Ryan (2020). Animal Genome Size Database. http://www.genomesize.com . # # C-value is the haploid genome size in picograms. 1 picogram of DNA is roughly # 1 billion bases, i.e. 1 Gigabase or Gb. The human haploid genome size is # about 3.3 Gb. # # Cell Size is measured as mean corpuscular volume ("MCV"). "Corpuscles" is another # word for erythrocytes, i.e. red blood cells. # # We are just going to use MCV, but below are some notes from Gregory's database # to indicate how cell size is measured: ####################################################### # Notes on measuring cell size from the Animal Genome Size Database: # ==================================================================== # http://www.genomesize.com/cellsize/amphibians.htm # # NOTE: These are size measurements for amphibian erythrocytes (red blood cells). # Erythrocytes in amphibians are elliptical, and as such two # different diameters are provided: # # CLD ("cell long diameter") # CSD ("cell short diameter") # NLD ("nucleus long diameter") # NSD ("nucleus short diameter") are also provided where available. # # In both cases, the diameters were measured from dry smears. These were used to calculate # dry cell area (CA) and # dry nucleus area (NA) # using the equation for elliptical area: CA = pi * (CLD/2) * (CSD/2). # # MCV ("mean corpuscular volume") is wet volume and is usually measured # with a Coulter counter. # DCV ("dry cell volume") and # DNV ("dry nucleus volume") # # have occasionally been measured from dry smears, and are also provided. # These various measurements (e.g., CA and MCV) should not be compared to each other, # so pick only one for any analyses you wish to do, or run them separately. # # Diameters are in µm, areas are in µm2, and volumes are in µm3. # # # Mammal erythrocyte sizes # http://www.genomesize.com/cellsize/mammals.htm # # NOTE: These are size measurements for mammalian erythrocytes (red blood cells). # Erythrocytes in mammals are unique among vertebrates in that they are enucleated # (i.e., contain no nuclei) and circular rather than elliptical (with the exception # of some artiodactyls). # # Two different measurements are given. # First, DCD ("dry cell diamater"), measured from dry blood smears, and # MCV ("mean corpuscular volume"), is wet volume # and is usually measured with a Coulter counter. # The few diameters measured from wet smears have been converted # to estimated dry values as in Gregory (2000). Diameters are in µm and # volumes are in µm3. # ==================================================================== ####################################################### # Plotting cell size vs. genome size ####################################################### # Download the data cell_size_Cvalue_fn = "http://phylo.wdfiles.com/local--files/explore-genome-size-data/verts_cs_cv_v1.txt" verts_cs_cv = read.table(file=cell_size_Cvalue_fn, header=TRUE, strip.white=TRUE, sep="\t", stringsAsFactors=FALSE) head(verts_cs_cv) dim(verts_cs_cv) # Plot the data (non-logged) colvals = rep("black", times=nrow(verts_cs_cv)) # point colors (col) pchvals = rep(1, times=nrow(verts_cs_cv)) # point characters (pch) plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals) title("Erythrocyte cell volume vs. C-value") # Q: Does there appear to be a relationship between cell volume and C-value? # Plot the data (logging the axes) colvals = rep("black", times=nrow(verts_cs_cv)) # point colors (col) pchvals = rep(1, times=nrow(verts_cs_cv)) # point characters (pch) plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals, log="xy") title("Erythrocyte cell volume vs. C-value (axes logged)") # Q: Does there appear to be a relationship between cell volume and C-value? colvals = rep("black", times=nrow(verts_cs_cv)) # point colors (col) pchvals = rep(1, times=nrow(verts_cs_cv)) # point characters (pch) pchvals[verts_cs_cv$Class == "Fishes"] = "F" colvals[verts_cs_cv$Class == "Fishes"] = "black" pchvals[verts_cs_cv$Class == "Amphibians"] = "A" colvals[verts_cs_cv$Class == "Amphibians"] = "green3" pchvals[verts_cs_cv$Class == "Reptiles"] = "R" colvals[verts_cs_cv$Class == "Reptiles"] = "red" pchvals[verts_cs_cv$Class == "Birds"] = "B" colvals[verts_cs_cv$Class == "Birds"] = "blue" pchvals[verts_cs_cv$Class == "Mammals"] = "M" colvals[verts_cs_cv$Class == "Mammals"] = "grey50" # Plot the data, with letter representing the taxa plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals) title("Erythrocyte cell volume vs. C-value") # Plot the data, with letter representing the taxa (logging the axes) plot(x=verts_cs_cv$cs, y=verts_cs_cv$cv, pch=pchvals, xlab="Cell size (µm3)", ylab="C-value", col=colvals, log="xy") title("Erythrocyte cell volume vs. C-value (axes logged)") # Add an orange asterisk for humans points(x=90, y=3.3, pch="*", col="orange2", cex=5) text(x=90, y=20.0, labels="YOU ARE HERE", pos=3, col="orange2", cex=2.0) arrows(x0=90, x1=90, y0=20, y1=4.5, col="orange2", lwd=3, length=0.1) # Amphibian tree atr_fn = "http://phylo.wikidot.com/local--files/explore-genome-size-data/Pyron_amphibian_tree_subset.newick" atr = read.tree(atr_fn) atr # Let's look at the relationship between cell size and cell volume in amphibians ####################################################### # Amphibians analysis ####################################################### spnames = verts_cs_cv$Species spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names rows_in_tr_TF = spnames %in% atr$tip.label # TRUE/FALSE test for spnames in atr spnames = spnames[rows_in_tr_TF] verts_cs_cv$Species[rows_in_tr_TF] cellsize = verts_cs_cv$cs[rows_in_tr_TF] names(cellsize) = spnames cval = verts_cs_cv$cv[rows_in_tr_TF] names(cval) = spnames ####################################################### # Plot cell-size vs. C-value ####################################################### cols = rep("black", times=length(cellsize)) # dot colors linear_regression_plot(x=cellsize, y=cval, xlabel="Cell size (µm3)", ylabel="C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize), ylim=minmax_pretty(cval), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Amphibians: cell size vs. C-value") ####################################################### # Phylogenetic independent contracts ####################################################### cellsize_PICs = pic(x=cellsize, phy=atr) Cvalue_PICs = pic(x=cval, phy=atr) linear_regression_plot(x=cellsize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of cell size (µm3)", ylabel="P.I.C.s of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE) title("Amphibians: cell-size contrasts vs. C-value contrasts") ####################################################### # Plot log10(cell-size) vs. log10(C-value) ####################################################### spnames = verts_cs_cv$Species spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names rows_in_tr_TF = spnames %in% atr$tip.label # TRUE/FALSE test for spnames in atr spnames = spnames[rows_in_tr_TF] verts_cs_cv$Species[rows_in_tr_TF] cellsize = log10(verts_cs_cv$cs[rows_in_tr_TF]) names(cellsize) = spnames cval = log10(verts_cs_cv$cv[rows_in_tr_TF]) names(cval) = spnames cols = rep("black", times=length(cellsize)) # dot colors linear_regression_plot(x=cellsize, y=cval, xlabel="log of Cell size (µm3)", ylabel="log of C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize), ylim=minmax_pretty(cval), slope1=FALSE, intercept_in_legend=TRUE, col=cols) title("Amphibians: log10(cell size) vs. log10(C-value)") ####################################################### # Phylogenetic independent contracts # Plot constrasts of log10(cell-size) vs. constrasts of log10(C-value) ####################################################### ####################################################### cellsize_PICs = pic(x=cellsize, phy=atr) Cvalue_PICs = pic(x=cval, phy=atr) linear_regression_plot(x=cellsize_PICs, y=Cvalue_PICs, xlabel="P.I.C.s of log cell size (µm3)", ylabel="P.I.C.s of log C-value (picograms)", tmppch=1, printall=TRUE, tmplinecol="black", tmplty=1, tmplwd=1, plottext=TRUE, legend_x="topleft", legend_y=NULL, xlim=minmax_pretty(cellsize_PICs), ylim=minmax_pretty(Cvalue_PICs), slope1=FALSE, intercept_in_legend=TRUE) title("Amphibians: log10(cell-size) contrasts vs. log10(C-value) contrasts") ####################################################### # Ancestral character estimation (ACE) for mammals ####################################################### library(phytools) # Get mammal data, remove NAs NAs_TF1 = is.na(verts_cs_cv$cs) == FALSE NAs_TF2 = is.na(verts_cs_cv$cv) == FALSE spnames = verts_cs_cv$Species spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names rows_in_tr_TF = spnames %in% mammstr$tip.label # TRUE/FALSE test for spnames in mammstr keepTF = (NAs_TF1 + NAs_TF2 + rows_in_tr_TF) == 3 spnames = spnames[keepTF] cellsize = verts_cs_cv$cs[keepTF] names(cellsize) = spnames cval = verts_cs_cv$cv[keepTF] names(cval) = spnames # Subset mammal tree tips_to_keep_TF = mammstr$tip.label %in% spnames mtr = drop.tip(mammstr, tip=mammstr$tip.label[tips_to_keep_TF==FALSE]) # Using phytools, estimate ancestral cell size values, # under the Brownian motion model for continuous traits fit = phytools::fastAnc(tree=mtr, x=cellsize) # projection of the reconstruction onto the edges of the tree obj = contMap(tree=mtr, x=cellsize, plot=FALSE) obj # Plot the tree, with colors representing mean # ancestral state values tree=mtr plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Cell volume (µm3)") title("Mammals: Ancest. state estimation for cell volume (µm3)", cex=0.15, line=-0.75) # Using phytools, estimate ancestral genome size values, # under the Brownian motion model for continuous traits fit = phytools::fastAnc(tree=mtr, x=cval) # projection of the reconstruction onto the edges of the tree obj = contMap(tree=mtr, x=cval, plot=FALSE) obj # Plot the tree, with colors representing mean # ancestral state values tree=mtr plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Genome size (pg)") title("Mammals: Ancest. state estimation for genome size (C-value)", cex=0.15, line=-0.75) ####################################################### # Ancestral character estimation (ACE) for birds ####################################################### library(phytools) # Get mammal data, remove NAs NAs_TF1 = is.na(verts_cs_cv$cs) == FALSE NAs_TF2 = is.na(verts_cs_cv$cv) == FALSE spnames = verts_cs_cv$Species spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names rows_in_tr_TF = spnames %in% birdstr$tip.label # TRUE/FALSE test for spnames in birdstr keepTF = (NAs_TF1 + NAs_TF2 + rows_in_tr_TF) == 3 spnames = spnames[keepTF] cellsize = verts_cs_cv$cs[keepTF] names(cellsize) = spnames cval = verts_cs_cv$cv[keepTF] names(cval) = spnames # Subset mammal tree tips_to_keep_TF = birdstr$tip.label %in% spnames btr = drop.tip(birdstr, tip=birdstr$tip.label[tips_to_keep_TF==FALSE]) # Using phytools, estimate ancestral cell size values, # under the Brownian motion model for continuous traits fit = phytools::fastAnc(tree=btr, x=cellsize) # projection of the reconstruction onto the edges of the tree obj = contMap(tree=btr, x=cellsize, plot=FALSE) obj # Plot the tree, with colors representing mean # ancestral state values tree=btr plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Cell volume (µm3)") title("Birds: Ancest. state estimation for cell volume (µm3)", cex=0.15, line=-0.75) # Using phytools, estimate ancestral genome size values, # under the Brownian motion model for continuous traits fit = phytools::fastAnc(tree=btr, x=cval) # projection of the reconstruction onto the edges of the tree obj = contMap(tree=btr, x=cval, plot=FALSE) obj # Plot the tree, with colors representing mean # ancestral state values tree=btr plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Genome size (pg)") title("Birds: Ancest. state estimation for genome size (C-value)", cex=0.15, line=-0.75) ####################################################### # Ancestral character estimation (ACE) for amphibians ####################################################### library(phytools) # Re-set the data spnames = verts_cs_cv$Species spnames = gsub(pattern=" ", replacement="_", x=spnames) # add underscores to species names rows_in_tr_TF = spnames %in% atr$tip.label # TRUE/FALSE test for spnames in atr spnames = spnames[rows_in_tr_TF] verts_cs_cv$Species[rows_in_tr_TF] cellsize = verts_cs_cv$cs[rows_in_tr_TF] names(cellsize) = spnames cval = verts_cs_cv$cv[rows_in_tr_TF] names(cval) = spnames # Using phytools, estimate ancestral cell size values, # under the Brownian motion model for continuous traits fit = phytools::fastAnc(tree=atr, x=cellsize) # projection of the reconstruction onto the edges of the tree obj = contMap(tree=atr, x=cellsize, plot=FALSE) obj # Plot the tree, with colors representing mean # ancestral state values tree=atr plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Cell volume (µm3)") title("Amphibians: Ancest. state estimation for cell volume (µm3)", cex=0.15, line=-0.75) # Using phytools, estimate ancestral genome size values, # under the Brownian motion model for continuous traits fit = phytools::fastAnc(tree=atr, x=cval) # projection of the reconstruction onto the edges of the tree obj = contMap(tree=atr, x=cval, plot=FALSE) obj # Plot the tree, with colors representing mean # ancestral state values tree=atr plot(x=obj, type="phylogram", legend=0.7*max(nodeHeights(tree)), sig=2, fsize=c(0.7,0.9), leg.txt="Genome size (pg)") title("Amphibians: Ancest. state estimation for genome size (C-value)", cex=0.15, line=-0.75) # Nick's machine - make PDFs of all plots # (closing PDFs) # dev.off() # cmdstr = paste0("open ", pdffn) # system(cmdstr)