| Title: | Working with Metabarcoding Data in a Tidy Format |
|---|---|
| Description: | A series of R functions that come in handy while working with metabarcoding data. The reasoning of doing this is to have the same functions we use all the time stored in a curated, reproducible way. In a way it is all about putting together the grammar of the 'tidyverse' from Wickham et al.(2019) <doi:10.21105/joss.01686> with the functions we have used in community ecology compiled in packages like 'vegan' from Dixon (2003) <doi:10.1111/j.1654-1103.2003.tb02228.x> and 'phyloseq' McMurdie & Holmes (2013) <doi:10.1371/journal.pone.0061217>. The package includes functions to read sequences from FAST(A/Q) into a tibble ('fasta_reader' and 'fastq_reader'), to process 'cutadapt' Martin (2011) <doi:10.14806/ej.17.1.200> 'info-file' output. When it comes to sequence counts across samples, the package works with the long format in mind (a three column 'tibble' with Sample, Sequence and counts ), with functions to move from there to the wider format. |
| Authors: | Ramón Gallego Simón [aut, cre] |
| Maintainer: | Ramón Gallego Simón <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.0 |
| Built: | 2026-06-02 08:26:44 UTC |
| Source: | https://github.com/cran/eDNAfuns |
Description: Placeholder for AMPURE dataset.
AMPUREAMPURE
A tibble/data.frame with X rows and Y columns:
Description of column1
Description of column2
...
Generated by the Author
An abundance table in long format.
ASV_tableASV_table
A data frame with 9 rows and 3 variables:
Character vector, last digit includes PCR replicate
The sequence found, hashed using sha1 for consistency
Number of times that hash was found in that sample
Contains the number of reads from unique ASVs across different samples.
Generated by the Author
Given a DNA sequence (either a 'DNAString' object or a character string), this function translates it using a specified genetic code and counts the number of stop codons ('*') in the resulting amino acid sequence. Alternatively, the translated sequence itself can be returned.
count_stop_codons( sequence = NULL, format = "DNAString", codon = 1, dictionary = 5, return = "count" )count_stop_codons( sequence = NULL, format = "DNAString", codon = 1, dictionary = 5, return = "count" )
sequence |
A DNA sequence, either as a [Biostrings::DNAString] object or as a character string. |
||||||||||||||||||||||||||||||||||||||||||
format |
Input format, either '"DNAString"' (default) or '"character"'. If '"character"', the sequence must consist of 'A', 'C', 'G', 'T' only. |
||||||||||||||||||||||||||||||||||||||||||
codon |
Integer giving the starting codon position (usually 1, 2, or 3). |
||||||||||||||||||||||||||||||||||||||||||
dictionary |
Integer specifying which translation code to use. See [Biostrings::GENETIC_CODE_TABLE] for all options. Common examples:
|
||||||||||||||||||||||||||||||||||||||||||
return |
Either '"count"' (default) to return the number of stop codons, or any other value to return the translated amino acid sequence. |
The function uses [Biostrings::translate()] with the specified starting position and genetic code. Translation warnings are suppressed.
If 'return = "count"', an integer giving the number of stop codons. Otherwise, a character string with the translated amino acid sequence.
Ramon Gallego, 2021
if (requireNamespace("Biostrings", quietly = TRUE)) { library(Biostrings) seq <- DNAString("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") count_stop_codons(seq, codon = 1, dictionary = 1) count_stop_codons(seq, codon = 1, dictionary = 1, return = "translation") }if (requireNamespace("Biostrings", quietly = TRUE)) { library(Biostrings) seq <- DNAString("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG") count_stop_codons(seq, codon = 1, dictionary = 1) count_stop_codons(seq, codon = 1, dictionary = 1, return = "translation") }
This function takes a long ASV table (tibble) and creates a new column with the relative proportions scaled to their maximum, to avoid the dominance of species with better primer efficiency
eDNAindex( tibble, Sample_column = Sample, OTU_column = Hash, Counts_column = nReads, Biological_replicate_column = NULL, ... )eDNAindex( tibble, Sample_column = Sample, OTU_column = Hash, Counts_column = nReads, Biological_replicate_column = NULL, ... )
tibble |
A tibble the ASV table in a long format, with at least three columns, Sample_column, OTU_column, Counts_column |
Sample_column |
The column indicating the sample. |
OTU_column |
The column indicating the OTU/ASV. |
Counts_column |
The column (numeric) with the number of sequences from that OTU in that sample. |
Biological_replicate_column |
The column representing replicate measurements of Sample_column. |
... |
Any extra columns that want to be added to the final dataset (either taxonomical information about OTUs, or metadata information about the samples) |
A tibble with at least the columns Sample_column, OTU_column and Normalized.reads
data("training.ASV.table") eDNAindex(training.ASV.table, Sample_column = Sample_name)data("training.ASV.table") eDNAindex(training.ASV.table, Sample_column = Sample_name)
Placeholder description for example_hashes dataset.
example_hashesexample_hashes
A vector or data frame with X rows/columns
Generated by the Author
Functions to read sequence files into a tidy data frame with one row per sequence.
fasta_reader(path_to_fasta) fastq_reader(path_to_fastq, keepQ = FALSE)fasta_reader(path_to_fasta) fastq_reader(path_to_fastq, keepQ = FALSE)
path_to_fasta |
Character. Path to a FASTA file. |
path_to_fastq |
Character. Path to a FASTQ file. |
keepQ |
Logical. If 'TRUE', keep a third column with quality scores when reading FASTQ files. Default is 'FALSE'. |
- 'fasta_reader()': A tibble with columns: - 'header': sequence identifiers (without the '>'). - 'seq': nucleotide sequences.
- 'fastq_reader()': A tibble with columns: - 'header': sequence identifiers (without the '@'). - 'seq': nucleotide sequences. - 'Qscores' (optional): quality scores, if 'keepQ = TRUE'.
fasta_df <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns")) fastq_df <- fastq_reader(system.file("extdata", "test.fastq", package="eDNAfuns"), keepQ = TRUE)fasta_df <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns")) fastq_df <- fastq_reader(system.file("extdata", "test.fastq", package="eDNAfuns"), keepQ = TRUE)
Functions to read sequence files into a tidy data frame with one row per sequence.
fasta_writer(df, sequence, header, file.out) fastq_writer(df, sequence, header, Qscores, file.out)fasta_writer(df, sequence, header, file.out) fastq_writer(df, sequence, header, Qscores, file.out)
df |
A dataframe where the sequence information is stored, one sequence per row |
sequence |
The name (unquoted) of the column where the sequence information (as characters) is stored |
header |
The name (unquoted) of the column where the header information is stored |
file.out |
Character. Path to the location where to write the file. |
Qscores |
The name (unquoted) of the column where the Quality information (encoded as characters) is stored |
- 'fasta_reader()': A tibble with columns: - 'header': sequence identifiers (without the '>'). - 'seq': nucleotide sequences.
- 'fastq_reader()': A tibble with columns: - 'header': sequence identifiers (without the '@'). - 'seq': nucleotide sequences. - 'Qscores' (optional): quality scores, if 'keepQ = TRUE'.
fasta_file <- tempfile(fileext = ".fasta") fasta_df <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns")) fasta_writer(fasta_df, sequence=seq, header = header, file.out = fasta_file) fastq_file <- tempfile(fileext = ".fastq") fastq_df <- fastq_reader(system.file("extdata", "test.fastq", package="eDNAfuns"), keepQ = TRUE) fastq_writer(fastq_df, sequence=seq, header = header,Qscores= Qscores, file.out = fastq_file)fasta_file <- tempfile(fileext = ".fasta") fasta_df <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns")) fasta_writer(fasta_df, sequence=seq, header = header, file.out = fasta_file) fastq_file <- tempfile(fileext = ".fastq") fastq_df <- fastq_reader(system.file("extdata", "test.fastq", package="eDNAfuns"), keepQ = TRUE) fastq_writer(fastq_df, sequence=seq, header = header,Qscores= Qscores, file.out = fastq_file)
Placeholder description for IndexPCR template web address.
Index_PCRIndex_PCR
A character vector
Generated by the Author, stored in googledrive
Placeholder description for metadata dataset.
metadatametadata
A data frame with X rows and Y columns:
Name of sample
either control or treatment
either 1 or 2
Generated by the Author
Placeholder description for molarity.data dataset.
molarity.datamolarity.data
A tibble/data.frame with X rows and Y columns
Name of sample
in nM
length of amplicon in bp
in ng/l
Generated by the Author
This function takes DNA sequences and generates mutated variants. Useful for testing classification algorithms on sequences with either PCR-induced or naturally occurring mutations.
mutation(sequence = NULL, format = "bin", n.mutations = NA, prob.mutation = NA)mutation(sequence = NULL, format = "bin", n.mutations = NA, prob.mutation = NA)
sequence |
A list of DNA sequences, either as "DNAbin" objects or character vectors. |
format |
Character. Format of the input sequences. "bin" for DNAbin, "char" for character vectors. |
n.mutations |
Integer. Number of mutations to introduce per sequence. Exclusive with prob.mutation. |
prob.mutation |
Numeric. Probability of mutation per position. Exclusive with n.mutations. |
A list of mutated sequences of the same class as the input.
data("test_seqs") mutation(test_seqs, n.mutations = 2) mutation(test_seqs, prob.mutation = 0.1) seqs <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns")) mutation(seqs$seq, format = "char", n.mutations = 1)data("test_seqs") mutation(test_seqs, n.mutations = 2) mutation(test_seqs, prob.mutation = 0.1) seqs <- fasta_reader(system.file("extdata", "test.fasta", package="eDNAfuns")) mutation(seqs$seq, format = "char", n.mutations = 1)
l for mass
nM for molarityfunctions to translate mass into molarity and vice versa, given we are talking about double stranded DNA
it requires two inputs, the mass (or molarity) and the length of the DNA fragment
It works with the two most common concentrations used in Molecular Ecology labs
ng/l for mass
nM for molarity
ng2nM(ng, length_amplicon) nM2ng(nM, length_amplicon)ng2nM(ng, length_amplicon) nM2ng(nM, length_amplicon)
ng |
Numeric. the concentration in ng per |
length_amplicon |
Integer. The length of the DNA fragment in base pairs. |
nM |
Numeric. The concentration in nmoles per litre |
Numeric. The equivalent concentration in nmoles per litre
Numeric. The equivalent concentration in ng per L
data("molarity.data") ng2nM(ng=molarity.data$mass, length_amplicon = molarity.data$Amp_len) data("molarity.data") nM2ng(nM=molarity.data$Molarity, length_amplicon = molarity.data$Amp_len)data("molarity.data") ng2nM(ng=molarity.data$mass, length_amplicon = molarity.data$Amp_len) data("molarity.data") nM2ng(nM=molarity.data$Molarity, length_amplicon = molarity.data$Amp_len)
Placeholder description for OTU_taxonomy dataset.
OTU_taxonomyOTU_taxonomy
A dataframe of DNA sequences names and their taxonomy, with 3 rows and 7 columns
Sequence identifier
taxon info at that rank
taxon info at that rank
taxon info at that rank
taxon info at that rank
taxon info at that rank
taxon info at that rank
Generated by the Author
This function takes a tibble produced by 'fasta_reader()' or 'fastq_reader()' and plots the distribution of sequence lengths.
plot_seq_len_hist(Hash_tibble, length_col, binwidth = 1, label_interval = 5)plot_seq_len_hist(Hash_tibble, length_col, binwidth = 1, label_interval = 5)
Hash_tibble |
A tibble containing a numeric column with sequence lengths. |
length_col |
The column of 'Hash_tibble' containing sequence lengths (unquoted). |
binwidth |
Width of histogram bins. Default = 1. |
label_interval |
Interval for x-axis labels. Default = 5. |
A ggplot object.
data("example_hashes") plot_seq_len_hist(example_hashes, seq_len)data("example_hashes") plot_seq_len_hist(example_hashes, seq_len)
Placeholder description for phyloseq object.
psps
Formal class 'phyloseq' with five slots
Generated by the Author
Reads data from a Google Sheets-based indexing PCR template. You can access the template here, create a copy in your Google Drive, and then create copies for each of your multiplexing PCR experiments.
Reads data from a Google Sheets-based PCR template. You can access the template here, create your own copy, and then create one copy per PCR reaction you want to keep track of.
read_indexing_PCR(ss) read_step1_PCR(ss, trim = TRUE, name = TRUE)read_indexing_PCR(ss) read_step1_PCR(ss, trim = TRUE, name = TRUE)
ss |
Google Sheet ID or URL of the indexing PCR spreadsheet. |
trim |
Logical. If TRUE, removes rows where 'Sample' is NA (default: TRUE). |
name |
Logical. If TRUE, adds the spreadsheet name as a 'PCR' column in the sample sheet (default: TRUE). |
The function extracts sample information across all plates and returns a tidy dataframe.
Captures reagent mix, cycling conditions, and sample information.
A tibble with columns:
Well position
Sample name
Barcode assigned to the sample
Barcode set identifier
A named list with three elements:
tibble of reagents and volumes.
tibble of PCR cycling conditions.
tibble of samples with columns 'Well', 'Sample', 'Success', 'Notes', and optionally 'PCR'.
[read_step1_PCR()] for reading normal PCR spreadsheets.
[read_indexing_PCR()] for reading multiplexing PCR spreadsheets.
# Examples are not executed because the function requires identification in Google ## Not run: data("Index_PCR") read_indexing_PCR(Index_PCR) ## End(Not run) # Examples are not executed because the function requires identification in Google ## Not run: data("template_PCR") read_step1_PCR(template_PCR) ## End(Not run)# Examples are not executed because the function requires identification in Google ## Not run: data("Index_PCR") read_indexing_PCR(Index_PCR) ## End(Not run) # Examples are not executed because the function requires identification in Google ## Not run: data("template_PCR") read_step1_PCR(template_PCR) ## End(Not run)
Read the '–info-file' output generated by cutadapt for adapter trimming. Column structure differs depending on whether the input came from Nanopore (ONT) or Illumina sequencing.
read_info_file(file, delim = "\t", col_select = NULL, ...)read_info_file(file, delim = "\t", col_select = NULL, ...)
file |
Path to the cutadapt info file (TSV). |
delim |
Field delimiter (default = "\t"). |
col_select |
Optional tidyselect specification of which columns to read. |
... |
Additional arguments passed to [vroom::vroom()]. |
- 'read_info_file()' (Illumina) returns columns: 'Seq.id', 'n_errors', 'start_adap', 'end_adap', 'seq_before_adap', 'matching_seq', 'seq_after_adap', 'adap_name', 'QScores_seq_before', 'QScores_matching', 'QScores_after'.
A tibble with the parsed 'cutadapt' information. Column names are standardized.
test_info <- system.file("extdata", "cutadapt_info_illumina.txt", package="eDNAfuns") df_illumina <- read_info_file(test_info)test_info <- system.file("extdata", "cutadapt_info_illumina.txt", package="eDNAfuns") df_illumina <- read_info_file(test_info)
This function takes a tibble and create human readable contingency tables from two variables, either by showing number of cases in each combination or weighted by the sum of a numerical variable
tally_wide(tibble, rows, cols, wt = NULL, ...)tally_wide(tibble, rows, cols, wt = NULL, ...)
tibble |
A tibble containing at least two columns |
rows |
The column with the levels included as rows in the final table. |
cols |
The column with the levels included as columns in the final table. |
wt |
The column (numeric) whose values to add in order to fill the cells. If wt = NULL (the default), counts are returned instead of weighted sums. |
... |
Any parameters that can be passed to tally_wide 'values_fill' is a useful one |
A tibble
df <- tibble::tibble( group = c("A", "A", "B", "B", "B"), outcome = c("yes", "no", "yes", "yes", "no") ) tally_wide(df, rows = group, cols = outcome)df <- tibble::tibble( group = c("A", "A", "B", "B", "B"), outcome = c("yes", "no", "yes", "yes", "no") ) tally_wide(df, rows = group, cols = outcome)
Placeholder description for Normal PCR template web address.
template_PCRtemplate_PCR
A character vector
Generated by the Author, stored in googledrive
Placeholder description for test_seqs dataset.
test_seqstest_seqs
An object of class DNAbin of length 3.
Generated by the Author
Converts a long-format table of sequence counts into a wide community matrix (samples × taxa) suitable for vegan or other community ecology tools.
Produces a distance matrix between samples using vegan's vegdist().
Optionally applies a data transformation before distance calculation.
Returns the environmental data frame (sample metadata) from a long sequence/OTU tibble by removing taxon and abundance columns.
tibble_to_comm(long.table, taxon, Abundance, sample.name) tibble_to_dist( long.table, taxon, Abundance, sample.name, distance = "bray", transformation = NULL, ... ) tibble_to_env(long.table, taxon, Abundance, sample.name, ...)tibble_to_comm(long.table, taxon, Abundance, sample.name) tibble_to_dist( long.table, taxon, Abundance, sample.name, distance = "bray", transformation = NULL, ... ) tibble_to_env(long.table, taxon, Abundance, sample.name, ...)
long.table |
A tibble with sample, taxon, abundance, and metadata columns. |
taxon |
Column containing taxa/OTU IDs (unquoted). |
Abundance |
Column with counts/abundance values (unquoted). |
sample.name |
Column with sample IDs (unquoted). |
distance |
Distance metric to use (default = "bray"). |
transformation |
Optional transformation (e.g. "hellinger", "log"). See vegan::decostand documentation for a great explanation of all transformations |
... |
Additional metadata columns to retain. |
A numeric matrix with taxa as columns and samples as row names.
A dist object.
A tibble of unique sample-level metadata.
data("ASV_table") tibble_to_comm(ASV_table, taxon = Hash, Abundance = nReads, sample.name = sample_name) data("ASV_table") tibble_to_dist(ASV_table, taxon = Hash, Abundance = nReads, sample.name = sample_name, distance = "bray", transformation = "hellinger") data("ASV_table") data("metadata") dplyr::inner_join(ASV_table,metadata) |> tibble_to_env(taxon = Hash, Abundance = nReads, sample.name = sample_name)data("ASV_table") tibble_to_comm(ASV_table, taxon = Hash, Abundance = nReads, sample.name = sample_name) data("ASV_table") tibble_to_dist(ASV_table, taxon = Hash, Abundance = nReads, sample.name = sample_name, distance = "bray", transformation = "hellinger") data("ASV_table") data("metadata") dplyr::inner_join(ASV_table,metadata) |> tibble_to_env(taxon = Hash, Abundance = nReads, sample.name = sample_name)
This function converts a tidy ASV table, along with optional taxonomy and metadata, into a 'phyloseq' object.
Extracts ASV counts, taxonomy, and metadata from a 'phyloseq' object into tidy data frames.
tidy2phyloseq( ASV_table, OTU_taxonomy = NULL, metadata = NULL, Taxa = "sseqid", Sample = "sample_name", Reads = "nr", tree = NULL ) phyloseq2tidy(phylo_obj, Taxa = "sseqid", Sample = "sample_name", Reads = "nr")tidy2phyloseq( ASV_table, OTU_taxonomy = NULL, metadata = NULL, Taxa = "sseqid", Sample = "sample_name", Reads = "nr", tree = NULL ) phyloseq2tidy(phylo_obj, Taxa = "sseqid", Sample = "sample_name", Reads = "nr")
ASV_table |
A tidy data.frame/tibble of ASV counts. |
OTU_taxonomy |
A data.frame with OTU taxonomy. Optional. |
metadata |
A data.frame with sample metadata. Optional. |
Taxa |
Column name for OTU IDs in the output (default: "sseqid"). |
Sample |
Column name for sample IDs in the output (default: "sample_name"). |
Reads |
Column name for read counts in the output (default: "nr"). |
tree |
A phylogenetic tree of class 'phylo'. Optional. |
phylo_obj |
A phyloseq object. |
A 'phyloseq' object combining OTU table, taxonomy, metadata, and optionally a tree.
A list with three tibbles:
Long-format tibble of counts: Sample, Taxa, Reads.
Taxonomy table as tibble (NULL if none in phyloseq).
Sample metadata as tibble (NULL if none in phyloseq).
data("ASV_table") data("metadata") data("OTU_taxonomy") ps <- tidy2phyloseq(ASV_table = ASV_table, OTU_taxonomy = OTU_taxonomy, metadata = metadata, Taxa = "Hash", Reads = "nReads") data("ps") tidy_list <- phyloseq2tidy(ps)data("ASV_table") data("metadata") data("OTU_taxonomy") ps <- tidy2phyloseq(ASV_table = ASV_table, OTU_taxonomy = OTU_taxonomy, metadata = metadata, Taxa = "Hash", Reads = "nReads") data("ps") tidy_list <- phyloseq2tidy(ps)
Placeholder description for training.ASV.table dataset.
training.ASV.tabletraining.ASV.table
A data frame with 1909 rows and 4 variables:
Character vector, last digit includes PCR replicate
Which locus the sequences are from, numeric
The sequence found, hashed using sha1 for consistency
Number of times that hash was found in that sample
Generated by the Author
Placeholder description for training.metadata dataset.
training.metadatatraining.metadata
A data frame with 95 rows and 8 columns
Character vector, last digit includes PCR replicate
Biological sample
PCR replicate
Transect of origin
Stop in Transect
in the water column
latitude in decimal degrees
longitude in decimal degrees
Generated by the Author
Placeholder description for tree dataset.
treetree
An object of class phylo or similar
Generated by the Author
Placeholder description for UDI_Indices dataset.
UDI_IndicesUDI_Indices
A data frame with 384 rows and 6 columns
Character vector
Seqs in i7
Seqs to include
Seqs to include in some illumina platoforms
Seqs to include in other illumina platoforms
A, B, C, or D
Generated by the Author, data from Illumina
Creates a new Google Sheet for indexing PCRs from a template, fills in the sample information, and writes it into the correct ranges of the sheet.
write_indexing_PCR( data, name, ss_template = "1naS-F_dj4SNmND5nJ5TKhMX3TikmRS00ILKjfg_Ucgc" )write_indexing_PCR( data, name, ss_template = "1naS-F_dj4SNmND5nJ5TKhMX3TikmRS00ILKjfg_Ucgc" )
data |
A tibble or dataframe with at least the columns:
|
name |
Character. Name for the new Google Sheet that will be created. |
ss_template |
Character. ID of the template sheet to copy from (default: '"1naS-F_dj4SNmND5nJ5TKhMX3TikmRS00ILKjfg_Ucgc"'). |
This function: 1. Creates a new Google Sheet with the given 'name'. 2. Copies the template sheet into it. 3. Splits the input 'data' by plate column. 4. Writes each split dataframe into its designated range of the sheet.
A Google Sheet object (as returned by [googlesheets4::gs4_create()]) with the data written into the correct plate layout.
#Examples are not executed because the function requires identification in Google ## Not run: my_data <- tibble::tibble( Well = c("A1","A2"), Sample = c("Sample1","Sample2"), Column = c(1,2) ) write_indexing_PCR(my_data, "PCR_001") ## End(Not run)#Examples are not executed because the function requires identification in Google ## Not run: my_data <- tibble::tibble( Well = c("A1","A2"), Sample = c("Sample1","Sample2"), Column = c(1,2) ) write_indexing_PCR(my_data, "PCR_001") ## End(Not run)