Qiime 2 Workflow
Note
These commands are for Qiime2-2024.5. See Qiime2 Docs or use qiime2 --help
for most up-to-date flags.
The latest version of 2024.10 includes recent releases of SILVA/GTDB databases. It's available on the cluster as a test version and can be loaded through module load test-qiime2
instead of module load qiime2/2024.5
.
1. Set up read directory
Make a specific directory in a project folder to contain reads and barcode files. Copy raw reads into this directory.
mkdir reads
cp Undetermined_S0_L001_R1_001.fastq.gz ./reads
cp Undetermined_S0_L001_R2_001.fastq.gz ./reads
cp Undetermined_S0_L001_I1_001.fastq.gz ./reads
2. Rename forward, reverse, and barcode fastq.gz files
Sometimes sequencing centers give us weirdly named files. This makes it easier to remember which file is which.
cd reads
mv Undetermined_S0_L001_I1_001.fastq.gz barcodes.fastq.gz
mv Undetermined_S0_L001_R1_001.fastq.gz forward.fastq.gz
mv Undetermined_S0_L001_R2_001.fastq.gz reverse.fastq.gz
cd ..
3. Import Sequence files into Qiime 2
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=300
#SBATCH --partition=week
module load qiime2/2024.5
qiime tools import \
--type EMPPairedEndSequences \
--input-path reads/ \
--output-path emp-paired-end-sequences.qza
4. Demultiplex the sequences
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=300
#SBATCH --partition=week
module load qiime2/2024.5
qiime demux emp-paired \
--m-barcodes-file metadata.txt \
--m-barcodes-column barcode-sequence \ --p-no-rev-comp-mapping-barcodes \
--p-no-golay-error-correction \
--i-seqs emp-paired-end-sequences.qza \ --o-per-sample-sequences demux.qza \ --o-error-correction-details demux-details.qza
Breakdown:
Flag | Description |
---|---|
--m-barcodes-file |
a text file that contains barcodes for the samples we want in the feature table. It can also include other important metadata. Make sure the barcode column is named barcode-sequence (or matches whatever is indicted in next --m-barcodes-column ) |
--m-barcodes-column |
barcode-sequence indicates that in the mapping file, barcodes are in a column labeled barcode-sequence |
--p-no-rev-comp-mapping-barcodes |
specifies barcodes don't need to be reverse complemented. This option will vary by sequencing center. If unsure, process reads with the flag and if sequence counts are very low, run again without and compare. This flag will be used for reads from CU Boulder sequencing center. |
--p-no-golay-error-correction |
doesn't perform 12nt Golay error correction on barcode reads. This option will vary by sequencing center. This flag will be used for reads from CU Boulder sequencing center. |
--i-seqs |
.qza file generated in step 3 |
--o-per-sample-sequences |
output file containing reads separated by sample |
--o-error-correction-details |
provides stats about the error correction process on your samples |
5. Visualize sequence reads and sequence quality
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=1G
#SBATCH --partition=week
module load qiime2/2024.5
qiime demux summarize \
--i-data demux.qza \
--o-visualization demux.qzv
You can visualize demux.qzv
in the QIIME2 view tool interface or
through qiime tools view demux.qzv
on Desktop mode - Open OnDemand.
- The overview tab allows you to see the number of forward and reverse reads for each sample.
- Interactive Quality Plot allows you to see the quality scores of the forward and reverse reads based on sequence base pairs. This is useful for deciding where to trim in the following step. Frequently, quality scores will drop before the end of the sequencing run. You will want to choose trim lengths to remove low-quality reads but still allow enough overlap that forward and reverse reads can be merged.
If you are unsure of your sequencing length, this visualization allows you to see how long the forward and reverse reads are as well as look at the average quality of each base pair based on its position in the sequence. We generally see poorer read quality at the beginning and end of reads so we trim that portion away.
6. Run DADA2 to denoise and merge sequence reads
DADA2 is a pipeline that detects and removes Illumina sequencing errors when possible and merges forward and reverse reads. The output is called a feature table (table.qza) consisting of amplicon sequence variants (ASVs) and their count per sample. Additionally, the rep-seqs.qza file contains each ASV and its entire sequence (this file does not have any count or sample info, only sequences matched with ASV id)
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=1G
#SBATCH --partition=week
module load qiime2/2024.5
mkdir sample_stats
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 250 \
--p-trunc-len-r 250 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats sample_stats/denoising-stats.qza
Breakdown:
Flag | Description |
---|---|
--p-trim-left-f |
the trim length at the beginning of the forward read |
--p-trim-left-r |
the trim length at the beginning of the reverse read |
--p-trunc-len-f |
the truncation length at the end of the forward read |
--p-trunc-len-r |
the truncation length at the end of the reverse read |
--o-table |
output table that has the number of each ASV per sample |
--o-representative-sequences |
a file that contains every unique ASV |
Optional: Run the following code to get a qzv table of the output. This will allow you to see how many reads were retained during the dada2 filtering step.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=1G
#SBATCH --partition=week
module load qiime2/2024.5
qiime metadata tabulate \
--m-input-file sample_stats/denoising-stats.qza \
--o-visualization sample_stats/denoising-stats.qzv
You can visualize denoising-stats.qzv
in the QIIME2 view tool interface or
through qiime tools view denoising-stats.qzv
on Desktop mode - Open OnDemand.
7. Assign taxonomy to reads
Note
This requires a trained classifier. See below for how to train a classifier using a database and the Earth Microbiome Primers (EMP)(515f/806R).
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=1G
#SBATCH --partition=week
module load qiime2/2024.5
qiime feature-classifier classify-sklearn \
--i-classifier SILVA_138.1_SSURef_NR99_tax_515F806R_CLASSIFIER.qza \
--i-reads rep-seqs.qza \
--o-classification taxonomy_silva138_16S.qza
8. Export taxonomy to a table
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:10:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=125
#SBATCH --partition=week
module load qiime2/2024.5
qiime tools export \
--input-path taxonomy_silva138_16S.qza \
--output-path feature_taxonomy_out
9. Export ASV table to a table
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=0-00:30:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mem=125
#SBATCH --partition=week
module load qiime2/2024.5
qiime tools export \
--input-path table.qza \
--output-path feature_table_out
Training a classifier with EMP primers
1. Get Database Files
The default version/target is SILVA 138.1 SSUREF_NR99 on v2024.5, SILVA 138.2 SSUREF_NR99 for v2024.10.
Note
In case you're using a Slurm script, it is recommended to input 2G for memory allocation.
qiime rescript get-silva-data \
--o-silva-sequences SILVA_138.1_SSURef_NR99_RNAseq.qza \
--o-silva-taxonomy taxmap_slv_ssu_ref_nr_138.1.qza
--o-silva-sequences
gets FeatureData[RNASequence], which needs to be transcribed into FeatureData[Sequence]
Note
In case you're using a Slurm script, it is recommended to input 1G for memory allocation.
The default version/target is GTDB 214.1 Bacteria / Archaea SpeciesReps for v2024.5, GTDB 220.0 Bacteria / Archaea SpeciesReps for v2024.10
Note
In case you're using a Slurm script, it is recommended to input 1G for memory allocation.
2. Extract Primer regions from the sequences
Note
In case you're using a Slurm script, it is recommended to input 1G for memory allocation.
Note
In case you're using a Slurm script, it is recommended to input 1G for memory allocation.
3. Train the classifier
Note
In case you're using a Slurm script, it is recommended to input 10G for memory allocation.
Note
In case you're using a Slurm script, it is recommended to input 30G for memory allocation.