FigName



MiND Analysis pipeline developed by: Andreas B. Diendorfer, PhD - Senior Scientist Computational Biology

1 Introduction

This report contains the condensed results from the mRNA sequencing experiment. Additional data (e.g. raw reads data, detailed mappings) is available upon request, but is not included in this summary report.

  • Project ID: XXX
  • Customer: TAmiRNA
  • TAmiRNA project manager: John Doe ()
  • Report generated: 2023-01-30 (16:54:01 GTM +0100)by system user “marianne”

Analysis parameters:

  • Species: Homo sapiens (hsa, TXID: 9606)
  • Minimum read length: 17nt
  • Reads quality cutoff: 30 (phred quality score)
  • Significance level: 0.05

Tabular data can be filtered or sorted using the fields and options at the top of each table. To export the data for further processing, please select the desired format (Excel or CSV) at the table.

Nucleic acid species: mRNA

1.1 Sample table

1.2 Gene mappings

1.2.1 Genes RPM table

This table contains all identified genes in each sample. Read counts are normalized to 1 million mapped genes.

Please use the download link provided underneath the table to save the gene mappings data. The buttons provided at the top of the table can also be used, but won’t include detailed group information of the samples.

Due to the size of this data it was not embedded in the report but can be found in the file genesRPMTable.xlsx

1.2.2 Genes raw reads table

This table contains all identified genes in each sample. These are raw read counts (without any normalization).

Please use the download link provided underneath the table to save the genes mappings data. The buttons provided at the top of the table can also be used, but won’t include detailed group information of the samples.

Due to the size of this data it was not embedded in the report but can be found in the file genesTable.xlsx

1.2.3 Identified mRNAs comparison

This graph shows the amount of distinct mRNAs identified in each sample.

Due to the size of this data it was not embedded in the report but can be found in the file IdentifiedmRNAsComparison.xlsx

1.3 Heatmaps

Data is based on RPM normalized reads and scaled using the unit variance method for visualization in heatmaps. Clustering is done using the average method of pheatmap calculating the distances as correlations.

1.3.1 Top genes

This heatmap shows only the top 10% genes (based on coefficient of variation (CV%)). An additional filter was introduce to increase the robustness: only genes that show an RPM in at least 1 / n(groups) percent of samples (e.g. with 4 groups, the gene has to have an RPM value above 5 in at least 25% of the samples). This removes genes that have a high CV but are only expressed in a too small amount of samples to bear any statistical significance or biological relevance.

Download data used to generate the heatmap

1.3.2 All genes

10867 genes are shown in the following heatmap, based on the same filters described at the top 10% genes Download data used to generate the heatmap

1.4 PCA

Principal component analysis (PCA) uses RPM normalized gene reads and reduces the data dimensions down to two, so that it can be plotted in a graph. A quick introduction to PCA plots and the underlaying principle, can be found here.

Samples are either colored by their first group or by the cluster they were assigend to. Clustering is done using the ward (ward.D2) alrogithm of hclust (split at euclidian cluster height of 40).

1.4.1 PCA cluster by sample groups

1.4.2 PCA cluster by hierachical clustering

1.5 t-SNE

t-SNE is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space (like 2 dimensions here). It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. More details can be found in the author’s publication (Maaten and Hinton 2008).

2 Differential expression analysis

Differential expression analysis uses statistical tests to find mRNA that are over or underexpressed in a group. For this report, the well established analysis toolkit edgeR (Robinson, McCarthy, and Smyth 2009) was used.

Annotations in this result are standardized, as that for a contrast of GroupA vs. GroupB a positive logFC indicates that the mRNA is upregulated in groupA. E.g. a logFC of 2.5 equals an increase of mRNA by the factor of 2^2.5 = 5.66.

Please select a contrast below to view the differential expression analysis results.

2.1 A versus B

2.1.1 Sample overview

The following two tables give a quick overview of the samples that were part of the two groups compared in this contrast.

2.1.1.1 Samples group A

2.1.1.2 Samples group B

2.1.1.3 Independent filtering

As hundreds or even thousands of genes are tested for each contrast, multiple testing adjustment is required to reduce the false discovery rate (FDR). This is traditionally done using p-value adjustment methods like Benjamini Hochberg (BH) with an arbitrary cutoff for low expressed genes prior analysis. In this case, the BH method reduces the amount of false positives reliably, but at the same time, removing a great amount of valid observations. In addition, the cutoff for low expressed genes might remove biologically relevant observations.

Filtering of reads should be done independent of the group assignments. This is to avoid the introduction of any bias for the downstream differential expression analysis.

In order to give the highest sensitivity to our analysis, we have implemented a method of removing low read count genes from the data set until a statistically relevant set of significant results remains. This approach of independent filtering is also used by DESeq2 and provides the currently best established filtering method prior FDR adjustment. Assuming that most false-positives are caused by low abundant genes, the algorithm removes quantiles of genes from the low-abundance end and checks if the amount of significant genes increases after BH adjustment. This would be the case if mostly false positives have been removed because BH adjustment would now be more sensitive and not remove as many true positives, increasing the overall amount of significant results.

This method works reliably as long as there are any true positive results. If the result set consist only of false positives, then even after removing the low abundant results would not increase the amount of significant results (as there are no true positives to enrich). In this case the algorithm has a fallback, to filter for lowly expressed genes prior DE and FDR adjustment: In a first step, we filter out genes that are only expressed on very low levels: RPM smaller than 10 divided by the smallest library size in at least half the amount of samples of the smaller group. Those genes carry no biological and statistical relevance (Chen, Lun, and Smyth 2016) as they have very low read counts in both groups.

This plot visualizes the independent filtering method based on significant observations used for this contrast. The quantile of reads removed prior BH p-value adjustment is ploted on the x axis, while the amount of significant observations is shown on the y axis. The algorithm aims to optimize for the maximum amount of significant observations and picks the apropriate cutoff.

Prefiltering set cutoff to 1.24 RPM in at least 2 samples. There were 11088 low read count genes removed, accounting for 0.1978% (102370 reads absolute) of the total reads.

FDR based cutoff (see graph) removed 106 low read count genes, accounting for 0.0092% (4755 reads absolute) of the total reads.

2.1.2 Differentially expressed genes

This table shows only genes that are significant differentially expressed (FDR < 0.05 ).

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#B_DE_genes.csv

2.1.3 Volcano plot

This graph visualizes the relation of the logFC (how much did a gene change in the groups) and the statistical significance of this change. Genes higher up have a smaller FDR value, while genes more to the left or right of the center, show a greater differential expression.

2.1.3.1 FDR based

2.1.4 MA plot

MA plots visualize the relation of the mean expression (mean of expression counts in both groups on X axis = A) of a gene and it’s difference between the two groups (logFc on the Y axis = M). Significantly differentially expressed genes (FDR < 0.05 ) are shown in red. This plot can be taken into account to check the expression levels of significantly differentially expressed genes.

2.1.5 Top up- and down-regulated

Top up- and down-regulated genes in the given contrast with their CPM values from edgeR. genes are ordered by logFC (FDR < 0.05 only) starting with the greatest on the top left.

For genes with no reads (CPM = 0) in a sample, the CPM was set to 1, so that they can be displayed in this logarithmic plot as a 0 on the y axis (as the log10 of 0 is undefined).

2.1.5.1 Top up-regulated genes

2.1.5.2 Top down-regulated genes

2.1.6 All genes

2.1.6.1 edgeR results

This table contains the results of the differential expression analysis for all tested genes. Additional TMM values calculated by edgeR are provided at the edgeR test statistics table.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#B_edgeRResults.xlsx

2.1.6.2 edgerR test statistics

This table contains the results of edgeR’s glmQLFTest() method.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#B_edgeRAllResults.csv

2.2 A versus C

2.2.1 Sample overview

The following two tables give a quick overview of the samples that were part of the two groups compared in this contrast.

2.2.1.1 Samples group A

2.2.1.2 Samples group B

2.2.1.3 Independent filtering

As hundreds or even thousands of genes are tested for each contrast, multiple testing adjustment is required to reduce the false discovery rate (FDR). This is traditionally done using p-value adjustment methods like Benjamini Hochberg (BH) with an arbitrary cutoff for low expressed genes prior analysis. In this case, the BH method reduces the amount of false positives reliably, but at the same time, removing a great amount of valid observations. In addition, the cutoff for low expressed genes might remove biologically relevant observations.

Filtering of reads should be done independent of the group assignments. This is to avoid the introduction of any bias for the downstream differential expression analysis.

In order to give the highest sensitivity to our analysis, we have implemented a method of removing low read count genes from the data set until a statistically relevant set of significant results remains. This approach of independent filtering is also used by DESeq2 and provides the currently best established filtering method prior FDR adjustment. Assuming that most false-positives are caused by low abundant genes, the algorithm removes quantiles of genes from the low-abundance end and checks if the amount of significant genes increases after BH adjustment. This would be the case if mostly false positives have been removed because BH adjustment would now be more sensitive and not remove as many true positives, increasing the overall amount of significant results.

This method works reliably as long as there are any true positive results. If the result set consist only of false positives, then even after removing the low abundant results would not increase the amount of significant results (as there are no true positives to enrich). In this case the algorithm has a fallback, to filter for lowly expressed genes prior DE and FDR adjustment: In a first step, we filter out genes that are only expressed on very low levels: RPM smaller than 10 divided by the smallest library size in at least half the amount of samples of the smaller group. Those genes carry no biological and statistical relevance (Chen, Lun, and Smyth 2016) as they have very low read counts in both groups.

This plot visualizes the independent filtering method based on significant observations used for this contrast. The quantile of reads removed prior BH p-value adjustment is ploted on the x axis, while the amount of significant observations is shown on the y axis. The algorithm aims to optimize for the maximum amount of significant observations and picks the apropriate cutoff.

Prefiltering set cutoff to 1.53 RPM in at least 2 samples. There were 11923 low read count genes removed, accounting for 0.2399% (132332 reads absolute) of the total reads.

FDR based cutoff (see graph) removed 0 low read count genes, accounting for 0% (0 reads absolute) of the total reads.

2.2.2 Differentially expressed genes

This table shows only genes that are significant differentially expressed (FDR < 0.05 ).

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#C_DE_genes.csv

2.2.3 Volcano plot

This graph visualizes the relation of the logFC (how much did a gene change in the groups) and the statistical significance of this change. Genes higher up have a smaller FDR value, while genes more to the left or right of the center, show a greater differential expression.

2.2.3.1 FDR based

2.2.4 MA plot

MA plots visualize the relation of the mean expression (mean of expression counts in both groups on X axis = A) of a gene and it’s difference between the two groups (logFc on the Y axis = M). Significantly differentially expressed genes (FDR < 0.05 ) are shown in red. This plot can be taken into account to check the expression levels of significantly differentially expressed genes.

2.2.5 Top up- and down-regulated

Top up- and down-regulated genes in the given contrast with their CPM values from edgeR. genes are ordered by logFC (FDR < 0.05 only) starting with the greatest on the top left.

For genes with no reads (CPM = 0) in a sample, the CPM was set to 1, so that they can be displayed in this logarithmic plot as a 0 on the y axis (as the log10 of 0 is undefined).

2.2.5.1 Top up-regulated genes

2.2.5.2 Top down-regulated genes

2.2.6 All genes

2.2.6.1 edgeR results

This table contains the results of the differential expression analysis for all tested genes. Additional TMM values calculated by edgeR are provided at the edgeR test statistics table.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#C_edgeRResults.xlsx

2.2.6.2 edgerR test statistics

This table contains the results of edgeR’s glmQLFTest() method.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#C_edgeRAllResults.csv

2.3 A versus D

2.3.1 Sample overview

The following two tables give a quick overview of the samples that were part of the two groups compared in this contrast.

2.3.1.1 Samples group A

2.3.1.2 Samples group B

2.3.1.3 Independent filtering

As hundreds or even thousands of genes are tested for each contrast, multiple testing adjustment is required to reduce the false discovery rate (FDR). This is traditionally done using p-value adjustment methods like Benjamini Hochberg (BH) with an arbitrary cutoff for low expressed genes prior analysis. In this case, the BH method reduces the amount of false positives reliably, but at the same time, removing a great amount of valid observations. In addition, the cutoff for low expressed genes might remove biologically relevant observations.

Filtering of reads should be done independent of the group assignments. This is to avoid the introduction of any bias for the downstream differential expression analysis.

In order to give the highest sensitivity to our analysis, we have implemented a method of removing low read count genes from the data set until a statistically relevant set of significant results remains. This approach of independent filtering is also used by DESeq2 and provides the currently best established filtering method prior FDR adjustment. Assuming that most false-positives are caused by low abundant genes, the algorithm removes quantiles of genes from the low-abundance end and checks if the amount of significant genes increases after BH adjustment. This would be the case if mostly false positives have been removed because BH adjustment would now be more sensitive and not remove as many true positives, increasing the overall amount of significant results.

This method works reliably as long as there are any true positive results. If the result set consist only of false positives, then even after removing the low abundant results would not increase the amount of significant results (as there are no true positives to enrich). In this case the algorithm has a fallback, to filter for lowly expressed genes prior DE and FDR adjustment: In a first step, we filter out genes that are only expressed on very low levels: RPM smaller than 10 divided by the smallest library size in at least half the amount of samples of the smaller group. Those genes carry no biological and statistical relevance (Chen, Lun, and Smyth 2016) as they have very low read counts in both groups.

This plot visualizes the independent filtering method based on significant observations used for this contrast. The quantile of reads removed prior BH p-value adjustment is ploted on the x axis, while the amount of significant observations is shown on the y axis. The algorithm aims to optimize for the maximum amount of significant observations and picks the apropriate cutoff.

Prefiltering set cutoff to 2.49 RPM in at least 2 samples. There were 11920 low read count genes removed, accounting for 0.361% (149465 reads absolute) of the total reads.

FDR based cutoff (see graph) removed 0 low read count genes, accounting for 0% (0 reads absolute) of the total reads.

2.3.2 Differentially expressed genes

This table shows only genes that are significant differentially expressed (FDR < 0.05 ).

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#D_DE_genes.csv

2.3.3 Volcano plot

This graph visualizes the relation of the logFC (how much did a gene change in the groups) and the statistical significance of this change. Genes higher up have a smaller FDR value, while genes more to the left or right of the center, show a greater differential expression.

2.3.3.1 FDR based

2.3.4 MA plot

MA plots visualize the relation of the mean expression (mean of expression counts in both groups on X axis = A) of a gene and it’s difference between the two groups (logFc on the Y axis = M). Significantly differentially expressed genes (FDR < 0.05 ) are shown in red. This plot can be taken into account to check the expression levels of significantly differentially expressed genes.

2.3.5 Top up- and down-regulated

Top up- and down-regulated genes in the given contrast with their CPM values from edgeR. genes are ordered by logFC (FDR < 0.05 only) starting with the greatest on the top left.

For genes with no reads (CPM = 0) in a sample, the CPM was set to 1, so that they can be displayed in this logarithmic plot as a 0 on the y axis (as the log10 of 0 is undefined).

2.3.5.1 Top up-regulated genes

2.3.5.2 Top down-regulated genes

2.3.6 All genes

2.3.6.1 edgeR results

This table contains the results of the differential expression analysis for all tested genes. Additional TMM values calculated by edgeR are provided at the edgeR test statistics table.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#D_edgeRResults.xlsx

2.3.6.2 edgerR test statistics

This table contains the results of edgeR’s glmQLFTest() method.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#D_edgeRAllResults.csv

2.4 A versus E

2.4.1 Sample overview

The following two tables give a quick overview of the samples that were part of the two groups compared in this contrast.

2.4.1.1 Samples group A

2.4.1.2 Samples group B

2.4.1.3 Independent filtering

As hundreds or even thousands of genes are tested for each contrast, multiple testing adjustment is required to reduce the false discovery rate (FDR). This is traditionally done using p-value adjustment methods like Benjamini Hochberg (BH) with an arbitrary cutoff for low expressed genes prior analysis. In this case, the BH method reduces the amount of false positives reliably, but at the same time, removing a great amount of valid observations. In addition, the cutoff for low expressed genes might remove biologically relevant observations.

Filtering of reads should be done independent of the group assignments. This is to avoid the introduction of any bias for the downstream differential expression analysis.

In order to give the highest sensitivity to our analysis, we have implemented a method of removing low read count genes from the data set until a statistically relevant set of significant results remains. This approach of independent filtering is also used by DESeq2 and provides the currently best established filtering method prior FDR adjustment. Assuming that most false-positives are caused by low abundant genes, the algorithm removes quantiles of genes from the low-abundance end and checks if the amount of significant genes increases after BH adjustment. This would be the case if mostly false positives have been removed because BH adjustment would now be more sensitive and not remove as many true positives, increasing the overall amount of significant results.

This method works reliably as long as there are any true positive results. If the result set consist only of false positives, then even after removing the low abundant results would not increase the amount of significant results (as there are no true positives to enrich). In this case the algorithm has a fallback, to filter for lowly expressed genes prior DE and FDR adjustment: In a first step, we filter out genes that are only expressed on very low levels: RPM smaller than 10 divided by the smallest library size in at least half the amount of samples of the smaller group. Those genes carry no biological and statistical relevance (Chen, Lun, and Smyth 2016) as they have very low read counts in both groups.

This plot visualizes the independent filtering method based on significant observations used for this contrast. The quantile of reads removed prior BH p-value adjustment is ploted on the x axis, while the amount of significant observations is shown on the y axis. The algorithm aims to optimize for the maximum amount of significant observations and picks the apropriate cutoff.

Prefiltering set cutoff to 2.29 RPM in at least 2 samples. There were 12278 low read count genes removed, accounting for 0.342% (164040 reads absolute) of the total reads.

FDR based cutoff (see graph) removed 0 low read count genes, accounting for 0% (0 reads absolute) of the total reads.

2.4.2 Differentially expressed genes

This table shows only genes that are significant differentially expressed (FDR < 0.05 ).

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#E_DE_genes.csv

2.4.3 Volcano plot

This graph visualizes the relation of the logFC (how much did a gene change in the groups) and the statistical significance of this change. Genes higher up have a smaller FDR value, while genes more to the left or right of the center, show a greater differential expression.

2.4.3.1 FDR based

2.4.4 MA plot

MA plots visualize the relation of the mean expression (mean of expression counts in both groups on X axis = A) of a gene and it’s difference between the two groups (logFc on the Y axis = M). Significantly differentially expressed genes (FDR < 0.05 ) are shown in red. This plot can be taken into account to check the expression levels of significantly differentially expressed genes.

2.4.5 Top up- and down-regulated

Top up- and down-regulated genes in the given contrast with their CPM values from edgeR. genes are ordered by logFC (FDR < 0.05 only) starting with the greatest on the top left.

For genes with no reads (CPM = 0) in a sample, the CPM was set to 1, so that they can be displayed in this logarithmic plot as a 0 on the y axis (as the log10 of 0 is undefined).

2.4.5.1 Top up-regulated genes

2.4.5.2 Top down-regulated genes

2.4.6 All genes

2.4.6.1 edgeR results

This table contains the results of the differential expression analysis for all tested genes. Additional TMM values calculated by edgeR are provided at the edgeR test statistics table.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#E_edgeRResults.xlsx

2.4.6.2 edgerR test statistics

This table contains the results of edgeR’s glmQLFTest() method.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#E_edgeRAllResults.csv

2.5 A versus F

2.5.1 Sample overview

The following two tables give a quick overview of the samples that were part of the two groups compared in this contrast.

2.5.1.1 Samples group A

2.5.1.2 Samples group B

2.5.1.3 Independent filtering

As hundreds or even thousands of genes are tested for each contrast, multiple testing adjustment is required to reduce the false discovery rate (FDR). This is traditionally done using p-value adjustment methods like Benjamini Hochberg (BH) with an arbitrary cutoff for low expressed genes prior analysis. In this case, the BH method reduces the amount of false positives reliably, but at the same time, removing a great amount of valid observations. In addition, the cutoff for low expressed genes might remove biologically relevant observations.

Filtering of reads should be done independent of the group assignments. This is to avoid the introduction of any bias for the downstream differential expression analysis.

In order to give the highest sensitivity to our analysis, we have implemented a method of removing low read count genes from the data set until a statistically relevant set of significant results remains. This approach of independent filtering is also used by DESeq2 and provides the currently best established filtering method prior FDR adjustment. Assuming that most false-positives are caused by low abundant genes, the algorithm removes quantiles of genes from the low-abundance end and checks if the amount of significant genes increases after BH adjustment. This would be the case if mostly false positives have been removed because BH adjustment would now be more sensitive and not remove as many true positives, increasing the overall amount of significant results.

This method works reliably as long as there are any true positive results. If the result set consist only of false positives, then even after removing the low abundant results would not increase the amount of significant results (as there are no true positives to enrich). In this case the algorithm has a fallback, to filter for lowly expressed genes prior DE and FDR adjustment: In a first step, we filter out genes that are only expressed on very low levels: RPM smaller than 10 divided by the smallest library size in at least half the amount of samples of the smaller group. Those genes carry no biological and statistical relevance (Chen, Lun, and Smyth 2016) as they have very low read counts in both groups.

This plot visualizes the independent filtering method based on significant observations used for this contrast. The quantile of reads removed prior BH p-value adjustment is ploted on the x axis, while the amount of significant observations is shown on the y axis. The algorithm aims to optimize for the maximum amount of significant observations and picks the apropriate cutoff.

Prefiltering set cutoff to 1.24 RPM in at least 2 samples. There were 11581 low read count genes removed, accounting for 0.1992% (107061 reads absolute) of the total reads.

FDR based cutoff (see graph) removed 0 low read count genes, accounting for 0% (0 reads absolute) of the total reads.

2.5.2 Differentially expressed genes

This table shows only genes that are significant differentially expressed (FDR < 0.05 ).

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#F_DE_genes.csv

2.5.3 Volcano plot

This graph visualizes the relation of the logFC (how much did a gene change in the groups) and the statistical significance of this change. Genes higher up have a smaller FDR value, while genes more to the left or right of the center, show a greater differential expression.

2.5.3.1 FDR based

2.5.4 MA plot

MA plots visualize the relation of the mean expression (mean of expression counts in both groups on X axis = A) of a gene and it’s difference between the two groups (logFc on the Y axis = M). Significantly differentially expressed genes (FDR < 0.05 ) are shown in red. This plot can be taken into account to check the expression levels of significantly differentially expressed genes.

2.5.5 Top up- and down-regulated

Top up- and down-regulated genes in the given contrast with their CPM values from edgeR. genes are ordered by logFC (FDR < 0.05 only) starting with the greatest on the top left.

For genes with no reads (CPM = 0) in a sample, the CPM was set to 1, so that they can be displayed in this logarithmic plot as a 0 on the y axis (as the log10 of 0 is undefined).

2.5.5.1 Top up-regulated genes

2.5.5.2 Top down-regulated genes

2.5.6 All genes

2.5.6.1 edgeR results

This table contains the results of the differential expression analysis for all tested genes. Additional TMM values calculated by edgeR are provided at the edgeR test statistics table.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#F_edgeRResults.xlsx

2.5.6.2 edgerR test statistics

This table contains the results of edgeR’s glmQLFTest() method.

Due to the size of this data it was not embedded in the report but can be found in the file Group1#A_vs_Group1#F_edgeRAllResults.csv

3 Summary of differential expression analysis

The direction follows the previously mentioned annotation. So “upregulated” (logFC > 0) means that the gene is overexpressed in the first group of the contrast.

4 Appendix

Generation of this report took 2.72 seconds running MEND rev. 86f13e34d52fb7c8933d05e24c34d9f032cc043f on tyrion.

4.1 Methods

The following paragraph describes the methods used for generating this report as required by most publishers. Please consider trimming it down to the parts relevant for your publication and as required by the specific journal. For additional citations please see the “References” section at the end of this report.

Overall quality of the next-generation sequencing data was evaluated automatically and manually with fastQC v0.11.8 (Andrews 2010) and multiQC v1.7 (Ewels et al. 2016). Reads from all passing samples were adapter trimmed and quality filtered using bbduk from the bbmap package v38.69 (Bushnell 2015) and filtered for a minimum length of 17nt and phred quality of 30. Alignment steps were performed with STAR v2.7 (Dobin et al. 2013) using samtools v1.9 (Li et al. 2009) for indexing, whereas reads were mapped against the genomic reference GRCh38.p12 provided by Ensembl (Zerbino et al. 2018). Assignment of features to the mapped reads was done with htseq-count v0.13 (Anders, Pyl, and Huber 2015). Differential expression analysis with edgeR v3.30 (Robinson, McCarthy, and Smyth 2009) used the quasi-likelihood negative binomial generalized log-linear model functions provided by the package. The independent filtering method of DESeq2 (Love, Huber, and Anders 2014) was adapted for use with edgeR to remove low abundante genes and thus optimize the false discovery rate (FDR) correction.

4.2 R session information

devtools::session_info()
## ─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.0.2 (2020-06-22)
##  os       Debian GNU/Linux 11 (bullseye)
##  system   x86_64, linux-gnu
##  ui       X11
##  language en_US:en
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Vienna
##  date     2023-01-30
##  pandoc   2.10.1 @ /mnt/storage/shared/conda/envs/dabf9820504c4a7c89741741bb50afed/bin/ (via rmarkdown)
## 
## ─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package       * version   date (UTC) lib source
##  annotate        1.66.0    2020-04-27 [1] Bioconductor
##  AnnotationDbi   1.52.0    2020-10-27 [1] Bioconductor
##  assertthat      0.2.1     2019-03-21 [1] CRAN (R 4.0.5)
##  Biobase       * 2.50.0    2020-10-27 [1] Bioconductor
##  BiocFileCache   1.14.0    2020-10-27 [1] Bioconductor
##  BiocGenerics  * 0.36.1    2021-04-16 [1] Bioconductor
##  biomaRt       * 2.53.2    2022-09-20 [1] Github (grimbough/biomaRt@edcd48c)
##  bit             4.0.4     2020-08-04 [1] CRAN (R 4.0.3)
##  bit64           4.0.5     2020-08-30 [1] CRAN (R 4.0.3)
##  bitops          1.0-7     2021-04-24 [1] CRAN (R 4.0.5)
##  blob            1.2.3     2022-04-10 [1] CRAN (R 4.0.5)
##  cachem          1.0.6     2021-08-19 [1] CRAN (R 4.0.5)
##  callr           3.7.2     2022-08-22 [1] CRAN (R 4.0.5)
##  cellranger      1.1.0     2016-07-27 [1] CRAN (R 4.0.5)
##  cli             3.4.0     2022-09-08 [1] CRAN (R 4.0.5)
##  colorspace      2.0-3     2022-02-21 [1] CRAN (R 4.0.5)
##  crayon          1.5.1     2022-03-26 [1] CRAN (R 4.0.5)
##  crosstalk       1.2.0     2021-11-04 [1] CRAN (R 4.0.5)
##  curl            4.3.2     2021-06-23 [1] CRAN (R 4.0.5)
##  data.table      1.14.2    2021-09-27 [1] CRAN (R 4.0.5)
##  DBI             1.1.3     2022-06-18 [1] CRAN (R 4.0.5)
##  dbplyr          2.2.1     2022-06-27 [1] CRAN (R 4.0.2)
##  devtools        2.3.2     2020-09-18 [1] CRAN (R 4.0.2)
##  digest          0.6.29    2021-12-01 [1] CRAN (R 4.0.5)
##  dplyr         * 1.0.10    2022-09-01 [1] CRAN (R 4.0.5)
##  DT            * 0.16      2020-10-13 [1] CRAN (R 4.0.3)
##  edgeR         * 3.30.0    2020-04-27 [1] Bioconductor
##  ellipsis        0.3.2     2021-04-29 [1] CRAN (R 4.0.3)
##  evaluate        0.16      2022-08-09 [1] CRAN (R 4.0.5)
##  fansi           1.0.3     2022-03-24 [1] CRAN (R 4.0.5)
##  farver          2.1.1     2022-07-06 [1] CRAN (R 4.0.5)
##  fastmap         1.1.0     2021-01-25 [1] CRAN (R 4.0.3)
##  fs              1.5.2     2021-12-08 [1] CRAN (R 4.0.5)
##  genefilter    * 1.70.0    2020-04-27 [1] Bioconductor
##  generics        0.1.3     2022-07-05 [1] CRAN (R 4.0.5)
##  ggfortify     * 0.4.14    2022-01-03 [1] CRAN (R 4.0.5)
##  ggplot2       * 3.3.6     2022-05-03 [1] CRAN (R 4.0.5)
##  ggrepel       * 0.8.2     2020-03-08 [1] CRAN (R 4.0.0)
##  glue            1.6.2     2022-02-24 [1] CRAN (R 4.0.5)
##  gridExtra     * 2.3       2017-09-09 [1] CRAN (R 4.0.5)
##  gtable          0.3.1     2022-09-01 [1] CRAN (R 4.0.5)
##  highr           0.9       2021-04-16 [1] CRAN (R 4.0.3)
##  hms             1.1.2     2022-08-19 [1] CRAN (R 4.0.5)
##  htmltools       0.5.3     2022-07-18 [1] CRAN (R 4.0.5)
##  htmlwidgets     1.5.4     2021-09-08 [1] CRAN (R 4.0.5)
##  httr            1.4.4     2022-08-17 [1] CRAN (R 4.0.5)
##  IRanges         2.24.1    2020-12-12 [1] Bioconductor
##  jsonlite        1.8.0     2022-02-22 [1] CRAN (R 4.0.5)
##  kableExtra    * 1.2.1     2020-08-27 [1] CRAN (R 4.0.2)
##  knitr           1.40      2022-08-24 [1] CRAN (R 4.0.5)
##  labeling        0.4.2     2020-10-20 [1] CRAN (R 4.0.5)
##  lattice         0.20-45   2021-09-22 [1] CRAN (R 4.0.5)
##  lazyeval        0.2.2     2019-03-15 [1] CRAN (R 4.0.5)
##  lifecycle       1.0.2     2022-09-09 [1] CRAN (R 4.0.5)
##  limma         * 3.44.1    2020-04-28 [1] Bioconductor
##  locfit          1.5-9.4   2020-03-25 [1] CRAN (R 4.0.5)
##  magrittr      * 2.0.3     2022-03-30 [1] CRAN (R 4.0.2)
##  Matrix          1.4-1     2022-03-23 [1] CRAN (R 4.0.2)
##  memoise         2.0.1     2021-11-26 [1] CRAN (R 4.0.5)
##  mime            0.12      2021-09-28 [1] CRAN (R 4.0.5)
##  munsell         0.5.0     2018-06-12 [1] CRAN (R 4.0.5)
##  pcaMethods    * 1.80.0    2020-04-27 [1] Bioconductor
##  pheatmap      * 1.0.12    2019-01-04 [1] CRAN (R 4.0.5)
##  pillar          1.8.1     2022-08-19 [1] CRAN (R 4.0.5)
##  pkgbuild        1.3.1     2021-12-20 [1] CRAN (R 4.0.5)
##  pkgconfig       2.0.3     2019-09-22 [1] CRAN (R 4.0.5)
##  pkgload         1.3.0     2022-06-27 [1] CRAN (R 4.0.5)
##  plotly        * 4.9.4.1   2021-06-18 [1] CRAN (R 4.0.5)
##  prettyunits     1.1.1     2020-01-24 [1] CRAN (R 4.0.5)
##  processx        3.7.0     2022-07-07 [1] CRAN (R 4.0.5)
##  progress        1.2.2     2019-05-16 [1] CRAN (R 4.0.5)
##  ps              1.7.1     2022-06-18 [1] CRAN (R 4.0.5)
##  purrr           0.3.4     2020-04-17 [1] CRAN (R 4.0.3)
##  R6              2.5.1     2021-08-19 [1] CRAN (R 4.0.5)
##  rappdirs        0.3.3     2021-01-31 [1] CRAN (R 4.0.3)
##  RColorBrewer  * 1.1-2     2014-12-07 [1] CRAN (R 4.0.5)
##  Rcpp            1.0.9     2022-07-08 [1] CRAN (R 4.0.5)
##  RCurl           1.98-1.3  2021-03-16 [1] CRAN (R 4.0.3)
##  readr         * 1.4.0     2020-10-05 [1] CRAN (R 4.0.5)
##  readxl        * 1.3.1     2019-03-13 [1] CRAN (R 4.0.5)
##  remotes         2.4.2     2021-11-30 [1] CRAN (R 4.0.5)
##  rlang           1.0.5     2022-08-31 [1] CRAN (R 4.0.5)
##  rmarkdown       2.4       2020-09-30 [1] CRAN (R 4.0.2)
##  RSQLite         2.2.17    2022-09-10 [1] CRAN (R 4.0.2)
##  rstudioapi      0.14      2022-08-22 [1] CRAN (R 4.0.5)
##  Rtsne         * 0.15      2018-11-10 [1] CRAN (R 4.0.5)
##  rvest           1.0.3     2022-08-19 [1] CRAN (R 4.0.5)
##  S4Vectors       0.28.1    2020-12-09 [1] Bioconductor
##  scales        * 1.2.1     2022-08-20 [1] CRAN (R 4.0.5)
##  sessioninfo     1.2.2     2021-12-06 [1] CRAN (R 4.0.5)
##  stringi         1.7.8     2022-07-11 [1] CRAN (R 4.0.2)
##  stringr       * 1.4.1     2022-08-20 [1] CRAN (R 4.0.5)
##  survival        3.4-0     2022-08-09 [1] CRAN (R 4.0.5)
##  tibble        * 3.1.8     2022-07-22 [1] CRAN (R 4.0.2)
##  tidyr         * 1.1.4     2021-09-27 [1] CRAN (R 4.0.5)
##  tidyselect      1.1.2     2022-02-21 [1] CRAN (R 4.0.5)
##  usethis         2.1.6     2022-05-25 [1] CRAN (R 4.0.5)
##  utf8            1.2.2     2021-07-24 [1] CRAN (R 4.0.5)
##  vctrs           0.4.1     2022-04-13 [1] CRAN (R 4.0.5)
##  viridisLite     0.4.1     2022-08-22 [1] CRAN (R 4.0.5)
##  webshot         0.5.3     2022-04-14 [1] CRAN (R 4.0.5)
##  withr           2.5.0     2022-03-03 [1] CRAN (R 4.0.5)
##  WriteXLS      * 5.0.0     2019-05-25 [1] CRAN (R 4.0.0)
##  xfun            0.31      2022-05-10 [1] CRAN (R 4.0.5)
##  XML             3.99-0.10 2022-06-09 [1] CRAN (R 4.0.2)
##  xml2            1.3.3     2021-11-30 [1] CRAN (R 4.0.5)
##  xtable          1.8-4     2019-04-21 [1] CRAN (R 4.0.5)
##  yaml          * 2.2.2     2022-01-25 [1] CRAN (R 4.0.5)
## 
##  [1] /mnt/storage/shared/conda/envs/dabf9820504c4a7c89741741bb50afed/lib/R/library
## 
## ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

4.3 References

The following references are provided for tools used with implications on the scientific and statistical outcome of this analysis. A multitude of other tools helped in preparation of this report of which many are available as open source. Please contact us for a full list of references.

Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. 2015. “HTSeq–a Python framework to work with high-throughput sequencing data.” Bioinformatics 31 (2): 166–69. https://doi.org/10.1093/bioinformatics/btu638.

Andrews, Simon. 2010. “FastQC: A quality control tool for high throughput sequence data.” https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Bushnell, Brian. 2015. “BBMap.” https://sourceforge.net/projects/bbmap/.

Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. “From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; referees: 5 approved].” F1000Research 5: 1–49. https://doi.org/10.12688/F1000RESEARCH.8987.2.

Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2013. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.

Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. “MultiQC: Summarize analysis results for multiple tools and samples in a single report.” Bioinformatics 32 (19): 3047–8. https://doi.org/10.1093/bioinformatics/btw354.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.

Köster, Johannes, and Sven Rahmann. 2012. “Snakemake-a scalable bioinformatics workflow engine.” Bioinformatics 28 (19): 2520–2. https://doi.org/10.1093/bioinformatics/bts480.

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map format and SAMtools.” Bioinformatics 25 (16): 2078–9. https://doi.org/10.1093/bioinformatics/btp352.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 1–21. https://doi.org/10.1186/s13059-014-0550-8.

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing High-Dimensional Data Using t-SNE.” Journal of Machine Learning Research 9 9 (August): 2579–2605.

McCarthy, Davis J., Yunshun Chen, and Gordon K. Smyth. 2012. “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research 40 (10): 4288–97. https://doi.org/10.1093/nar/gks042.

Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2009. “edgeR: A Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Stacklies, Wolfram, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. “pcaMethods - A bioconductor package providing PCA methods for incomplete data.” Bioinformatics 23 (9): 1164–7. https://doi.org/10.1093/bioinformatics/btm069.

Zerbino, Daniel R., Premanand Achuthan, Wasiu Akanni, M. Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, et al. 2018. “Ensembl 2018.” Nucleic Acids Research 46 (D1): D754–D761. https://doi.org/10.1093/nar/gkx1098.