- User Guide
- 1 Introduction
- 2 Requirements
- 3 Installation
- 4 Quick Start
- 5 Differential splicing analysis
- 6 Hierarchical clustering based on splice site usage
- 7 Input
- 8 Missing values
- 9 Output
- 10 Usage
- 11 Parameters
- 13 Complaints
DIEGO is a software tool to find splice junctions or exons that are differentially used in two conditions. It uses an Aichinson's statistics and a parameter free test (Wilcoxson) to find these junctions/exons. Because of the usage of the Wilcoxson test, it only can produce significant results after correction for multiple testing, for a sufficiently high number of samples. DIEGO can cope with hundreds of samples, but using it with less than 5 samples per group will reduce the number of significant results drastically. DIEGO uses a table of splice junction usage or exon expression as input. This table can easily be generated, but DIEGO also provides scripts to generate an input matrix from STAR splice junction counting files, DEXSeq count files or segemehl bam files. Besides finding differential splice junction/exon usage, DIEGO can also be used to cluster samples wrt. their exon/splice junction usage.
DIEGO needs python (version 2.7 or later) and the numpy (version >=1.11.2), scipy (version>=0.19.0) and matplotlib (version>=1.5.3) python libraries as well as perl (version 5). Python and perl scripts should run out of the box in all machines having these libraries installed.
Download the .zip file from here and unzip it.
You now should have a directory called DIEGO containing all the scripts.
$ unzip DIEGO.zip
4 Quick Start¶
In order to use DIEGO, you need two files: One containing the junction usage matrix, the other a list of the names (as used in the junction usage matrix) and a condition that the respective sample belongs to. To create a junction usage matrix file, you need a genome annotation file which can be created out of a gff (gtf) file with
and a file containing a list of the sample names and a file mapping the respective names to a count (or bam) file (Format: sample-name tab-delimiter path/to/file).
$ perl gfftoDIEGObed.pl -g your_gff_file -o your_annotation_bed_file
If for example you have a set of STAR generated junction files, you can create your junction table file using:
$ python pre_STAR.py -l your_list_of_names_and_starjunction_files -d your_annotation_bed_file
This will create a file junction_table.txt you can, use together with a file that maps the sample names to one of two groups (condition tab-delimiter sample_name), as input for DIEGO.py One of the two group names has to be used as BASE_CONDITION
$ python DIEGO.py -a junction_table.txt -b your_list_of_names_and_group_files -x your_base_condition > Your_output
If you want to create a tree with a clustering, you also have to define a base condition:
$ python DIEGO.py -a junction_table.txt -b your_list_of_names_and_group_files -x your_base_condition -e [-f name_your_dendrogram]
5 Differential splicing analysis¶
The default mode of DIEGO uses Aitchinson's geometry to find differential splicing between two groups of samples. Both splice junction counts and exon expression counts (as e.g. created by HTSeq) can be used to generate a juction table file that unites counts and mappings to genes.
6 Clustering based on splice site usage¶
DIEGO can also be used to generate a clustering based on alternative splicing. This is triggered by the -e parameter. DIEGO first identifies the genes with the highest splice form variance. Then, it uses these genes to compute pairwise distances between the single samples based on Aitchinson's geometry. These distances are then used to generate a clustering dendrogram. The cluster dendrogram is colored in the following way: For brevity, let t be the color_threshold. Colors all the descendent links below a cluster node k the same color if k is the first node below the cut threshold t. All links connecting nodes with distances greater than or equal to the threshold are colored blue. If t is less than or equal to zero, all nodes are colored blue. If color_threshold is None or ‘default’, corresponding with MATLAB(TM) behavior, the threshold is set to 0.7*max(Z[:,2]). Per default, the name of the dendrogram created is cluster_dendrogram.pdf, but this name can be changed using the -f parameter. Please note that for anachronistic reasons, the -x parameter has to be set also for clustering.
The input consists of two files: a tab-separated file, which must contain a header line of the format:
| junction | type | sample_1 | sample_2 | [...] | sample_x | sample_y | geneID | geneName |
where the first two and last two columns have to be fixed, the samples can come in any order. The following tab-separated lines contain the data for each splice junction or exon, depending on the users choice. The affiliations of samples is assigned through another file that maps every sample to a group (see below). This input file can contain data of more than two groups, however, only samples that are assigned a group in the affiliations file are considered. The other file is also tab-separated and contains the assignments of the sample names to two groups:
| group1 | sample_name1 |
| group2 | sample_namex |
in no particular order. Besides these two files, many scripts in DIEGO also need a tab seperated file that links a sample name to a data file (e.g. STAR junctions file, segemehl bam file, HTSeq count file):
| sample_name1 | /path/to/datafile1|
| sample_namex | /path/to/datafilex|
Generate an input file from STAR junction ¶
We offer an easy way to generate an appropriate input file containing all splice junction counts from STAR junction files (usually having the suffix SJ.out.tab). After generating a file with sample names and the junction files (as described above) and a gene annotation (.
$ python pre_STAR.py -l your_file_with_names_and_data -d your_annotations [-o output_directory]
this will create a file (junction_table.txt) that can be used as an input for DIEGO.py#generate_an_input_file_from_tcga_derived_files
Generate an input file from TCGA junction support files ¶
We offer an way to generate an appropriate input file containing all splice junction counts from TCGA - formatted junction support files. After generating a file with sample names and the junction files (as described above) and a gene annotation (.
$ python pre_std.py -l your_file_with_names_and_data -d your_annotations [-o output_directory]
this will create a file (junction_table.txt) that can be used as an input for DIEGO.py#generate_an_input_file_from_segemehl1_derived_files
Generate an input file from segemehl 0.1 derived bam files ¶
We also offer a script to generate an appropriate input file containing all splice junction counts from bam files generated by segemehl 0.1. After generating a file with sample names and the bam files (as described above) and a gene annotation.
$ python pre_segemehl.py -l your_file_with_names_and_data -a your_annotations -d genome.fa [-t number of threads] [-o output_directory]
|-m , --max_range||maximal distance of the splice junction of a single read to be assigned to a certain splice junction|
|-j , --min_sup||minimum support for a splice junction|
this also will create a file (junction_table.txt) that can be used as an input for DIEGO.py
Generate an input file from segemehl 0.2 derived splice junction count files ¶
Newer versions of segemehl provide lists of splice events and splice events counts We also offer a script to generate an appropriate input file from the ".sngl.bed" files generated bey segemehl. After generating a file with sample names and the .sngl.bed files (as described above) and a gene annotation.
$ python pre_segemehl0_3_0.pl -l your_file_with_names_and_data -a your_annotations [-o output_file_name]
this will create a file (default: junction_table_DIEGO.txt) that can be used as an input for DIEGO.py
Generate an input file from HTSeq count files ¶
We also offer a script to generate an appropriate input file containing all exon counts from HTSeq count files. After generating a file with sample names and the HTSeq files (as described above) and a gene annotation.
$ perl HTseq2DIEGO.pl -i your_file_with_names_and_data [-o output_file name]
this also will create a file (default junction_table_dexdas) that can be used as an input for DIEGO.py
8 Missing values¶
DIEGO handles splice junction coverages of 0 as if they were missing data. DIEGO uses a negative binomial distribution to handle these missing values for splice junction counts. For any splice junction that passes the filter built into DIEGO (number of samples, minimum coverage), all 0 will be replaced by numbers chosen from a negative binomial. Splice junctions where zero replacement was performed are indicated in the output.
The output for differential splice site detection has the following format:
| junction id | junction type | abundance change | p-value | q-value | gene ID | gene Name | number of junctions in gene | number of significant junctions | distance of the centres | significant | zero replacement |
The abundance change of a splice junction is used to decide whether statistical tests are run, the higher the absolute value, the more different the two groups are. The center distance gives an idea of how different the splicing in the whole gene is. If only the results passing the filters the user defined should be shown, you can grep for the word yes (in the significant column).
$ grep -w yes DIEGOoutput_file
$ python DIEGO.py [-h] -a MY_A -b MY_B -x MY_BASE_CONDITION [-c MY_MS] [-d MY_MT] [-q MY_QVALUE] [-z MY_FC] [-e] [-f MY_F] [-r]
|DataInputFile||String||a table file containing the input data|
|a, --table||String||table of splice junction supports per sample|
|-b, --list||String||condition to sample relation in the format: condition tab-delimiter sampleName|
|-x, --base_condition||String||specify base condition|
|-c, --minsupp||integer||10||min support per splice site (at least -d samples have to show this min support)|
|-d, --minsamples||Integer||3||min amount of samples showing in at least one of the junctions the min support|
|-q, --significanceThreshold||Double||0.01||signifcance level|
|-z, --foldchangeThreshold||Double||1.0||abundance change threshold|
|-e, --cluster||enables clustering modes|
|-f, --dendrogram||prefix specifying the dendogram plot|
|-r, --random||Integer||random seed|
DIEGO uses this table as input for the splice site/exon coverage date. The tab-seperated table has to have a header:
| junction | type | sample_1 | sample_2 | [...] | sample_x | sample_y | geneID | geneName |
Can be generated with pre_std.py pre_star.py, HTseq2DIEGO.pl or pre_segemehl.py
The length parameter -b gives a tab-delimited table assigning the sample names (as used in the data file) to two groups for differential analysis. Format:
| condition | sample_name |
The option -c sets the minimum coverage necessary for a splice junction to be investigated. Every splice junction that does not have at least a support of -c (in -d samples) with support is neglected.
The option -d sets the minimum number of samples that have to show at least minimum support (-c) for a splice junction to be investigated. Every splice junction that does not have at least -c samples with support is neglected.
This parameter specifies the corrected p-value that is to be considered as significant (will have a yes in the significance column)
This parameter gives the minimum abundance change of a splice junction to be investigated using a wilcoxson test.
This parameter specifies the base condition (one of the two groups in the -b file)
This triggers the clustering mode, where a dendrogram will be generated based on the usage of alternative splicing (by default called cluster_dendrogram.pdf)
prefix of the name of the dendrogram generated in -e mode (only in clustering mode).
specify random seed (for exact reproducability).
All complaints go to [berni,steve] at bioinf dot uni-leipzig dot de