DIEGO is a python program that should run out of the box as long as the prerequisites (python external libraries: numpy, scipy and matplotlib) are installed on your system. Several python and perl scripts to prepare the input are also provided. DIEGO runs on a normal desktop machine. You can download the latest version of DIEGO from here and extract it with
$ unzip DIEGO.zip
go to the new directory and
run the scripts directly.
You might want to work through our Tutorial
2 Prepare Input File¶
There are two input files. One containing all splice event (or exon expression) data is a sorted (gene ID) tab-separated file with the following format and header:
chr:start-stop <tab> type <tab> g1_xxx <tab> g1_xxx <tab> [...] <tab> g2_xxx <tab> g2_xxx <tab> [...]<tab> gene identifier <tab> gene name
where the first column refers to the location of the splice junction or exon, which must have the string:number-number format, but is ignored in the computation, so the numbers and chromosome names are arbitrary. The second column to the type of splice junction (usually N_w, but also e.g. circular and the following columns to the absolute number of reads. Each read count column has a unique name that is used to identify it in the second input file. The names from the first input file are classified in the second input file, which is tab seperated and has the following format.
group identifier <tab> count_column_name
There are multiple ways to build the first input file. Depending on the type of data you want to analyse, you can use one of the scripts provided within DIEGO. Usually, these scripts need an input file of the format
These file_containing_read_information are either segemehl derived bam files, STAR derived splice junction count files, or HTSeq derived exon count files. Simply run the appropriate script, and you get the input file containing the read information. You now only have to provide the second input file and you can run DIEGO.
count_column_name <tab> (PATH_to)file_containing_read_information
3 Detection of differential alternative splicing¶
To do a differential splice site analysis run
$ python DIEGO.py -a readcount_file -b classification_file -x base_condition > your_output file
Option -a is the read count table, -b the classification file, and -x the name of the condition (as listed in the classification file) that is used as the basis of the differential analysis. Additionally, you might want to change the minimum support necessary for prediction (-c), the minimum amount of samples that show minimum support per splice site (-d), the minimum significance level (-q) and the minimum fold change (-z).
4 Clustering based on alternative splicing¶
To do a clustering based on alternative splicing run
$ python DIEGO.py -a readcount_file -b classification_file -x base_condition -e
Again, option -a is the read count table, -b the classification file, and -x a name of a condition (as listed in the classification file). The -e parameter toggles the clustering mode. You can use -f to specify the name of the dendrogram that will be put out.
The output for the alternative splice event detection has a tab delimited format:
location of junction <tab> type <tab> abundance change <tab> p-value <tab> q-value <tab> gene ID <tab> gene name <tab> number of junctions (gene) <tab> number of significant junctions <tab> centre distance <tab> significant yes/no <tab> biologically significant change of splice junction useage
The output is not filtered for significance! Please decide on your own, if you like to filer on p- or q-value and to which significance.