R: TileShuffle

TileShuffle {TileShuffle}

R Documentation

TileShuffle

Description

The overall procedure to analyze tiling array expression data.

Usage

TileShuffle(input.type, signal.filename, cel.filename, cel2.filename,
    celinc.filename, custom.filename, custom2.filename,
    custominc.filename, bpmap.filename, minhits=8000, group="Hs",
    pmonly=TRUE, normalize=TRUE, mod.tstat=FALSE, noofperms, winsize,
    qvalue, gcmode="fixed", gcnum, score.function="trimmed",
    randomize=FALSE, diff, diff.variant="B", regions.filename,
    zscore.filename, output.filename, verbose=FALSE)

Arguments

`input.type`	The type of input data which contains the probe and intensity information. In case of `input.type == 1`, the input is given in terms of human-readable signal.txt files, created by the Affymetrix Tiling Array Software (TAS) that contain the probe scores, and Affymetrix BPMAP files comprising the probe mapping and further information on them. In case of `input.type == 2`, the input is provided as Affymetrix CEL files and Affymetrix BPMAP files. On the other hand, by selecting `input.type == 3`, the input is provided as custom-formatted files that may be created from tiling array data of any design and platform. More details on the format are given in the description of the function `TileReadCustom` The selection of `input.type == 1` is deprecated since it was specifically designed for the Affymetrix pipeline with TAS on the Human tiling array 1.0R platform and may not work properly in other cases. Instead, please use `input.type = 2` with Affymetrix CEL files or `input.type = 3` with custom-formatted text files.
`signal.filename`	Filename of human-readable signal.txt file (as a `character`) created by the Tiling Array Software (TAS).
`cel.filename`	A `vector` of one or more filenames of Affymetrix CEL files (as `character`) that contain the probe intensities in the first cellular condition. Note that replicates are simply defined by more than one filename and used according to `mod.tstat`.
`cel2.filename`	A `vector` of one or more filenames of Affymetrix CEL files (as `character`) that contain the probe intensities in the second cellular condition. Note that replicates are simply defined by more than one filename and used according to `mod.tstat`.
`celinc.filename`	A `vector` of one or more filenames of Affymetrix CEL files (as `character`) containing probe intensities that should be included in the normalization. This may be desirable in case tiling array data of more than two different cellular states is available and multiple transitions between them are being analyzed.
`custom.filename`	A `vector` of one or more filenames of custom- formatted files (as `character`) that contain the probe intensities in the first cellular condition and are formatted as described. Note that replicates are simply defined by more than one filename and used according to `mod.tstat`.
`custom2.filename`	A `vector` of one or more filenames of custom- formatted files (as `character`) that contain the probe intensities in the second cellular condition and are formatted as described. Note that replicates are simply defined by more than one one filename and used according to `mod.tstat`.
`custominc.filename`	A `vector` of one or more filenames of custom-formatted files (as `character`) containing probe intensities that should be included in the normalization. This may be desirable in case tiling array data of more than two different cellular states is available and multiple transitions between them are being analyzed.
`bpmap.filename`	Filename of Affymetrix binary probe mapping (BPMAP) file (as a `character`), which is a binary file containing information on the location of each probe in the reference sequence. Moreover, it stores the probe sequences that are necessary to calculate the GC content.
`minhits`	Minimal number of hits in BPMAP entry to be considered for the further analysis. Due to historical reasons there are several entries in the BPMAP file with only around thousand probes assigned that might overlap with the larger entries or with entries on other tiling arrays. In case of Affy tiling array 1.0R, a value of 8000 is recommended.
`group`	A group name as the organism abbreviation in order to consider only these entries in the BPMAP file and hence disregard entries such as TIGR, Affymetrix, or bacterial controls.
`pmonly`	Indicates whether only intensities of perfect match (PM) probes on the tiling array are incorporated in the probe intensity estimation. If neither `pmonly` nor `mmonly` is set to `TRUE`, the specific hybridization effect of a probe is estimated by taking `PM-MM`.
`normalize`	Indicates whether the probe intensities of the given files in `cel.filename`, `cel2.filename`, and `celinc.filename` (with `input.type == 2` or `custom.filename`, `custom2.filename`, and `custominc.filename` (with `input.type == 3`) are normalized by use of full-quantile normalization. The normalization is recommended if replicates are available or a differential analysis is executed and, hence, the transition between cellular states is analyzed.
`mod.tstat`	Indicates the use of replicate information. If `TRUE`, the score is the value of the moderated t-stastistic (see `eBayes` of limma package for further details). Otherwise, the median probe `log2`-intensity among the given replicates or the median of all pairwise `log2`-fold changes between both states will be used as estimate of the probe differential score. Note that the moderated t-statistic can only be used if replicate information is available.
`noofperms`	Number of permutations used to sample the background distribution. With higher number of permutations, the statistical significance of windows can be assessed more precisely in particular with more restrictive significance thresholds, i.e., low values of the `qvalue` parameter. Moreover, in case of the differential analysis (`diff` is enabled), the number of permutations should be increased to sample the background distribution more accurately. In general, `noofperm` is recommended to be set to `10000` and `100000` in case of expression and differential expression analyses, respectively.
`winsize`	Maximal width of the windows that are being statistically assessed. The width is defined as the difference in the genomic center positions of the first and last enclosed probe. Due to gaps in the commonly uniform distribution of probes over the genomic sequence. The analyzed windows may be considerably shorter than the defined `winsize` or may even consist of only one probe.
`qvalue`	Maximal permitted q-value that is applied in the statistical analysis. Hence, windows with a q-value above the given value will not be included in the returned `data.frame`.
`gcmode`	Mode of GC content binning. In case of `gcmode` set to "fixed", the classification of probes in bins was predefined considering the GC content effect on probe intensities on the Affymetrix tiling array 1.0R platform. In this case, only the following values are permitted: 1, 2, 3, 4, or 5. By setting `gcmode` to "automatic", the binning is done automatically solely on the distribution of the GC content of the probes in order to obtain GC content bins that are optimally balanced in terms of their sizes.
`gcnum`	Number of different GC content bins where probes within each bin have a similar expected sequence-specific affinity and are permuted independently from each other. Accordingly, intensities of probes that belong to different affinity bins must not be interchanged. Due to the trade-off between the reduction of the sequence-specific effect and the maintenance of sufficiently large permutation bins, three GC content bins are recommended in the expression analysis. In case of `gcmode` set to "fixed", the classification of probes in bins was predefined considering the GC content effect on probe intensities on the Affymetrix tiling array 1.0R platform. In this case, only the following values are permitted: 1, 2, 3, 4, or 5. By setting `gcmode` to "automatic", the binning is done automatically solely on the distribution of the GC content of the probes in order to obtain GC content bins that are optimally balanced in terms of their sizes. Note that the `gcnum` is set to one in case of differential expression analysis (`diff` enabled) since sequence-specific effect cancel out and affinity binning is rendered unnecessary.
`score.function`	Function to calculate windows scores over the scores of the corresponding probes, i.e., arithmetic average (`score.function` = "mean"), arithmetic mean trimmed by the minimal and maximal value (`score.function` = "trimmed"), or the median (`score.function` = "median"). Note that the definition of trimmed mean differs from the common one with given percentile ranges. Moreover, the resulting scores with trimmed mean may differ from the mean only in case of windows that contain more than two probes. The latter two scoring functions are recommended due to their higher robustness against outliers. However, due to the higher calculation costs, the running time increases by selecting "trimmed" or "median". Note that the function is given as `character`.
`randomize`	Indicates whether an additional permutation is applied prior to the calculation of original window scores. It is a possiblity to roughly estimate the false positive rate since under the assumption of mostly unexpressed probes no window over permuted intensities is expected to differ significantly from the background distribution.
`diff`	Indicates whether differential expression analysis is applied.
`diff.variant`	The variants of the differential expression analysis differ in score calculation, in the permutation procedure as well as in their assignment of statistical significance to windows. The `diff.variant` A is similar to the normal expression analysis but two-tailed p-values are estimated to regard both regulation directions, up and down. The multiple testing correction is then adjusted to account for these additional comparisons. The `diff.variant` B assumes that entire windows are either up- or down-regulated between conditions. The presumed direction of regulation is initially assigned to each window on the basis of its score. Subsequently, all converse probes, i.e., probes with negative score within positive windows or vice versa, are ignored and neither permuted nor incorporated into the score calculation. Consequently, positive and negative windows are compared to different background distributions. The p-value estimation and correction is done equivalent as in the case of the normal expression analyses. Both variants produce fairly similar results while the variant B is slightly superior in its performance and hence recommended.
`regions.filename`	Filename of BED-formatted file that contains regions to which the (differential) expression analysis should be limited to. Hence, only windows entirely enclosed in the union of these regions are statistically evaluated. Commonly, this parameter is used in order to identify highly and differentially expression (highdiff) regions by restricting the differential expression analysis (`diff` enabled) to regions identified as highly expressed in either one of the corresponding cellular conditions.
`zscore.filename`	Filename of BED-formatted z-score file that may be written and comprises the analyzed windows of the statistical analysis. Note that it is only written if `zscore.filename` is set accordingly. In this case, it contains the name of the reference sequence, start and end position, description, estimate z-score z-score, and ‘+’ as strand for each analyzed window. The description is an underscore-delimitted string of the number of covered probes, the average GC content of their sequences, the window q-value multiplied by 100, and the window score calculated with the given `scoring.function` on the probe scores. The z-scores represent normalized window scores that are calculated by use of the sampled background. More precisely, the window z-score `z` is calculated by z = \frac{x - \mu}{\sigma} where `x` is the window score and \mu and \sigma are the mean and standard deviation of the permuted window scores, respectively. Note that negative z-scores indicate down-regulation and positive indicate up- regulated regions since z-scores are bounded to zero from below or from above in case of up- or down-regulation, respectively. Hence, using common expression analysis (if `diff` is `FALSE`), the z-score cannot be negative. Note that any existing file will be overwritten.
`output.filename`	Filename of BED-formatted output file that is written and comprises information on the segments identified as significantly (differentially) expressed including name of reference sequence, start and end position, description, score that is uniformly set to ‘10’ (better compatibility of BED files e.g. with UCSC genome browser), and ‘+’ as strand. The description is an underscore- delimitted string of the name of reference sequence, the start and end position, the number of probes covered by the segment, the average GC content of their sequences, the minimal q-value of windows that were merged into the segment (multiplied by 100), and the segment score calculated with the given `scoring.function` on the covered probes.
`verbose`	Indicates whether information on progress are printed.

Details

The overall procedure to analyze tiling array expression data by reading the measure probe scores and further information as input, executes the statistical analysis of this data, and report the output in terms of one or two BED-formatted files (depending on whether zscore.filename is set to NULL).

Value

None. All output is written to the output.filename and zscore.filename if zscore.filename is not NULL. All generated files contain header information marked by ‘#’ at the beginning of the line that lists the used parameters with their values and the description of the columns in the subsequent output. Note that any existing file will be overwritten.

Examples

## Note that the following example only executes if the external data
## of the Starr R package is available which includes an artificial
## Affymetrix BPMAP file and corresponding CEL files.
path <- system.file("extdata", package = "Starr")
if (path != ""){
## define Affymetrix BPMAP file for probe mapping
bpmap.filename <- file.path(path, "Scerevisiae_tlg_chr1.bpmap")
## define Affymetrix CEL files
## here: file of control experiment (wt)
wt.filename <- file.path(path, "wt_IP_chr1.cel")
## and files of real experiment with
## tagged Rpb3 with two replicates (ip)
ip.filename <- c(file.path(path, "Rpb3_IP_chr1.cel"),
file.path(path, "Rpb3_IP2_chr1.cel"))
stopifnot(file.exists(bpmap.filename) && all(file.exists(ip.filename)) &&
file.exists(wt.filename))

## identify highly expressed segments in IP state
## (only 100 permutations as example)
## Note that group is here '' (blank) for old Affy chr21/22 arrays
## but commonly it is "Hs" for Human, "Mm" for Mouse or "Dm" for Drosophila
TileShuffle(bpmap.filename=bpmap.filename, cel.filename=ip.filename,
input.type=2, group="", pmonly=TRUE, normalize=TRUE,
noofperms=100, winsize=200, qvalue=0.05,
gcnum=3, diff=FALSE, output.filename="ip_high.bed",
zscore.filename="ip_high_zscore.bed", verbose=FALSE)

## identify highly expressed segments in wt state
## (only 100 permutations as example)
TileShuffle(bpmap.filename=bpmap.filename, cel.filename=wt.filename,
input.type=2, group="", pmonly=TRUE, normalize=TRUE,
noofperms=100, winsize=200, qvalue=0.05,
gcnum=3, diff=FALSE, output.filename="wt_high.bed",
zscore.filename="wt_high_zscore.bed", verbose=FALSE)

## concatenate files
file.create("ip_and_wt_high.bed")
file.append("ip_and_wt_high.bed", "ip_high.bed")
file.append("ip_and_wt_high.bed", "wt_high.bed")

## identify highly and differentially expressed segments (highdiff)
## that are highly expressed in control or real experiment and
## significantly differentially expressed between both conditions
## (only 100 permutations as example)
## Note: common differential analysis without idenfying highdiff segments
## is executed simply by omitting the parameter 'regions.filename'
TileShuffle(bpmap.filename=bpmap.filename, cel.filename=wt.filename,
cel2.filename=ip.filename, input.type=2, group="",
pmonly=TRUE, normalize=TRUE, noofperms=100,
winsize=200,qvalue=0.05, gcnum=1, diff=TRUE,
regions.filename="ip_and_wt_high.bed",
output.filename="ip_and_wt_highdiff.bed",
zscore.filename="ip_and_wt_highdiff_zscore.bed",
verbose=FALSE)
## cleanup
rm(bpmap.filename, wt.filename, ip.filename)
}
rm(path)

[Package TileShuffle version 0.2.0 Index]