TileAnalysis {TileShuffle}R Documentation

TileAnalysis

Description

The statistical analysis of tiling array expression data.

Usage

TileAnalysis(data, noofperms, winsize, qvalue, gcmode="fixed", gcnum,
    score.function="trimmed", randomize=FALSE, diff, diff.variant="B",
    regions.filename, zscore=TRUE, verbose=FALSE)

Arguments

data A data.frame containing information on all probes, i.e., the full name of the reference sequence (organism abbreviation and chromosome name), the chromosome name, the probe center position, the length of the probe sequence, the GC content of the probe sequence, and the probe log2-intensity or log2-fold change. This data.frame is reported by the functions TileReadCel, TileReadSignal, or TileReadCustom.
noofperms Number of permutations used to sample the background distribution. With higher number of permutations, the statistical significance of windows can be assessed more precisely in particular with more restrictive significance thresholds, i.e., low values of the qvalue parameter. Moreover, in case of the differential analysis (diff is enabled), the number of permutations should be increased to sample the background distribution more accurately. In general, noofperm is recommended to be set to 10000 and 100000 in case of expression and differential expression analyses, respectively.
winsize Maximal width of the windows that are being statistically assessed. The width is defined as the difference in the genomic center positions of the first and last enclosed probe. Due to gaps in the commonly uniform distribution of probes over the genomic sequence. The analyzed windows may be considerably shorter than the defined winsize or may even consist of only one probe.
qvalue Maximal permitted q-value that is applied in the statistical analysis. Hence, windows with a q-value above the given value will not be included in the returned data.frame.
gcmode Mode of GC content binning. In case of gcmode set to "fixed", the classification of probes in bins was predefined considering the GC content effect on probe intensities on the Affymetrix tiling array 1.0R platform. In this case, only the following values are permitted: 1, 2, 3, or 4. By setting gcmode to "automatic", the binning is done automatically solely on the distribution of the GC content of the probes in order to obtain GC content bins that are optimally balanced in terms of their sizes.
gcnum Number of different GC content bins where probes within each bin have a similar expected sequence-specific affinity and are permuted independently from each other. Accordingly, intensities of probes that belong to different affinity bins must not be interchanged. Due to the trade-off between the reduction of the sequence-specific effect and the maintenance of sufficiently large permutation bins, three GC content bins are recommended in the expression analysis. In case of gcmode set to "fixed", the classification of probes in bins was predefined considering the GC content effect on probe intensities on the Affymetrix tiling array 1.0R platform. In this case, only the following values are permitted: 1, 2, 3, or 4. By setting gcmode to "automatic", the binning is done automatically solely on the distribution of the GC content of the probes in order to obtain GC content bins that are optimally balanced in terms of their sizes. Note that the gcnum is set to one in case of differential expression analysis (diff enabled) since sequence-specific effect cancel out and affinity binning is rendered unnecessary.
score.function Function to calculate windows scores over the log2-intensities or log2-fold changes of the corresponding probes, i.e., arithmetic average (score.function = "mean"), arithmetic mean trimmed by the minimal and maximal value (score.function = "trimmed"), or the median (score.function = "median"). Note that the definition of trimmed mean differs from the common one with given percentile ranges. Moreover, the resulting scores with trimmed mean may differ from the mean only in case of windows that contain more than two probes. The latter two scoring functions are recommended due to their higher robustness against outliers. However, due to the higher calculation costs, the running time increases by selecting "trimmed" or "median". Note that the function is given as character.
randomize Indicates whether an additional permutation is applied prior to the calculation of original window scores. It is a possiblity to roughly estimate the false positive rate since under the assumption of mostly unexpressed probes no window over permuted intensities is expected to differ significantly from the background distribution.
diff Indicates whether differential expression analysis is applied.
diff.variant The variants of the differential expression analysis differ in score calculation, in the permutation procedure as well as in their assignment of statistical significance to windows. The diff.variant A is similar to the normal expression analysis but two-tailed p-values are estimated to regard both regulation directions, up and down. The multiple testing correction is then adjusted to account for these additional comparisons. The diff.variant B assumes that entire windows are either up- or down-regulated between conditions. The presumed direction of regulation is initially assigned to each window on the basis of its score. Subsequently, all converse probes, i.e., probes with negative log2-fold change within positive windows or vice versa, are ignored and neigther permuted nor incorporated into the score calculation. Consequently, positive and negative windows are compared to different background distributions. The p-value estimation and correction is done equivalent as in the case of the normal expression analyses. Both variants produce fairly similar results while the variant B is slightly superior in its performance and hence recommended.
regions.filename Filename of BED-formatted file that contains regions to which the (differential) expression analysis should be limited to. Hence, only windows entirely enclosed in the union of these regions are statistically evaluated. Commonly, this parameter is used in order to identify highly and differentially expression (highdiff) regions by restricting the differential expression analysis (diff enabled) to regions identified as highly expressed in either one of the corresponding cellular conditions.
zscore Indicates whether z-scores, i.e., normalized window scores, should be calculated by use of the sampled background. More precisely, the window z-score z is calculated by z = frac{x - μ}{σ} where x is the window score and μ and σ are the mean and standard deviation of the permuted window scores, respectively. Note that negative z-scores indicate down-regulation and positive indicate up-regulated regions since z-score are bounded to zero from below or from above in case of up- or down-regulation, respectively. Hence, using common expression analysis (if diff is FALSE), the z-score cannot be negative.
verbose Indicates whether information on progress are printed.

Details

Executes the statistical analysis of tiling array expression data that identifies (differential) expression as significant changes from the background distribution while considering sequence-specific affinities as well as cross-hybridization. This method returns a list with two data.frames: one containing the information on the analyzed window including their estimated z-score if zscore is enabled while the other one comprise the significantly (differentially) expressed segments.

Value

Returns a list with two data.frames: one containing the analyzed windows including the calculated z-score and the other one comprise the significantly expressed segments. The first list entry is NULL if zscore is FALSE. Otherwise, the entry keeps the z-score data.frame containing name of reference sequence, start and end position, description, estimated z-score, and `+' as strand for each analyzed window. The description is an underscore-delimitted string of the number of covered probes, the average GC content of their sequences, the window q-value multiplied by 100, and the window score calculated with the given scoring.function on the probe scores. The second list entry is a data.frame that comprises the significantly (differentially) expressed segments including name of reference sequence, start and end position, description, score that is uniformly set to `10' and `+' as strand. The description is an underscore-delimitted string of the name of reference sequence, the start and end position, the number of probes covered by the segment, the average GC content of their sequences, the minimal q-value of windows that were merged into the segment (multiplied by 100), and the segment score calculated with the given scoring.function on the covered probes.


[Package TileShuffle version 0.1.0 Index]