TileShuffle {TileShuffle} | R Documentation |
The overall procedure to analyze tiling array expression data.
TileShuffle(input.type, signal.filename, cel.filename, cel2.filename, celinc.filename, custom.filename, custom2.filename, custominc.filename, bpmap.filename, minhits=8000, group="Hs", pmonly=TRUE, normalize=TRUE, mod.tstat=FALSE, noofperms, winsize, qvalue, gcmode="fixed", gcnum, score.function="trimmed", randomize=FALSE, diff, diff.variant="B", regions.filename, zscore.filename, output.filename, verbose=FALSE)
input.type |
The type of input data which contains the probe and
intensity information.
In case of input.type == 1 , the input is given in terms of
human-readable signal.txt files, created by the Affymetrix Tiling
Array Software (TAS) that contain the probe scores, and Affymetrix
BPMAP files comprising the probe mapping and further information on
them.
In case of input.type == 2 , the input is provided as
Affymetrix CEL files and Affymetrix BPMAP files.
On the other hand, by selecting input.type == 3 , the input is
provided as custom-formatted files that may be created from tiling
array data of any design and platform. More details on the format are
given in the description of the function TileReadCustom
The selection of input.type == 1 is deprecated since it was
specifically designed for the Affymetrix pipeline with TAS on the
Human tiling array 1.0R platform and may not work properly in other
cases. Instead, please use input.type = 2 with Affymetrix CEL
files or input.type = 3 with custom-formatted text files. |
signal.filename |
Filename of human-readable signal.txt file
(as a character ) created by the Tiling Array Software (TAS). |
cel.filename |
A vector of one or more filenames of Affymetrix
CEL files (as character ) that contain the probe intensities in
the first cellular condition. Note that replicates are simply defined
by more than one filename and used according to mod.tstat . |
cel2.filename |
A vector of one or more filenames of Affymetrix
CEL files (as character ) that contain the probe intensities in
the second cellular condition. Note that replicates are simply
defined by more than one filename and used according to
mod.tstat . |
celinc.filename |
A vector of one or more filenames of
Affymetrix CEL files (as character ) containing probe
intensities that should be included in the normalization. This may
be desirable in case tiling array data of more than two different
cellular states is available and multiple transitions between them
are being analyzed. |
custom.filename |
A vector of one or more filenames of custom-
formatted files (as character ) that contain the probe
intensities in the first cellular condition and are formatted as
described. Note that replicates are simply defined by more than
one filename and used according to mod.tstat . |
custom2.filename |
A vector of one or more filenames of custom-
formatted files (as character ) that contain the probe
intensities in the second cellular condition and are formatted as
described. Note that replicates are simply defined by more than one
one filename and used according to mod.tstat . |
custominc.filename |
A vector of one or more filenames of
custom-formatted files (as character ) containing probe
intensities that should be included in the normalization. This may
be desirable in case tiling array data of more than two different
cellular states is available and multiple transitions between them
are being analyzed. |
bpmap.filename |
Filename of Affymetrix binary probe mapping (BPMAP)
file (as a character ), which is a binary file containing
information on the location of each probe in the reference sequence.
Moreover, it stores the probe sequences that are necessary to
calculate the GC content. |
minhits |
Minimal number of hits in BPMAP entry to be considered for the further analysis. Due to historical reasons there are several entries in the BPMAP file with only around thousand probes assigned that might overlap with the larger entries or with entries on other tiling arrays. In case of Affy tiling array 1.0R, a value of 8000 is recommended. |
group |
A group name as the organism abbreviation in order to consider only these entries in the BPMAP file and hence disregard entries such as TIGR, Affymetrix, or bacterial controls. |
pmonly |
Indicates whether only intensities of perfect match (PM)
probes on the tiling array are incorporated in the probe intensity
estimation. If neither pmonly nor mmonly is set to
TRUE , the specific hybridization effect of a probe is
estimated by taking PM-MM . |
normalize |
Indicates whether the probe intensities of the given files
in cel.filename , cel2.filename , and
celinc.filename (with input.type == 2 or
custom.filename , custom2.filename , and
custominc.filename (with input.type == 3 ) are
normalized by use of full-quantile normalization. The normalization
is recommended if replicates are available or a differential analysis
is executed and, hence, the transition between cellular states is
analyzed. |
mod.tstat |
Indicates the use of replicate information. If TRUE ,
the score is the value of the moderated t-stastistic (see
eBayes of limma package for further details).
Otherwise, the median probe log2 -intensity among the given
replicates or the median of all pairwise log2 -fold changes
between both states will be used as estimate of the probe
differential score. Note that the moderated t-statistic can only be
used if replicate information is available. |
noofperms |
Number of permutations used to sample the background
distribution. With higher number of permutations, the statistical
significance of windows can be assessed more precisely in particular
with more restrictive significance thresholds, i.e., low values of
the qvalue parameter. Moreover, in case of the differential
analysis (diff is enabled), the number of permutations should
be increased to sample the background distribution more accurately.
In general, noofperm is recommended to be set to 10000
and 100000 in case of expression and differential expression
analyses, respectively. |
winsize |
Maximal width of the windows that are being statistically
assessed. The width is defined as the difference in the genomic
center positions of the first and last enclosed probe. Due to
gaps in the commonly uniform distribution of probes over the
genomic sequence. The analyzed windows may be considerably shorter
than the defined winsize or may even consist of only one
probe. |
qvalue |
Maximal permitted q-value that is applied in the statistical
analysis. Hence, windows with a q-value above the given value will
not be included in the returned data.frame . |
gcmode |
Mode of GC content binning. In case of gcmode set to
"fixed", the classification of probes in bins was predefined
considering the GC content effect on probe intensities on the
Affymetrix tiling array 1.0R platform. In this case, only the
following values are permitted: 1, 2, 3, 4, or 5. By setting
gcmode to "automatic", the binning is done automatically
solely on the distribution of the GC content of the probes in order
to obtain GC content bins that are optimally balanced in terms of
their sizes. |
gcnum |
Number of different GC content bins where probes within each
bin have a similar expected sequence-specific affinity and are
permuted independently from each other. Accordingly, intensities
of probes that belong to different affinity bins must not be
interchanged. Due to the trade-off between the reduction of the
sequence-specific effect and the maintenance of sufficiently large
permutation bins, three GC content bins are recommended in the
expression analysis. In case of gcmode set to "fixed", the
classification of probes in bins was predefined considering the
GC content effect on probe intensities on the Affymetrix tiling
array 1.0R platform. In this case, only the following values are
permitted: 1, 2, 3, 4, or 5. By setting gcmode to "automatic",
the binning is done automatically solely on the distribution of the
GC content of the probes in order to obtain GC content bins that are
optimally balanced in terms of their sizes. Note that the
gcnum is set to one in case of differential expression
analysis (diff enabled) since sequence-specific effect cancel
out and affinity binning is rendered unnecessary. |
score.function |
Function to calculate windows scores over the
scores of the corresponding probes, i.e., arithmetic average
(score.function = "mean"), arithmetic mean trimmed by the
minimal and maximal value (score.function = "trimmed"),
or the median (score.function = "median"). Note that the
definition of trimmed mean differs from the common one with given
percentile ranges. Moreover, the resulting scores with trimmed
mean may differ from the mean only in case of windows that contain
more than two probes. The latter two scoring functions are
recommended due to their higher robustness against outliers. However,
due to the higher calculation costs, the running time increases by
selecting "trimmed" or "median". Note that the function is given as
character . |
randomize |
Indicates whether an additional permutation is applied prior to the calculation of original window scores. It is a possiblity to roughly estimate the false positive rate since under the assumption of mostly unexpressed probes no window over permuted intensities is expected to differ significantly from the background distribution. |
diff |
Indicates whether differential expression analysis is applied. |
diff.variant |
The variants of the differential expression analysis
differ in score calculation, in the permutation procedure as well as
in their assignment of statistical significance to windows. The
diff.variant A is similar to the normal expression analysis
but two-tailed p-values are estimated to regard both regulation
directions, up and down. The multiple testing correction is then
adjusted to account for these additional comparisons. The
diff.variant B assumes that entire windows are either up-
or down-regulated between conditions. The presumed direction of
regulation is initially assigned to each window on the basis of its
score. Subsequently, all converse probes, i.e., probes with negative
score within positive windows or vice versa, are ignored and neither
permuted nor incorporated into the score calculation. Consequently,
positive and negative windows are compared to different background
distributions. The p-value estimation and correction is done
equivalent as in the case of the normal expression analyses. Both
variants produce fairly similar results while the variant B is
slightly superior in its performance and hence recommended. |
regions.filename |
Filename of BED-formatted file that contains regions
to which the (differential) expression analysis should be limited to.
Hence, only windows entirely enclosed in the union of these regions
are statistically evaluated. Commonly, this parameter is used in
order to identify highly and differentially expression (highdiff)
regions by restricting the differential expression analysis
(diff enabled) to regions identified as highly expressed in
either one of the corresponding cellular conditions. |
zscore.filename |
Filename of BED-formatted z-score file that may be
written and comprises the analyzed windows of the statistical
analysis. Note that it is only written if zscore.filename is
set accordingly. In this case, it contains the name of the reference
sequence, start and end position, description, estimate z-score
z-score, and ‘+’ as strand for each analyzed window. The description
is an underscore-delimitted string of the number of covered probes,
the average GC content of their sequences, the window q-value
multiplied by 100, and the window score calculated with the given
scoring.function on the probe scores. The z-scores represent
normalized window scores that are calculated by use of the sampled
background. More precisely, the window z-score z is calculated
by z = \frac{x - \mu}{\sigma} where x is the window
score and \mu and \sigma are the mean and standard
deviation of the permuted window scores, respectively. Note that
negative z-scores indicate down-regulation and positive indicate up-
regulated regions since z-scores are bounded to zero from below or
from above in case of up- or down-regulation, respectively. Hence,
using common expression analysis (if diff is FALSE ),
the z-score cannot be negative. Note that any existing file will be
overwritten. |
output.filename |
Filename of BED-formatted output file that is written
and comprises information on the segments identified as significantly
(differentially) expressed including name of reference sequence,
start and end position, description, score that is uniformly set to
‘10’ (better compatibility of BED files e.g. with UCSC genome
browser), and ‘+’ as strand. The description is an underscore-
delimitted string of the name of reference sequence, the start and
end position, the number of probes covered by the segment, the
average GC content of their sequences, the minimal q-value of windows
that were merged into the segment (multiplied by 100), and the
segment score calculated with the given scoring.function on
the covered probes. |
verbose |
Indicates whether information on progress are printed. |
The overall procedure to analyze tiling array expression data by reading the
measure probe scores and further information as input, executes the
statistical analysis of this data, and report the output in terms of
one or two BED-formatted files (depending on whether zscore.filename
is set to NULL
).
None. All output is written to the output.filename
and
zscore.filename
if zscore.filename
is not NULL
.
All generated files contain header information marked by ‘#’ at the
beginning of the line that lists the used parameters with their
values and the description of the columns in the subsequent output.
Note that any existing file will be overwritten.
## Note that the following example only executes if the external data ## of the Starr R package is available which includes an artificial ## Affymetrix BPMAP file and corresponding CEL files. path <- system.file("extdata", package = "Starr") if (path != ""){ ## define Affymetrix BPMAP file for probe mapping bpmap.filename <- file.path(path, "Scerevisiae_tlg_chr1.bpmap") ## define Affymetrix CEL files ## here: file of control experiment (wt) wt.filename <- file.path(path, "wt_IP_chr1.cel") ## and files of real experiment with ## tagged Rpb3 with two replicates (ip) ip.filename <- c(file.path(path, "Rpb3_IP_chr1.cel"), file.path(path, "Rpb3_IP2_chr1.cel")) stopifnot(file.exists(bpmap.filename) && all(file.exists(ip.filename)) && file.exists(wt.filename)) ## identify highly expressed segments in IP state ## (only 100 permutations as example) ## Note that group is here '' (blank) for old Affy chr21/22 arrays ## but commonly it is "Hs" for Human, "Mm" for Mouse or "Dm" for Drosophila TileShuffle(bpmap.filename=bpmap.filename, cel.filename=ip.filename, input.type=2, group="", pmonly=TRUE, normalize=TRUE, noofperms=100, winsize=200, qvalue=0.05, gcnum=3, diff=FALSE, output.filename="ip_high.bed", zscore.filename="ip_high_zscore.bed", verbose=FALSE) ## identify highly expressed segments in wt state ## (only 100 permutations as example) TileShuffle(bpmap.filename=bpmap.filename, cel.filename=wt.filename, input.type=2, group="", pmonly=TRUE, normalize=TRUE, noofperms=100, winsize=200, qvalue=0.05, gcnum=3, diff=FALSE, output.filename="wt_high.bed", zscore.filename="wt_high_zscore.bed", verbose=FALSE) ## concatenate files file.create("ip_and_wt_high.bed") file.append("ip_and_wt_high.bed", "ip_high.bed") file.append("ip_and_wt_high.bed", "wt_high.bed") ## identify highly and differentially expressed segments (highdiff) ## that are highly expressed in control or real experiment and ## significantly differentially expressed between both conditions ## (only 100 permutations as example) ## Note: common differential analysis without idenfying highdiff segments ## is executed simply by omitting the parameter 'regions.filename' TileShuffle(bpmap.filename=bpmap.filename, cel.filename=wt.filename, cel2.filename=ip.filename, input.type=2, group="", pmonly=TRUE, normalize=TRUE, noofperms=100, winsize=200,qvalue=0.05, gcnum=1, diff=TRUE, regions.filename="ip_and_wt_high.bed", output.filename="ip_and_wt_highdiff.bed", zscore.filename="ip_and_wt_highdiff_zscore.bed", verbose=FALSE) ## cleanup rm(bpmap.filename, wt.filename, ip.filename) } rm(path)