FuzzyClustering Guidelines

Go to evaluation of the output if you need a quick help to optimally run FuzzyClustering.
If you have not used FuzzyClustering before, go through the following steps :
Introduction
FuzzyClustering Algorithm
Output to the user
Evaluation of the output
Optionally, also read an elaborate discussion on the results of FuzzyClusterings performance in a case study.
Case study: Demonstrate the added value of constructing ensemble motifs using FuzzyClustering on real benchmark datasets.

Introduction

FuzzyClustering constructs ensemble motifs from a list of multiple motif detection solutions (by MotifSampler or another stochastic motif detector) by analyzing the detected motifs at their instance level. An ensemble motif is composed of subsets of instances (cluster) that often occur together in multiple motif detection solutions. Calculating ensemble motifs helps prioritizing the most likely true instances while filtering out spurious instance predictions. FuzzyClustering can be used on any list of motifs (instances) irrespective of the tool that was used to predict these motifs.

In what follows, we give a comprehensive description of FuzzyClustering. We provide guidelines on selecting program parameter settings and evaluating the output. If you want to use this application, go back to the applications webpage or setup the FuzzyClustering commandline in case of standalone use. Note: all program parameters described below have default settings in FuzzyClustering unless stated otherwise.

FuzzyClustering Algorithm

The input for FuzzyClustering is a file (parameter -f) containing a list of detected motifs reported by multiple runs of MotifSampler or any other motif finder in instances format. Each detected motif is described by a motif identifier (e.g. '#id: motif1' or '>motif1') and a set of instances annotated in a particular sequence set. The annotation of an instance in a sequence describes the sequence identifier, the start and end position and strand of the instance in the sequence and its nucleotide description e.g. seq13 122 139 + CTGCCTAC. Optionally, the nucleotide description is followed by a numerical value (instance weight) reflecting the reliability of the detected instance as assessed by the motif finder (if no weight is supplied by the user, the value is set to 1 by FuzzyClustering).

FuzzyClustering assembles the input of the motif detection results in a matrix F as follows (Fig.1):

Fig.1: Representation of the input of the motif detection results in a matrix F.

The columns in matrix F represent all instances described in the input file and the rows represent the identifiers of all detected motifs described in the input file. A matrix entry equals zero when a particular motif does not contain a given instance (e.g. motif i does not contain instance 1), else it equals weight(i,j): the instance weight of the given instance(j) for this particular motif(i). Subsets of instances that frequently occur together in subsets of detected motifs will show as sub-matrices (clusters) consisting of mainly non-zero values in matrix F (Fig.2).

Fig.2: Sub-matrices consisting of mainly non-zero values (clusters) correspond to ensemble motifs.

FuzzyClustering will iteratively extract the most cohesive cluster that each time corresponds to an ensemble motif by means of a Singular Value Decomposition (SVD, based on Joshi et al., 2008) of matrix F.

PREPROCESSING OF MATRIX F
A high number of zeros in matrix F complicates the cluster extraction process (SVD) or will decrease the quality of the extracted ensemble motif (small or in-cohesive clusters). Fig.3 demonstrates how the number of zeros in matrix F drastically increases when detected motifs describe the same motif on a slightly shifted position compared to using a 'merged' instance that represents all shifted versions of the motif in a given sequence and strand.

Fig.3: Impact on the sparsity of matrix F when representing shifted versions of the same motif (green rectangles) in a given sequence/strand as different instances (left) or as one merged instance (right).

On the other hand, treating overlapping instances as members of the same motif (and thus using a merged column in matrix F) despite them belonging to truly different motifs that are located in each other's neighborhood on the sequence, will overestimate the clustering signal of a wrongly merged motif. As a trade off, all instances that overlap for at least 50% with the most frequently detected instance amongst a set of overlapping instances are treated as belonging to the same motif.
For further analysis, overlapping instances of the same motif are represented by one 'merged' instance (see Fig.3, right) with a start and end position that equals the average of the start and end positions of the overlapping instances that this merged instance represents (each instance contributes proportional to its instance weight (weight(i,j)) in the computation of the average start and end). In matrix F, all columns that correspond to overlapping instances of a same motif are replaced by one merged column that now corresponds to the merged instance.
Instances (merged) that have a detection frequency (defined in Fig.4) lower than a threshold set by parameter -p, also correspond to columns in matrix F with many zero values. These spurious instances are removed from the input as they might interfere with the clustering of the more important instances (those with a higher weighted detection frequency).

Fig.4: The contribution of each instance in the computation of the detection frequency of a merged instance j is proportional to its instance weight (= weight(i,j) in matrix F). Formula : the first factor equals the average of the weights of all instances that were merged in instance j, the second factor equals the number of occurences (sum o(i,j)) of (shifted versions) of instance j divided by the number of detected motifs (M) in the input file.

EXTRACTING MOTIFS FROM MATRIX F (Fig.5)
Applying SVD on matrix F results in the extraction of the most cohesive cluster of instances and detected motifs from matrix F that all correspond to the same ensemble motif. Each instance and each motif in the extracted cluster get assigned a score (membership). The membership is a relative measure that reflects how well an instance (motif) represents the ensemble motif compared to the other instances (motifs) in the extracted cluster. The higher the memberships of all instances and motifs in the cluster, the more confidence we can have in the reliability of the ensemble motif. Instances (motifs) with a low instance (motif) membership are removed from the cluster as they are considered to represent unreliable predictions that are not sufficiently well supported by the data. An instance (motif) membership is found low when it is lower than a fraction (set by threshold) of the highest instance (motif) membership observed in the cluster. In default mode, the instance and the motif fraction threshold are (separately) computed by the program as the value that best balances the size of the cluster (number of remaining instances (motifs) in the cluster) with the overall reliability of the cluster (coherence of all instance (motif) memberships in the cluster). In non-default mode, the fraction threshold (in %, applicable for both instance and motif membership) is set by parameter -m.

Fig.5: A cluster extracted from matrix F consists of a set of instances and motifs that all represent the same motif. Each instance and motif is prioritized by a membership score. Only instances and motifs with a membership score higher than an internally computed (or user-defined) membership threshold will be considered.

Before extracting a second cluster (ensemble motif), each entry in matrix F that corresponds to an instance and a motif that were both assigned to the first cluster is set to zero in matrix F. Note that instances (motifs) assigned to a previously extracted ensemble motif can still contribute to a subsequent motif (because only assigned instance-motif entries are set to zero, not the full column (row) of an assigned instance (motif)), referring to the 'fuzzy' aspect of the clustering algorithm. The cluster (ensemble motif) extraction process continues until no more clusters can be extracted (matrix F becomes too sparse to apply SVD).

POSTPROCESS EXTRACTED MOTIFS
>From the ensemble motifs extracted by FuzzyClustering, only the ones that are expected to represent true motifs are reported in the output. First, we consider as true motifs those that are overrepresented in most sequences of the input sequence set (assuming the input sequences are choosen based on high coregulatory evidence): a cluster is only retained if the instances contributing to the cluster occur in a minimal fraction (as set by parameter -i, default 50%) of the sequence set (note that the sequence set is defined by all sequence identifiers listed in the input file supplied to FuzzyClustering - this sequence set is not necessarily the same as the one supplied to MotifSampler as there may be sequences in which no motif was detected by MotifSampler). Secondly, a cluster must be supported by multiple solutions of the motif finder: a cluster is only retained if the weighted count of contributing motifs (C(m)) to the cluster (computed as the sum of all non-zero motif memberships normalized by the highest motif membership in the cluster) is higher than a minimal fraction (as set by parameter -j, default 10%) of the number of supplied motifs to FuzzyClustering.

To construct a PWM representation of an ensemble motif, the instances contributing to the cluster are aligned, starting with the most reliable instance (highest instance membership) and gradually adding instances with lower instance membership with a maximal nucleotide match, minimal shift and at least 50% overlap to the most reliable instance. The PWM is derived from this alignment as follows: each instance contributes a nucleotide count for A, C, G or T (or N if the instance is not described on a particular position in the alignment) that is proportional with the instance membership in the cluster (together with a small pseudocount). We define the core PWM as the one having the same width as the most reliable instance. This core PWM is further extended to the left and right as far as the alignment reaches and the PWM with the highest CS score is chosen as the representative of the extracted ensemble motif. Finally, only ensemble motifs for which the representative PWM has a CS score higher than a minimal setpoint (parameter -c, default 0.5) are reported in the output.

Output to the user

The retrieved ensemble motifs are reported in the order according to which they were iteratively extracted.
FuzzyClustering reports each ensemble motif in two files:

1) in text file -o: the first sections of this file describe information on the clustering process (see below *). The different ensemble motifs are described towards the end of the file in the section '#MotifClusterRun::PrintClusters() :'.

In this section, the description of a motif starts with a motif identifier (#id:) that summarizes useful information on the ensemble motif (read format for a comprehensive description).
Next follows a line starting with '#CS-distribution :'. This line reports the CS scores for the range of PWMs of different widths that were computed for the ensemble motif. The CS score in [brackets] is the score of the PWM with highest CS score. The CS scores at respectively the (left) and (right) of the [CS score] are the scores of the extended PWMs (each time extended with one extra nucleotide to the left respectively right in the alignment).
Example: #CS-distribution: (1.33)[1.44](1.20, 1.34).

The line starting with '#instance memberships' reports the weighted count of instances contributing to the ensemble motif (N(i), computed as the sum of all instance memberships normalized by the highest instance membership in the list) and the weighted average occurrence (Occ(i), indicator for the detection frequency of the ensemble motif, computed as the sum of all instance occurrences (occ) multiplied with their instance membership, normalized with the total sum of all instance memberships):
Example: #instance memberships: (N(i)=13, Occ(i)=24)
This line is followed by a listing of all (merged) instances of the given ensemble motif. For each instance, an annotation is given (the sequence in which it occurs, start, end, strand and nucleotide string), followed by (shift:) its shift compared to the consensus description of the representative (highest CS-scoring) PWM of the ensemble motif, (occ:) the number of times this instance (or a shifted version that represents the same motif) occurred in the input file -f, and -separated by a tab- (membership:) a value that prioritizes how relevant this instance is for the given ensemble motif compared to the other listed instances.
Example: seq10_59_77_+_TAATCCTACAGGCGTAAGA_(shift=0,occ=55) 0.447

The line starting with '#motif memberships' reports the weighted count of contributing motifs to the ensemble motif (N(m), computed as the sum of all motif memberships normalized by the highest motif membership in the list), this value is compared with the threshold derived from parameter -j)
: Example: #motif memberships: (N(m)=45)
The line precedes a listing of motif identifiers (i.e. 'id:' the same as described as identifier in the input file -f), each followed by a value (membership) that prioritizes how well the motif corresponds to the ensemble motif compared to the other listed motifs.
Example: box_24_2_TCATCrrTAyAATmnATGA 0.447

2) in matrix file -O: each ensemble motif is described by the highest scoring (CS) PWM representation (read PWM format). If another PWM representation of longer width (but lower CS score) has flanking regions that show an interesting conserved subregion that is not shown by the highest scoring PWM (e.g. for motifs with a variable spacer), then this PWM is also printed to the output. In case two PWMs are reported, the ensemble motif identifier ('#ID:') of the highest scoring PWM is marked by the symbol 'CS-based' and the second PWM by the symbol 'width-based'.

(*) In addition to the ensemble motifs, the text file (-o) also prints information on the clustering process:

- The section '#MotifClusterRun::CheckDatasetDimension' starts with ('#DatasetDimension LOADED:') a description of the total number of instances, sequences and motifs present in the input file (-f). The second line ('#Dataset dimension FILTERED:') reports the numbers after merging overlapping instances and removing spurious instances, followed by the lines '#::Noisy sequences' and '#::Noisy motifs' that describe the (number) and [identifiers] of all noisy sequences and noisy motifs (i.e. that only had spurious instances) that were removed from the input (preprocessing).
In the section '#List of merged non-spurious instances', each retained (merged) instance is described (on a separate line) by its annotation and in (brackets) its number of (shifted) occurrences in the input file. If the retained instance was detected at shifted versions in the input file, a listing of all shifted versions of this representing merged instance is given in [brackets].
Example:
seq10_12_29_+_TGTACCGATGATTTATAG(2):[seq10_9_26_+_GATTGTACCGATGATTTA(1),seq10_11_28_+_TTGTACCGATGATTTATA(2),seq10_15_32_+_ACCGATGATTTATAGTTT(1)]

-The section #MotifClusterRun::Extraction() describes the number of clusters that was extracted by FuzzyClustering. Not necessarily all clusters correspond to likely true motifs. The number of rejected clusters because of parameter -i (ensemble motifs must be present in a minimal fraction of the sequence set), parameter -j (ensemble motifs must be sufficiently supported by a minimal fraction of the supplied motifs) or parameter -c (ensemble motifs must have a minimal degree of conservation) is described in the subsequent lines.
Example: ##MotifClusterRun::Extraction(): 5 clusters clusters extracted by edgeclust.m
# 1 clusters rejected for low (<= 24) number of sequences (threshold derived from parameter -i)
# 1 clusters rejected for low (<= 10) number of supporting motifs (threshold derived from parameter -j)
# 0 clusters rejected for low (<= 0.5) Consensus score (threshold set by parameter -c).

Evaluation of the output

- '#MotifClusterRun::CheckDatasetDimension' gives valuable information on the quality of the input supplied to FuzzyClustering:
A high number of noisy sequences (reported in '#::Noisy sequences') means that the sequence set supplied to MotifSampler consisted of many noisy sequences (i.e. not containing a motif). Noisy sequences decrease the motif to noise signal for de novo motif detection and consequently also the clustering signal for FuzzyClustering.
A high number of noisy motifs (reported in '#::Noisy motifs') means that many runs in MotifSampler detected only spurious instances (i.e. not corresponding to a motif). This may occur because the true motif is not strongly overrepresented (thus difficult to find by de novo detection), or because MotifSampler was run with suboptimal parameter settings.
A high number of spurious instances (compare the number of instances LOADED with the number of instances FILTERED) could indicate that MotifSampler was run with a parameter setting that overestimated the actual number of instances in each sequence, resulting in an enforced detection of spurious false instances (parameter -p in MotifSampler).
A high number of shifted versions for many instances (described for each instance in '#List of merged non-spurious instances') might indicate that MotifSampler was run with an under-or overestimated setting for the motif width (parameter -w in MotifSampler).
Careful inspection of the output of FuzzyClustering allows removing identified noisy sequences from the sequence set and repeating MotifSampler on the skimmed sequence set with optimized parameter settings to create a new input file -f that is expected to have a stronger true motif signal.

- Changing the setting for parameters -i, -j and -c influences the number of ensemble motifs that is finally reported. If the number of clusters that is rejected because of a low number of sequences (-i), a low number of data-motifs (-j) or a low consensus score (-c) is high, you may want to repeat FuzzyClustering with revised settings for these parameters (reset to a lower value or to zero) to explore the meaning of the ensemble motifs that correspond with these otherwise rejected clusters. The range of -c is [0, 2[, the range for -i and -j is [0, 1[. Note that changing these parameters does only affect the number of ensemble motifs that are reported. It does not change the actual clustering results.

- Changing the setting for parameters -p and -m does affect the clustering, impacting not only (and not necessarily) the number, but mainly the quality of the reported ensemble motifs. Increasing parameter -p will remove more instances from the input as being spurious, shifting the focus on reporting only strong ensemble motifs that are mainly described by their most conserved (easily detectable) instances. Parameter -m is in general best set to default mode and the computed fractional thresholds can be read from the text output file (-o, %InstanceCut and %MotifCut). If instances (most likely less conserved and therefore difficult to detect) are not assigned to the correct cluster (that represents the true motif to which they belong), they will interfere with the clustering of subsequentially extracted motifs. Resetting parameter -m to a value lower than the values computed in default mode therefore not only allows exploring less reliable, but not necessarily untrue predictions, it might also improve the extraction of subsequent (less significant but not necessarily less true) motifs.

- The significance of a reported motif can be assessed 1) by the order in which the motif is reported (strong motifs will be clustered first), 2) by the detection frequency of the instances of the ensemble motif (a high weighted_average_occurrence Occ(i) means that the motif finder detected the ensemble motif instances repeatedly) preferably in combination with a well supporting number of corresponding motifs (an equally high weighted_motif_count N(m) means that the ensemble motif instances were repeatedly detected together) and 3) by the coherence of the reported instance memberships (instances of a strong motif will all have equally high memberships if the detection of the motif is not interfered by noise or weak motifs present in the sequence set).

- The membership of an instance reported by FuzzyClustering is a value to be compared with the memberships of other instances of the same ensemble motif to assess the relative reliability of the respective instance predictions. You should not compare absolute memberships of instances between different ensemble motifs (the same applies for the reported motif memberships). If the difference between the maximum and minimum instance membership score (dM(i)) in an ensemble motif is higher than 0.1, we recommend to repeat FuzzyClustering with an external setting for parameter -m (higher than the %InstanceCut reported in the output) to further remove some unreliable instances (but make sure the weighted_average_occurrence Occ(i) stays of the same order to not loose true instances).

- The distribution of CS scores ('#CSdistribution') of PWM representations of different width of an ensemble motif shows the importance of different regions in that motif. The most conserved part of the motif is represented by the highest scoring PWM in [brackets]. Usually slightly longer motifs (extensions to the left or right of the core motif) have a lower CS: as the true motif width is unknown, shifted instances of the motif detected by the motif finder contribute to these flanking ends, but most instances contribute to the core. A high number of less conserved flanking positions is indicative for an overestimated (in that case most instances of the ensemble motif are longer than the core motif) or an underestimated setting for the motif width used in the motif finder. If the true motif has a variable spacer flanked by two conserved regions, the CS score will first drop upon extending the core motif but will then increase again.

- The solutions of multiple motif detection runs of different motif finders on the same sequence set can also be merged with FuzzyClustering. In that case make sure each motif identifier has a reference to the respective motif finder by which it was detected. The motif memberships for an ensemble motif reported by FuzzyClustering indicate which (runs) of the motif finders detected the ensemble motif best. Finding that an ensemble motif was detected by all of the different motif finders increases the confidence in a reported ensemble motif. If on the other hand an ensemble motif was mainly detected by (one or a few) particular motif finders, this learns that some of the motif finders did not perform well (a revision of their parameter settings may be needed).

Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.