===================================================================================== ===================================================================================== ====================== User's Guide for dCLIP ===================================== ===================================================================================== ===================================================================================== WHAT IS dCLIP ============= dCLIP is written in Perl for discovering differential binding regions in CLIP-Seq (HITS-CLIP, PAR-CLIP or iCLIP) experiments. It is appropiate in experiments where the common binding regions that are significantly enriched in both conditions tend to have similar binding strength and when researchers are more interested in the difference in binding strength rather than the binary event of whether binding site is common or not. For example, dCLIP will work when researchers would like to know the differential binding sites of AGO protein under a wild-type and miRNA knockdown condition. In dCLIP analysis, duplicate reads that have the same mapping coordinates are collapsed and only one read with the highest number of mutations is retained. Then dCLIP identifies CLIP clusters from the remaining reads of both experiments. Then dCLIP would divide CLIP clusters into bins of the same size, count tag intensities in each bin and normalizes the two conditions by MA plot followed by linear regression. Using the normalized M values, dCLIP implements a Hidden Markov Model and Viterbi algorithm to identify differential and common binding regions. The CLIP-Seq experiments can be done in single-end mode or paired-end mode but must be in a strand-specific manner. dCLIP takes the SAM format files of two experiments as input and produces 10 files as output. Most of the output files can be uploaded to UCSC Genome Browser for visualization. dCLIP only uses tag enrichment for inference of differential binding regions. Mutation information is not used for statistical inference but is still collected and written to output files. In addition, the users can choose a specific mutation type or combination of mutations types as characteristic mutations. For example, "T2C" for PAR-CLIP and "Del" for HITS-CLIP. HOW TO CITE =========== When using dCLIP, you may want to cite: Wang T, Xie Y, Xiao G (2014) dCLIP: a computational approach for comparative CLIP-seq analyses. Genome Biol 15: R11. INSTALLATION ============ You should have a perl version >= 5.16 to run dCLIP. Also you need to install Perl's PDL module and PDL::Stats module to run dCLIP. If you don't have or don't know how to install this package. Please check the MODULE file, which has some suggestions for installing Perl modules. With it installed, just use the following command to decompress the package $ tar zxvf dCLIP_1.6.tar.gz And the dCLIP.pl script will be placed in the bin directory, you can add dCLIP bin folder to your path. To call the software, use $ perl dCLIP.pl [options] If you are to provide dCLIP with background sequencing files, you also need to have the compiler gcc in your directory. For the moment, running in background control mode is only available in Unix-like systems. RUNNING dCLIP ============= Input: -f1 The SAM format file of the first condition. -f2 The SAM format file of the second condition. -pair If the aligned SAM format files are from single-end experiments, leave this option unset. For paired-end files, set this option to the suffix of the names of forward reads and backward reads. For example, "F3,F5-RNA". -m1 The minimum number of tags for the first condition. All tags from both conditions are pooled, collapsed and overlapped to form clusters. Only clusters with at least m1 tags of the first condition or m2 tags of the second condition will be considered. Default: 5. -m2 The minimum number of tags for the second condition. Default: 5. Directory: -temp The temporary directory to store intermediate files. Default: ".". -dir The folder to store final output files. Default: ".". Parameters: -step The step size of profiling tag intensities. This controls the resolution of the Hidden Markov Model. Default: 5. -filter A filter value used for defining regions with significant binding in both conditions. A higher value will be more conservative in calling differential binding sites for regions with small number of tag intensity counts. Should be set >=0. A larger dataset generally should have a larger filter value (like 10 for a dataset > 40M reads). Default: 2. -mut The mutant type(s) of the marker mutations. Can be any one or combination (separated by comma) of "T2C","T2A",...,"A2G","Del","Ins". For example, "T2C,A2G" will include T-to-C and A-to-G mutations as marker mutations. "all" will include all types of mutations. Default: "T2C". -max The maximum number of iterations allowed for the Hidden Markov Model. Default: 10. -pre The precision of the criterion for convergence. Default: 0.001. -process Whether to upload a user-processed tag intensity count file. Default: "". If set, it will override -f1, -f2, -pair, -m1, -m2, -mut. -iCLIP Set if the supplied dataset is from iCLIP experiments. It will override -pair, -m1, -m2, -mut. The nonnegative integer value represents how many nucleotides the algorithm will expand each cDNA count in both upstream and downstream directions. For example, if set to 7, each cDNA count will be expanded into 15 nt counts centered on that cDNA count. If set to 0, no expansion will be done. The expansion is recommended because cDNA counts may have some offset from the real protein binding sites. Suggested value 2 to 7. -control Provide the background abundance control for -f1 and -f2 (Only for HITS-CLIP and PAR-CLIP). The format is "-control background_1.sam,background_2.sam", corresponding to the CLIP-Seq sequencing files in the same order. This will override -m1 and -m2. Instead a minimum number of tags of 20 coming from any one or more SAM files (both CLIP and control) is always used. This will also override -process. -fr The mode of the paired-end sequencing. The default is "fr-second". The alternative is "fr-first". Check the following for more details. Help: -h Print this help message. MORE DETAILS ============ 1) For paired-end experiments, reads are usually named with suffix specifying the forward and backward segment for the same read. For example, 604_829_626_F5-RNA 604_829_626_F3 167_570_1179_F3 167_570_1179_F5-RNA "F3" means forward strand and "F5-RNA" means backward strand, so the pair parameter should be set to F3,F5-RNA in this case. In other cases, the pair paramter may be set to 1,2 or other character strings. Sometimes, the aligner will trim the suffix. For example, "HWI-ST188:8:2217:5190:132924\#0/1" and "HWI-ST188:8:2217:5190:132924\#0/2" are one mate and certain aligners will only write "HWI-ST188:8:2207:5196:132923\#0" for both segments in the alignment file. In such cases, please set the suffix to "" and "" or "\#0" and "\#0". The point is to make the remaining part of the read names the same for a mate. 2) dCLIP treats paired-end sequencing data as fr-secondstrand by default (consult Tophat manul for definition). It can also handle fr-firststrand if you use opposite bases when specifying substitution mutations. For example, instead of specify T2C, you can use A2G. dCLIP cannot handle any other type of fr or any type of ff libraries. 2) The m1 and m2 parameters are designed to limit differential analysis to larger clusters. For those small clusters with only a few reads, they are not very interesting for biologists not matter whether they have differential binding strength or not. Also, bigger m1 and m2 values will make the program run faster. I would recommend a value between 5-15 for m1 and m2. 3) For PAR-CLIP experiments, the characteristic mutation type (mut) is "T2C" or "G2A" depending on the nucleoside analog used. For HITS-CLIP experiments, it has been suggested that "Del" mutation is characteristic of AGO and Nova proteins. However, this may not be the case for other proteins. So if the user is interested in mutation information, it may be a good idea to try different types and combinations of mutations and to see which setting makes the most sense from downstream analysis. 4) Reads with gapped alignments are discarded. However, gapped-aligned reads usually comprise a very small portion of total reads (<2%) so it should be ok to neglect these alginments. 5) For one experimental condition with more than one replicate, it is ok to simply concatenate the SAM format alignment files like this $ cat file1.sam file2.sam file3.sam > file.sam 6) The user can supply their own tag intensity file. This will be convenient if the users want to customize the pre-processing step, such as collapsing reads to tags in a different way or normalizing tag intensity count by expression data if such data are available. The format of the tag intensity file must stick to the following description. * The file is tab-delimited text file, with no column names and no row names. * Each line corresponds to a bin in one cluster. The bins in the same cluster are in continuous rows in increasing order of their positions on the chromosome. * The first column is the id number for cluster, starting from 1. All bins in the same cluster share the same id. * The second column is the chromosome name. * The third column is the genomic position. The position must be a multiple of step. For example, if step=10bp, then the positions for all the bins in one cluster can be 100,110,120,...,230. As neiboring bins are adjacent on the chromosome, the difference in their positions must be step. * The fourth column is strand (+ or -). * Starting from the fifth column, there are step columns, each column corresponding to the total tag count in condition 1 in that bin, starting from the position in the third column. This is how the tag count on each base pair in a bin is stored. * Following are another step columns, each column corresponding to the mutant tag count in condition 1 in that bin, starting from the position in the third column. * Following are another step columns, each column corresponding to the total tag count in condition 2 in that bin, starting from the position in the third column. * Following are another step columns, each column corresponding to the mutant tag count in condition 2 in that bin, starting from the position in the third column. Here is an example of the format of the tag intensity count file (step=3bp): 1 chr18 102 + 33 34 35 4 0 0 33 20 20 2 0 1 1 chr18 105 + 21 24 28 0 15 20 30 50 51 2 0 2 1 chr18 108 + 40 32 20 0 0 0 10 11 1 0 0 1 2 chr18 450 + 1 4 29 0 0 0 100 113 122 20 0 3 2 chr18 453 + 120 98 76 0 0 2 89 111 122 0 0 3 7) If the user is supplying iCLIP datasets: * iCLIP datasets must be single-stranded. * Barcodes should be trimmed before alignment. A supplementary Perl script, remove_barcode.pl, in the bin folder is provided for the ease of users that takes in Fastq files and trims barcodes. * Turning on the -iCLIP option will set dCLIP to takes in SAM alignment files for iCLIP datasets. * In the iCLIP mode, total tag count will be the cDNA count, mutant tag count will be 0. For explanation of cDNA count, please check the following publication: Wang Z, Kayikci M, Briese M et al. iCLIP predicts the dual splicing effects of TIA-RNA interactions, PLoS Biol 2010;8:e1000530. * Each cDNA count is expanded upstream and downstream by a few nucleotides (specified by the user). Therefore, the total tag count in each bin is actually sum of expanded cDNA counts. 8) If you are running without background transcript abundance controls, I would suggest to set the filter parameter between 5-10 depending on your sequencing depth. If background controls are provided, I would suggest to set the parameter filter to a smaller value. The rule of thumb is set it to roughly the ratio of the sequencing depth of the CLIP-Seq samples vs. the control samples. For example, if the CLIP-Seq samples have around 30M reads and the control samples have around 50M reads. You can set filter to around 0.6. 9) The -fr parameter is used only for paired-end sequencing data. There are two possible values, "fr-second" and "fr-first". Usually paired-end sequencing is done in "fr-second" mode (http://tophat.cbcb.umd.edu/manual.shtml). The "fr-second" mode is the same as the "yes" mode defined by htseq (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html). The alternative, less frequently used, is "fr-first" mode by tophat definition or "reverse" mode by htseq definition. OUTPUT FORMAT ============= The program will produce 10 output files in the output folder. 1) 8 of these files named like "File1_Mutant_neg.bedgraph" are bedGraph format files storing the total or mutant tag counts on the + or - strand in the first or second condition. For example, "File1_Mutant_neg.bedgraph" stores the mutant tag count on the - strand in the first condition as a bedGraph format file. The resolution of these files is 1bp. These files can be uploaded to UCSC Genome Browser for visualization. 2) "dCLIP_output.txt" stores the detailed information of the raw data and Hidden Markov Model inference results at the resolution of the step bp. The first id column is an id number of the CLIP clusters. The chrom, strand and position specify the genomic location of each bin of step size. The next two columns, state and probability, specify the inference results of the Hidden Markov Model. The differential column is the normalized difference of the tag intensities between the two conditions. The last four column, tag1, mut1, tag2 and mut2, specify the total tag intensity and mutatant tag number of the two conditions in each bin. Note: For the state column, 0 refers to a bin with more binding in condition 2 than condition 1 1 refers to a bin with equal binding in both conditions 2 refers to a bin with more binding in condition 1 than condition 2 3) "dCLIP_summary.bed" stores the summary of the inference results. Neighbouring bins in the same cluster with the same inference results are collapsed to be represented as one region in the BED file. The fourth column is the same id number as used in the dCLIP_output.txt file. The fifth column is the average binding strength of condition 1 in this continuous region if in state 2, the average binding strength of condition 2 if in state 0 and 0 if in state 1. I recommend to use this value for ranking regions with differential binding (state 0 and state 2 regions). This file can be uploaded to UCSC Genome Browser. Note: For the color coding Red bars are regions where condition 2 has stronger binding than condition 1 Green bars are regions with equal binding strength for both conditions Blue bars are regions where condition 1 has stronger binding than condition 2 EXEMPLARY DATA ============== In the test folder there are two demo SAM files for demonstrating the usage of the dCLIP software. These two files are a small portion of the published dataset GSE41285. Here is a link to the original publication: http://www.sciencedirect.com/science/article/pii/S1097276512008544. Please cd into the dCLIP directory and run the following command. Results will also be placed in the test folder. $ perl bin/dCLIP.pl -f1 test/knockdown.sam -f2 test/wildtype.sam -m1 2 -m2 2 -temp test -dir test -mut "T2C" $ perl bin/dCLIP.pl -f1 test/knockdown.sam -f2 test/wildtype.sam -m1 2 -m2 2 -temp test -dir test -mut "T2C" -control test/wildtype.sam,test/knockdown.sam -filter 0.5 OTHER ===== Version: 1.6.3 Date: 4/29/2014 Author: Tao Wang Email: tao.wang@utsouthwestern.edu