All of the programs I wrote can be accessed from Github. The following are detailed descriptions and associated datasets.
Modular XIST Structures and Functions.
Using a combination of multiple orthogonal computational and experimental approaches we have built the first whole-transcript-level structure and interaction model for the XIST RNP complex. The source data used for plotting figures in two papers are available here: XIST_data.
Psoralen Analysis of RNA Interactions and Structures
We developed a set of tools for the analysis and visualization of RNA structural data generated with PARIS. The programs are available in Github repositories of mine and Cliff’s. The visualization of structures are enabled in IGV, thanks to the contribution of Jim Robinson and Jill Mesirov.
The raw data are available from NCBI GEO and SRA (GSE74353). The processed data that are not hosted in NCBI are provided here for people who are interested but do not want to re-process the data. The following processed RNA structurome data can be downloaded from here: HeLa, two replicates (bam and duplex group bed), HEK293, three replicates (bam and duplex group bed) and mouse ES, three replicates (bam and duplex group bed)
The RNA interactome data can be downloaded from these links: HEK293 RNA interactome (Rfam), HEK293 RNA interactome (Rfam + mRNA), mouse ES RNA interactome (Rfam), mouse ES RNA interactome (Rfam + mRNA)
The “genome” references for the interactions can be accessed here: Human Rfam, Human Rfam + mRNA, Mouse Rfam, Mouse Rfam + mRNA
Vicinal — the chimeric read analysis tool
The recent introduction of high-throughput sequencing technology sparked an evolution in biology. Sequencing data (e.g. RNA-seq) contain more information than people have realized, and creative ways of analyzing the large amount of published and to-be published data are necessary to bring out the full potential of this technology.
Vicinal utilizes locally (which means partially here) mapped reads, derived from self-priming and ligation, to precisely determine the termini of ncRNAs and provide support for predicated terminal stem-loops. The scripts, in python and shell, can be downloaded from this page (see below). Users are welcome to comment on it to make it better.
Initial examination of the chimeric reads derived from ncRNAs reveals two important features, irrespective of whether they arose via self-priming or ligation. First, the two parts of each chimeric read map close one another, usually within 100 nt. This distance is basically determined by the size of the terminal stem-loop. Second, the two parts of each chimera map to opposite strands of the encoding DNA, unlike reads from spliced RNAs that map to the same strand.
To obtain chimeric reads, we first mapped raw reads to genome references using bowtie2 in the –sensitive-local mode. Note that Tophat doesn’t work because it does not allow local mapping. We filtered the mapped reads to look for ones that are softclipped, i.e., only one part of each read is mappable. Then we mapped the unmappable parts of the softclipped reads to the opposite strand of the mappable parts, and close to the mappable parts. Reads with two parts mapped to opposite strands were used to construct a potential terminal stemloop secondary structure (picture on the right)
Note before you use it. 1. The raw reads have to come from RNA-seq libraries that contain ncRNAs. 2. The reverse transcription step in the library construction must be prior to adapter ligation
We have analyzed hundreds of RNA-seq datasets and compiled lists of ncRNAs with chimeric reads. We have made a database for them, so that readers can make use of the analyzed data and determine the ends of RNAs of your interests. The ncRNAs with enough chimeric read coverage are mostly snRNAs, scaRNAs, snoRNAs, 7SK, 7SL, RNaseP, RNaseMRP, 5S rRNA, etc. The method does not work for miRNAs, piRNAs, lincRNAs etc. because they do not generate chimeric reads during library preparation. The analyzed data files can be downloaded from the links below. A total of 9 groups of files are available, each from a different source, e.g. organism, tissue/cell type etc.: fly_larva3_48nt, fly_ovary_RIP_35nt, fly_pharate_48nt, fly_pupa_48nt, fly_S2_45nt, human_HCT116_50nt, mouse_ES_40nt, mouse_ES_51nt, mouse_satellite_50nt
Each group contains 5 files; two of them are the bedgraph track files, two of them are the bam files of chimeric reads, and one is the ncRNA list with numbers of chimeric reads (shown below): prefix_1.bg.gz, prefix_2.bg.gz, prefix_chim_sorted.bam, prefix_chim_sorted.bam.bai and prefix_ncRNA.txt