/*-----------------------------------------------------------------------------

   QUASAR - q-gram Alignment based on Suffix ARrays

   Copyright (C) 1998 Stefan Burkhardt
   Author: Stefan Burkhardt <stburk@mpi-sb.mpg.de>
   This file is part of the QUASAR package.

   QUASAR is free software; you can redistribute it and/or
   modify it under the terms of the GNU Library General Public License as
   published by the Free Software Foundation; either version 2 of the
   License, or (at your option) any later version.

   QUASAR is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Library General Public License for more details.

   You should have received a copy of the GNU Library General Public
   License along with the QUASAR package; see the file copying.  If not,
   write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
   Boston, MA 02111-1307, USA.  or contact the author. 

  $File$
  $Revision: 1.8 $
  $Date: Wed, 29 Mar 2000 11:07:45 +0200 $

-----------------------------------------------------------------------------*/

This file contains information on installing and using QUASAR.

Installation:
=============
NOTE:	This version requires a fully installed ncbi-library. It is used
	for compiling the function call to ncbi-blast. The steps required
	for the installation are:

Edit the makefile:
	set the BUILD variable to the path you are building in
	set the NCBI variable to the path where you unpacked the ncbi-library
	(this sbudirectory should contain the fully installed library).
type gmake to build quasar and the associated utilities
this will (hopefully) create the following binaries:

quasar		the database search program

f2q		a program that converts fasta files into the quasar sset files
		NOTE: Right now QUASAR only correctly handles files containing
		the LETTERS A, C, G and T. N and other ambigous characters
		will be ignored (this means cut out of the sequence).

q_sa		a program that builds a suffix array for a certain .sset file
		NOTE: This is a VERY simple suffix array construction 
		implementation based on qsort and strcmp. It does NOT construct
		a REAL suffix array for .sset files, since the sequences in 
		such a file are seperated by \0 and strcmp stops comparing
		at sequence ends. However, it still offers full functionality
		for QUASAR, since alignments never extend over sequence
		boundaries. q_sa rather slow, probably not practical for
		files > 300 MB. Other, more sophisticated implementations are
		needed for this. One pointer i can give is LEDA-SM by
		Andreas Crauser.
		(http://www.mpi-sb.mpg.de/~crauser/leda-sm.html)
		Runtimes for q_sa on one processor of an E10000:
		 0.5 Mbps			   5 seconds
		   5 Mbps			  64 seconds
		  25 Mbps			 986 seconds
		  87 Mbps			8438 seconds
		 213 Mbps		       64249 seconds
		 650 Mbps		>5 days, unfinished....

q_pre		the program that precomputes the search results for a given q
	
Use:
===
Preparation:
------------
In order to be able to search for a sequence in a certain database, the 
database has to be prepared for use with QUASAR. This requires the following
steps:

1. convert the database into the quasar sset files
	the database has to be in fasta format. run the following command:

	f2q database_file_name

	this will create a set of files that QUASAR needs to run

2. build the suffix array for the current database
	in order to construct the suffix array for the given database, run:
	
	q_sa database_file_name bin_size

	this will create the suffix array. bin_size is a number between 0 
	and 12 that defines the length of the bin_size-grams used for the 
	bin_sort step. 1<<(2*bin_size) bins are created, then an in-place 
	bin sort is conducted and the full bins are then processed with 
	qsort. If one chooses bin_size=0, the old version that simply does 
	a qsort on the whole db is used.

3. precompute all searches for q-grams for a given q-gram length
	this step is required to precompute the results of all searches for
	all possible q-grams of a certain length. to do so for a given q run:

	q_pre database_file_name q
	
	where q -s the chosen number. be aware that it is not adviseable to
	use q's greater than about 13 due to insane memory/time consumption
	a choice of q=11 is what i recommend for doing low sensitvity high
	speed searches

	NAME.fasta	
	    |
	    | f2q
	    |
	NAME.raw	NAME.sset	NAME.headers
	    |
	    | q_sa
	    |
	NAME.sa
	    |
	    | q_pre
	    |
	NAME.srq	NAME.saq



Search:
-------
Assuming you have build a database as described above, you can search for 
one or several queries given in fasta format by doing the following:

1. convert the fasta file into the sset format
	
	f2q query_file

2. edit the options file to choose parameters
	
	if you want standard behaviour, dont touch the options file and build
	the database with q=11. In order to set the options in a way that makes
	sense, its probably best to read the QUASAR paper from RECOMB 99. The
	format of the file is plain ASCII, each paramter is an integer followed
	by a \n
	
	The parameters are (with legal ranges):
	q	: chosen length of q-grams ( 1 - 31 )
	i	: length of q-grams used for index ( 0 ... 15 )
	w	: minimum window length for the filter ( > q )
	t	: min number of matching q-grams in a block ( 1 ... w - q + 1 )
	b	: block size (hast to be a power of 2, > w )
	filter_mode : should be 0
	output_mode : 0 = write the filtered sequences out to a fasta file
		      1 = call formatdb and blastall on the filtered sequences
		      2 = use my code for fromatting the db, call blastall on 
			  the filtered sequences
		      3 = call blast as a function (under development)
	rquery	    : 0 = search for each query, but not for the reverse 
			  complement
		      1 = search for each query and its reverse complement

3. run QUASAR
	call QUASAR by invoking it with:	

	quasar options_file query_file_name database_file_name
	

Output:
-------
Depending on which output_mode you choose, you will get different types of
search result output. 

Output mode 0 simply writes the result of the filtration to a fasta file.
However, most of these sequences will most likely not have a good alignment 
with the query. For the query number i, the filtered database is written
to 'quasar_hits_qi' (i.e. for query number 3 to quasar_hits_q3).

Output modes 1 and 2 should return exactly the same output. They write their
results to 'quasar_hits_qi' for query number i. Since this is a call of
blastall using system, the output is the standard BLAST output. It should
be noted, that the E-Values are not correct, since they are calculated using
the size of the filtered database.

Output mode 3 uses my own output function. This version calls blast as a
c-function. It is the fastest version. Its output is (currently) to stdout.
For each hit to a certain query the program reports the following:

Fasta header (max. 60 char)
Query: AACGTA....AGTT-GGCA 1  - 435, 0   -   0, 436 - 440
Sbjct: AACGTA....AGTTCGGGA 14 - 448, 449 - 449, 450 - 453

Where gaps are denoted by a - and the numbers behind the sequences denote the
start/end of pieces from the original sequences. (Here 0 - 0 denotes a gap).

NOTE:
All of these implementations still pass the results of the filter step 
to blast via disk, therefore the running times are not THAT great. However, 
the version with the hacked direct interface to NCBI-BLAST has some problems
that lie in the NCBI-code and is therefore not distributed right now.


