Automatization of Sequence Handling (Staden's pregap4 and gap4)

Already a while ago I wrote a script, that tries to automatize most of the work involved in getting sequences from our ABI sequencer into a gap4 database. The script is far from perfect and looks like this:

!/bin/sh rm *.seq *.log Log\ file.txt for i in `ls *.ab1`; do echo "Renaming $i" mv $i `echo $i | sed "s/\(.\{27\}\).*/\1\.ab1/"` done ls *.ab1 *.txt > tracefile.list pregap4 -nowin -config /home/jeltsch/bin/pregap4_gap4.conf -fofn tracefile.list gap4 test.0.aux

I put the script into ~/bin; and this script is supposed to be invoked from within the directory where the sequences are located. Typically the ab1 trace files from a single sequencing run from our machine are stored in one directory together with some (for me) meaningless log files, etc.

First the script deletes unnecessary files from the directory; then it truncates filenames to 27 characters plus .ab1 ending. Funnily pregap4 handles longer filenames well, but gap4 has problems. Then all the remaining files are put into a list (tracefile.list) that is read by pregap4. pregap4 is used non-interactively. It gets its instructions (the setup of the modules) from a configuration file (which in my case is also stored in ~/bin).

pregap4 calls the gap4 shotgun assembler and puts the readings into a gap4 database. Unfortunately I havn't figured out how to automatically assign a descriptive name for this database. At the moment they will all have the same names and can be only identified based on their location in a different directory. It would be good to give them automatically a unique name, e.g. the date when the assembly was done or something similar.

After all this has been done, gap4 is called and loads the newly created database for manual inspection and editing.
Most of our sequencing is done to check newly made vectors. Therefore we usually know exactly what sequence we expect. If (which is unfortunately not the case) people use any program to document their constructs that included full sequence information (such as the Gene Construction Kit), the trace files should be automatically compared to such "expected" sequence.

At the moment I achieve this by exporting the DNA sequence from the Gene Construction Kit program (GCK) as plain text file (apparently I could also use EMBL format) and putting this file together with the ab1 trace file into the same directory before starting the script. Thus it is handles just as any other sequence and the sequence readings are aligned to it.

However, there are several things that I would like to be set automatically when invoking gap4 because I always perform the same sequence of clicks when manually inspecting in gap4 the alignment:

Displaying all forward reading frames: I presume that because the "expected sequence" is the longest, it is always present in forward orientation in the assembly and the reading frames are allways forward. This results from the fact that I maintain (for ease of reading) the vector sequence in GCK allways in such orientation that the CDS of the GOI is in a forward frame.

Highlighting disagreements by background color should be switched on.

Upper/lower case character differences should be not handled as disagreements

Because the quality of our sequencing is modest, mostly the default values are to strict for gap4 to enter the readings into the same contig. Thus I usually end up finding internal repeats and entering them manually. This should be possible to set automatically in the gap4 assembly, but I haven't got around to figure out how.