I am trying to use EMBOSS for the assembly of vector sequences. Long time ago, I used the CGC seqed program for this purpose and at the moment I use the Gene Construction Kit. EMBOSS doesn't have a straight equivalent for seqed and one has to use a bunch of other tools to replace its functionality. Look at this comparison between CGC and EMBOSS. Unfortunately, only one of the programs replacing the functionality of seqed is capable of maintaining embl features: seqret. If you use any of the other programs (e.g. union), the features will be lost. It should be that features are maintained (and just renumbered) when they are preserved as a whole during the operation. Additionally, when an operation cuts a features, this should result into a new feature which contains the information about the event (e.g.: feature "partial CMV promoter last 400bp of originally 800bp"; or "partial human VEGF165 CDS first 300bp of originally 500bp fused in-frame to partial human VEGF-C CDS last 900bp pf 1200bp"). An example of an operation that maintains features is: seqret hTIE-1.ebl -sbegin1 200 -send1 400 -feature -osformat embl
This takes the stretch between nucleotides 200 and 400 from the hTIE-1.ebl file (which is in the current directory) and writes them to a new file in embl format and maintains the fatures. It renumbers the features, but doesn't report if a feature was cut in the middle. You will be promoted for the name of the output file. Now this functionality is not sufficient for the assembly of vector sequences. Union would do a better job. it takes a list of sequences with their respective first and last nucleotides and concatenates them: union @list.txt -osformat embl
The file list.txt has the following content: hTIE-1.ebl[1:200] hTIE-1.ebl[1:200] hTIE-1.ebl[1:200] hTIE-1.ebl[100:500] hTIE-1.ebl[200:600] hTIE-1.ebl[1:200]
The union command above takes the sequence stretches defined in the list.txt file and assembles them into a new sequence. This should work as well without a listfile as input (giving the sequences as command line arguments), but so far I got it only working with one sequence as input: union hTIE-1.ebl -sbegin 5 -send 100
One of the biggest drawbacks of the Gene Construction Kit (GCK) is that it is proprietary and few of my collaborators do own a license. Thus I end up sending them the plasmid maps as pdf files. This is not satisfactory. It would be a good idea if GCK could export the sequence data AND the annotation data in embl format, where the annotation (mainly ORFs, promoters, polyAs, primer anealing sites) is converted into embl features. GCK can export the sequence as plain text or CGC file, however the CGC file has no advantage over the plain text file as it never contains any annotation or features. GCK can also export the annotation and features ("Export -> Comments as text file") into a seperate text file. It shouldn't be that difficult to write a perl script that takes as input the plain text file/cgc file and the comments file and generates a embl file with the comments as features.