7.3. Sequence data in FASTA format

For compatibility in downstream analyses, sequence data should be in a single file and FASTA formatted.  Sequence databases include FASTA as an option for output format. An example of FASTA formatted sequences retrieved from GenBank (abbreviated in length for the sake of space):

>gi|21747902|gb|AY114459.1| Apis mellifera mellifera isolate melli4 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial gene for mitochondrial product

CCCCGAATAAATAATGTTAGATTTTGATTACTTCCTCCCTCATTATTAATACTTTTATTAAGAAATTTATTTTACCCAAGACCAGGAACTGGATGAACAGTATATCC

>gi|14193071|gb|AF153104.1| Apis cerana haplotype 4 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial gene for mitochondrial product

TTTCTAATTGGAGGTTTTGGAAATTGATTAATTCCTTTAATATTAGGATCTCCAGATATAGCATTTCCTCGAATAAATAATATTAGATTCTGATTACTCCCTCCTTC

>gi|67626085|gb|DQ016070.1| Apis dorsata haplotype 7 cytochrome c oxidase subunit 1 (CO1) gene, partial cds; mitochondrial

TTTTTAATTGGAGGATTTGGAAATTGATTAATCCCTTTAATATTAGGGTCTCCAGATATAGCATTTCCTCGAATAAATAATATTAGATTTTGATTATTACCTCCTT

The sequence identifier (e.g. accession number) and title for each entry is preceded by a carrot symbol “>” and ends with a hard return. The immediate next line below this is the sequence information and should contain no spaces. The end of the sequence is determined by a hard return.  You will want to abbreviate the title of your sequence entries now, prior to alignment, using the all-important accession number or perhaps just the species name. The number of characters allowed in the sequence title is limited, to varying degrees, by alignment programs but are typically 30 characters or less. Only letters, numbers, underscores “_”, and pipes “|” are typically allowed. The above sequence entries are prepared for alignment like this:

>AY114459_A_mellifera

CCCCGAATAAATAATGTTAGATTTTGATTACTTCCTCCCTCATTATTAATACTTTTATTAAGAAATTTATTTTACCCAAGACCAGGAACTGGATGAACAGTATATCC

>AF153104_A_cerana

TTTCTAATTGGAGGTTTTGGAAATTGATTAATTCCTTTAATATTAGGATCTCCAGATATAGCATTTCCTCGAATAAATAATATTAGATTCTGATTACTCCCTCCTTC

>DQ016070_A_dorsata

TTTTTAATTGGAGGATTTGGAAATTGATTAATCCCTTTAATATTAGGGTCTCCAGATATAGCATTTCCTCGAATAAATAATATTAGATTTTGATTATTACCTCCTT

Note that you may want to keep two copies of your sequence data files: one with all the original information pertaining to the sequences and a second with just the abbreviated titles prepared for alignment analysis.