Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1999 Oct;9(10):960-71.
doi: 10.1101/gr.9.10.960.

Generation and analysis of 25 Mb of genomic DNA from the pufferfish Fugu rubripes by sequence scanning

Affiliations
Free PMC article

Generation and analysis of 25 Mb of genomic DNA from the pufferfish Fugu rubripes by sequence scanning

G Elgar et al. Genome Res. 1999 Oct.
Free PMC article

Abstract

We have generated and analyzed >50,000 shotgun clones from 1059 Fugu cosmid clones. All sequences have been minimally edited and searched against protein and DNA databases. These data are all displayed on a searchable, publicly available web site at. With an average of 50 reads per cosmid, this is virtually nonredundant sequence skimming, covering 30%-50% of each clone. This essentially random data set covers nearly 25 Mb (>6%) of the Fugu genome and forms the basis of a series of whole genome analyses which address questions regarding gene density and distribution in the Fugu genome and the similarity between Fugu and mammalian genes. The Fugu genome, with eight times less DNA but a similar gene repertoire, is ideally suited to this type of study because most cosmids contain more than one identifiable gene. General features of the genome are also discussed. We have made some estimation of the syntenic relationship between mammals and Fugu and looked at the efficacy of ORF prediction from short, unedited Fugu genomic sequences. Comparative DNA sequence analyses are an essential tool in the functional interpretation of complex vertebrate genomes. This project highlights the utility of using the Fugu genome in this kind of study.

Figures

Figure 1
Figure 1
Flow diagram of sequence generation and processing. (Left) How the sequence templates are generated; (right) the processing of the sequences and their subsequent presentation on the web with associated BLAST results.
Figure 2
Figure 2
001H15aC8 is a 619-bp sequence of very poor quality (10% Ns). However, it still shows a good match to the human PATCHED gene across two exons and one intron. Analysis of other clones from 001H15 confirms the presence of this gene on the cosmid.
Figure 3
Figure 3
Batch analysis of Fugu genomic sequences.
Figure 4
Figure 4
(A) A total of 52668 sequences were used to calculate the base frequencies. Ambiguous bases (Ns) were removed from the analysis on the assumption that they represent a roughly equal proportion of each of the 4 bases. (B) G+C content of each cosmid was calculated from the sequences derived from that cosmid. The G+C content of all sequences is also represented and shows a much wider distribution. (C) Dinucleotide frequencies are expressed as Observed (O)/Expected (E)-1. Negative values therefore correspond to dinucleotides that are suppressed, and positive values to those that are present at a frequency above that expected. Because the sequences have not been edited and have been allowed to run to 650 bp, there was concern that the quality of the data toward the end of sequences were poor. Therefore, the analysis was repeated after clipping all the sequences to 300 bp. The results are very similar to those from the whole data set, the only deviation being that the frequency of dinucleotides ApA, CpC, GpG, and TpT are slightly lower in the clipped data. This is due to broad peaks at the end of sequencing runs being mis-called as doublets of the same base.
Figure 4
Figure 4
(A) A total of 52668 sequences were used to calculate the base frequencies. Ambiguous bases (Ns) were removed from the analysis on the assumption that they represent a roughly equal proportion of each of the 4 bases. (B) G+C content of each cosmid was calculated from the sequences derived from that cosmid. The G+C content of all sequences is also represented and shows a much wider distribution. (C) Dinucleotide frequencies are expressed as Observed (O)/Expected (E)-1. Negative values therefore correspond to dinucleotides that are suppressed, and positive values to those that are present at a frequency above that expected. Because the sequences have not been edited and have been allowed to run to 650 bp, there was concern that the quality of the data toward the end of sequences were poor. Therefore, the analysis was repeated after clipping all the sequences to 300 bp. The results are very similar to those from the whole data set, the only deviation being that the frequency of dinucleotides ApA, CpC, GpG, and TpT are slightly lower in the clipped data. This is due to broad peaks at the end of sequencing runs being mis-called as doublets of the same base.
Figure 5
Figure 5
(A) Microsatellite abundance in the Fugu genome by relative frequency (green) and relative abundance (blue). (B) Table of other repeat families.
Figure 6
Figure 6
Schematic representation of the scanning procedure. Cosmid clone DNA (including vector sequence (yellow)) is sonicated, end repaired, and subcloned into EcoRV-cut pBluescript. Recombinant inserts are PCR amplified and sequenced from one end, generating ∼500 bp of sequence. These sequences are randomly distributed across the cosmid clone (small black bars). Low quality, vector, and E. coli sequences are removed. One, 64-lane ABI377 sequence gel generates ∼50 good insert sequences providing 30%–50% coverage. An average Fugu cosmid clone will contain five to seven genes (represented as ae); some are identifiable by BLAST homology at the protein level (fewer at the nucleotide level). The vertical colored bars represent the exons in the five genes and those with black dots above them are covered by the sequence scanning of this cosmid. Only the smallest genes are liable to be missed (gene b in this case). However, gene d has not been identified in other species as yet and so will not be recognized by BLAST searches (although gene prediction programs may do this) and some identified genes will show only low similarity with homologs across regions of the gene (e.g., the middle of gene a) and so may only give low BLAST scores.
Figure 7
Figure 7
BLAST output for 168A15bA3. This clone shows similarity to a predicted C. elegans (T08G11.1) gene, two yeast genes, and an E. coli hypothetical gene. There are at least three other clones from cosmid 168A15 that also hit regions of this gene.
Figure 8
Figure 8
156P04cD5 is a 609 bp sequence that spans four exons and three introns of a fatty acid transport protein. All three of the introns are <100 bp in length.
Figure 9
Figure 9
Percentage coding sequence in the Fugu genome as calculated from GeneMark ORF output. The Genemark program was run on all 52,668 sequences and the output of 964 of these, from 16 cosmids was used for the calculations of efficiency and accuracy. We have defined a correct Genemark prediction when a listed region of interest, and at least one exon prediction, is in the correct region and the reading frame of the sequence matches the confirmed BLAST hit.

Similar articles

Cited by

Publication types

LinkOut - more resources