Identification of Paralog Groups

Paralog groups 1, 2, 3, 8, 9, 10, 11, 12, 13, as well as the members of the combined middle group M (paralog groups 4, 5, 6, and 7) can be readily recognized by their peptide sequences in comparison with other gnathostome sequences.

 

Table 1. Bootstrap support for monophyly of paralog groups from the peptides with n=27.
PG chordata vertebrate gnathostomes
Trees NJ NJ NJ
1 39 71 81
2 48 63 53
3 52 63 65
4 - - -
5+6+7 - - -
8 75* 73 91
9 69 77 92
10 - 24 54
11 97* 99 99
12 91* 90 92
13 99* 96 47
* excluding the Amphioxus sequence.

 

Identification of Middle Group (PG4-PG7) Genes

For comparison we use the teleost homeobox sequences from Prohaska & Stadler (2004), homeobox sequences from Human, Shark (Kim et al. 2000), Latimeria (Koh et al. 2003), and Bichir (our PCR survey). We use Quartett mapping (QM) to identify the middle group PCR fragments in the following way: First we determine QM support for paralog groups PG4, PG5, and the combination of PG6 and PG7. For those sequences that are not identified as PG4 homeoboxes, we rerun the analysis, this time computing support for PG5, PG6, and PG7. The blocks are marked 3 and 3a in the table below.
In a second experiment we then consider trees of the form
( ({x},R),(U,(V,W)) ) and ( ({x},(R,U)),(V,W) )
where {x} denotes the query sequence from Hiodon and R, U, V, and W are sets of homeobox sequences. There are 6 inequivalent quartetts depending on which pair of paralog groups is lumped together:
{x}, R, U, (V,W)  
{x}, R, V, (U,W)  
{x}, R, W, (U,V)  
{x}, V, W, (R,U)  
{x}, U, W, (R,V)  
{x}, U, V, (R,W)

For each of these six quartets we compute the quartett mapping support and consider the quartett with the largest support value. This places either a single paralog group, say R, on the same side of the split as {x} or a pair of paralog groups, say (R,U). In the table below we report the maximum support of all single paralog groups that obtain maximum support in at least one of the trees, the number of trees in which this paralog group is maximally supported, and the number of trees in which a pair of paralog groups is maximally supported.
If QM support for the best supported paralog group is at least 0.05 larger than for the next best supported paralog group we color the field. Otherwise we use a lighter color and gray background to mark the alternative.
For comparison we also analyse neighbor-joining and maximum parsimony trees computed using Felsenstein's phylip package. In the table we list the maximum bootstrap value that groups the query sequences together with one or a group of known homeobox sequences.

 

Table 2. Quartet Mapping identification of middle group genes.
3 3a 4 Trees
Sequence Hox4 Hox5 Hox6/7 Hox5 Hox6 Hox7 Hox4 Hox5 Hox6 Hox7 NJ Parsi
Hi_4-1 0.4239 0.2726 0.3034 0.4629 [3|2] [0|0] 0.3815 [1|1] [0|1] 4 (0.56) 4 (0.36)
Hi_4-2 0.3857 0.2969 0.3173 0.4170 [3|2] [0|0] 0.3704 [1|1] [0|1] 4 (0.56) 4 (0.36)
Hi_4-3 0.4023 0.2507 0.3470 0.4339 [3|1] [0|0] 0.4077 [2|1] [0|0] 7 (0.10) 6 (0.06)
Hi_4-5 0.4476 0.2613 0.2911 0.4945 [3|2] [0|0] 0.3957 [1|1] [0|1] 4 (0.73) 4 (0.45) matches HoxC4 full length sequence
Hi_5-1 0.3176 0.3860 0.2964 0.3658 0.4012 0.2331 0.3732 [1|2] 0.3871 [2|1] 0.3812 [1|1] [0|0] 5 (0.54) 5 (0.58)
Hi_5-2 0.3257 0.3891 0.2852 0.3719 0.3872 0.2408 0.3774 [1|1] 0.3923 [2|1] 0.3660 [1|1] [0|1] 5 (0.54) 5 (0.58)
Hi_5-3 0.2565 0.4625 0.2810 0.3860 0.2599 0.3541 [0|1] 0.4706 [3|1] [0|0] 0.3937 [2|1] 5 (0.62) 5 (0.38)
Hi_6.7-1 0.3695 0.2447 0.3858 0.2381 0.5119 0.2500 0.4220 [1|1] [0|0] 0.4677 [3|2] [0|1] 6 (0.14) 6 (0.01)
Hi_6.7-2 0.2911 0.3076 0.4013 0.2676 0.5146 0.2179 [0|1] [0|1] 0.4883 [3|3] [0|1] 6 (0.10) 6 (0.03)
Hi_6.7-3 0.3230 0.2739 0.4031 0.2203 0.4766 0.3031 0.3544 [1|1] [0|0] 0.4393 [3|1] [0|1] 6 (0.22) 6 (0.03)
Hi_6.7-5 0.2965 0.3195 0.3840 0.2929 0.5149 0.1923 [0|1] [0|1] 0.4899 [3|2] [0|1] 6 (0.10) 6 (0.03)

The columns for the direct analysis with four paralog groups, marked 4 in the table, are read in the following way: Consider the the entries in the first line:
...  |  (Hox4)  0.4629 [3|2]  |  (Hox5)  [0|0]  |  (Hox6)  0.3815 [1|1]  |  (Hox7)  [0|1]  | 
Interpretation: Three quadruples place the query sequence {x}="Hi_4-1" next to Hox4 and in addition there are two quartets in which {x} is placed to the union of Hox4 with another paralog group. Hox5 never received maximal support in any quartet, neither alone nor in combination with another paralog group. In one quartett there was weak support for ({x},Hox6), and in one case the combination of Hox6 and another paralog group was best supported.

 

Identification of Gnathostome Cluster Types

The next step is, within each paralog group, to identify whether whether the query sequence is orthologous to one of the four gnathostome clusters. Again, we use Quartet Mapping as outlined above. >A13(0.24)
3 4 Trees
PGSequence A B C D A B C D NJ pars
1 Hi_1-1 0.3387
*
*
0.3505
0.3159
0.2601
0.3455
0.3894
[0|0] [0|0] 0.4311 [2|1] 0.4757 [3|1] ?? ??
Hi_1-2 0.3702 0.40030.2295* 0.3890 [1|1] 0.4606 [3|2] [0|1] [0|0] A1(0.05) ??
2 Hi_2-1 0.2601 0.3604 * 0.3794 B2(0.25) B2(0.31)
Hi_2-2 0.3620 0.2844 * 0.3537 A2(0.13) A2(0.07)
Hi_2-3 0.3410 0.3132 * 0.3458 B2(0.22) ??
3 Hi_3-1 0.3965 [3|2] [0|1]   0.3622 [1|1] B3(0.51) B3(0.50)
Hi_3-1C   [0|1] 0.3853 [1|1] 0.5387 [3|2] D3(0.17) D3(0.19)
Hi_3-2 [0|1] [0|1] [0|1] 0.4350 [3|3] A3(0.17) ??
4 Hi_4-1 [0|1] 0.5620 [3|2] 0.3670 [1|1]   B4(0.52) B4(0.36)
Hi_4-2   0.4953 [3|1] 0.4275 [2|1]   B4(0.52) B4(0.36)
Hi_4-3   0.4126 [2|1] 0.4167 [3|1]   B4(0.17) B4(0.07)
Hi_4-5 [0|1] [0|1] 0.6360 [3|3] [0|1] C4(0.70) C4(0.48)
5 Hi_5-1 0.1851 0.1904 0.6245 *     0.6147 [3|1] 0.5450 [2|1] + C5(0.71) C5(0.64)
Hi_5-2 0.1680 0.1974 0.6346 *     0.6239 [2|1] 0.5371 [2|1]+ C5(0.62) C5(0.62)
Hi_5-3 0.4076 0.3291 0.2633 * 0.4153 [3|3] [0|1] [0|1] [0|1] A5(0.60) A5(0.36)
6 Hi_6.7-1 0.4036 0.3782 0.2183 * B6(0.20) B6(0.14)
Hi_6.7-2 0.33110.2495 0.4194 * B6(0.14) ??
Hi_6.7-3 0.2780 0.3512 0.3708 * B6(0.24) B6(0.17)
Hi_6.7-5 0.3366 0.2542 0.4093 * B6(0.14) ??
8 Hi_8-1 * 0.3427 0.3118 0.3455 C? ??
9 Hi_9-1 0.4269 [3|3] [0|1] [0|1] [0|1] A9(0.76) A9(0.45)
Hi_9-2 [0|1] [0|1] 0.4888 [3|3] [0|1] C9(0.29) C9(0.15)
Hi_9-3 [0|1] 0.4948 [3|3] [0|1] [0|1] B9(0.35) B9(0.23)
Hi_9-4   [0|1] 0.4466 [3|2] 0.3545 [1|1] C9(0.29) C9(0.15)
10 Hi_10-1 0.5667 [3|3] [0|1] [0|1] [0|1] A10(0.39) A10(0.17)
Hi_10-3 0.4133 [2|2] [0|1] 0.4021 [1|2] [0|1] ?? ??
Hi_10-4   [0|1] 0.4473 [3|2] 0.3878 [1|1] C10(0.17) C10(0.02)
Hi_10-5     0.4332 [2|1] 0.4596 [3|1] D10(0.69) D10(0.60)
Hi_10-6   [0|1] 0.4277 [1|1] 0.5363 [3|2] D10(0.56) D10(0.27)
11 Hi_11-1 0.4736 * 0.2526 0.2738 A11(0.24) A11(0.19)
Hi_11-2 0.4798 * 0.2609 0.2593 A11(0.24) A11(0.19)
Hi_11-3 0.2613 * 0.5153 0.2234 C11(0.50) C11(0.28)
Hi_11-4 0.3216* 0.3444 0.3340 D11(0.64) D11(0.50)
Hi_11-5 0.2913* 0.2648 0.4439 D11(0.64) D11(0.50)
12 Hi_12-1 ** 0.4635(t) 0.2489 C12(0.94) C12(0.88)
Hi_12-2 ** 0.4648(t) 0.1933 C12(0.94) C12(0.88)
13 Hi_13-1 0.5621 [3|3] [0|1] [0|1] [0|1] A13(0.24) A13(0.23)
Hi_13-2 0.5759 [3|1]   0.4059 [2|1]   A13(0.23)
Hi_13-3 [0|1][0|1] 0.5826 [3|3] [0|1] C13(0.40) C13(0.29)
Hi_13-4 [0|1] [0|1] 0.5729 [3|3] [0|1] C13(0.33) C13(0.28)
Hi_13-5 0.3471 [1|1] 0.4725 [3|1] 0.3980 [1|1]   B13(0.40) B13(0.25)

For PG12 we compare with telost C12, non-teleost C12, and D12 since A12 and B12 sequences are unknown.
+ ... only a single known sequence in this PG.

Quartett Mapping tests for recent paralogs

candidates out1 out2 together (c1,o1)(c2,o2) (c1,o2)(c2,o1)
Hi_4-1 Hi_4-2 B4-und B4a 0.6943 0.1866 0.1191
Hi_5-1 Hi_5-2 C5-und C5a 0.6840 0.1618 0.1542
Hi_6.7-5 Hi_6.7-2 B6a B6b 0.9539 0.0096 0.0365
Hi_6.7-5 Hi_6.7-2 C6a C6b 0.8857 0.1143 0.0000
Hi_9-2 Hi_9-4 C9-und C9a 0.6153 0.1852 0.1995
Hi_10-1 Hi_10-3 A10a A10b 0.2045 0.3753 0.4202
Hi_10-5 Hi_10-6 D10-und D10a 0.3651 0.1467 0.4882
Hi_11-1 Hi_11-2 A11a A11b 0.5400 0.1709 0.2891
Hi_11-3 Hi_11-4 C11a C11b 0.1958 0.3792 0.4250
Hi_12-1 Hi_12-2 C12a C12b 0.5312 0.0938 0.3750
Hi_13-1 Hi_13-2 A13a A13b 0.4306 0.4465 0.1228
Hi_13-3 Hi_13-4 C13-und C13a 0.3385 0.3677 0.2937

Quartett Mapping of Teleost-Specific Duplication

We compare each query sequence against in a QM computation with the sequences from unduplicated gnathostomes and the two teleost-specific paralog groups. In those cases where only one paralog survived after the teleost duplication we use the sequence from the other three clusters (and their teleost duplicates) as "outgroup". A3 -
Orthology with teleost clusters
Sequence Type undup Tel-a Tel-b outgroup # NJ PA
Hi_1-1 D1         0
Hi_1-2 B1 0.40640.41070.1829 B1a (0.23) ? [LmB]
Hi_2-1 B2 0.28870.4541 0.2572 ? -
Hi_2-2 A2 0.38700.23420.3788 ?
Hi_2-3 A2 0.26660.40720.3262 ?
Hi_3-1 A3 0.30420.38030.3156 -
Hi_3-1C D3 0.38600.34260.2715 -
Hi_3-2 D3 0.31330.29870.3879 ?-
Hi_3-2 A3 0.31760.30620.3762 ?-
Hi_4-1 B4 0.37000.29540.3345 - B4 (0.47) B4 (0.32)
Hi_4-2 B4 0.31360.28100.4053 - B4 (0.47) B4 (0.32)
Hi_4-3 B4 0.2779 0.1958 0.5264
C4 0.1967 0.3436 0.4597 - C4 [HsC4] C4 [HsC4]
Hi_4-5 C4 0.25940.5094 0.2312 - C4a (0.19) C4a (0.10)
Hi_5-1 C5 0.40180.26750.3308 -
Hi_5-2 C5 0.36770.26700.3653 -
Hi_5.3 A5 - - - -
Hi_5-3 (B5) 0.51190.22490.2632 -
Hi_6.7-1 A6 - - - -
Hi_6.7-2 C6 0.20910.30800.4829 ? ?
Hi_6.7-5 C6 0.20450.26330.5322 ? ?
Hi_6.7-3 B6? 0.2776 0.26390.4585 ? B6 B6
Hi_6.7-3 C6 0.2535 0.29000.4565 ? C6a(0.26) ?
Hi_8-1 ? ?
Majority 6 5 4 4
Hi_9-1 A9 0.34970.30440.3459 A9b [Dr](0.59) A9b [Dr](0.38)
Hi_9-2 C9 0.31420.4290 0.2567 - C9? C9?
Hi_9-3 B9 0.40030.4026 0.1971 - B9? B9?
Hi_9-4 C9 0.30100.4067 0.2923 - C9? C9?
Hi_10-1 A10 0.47330.27360.2531 A10(0.37) A10(0.06)
Hi_10-3 A10 0.30410.33840.3576 ?
Hi_10-4 C10 0.36170.3412 0.2971 - ? ?
Hi_10-5 D10 0.37690.3338 0.2892 - D10a (0.85) [Dr] D10a (0.76) [Dr]
Hi_10-6 D10 0.58360.2517 0.1648 - ? ?
Hi_11-1 A11 0.22420.31390.4619 A11b(0.16) A11b(0.17)
Hi_11-2 A11 0.23150.35010.4183 A11b(0.16) A11b(0.17)
Hi_11-3 C11 0.44960.18850.3620 C11a(0.38) ?
Hi_11-4 C11 0.62630.12180.2519 ? C11 (0.83) C11 (0.75)
D11 0.16790.4748 0.3573 ?- D11a (0.68) D11a (0.80)
Hi_11-5 D11 0.17610.6085 0.2154 - D11a (0.68) D11a (0.80)
Hi_12-1 C12 0.23700.36190.4012 ? [Dr12a/b] ? [Dr12a/b]
Hi_12-2 C12 0.31790.41660.2656 ? [Dr12a/b] ? [Dr12a/b]
Hi_13-1 A13 0.34340.44870.2080 A13a (0.29) A13a (0.26) matches HoxA13-1 sequence
Hi_13-2 A13 0.33320.32310.3437 A13b (0.34) ? matches HoxA13-2 sequence
Hi_13-3 C13 0.35400.3163 0.3298 - ? ?
Hi_13-4 C13 0.32340.3627 0.3139 - ? ?
Hi_13-5 B13 0.35190.2901 0.3580 - ? ?
Majority 7.5 7.5 5 1
Total 13.5 12.5 10.0 5.0

Explanation of the symbols in the column #
? ... uncertain paralog group
- ... only one paralog group known in teleosts
0 ... no teleost sequences known

Quartett mapping of the homeoboxes remains inconclusive. The number of sequences that are preferrentially classified with one of the two teleost duplicates rather than the unduplicated clusters is only slightly larger and not statistically significant (13.5+5.0=18.5 unduplicated or outgroup versus 12.5+10.0=22.5 for the teleost a and b paralogs together).