Abstract

This passage introduces the basic principles of several major clustering methods and compares/contrasts the advantages and disadvantages of each method. Furthermore, the passage illustrates how each of these methods is realized in R language using several R packages. With these clustering methods, the passage then demonstrates the analysis of genetic distances among Chinese population based on E11.

I. Introduction

1. About E11

ADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm.
Specifically, ADMIXTURE uses a block relaxation approach to alternately update allele frequency and ancestry fraction parameters. Each block update is handled by solving a large number of independent convex optimization problems, which are tackled using a fast sequential quadratic programming algorithm. Convergence of the algorithm is accelerated using a novel quasi-Newton acceleration method. The algorithm outperforms EM algorithms and MCMC sampling methods by a wide margin.
With the R package radmixture, we can calculate the specific E11 result of an individual using the raw gene data. A typical E11 result is like this:

Africa Europe India Malay Southern.China(Thai) Southwestern.China(Yi) Eastern.China Japanese Northern.China(Oroqen) Yakut America
0.073 4.09 5.54 0 7.01 16.4 44.2 7.16 9.63 4.79 1.15

The first row is the ethnic groups that E11 use to calculate, and the second row is the percentage respectively.
Thus, every raw gene datum is compressed to an eleven-dimensional vector. We want to have a look at the ancestral differences among the groups and evaluate the results of E11 by studying how well it can distinguish different ethnic groups.

2. About Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

II. Materials and methods

1. Data

All the data of Chinese population come from https://www.wegene.com/question/17467, https://www.wegene.com/question/8033, https://www.wegene.com/crowdsourcing/details/134 and https://www.wegene.com/question/7320. And all the raw gene data of foreigners come from https://my.pgp-hms.org/public_genetic_data?utf8=%E2%9C%93&data_type=23andMe&commit=Search, and they are converted into E11 afterwards using radmixture.

We have ethnic groups including Han, Manchu, Hui, Mongolian, Uyghur, Zhuang, Tibetan, Korean, Kazakh, Bai, Salar, Bifzivkar, about 400 people.

We also have foreigners including Caucasian, African, Native American, and Indian, about 200 people.

2. Trials of different algorithms

In the following part, we will try different clustering methods on our data, mainly including hierarchical clustering, divisive clustering, partitional clustering, density-based clustering, FCM clustering, and Gaussian Clustering. We will also compare and contrast the advantages and disadvantages of each clustering method.

III. Results

1. Hierarchical Clustering

(1) Agglomerative Clustering

i. Description

Every element originally belongs to a unique cluster. Clusters are merged by specific standards until all elements belong to one single cluster, or certain ending conditions are fulfilled.

ii. Application

distance<-dist(world)
hc<-hclust(distance,"ward.D2")
plot(hc,hang=0.1)
rect.hclust(hc,k=7,border="red")

First of all, it is obvious that the outcome is hard to observe when analyzing big data sets. In this example, the data are clustered well in general. We can see, from left to right, that, Cluster 1 is mainly comprised of Caucasians in Europe and America; Han people from Guangdong and Guangxi, and also Zhuang people, form Cluster 2; Cluster 3 is mainly composed of Indians; Cluster 4 constitutes the ethnic minorities in Xinjiang, as well as Native Americans and the African; Cluster 5 is formed of Central and Eastern Chinese people; Tibetans mainly comprise Cluster 6; and Cluster 7 is mainly composed of Northern Chinese people. Therefore, this clustering method partitions our data well, except that there are some major problems in Cluster 4.

(2) Divisive Clustering

i. Description

DIANA (Divisive Analysis) adopts the ‘top-down’ strategy. All elements are put into one single cluster at first. Then this cluster is partitioned to smaller clusters step by step, until every element belongs to one single cluster, or certain ending conditions are fulfilled.

ii. Application

dia.f=diana(world)
plot(dia.f,which.plot=2,main="diana result", hang=-1)

The outcome is very similar with what we have in the agglomerative clustering, except that some outliers are processed in a different way. It is because that, the divisive clustering is highly sensitive to the outliers, for example, the Africans and the Native Americans. These data are simply split in the very first hierarchy.

(3) Advantages and Disadvantages

The advantages of hierarchical clustering is that we can do the clustering only by the distance matrix, even without knowing the raw data. However, this method also have some deficits: Firstly, the noise points may highly influence the clustering result. Secondly, it can only be used with data of a small scale. Thirdly, we have to set how many clusters we will have beforehand.

2. Partitional Clustering

(1) Description

For a given data set having n elements, we first appoint the number of clusters to be k. Then we optimize the clustering result by minimizing the objective function, and usually, with iteration.

(2) K-means Clustering

i. Procedures

  • Set k initial centers
  • Figure out which center is the closest one to each element, and make that element into that cluster
  • Set the new k centers by the mean of all elements in each cluster (That’s how we can minimize the objective function. See below)
  • Repeat the process (identifying new centers and clustering) until the centers move very slightly

ii. Objective Function

\[E=\sum_{i=1}^k\sum_{p\in Ci}||p-c_i||^2\]

iii. Advantages and Disadvantages

K-means clustering is a fast, simple and direct method of clustering. However, there are some disadvantages. Firstly, this method does not perform well when the cluster is not spherical. Secondly, we should set the number of clusters beforehand. Sometimes it is not easy to figure out that. But we have some functions in R which can identify the best number of clusters. (It will be shown later.) Thirdly, just as the hierarchical clustering, it is highly sensitive to noise points. Finally, we can only get local best result using this method, instead of global best result.

iv. Application

set.seed(123456)
df <- scale(world) 
km_result <- kmeans(df, 8, nstart = 30)
fviz_cluster(km_result, data = df, main = "k-means result", palette = c("magenta", "blue", "brown", "red", "green", "black", "orange", "grey"), geom = "point",ellipse = FALSE, star.plot = TRUE, show.clust.cent = TRUE, shape = 19, ggtheme = theme_minimal(base_size = 10), labelsize = 1, pointsize = 1, xlab = "", ylab = "")

We can see from the result that, our data are clustered perfectly by K-means clustering. The four clusters on the left are all Chinese. The red represents Mongolians and ethnic minorities in Xinjiang; the black represents Northern Han Chinese; the magenta represents Central and Eastern Han Chinese; the blue represents Southern Han Chinese (mainly Han Chinese in Guangdong and Guangxi) as well as Zhuang people. The other four clusters on the right are all non-Chinese. The grey represents Native Americans; the brown represents European and American Caucasians; the green represents the African; the yellow represents Indians.

(3) K-medoids Clustering

i. Description

In K-means clustering, the new center of each cluster is determined by the mean of all the elements in that cluster currently. Therefore, it is highly sensitive to the outliers. To deal with it, K-medoids clustering selects the element which is closest to the current center as the new center. This can generate better results than K-means clustering sometimes, however, there are also cases, especially when the data set is too small or the data are highly scattered, that, K-means clustering is better than K-medoids.

ii. Procedures

  • Set k initial centers
  • Figure out which center is the closest one to each element, and make that element into that cluster
  • Set the new k centers by the closest element to the current centers
  • Repeat the process (identifying new centers and clustering) until the centers move very slightly

iii. Application in PAM

As an application of K-medoids clustering, PAM (Partitioning around Medoid) intends to get the best clustering centers through iteration.

set.seed(123456)
df <- scale(world) 
pam_result <- pam(df, 8)
fviz_cluster(pam_result, data = df,
             main = "k-means result",
             palette = c("magenta", "blue", "brown", "red", "green", "black", "orange", "grey"),
             geom = "point",
             ellipse = FALSE,
             star.plot = TRUE, 
             show.clust.cent = TRUE,
             shape = 19,
             ggtheme = theme_minimal(base_size = 10),
             labelsize = 1,
             pointsize = 1,
             xlab = "",
             ylab = ""
)

Although K-medoids clustering can alleviate the influence from the outliers to some extent, however, when the data set is too small or all elements in some cluster are far from the current clustering center, it will cause great error by selecting any of the elements in that cluster as the new clustering center, and thus the clustering can not be done in a correct way. In this example, there are only two Native Americans, and three Indians. K-medoids can not perform well dealing with small data sets. Therefore, this can explain why Caucasians, Native Americans, and Indians, which should have large discrepancies, are partitioned to the same cluster by this method.

(4) CLARA

i. Description

Without considering the whole data set, we select a small part of it, and get the clustering center by PAM. If we sample the data randomly, then we can get the data which can represent the whole data set. The centers we choose in this way, are very likely to be similar with the centers we choose from the whole data set. In a word, CLARA takes several samples, and get the best clustering result by using PAM to each sample.

ii. Application

set.seed(123456)
df <- scale(world) 
clara_result <- clara(df, 8)
fviz_cluster(clara_result, data = df,
             main = "k-means result",
             palette = c("magenta", "blue", "brown", "red", "green", "black", "orange", "grey"),
             geom = "point",
             ellipse = FALSE,
             star.plot = TRUE, 
             show.clust.cent = TRUE,
             shape = 19,
             ggtheme = theme_minimal(base_size = 10),
             labelsize = 1,
             pointsize = 1,
             xlab = "",
             ylab = ""
)

The result of CLARA is not good since Turkic people and Indians are clustered together, and American local people, the African guy and some Europeans are in the same grey group. The reason of this is that CLARA uses sampling while treating data. It uses PAM at every step and is improved in efficiency by using sampling. However, random sample choosing might cause chaos as it shows and since we do not have a lot of data, it is completely unnecessary.

3. Density-Based Clustering

(1) DBSCAN (Density-Based Spatial Clustering of Application with Noise)

i. Description

DBSCAN (Density-Based Spatial Clustering of Application with Noise) is a widely adopted clustering method in data mining.

In order to elucidate the method, we should first give some concepts and definitions.

We should first set \(\epsilon\) and M. \(\epsilon\) is called neighborhood radius, and M is called minimum number of points required to form a cluster.

  • \(\epsilon\) neighborhood: Fix x. We call the elements in the sphere with center x and radius epsilon \(\epsilon\) neighborhood of x.
  • Density: Fix x. The size of \(\epsilon\) neighborhood of x is called the density of x.
  • Core point: If the density of x is larger than M, then x is a core point.
  • Border point: If x is not a core point, and x is in the \(\epsilon\) neighborhood of some core point(s), then x is a border point.
  • Noise point: If x is neither a core point, nor a border point, then x is a noise point.
  • Directly density-reachable: If x is a core point, and y is in the \(\epsilon\) neighborhood of x, then call y is directly density-reachable from x.
  • Density-reachable: If \(x_1\), \(x_2\), … , \(x_n\) are elements in the set, for each i from 1 to n-1, if \(x_{i+1}\) is directly density-reachable from \(x_i\), then xn is density-reachable from \(x_1\).
  • Density-connected: If y and z are density-reachable from x, then y and z are density-connected.

Having all these concepts and definitions, we can now define clusters: If C is a cluster, then it satisfies:

    1. If x belongs to C, and y is density-reachable form x, then y belongs to C;
    1. If x and y belong to C, then x and y are density-connected.

To achieve the clustering, we can first select a core point, and then expand towards the area which is density-reachable so that we can get a maximal area in which any two points are density-connected. And then repeat the process.

ii. Application

dbscan_res <- dbscan::dbscan(world, eps = 3.6, minPts = 4)
fviz_cluster(dbscan_res, data = world, main = "DBSCAN result", palette = c("magenta", "blue", "brown", "red", "green", "black", "orange", "grey", "yellow", "purple"), geom = "point", ellipse = TRUE, star.plot = TRUE, show.clust.cent = TRUE, shape = 19, ggtheme = theme_minimal(base_size = 10), labelsize = 1, pointsize = 1, outlier.color = "black", xlab = "", ylab = ""
)

We can see from the result that, black isolated points are all identified as noise points. In this example, DBSCAN does not perform well. First, it generates too many noise points. Secondly, the clusters it give do not match the reality. This may be because that our data do not satisfy the basic assumptions as we have mentioned above. Although there are several groups of people among Chinese, however, it is hard to distinguish them well. And since there are very few data of Indians, Native Americans, and Africans, they are ‘doomed’ to be noise points when \(M = 4\). And when \(M \le 3\), we will have too many clusters. Anyway, it is not a good method to partition our data.

And we even show the ridiculous result when \(M = 1\) here:

dbscan_res <- dbscan::dbscan(world, eps = 3.6, minPts = 1)
fviz_cluster(dbscan_res, data = world, main = "DBSCAN result", palette = rainbow(140), geom = "point", ellipse = TRUE, star.plot = TRUE, show.clust.cent = TRUE, shape = 19, ggtheme = theme_minimal(base_size = 10), labelsize = 1, pointsize = 1, outlier.color = "black", xlab = "", ylab = "")

(2) Advantages and Disadvantages

The most important advantages of density-based clustering are that it can figure out oddly-shaped clusters, and we don’t need to set the number of clusters beforehand. Except finding out non-spherical clusters and spared to set the number of clusters, density-based clustering also has another advantage: It can ‘fight against’ noise points. Noise points will be figured out separately in this method, and we can deal with them afterwards. The main disadvantage of density-based clustering, like DBSCAN, is that it is not very easy to set \(\epsilon\) and M appropriately.

4. FCM Clustering

(1) Description

FCM (Fuzzy C-Means) Clustering is a kind of fuzzy clustering methods. Unlike the ‘hard’ partitioning, fuzzy clustering methods partition the data ‘softly’. As for fuzzy sets, each element can belong to one cluster to some extent, or also several clusters to some extent.

FCM intends to minimize the following function: \[\sum_{v=1···k} \frac{\sum_{i,j} u(i,v)^r u(j,v)^r d(i,j)}{2 \sum_j u(j,v)^r}\]

where \(n\) is number of columns, \(k\) is number of clusters, \(d(i,j)\) is the discrepancy between column i and column j.

(2) Application

fannyz<-fanny(world,8,metric="SqEuclidean")
plot(fannyz,main="FCM result",xlab="",ylab="")

We can see from the result that the clustering is correct in general. Like many other methods, it puts outliers, like the African, Native Americans and Indians into the same cluster.

5. Gaussian Clustering

(1) Description

GMM (Gaussian Mixture Model) is a model which describes density-mixed distribution, that is, the mixture distribution of several Gaussian distributions. It has a very important property: If we have enough elements in the model, we can approach any continuous distribution with arbitrary precision.

\[p(x)=\sum_{i=1}^M\pi_i p_i(x;\mu_i,C_i)\] where \(p_i(x;\mu_i,C_i)\) is a Gaussian distribution with mean \(\mu_i\) and variance \(C_i\). \(\pi_i\) is called mixed parameter, representing the prior probability generated by the i-th Gaussian distribution, which satisfies: \[\sum_{i=1}^M \pi_i =1,\quad 0\leqslant \pi_i \leqslant 1\] The prior probability \(\pi_i\) can be figured out by EM method:

  • E Step: Under the condition that \(\theta\) is \(\theta^{(t)}\), calculate the expectation of \(l(\theta)=log\ p(X|\theta)=\sum_{H}\ p(X,H|\theta)\) with the conditional distribution of \(p(X,H|\theta^{(t)})\), which is \[Q(\theta|\theta^{(t)})=\mathbb {E} \left( \sum_{H}\ p(X|\theta)|p(H|X,\theta^{(t)})\right)\]
  • M Step: Renew \(\theta^{(t)}\) to \(\theta^{(t+1)}\), and maximize \(Q(\theta|\theta^{(t)})\)

(2) Application

em_res <- Mclust(world)
em_resDR <- MclustDR(em_res, lambda = 1)
plot(em_resDR, what = "classification", ngrid = 200)

The elements in the left are mostly Chinese people, and the elements in the right are mostly Europeans. Other elements are identified as outliers. We can see that the clustering is generally correct. However, the clustering is not fine enough, mainly because that the data we have do not satisfy the basic assumptions well.

6. Conclusion to Clustering Methods

Based on the clustering methods above, we can know that each method has its premises, and also its advantages and disadvantages. In this example, we have some extreme classes, for instance, Native Americans and the African. Given this, K-means clustering and K-medoids clustering are regarded as the best methods.

Generally speaking, for data sets which are too big, hierarchical clustering does not perform well. Agglomerative clustering and divisive clustering process the outliers in different ways. As for partitional clustering, K-means performs well when the data are expected to be partitioned by distance, but it is highly sensitive to outliers. K-medoids alleviates the influence from outliers. However, when all elements in some cluster are far from the center of that cluster, it can not get a proper result. DBSCAN, based on density, overcomes that problem. However, when the densities of different clusters vary too much, it also can not get a good result. Gaussian clustering can work well for specific data sets. However, it can not perform well when the data do not satisfy certain conditions. FCM clustering can deal with the boundaries excellently. However, this method is too sensitive to the outliers, and does not work well for smaller data sets.

Discussion

As we have mentioned above, K-means method is the best clustering method to analyse our data. In the following part, we will make further discussion of our result as well as make some visualization. But at first, we want to do some description statistics.

We have labeled each ethnic group with an abbreviation to better the visualized image result. The correspondence is as below:

Abbr. Ethnicity Ethnicity(Chinese)
A Han 汉族
B Bai 白族
Bi Bifzivkar 土家族
Kh Kazakh 哈萨克族
K Korean 朝鲜族
Ma Manchu 满族
Mo Mongolian 蒙古族
S Salar 撒拉族
T Tibetan 藏族
U Hui 回族
Uy Uyghur 维吾尔族
Z Zhuang 壮族
O Others(Unknown/mixed) 其他(包括未知民族和混血儿)

Of course, we have to pre-process our data at the very beginning. Furthermore, we have written several user-defined functions to simplify our jobs. As these parts are not the core parts of our discussion, and sadly, they are a little bit long, we will put them after the main discussion. (See the parts My Functions and Data Pre-processing.)

Box Plots

First we want to compare the proportion of each ethnic group.

# Read data.
par(mfrow=c(2,2),mex=.5)
e11.list <- read.e11(dataframe = FALSE)
# Visualize the result.
for (i in 1:length(e11.list)) {
  boxplot(e11.list[[i]],main=mynames(mode = "eth.english")[i],names=mynames(mode = "e11.english"))
}

Then we want to compare each ethnic group.

par(mfrow=c(2,2),mex=.5)
for (i in 1:11) {
  boxplot(e11.list[[1]][,i],
          e11.list[[2]][,i],
          e11.list[[3]][,i],
          e11.list[[4]][,i],
          e11.list[[5]][,i],
          e11.list[[6]][,i],
          e11.list[[7]][,i],
          e11.list[[8]][,i],
          e11.list[[9]][,i],
          e11.list[[10]][,i],
          e11.list[[11]][,i],
          e11.list[[12]][,i],
          e11.list[[13]][,i],
          main = mynames(mode = "e11.english")[i],
          names = mynames(mode = "eth.english"))
}

Pie Plots

par(mfrow=c(2,2),mex=.8)
aver <- list()
for(j in 1:length(e11.list)){
  aver[[j]] <- vector(mode="numeric")
  for (i in 1:ncol(e11.list[[j]])) {
    aver[[j]][i] <- mean(unlist(e11.list[[j]][i]))
  }
  names(aver[[j]]) <- mynames(mode = "e11.english")
  pie(aver[[j]], main = mynames(mode = "eth.english")[j])
}

Now, let’s go further to the clustering.

Number of Clusters

e11 <- read.e11(Khazakh = FALSE, Uyghur = FALSE)
e11.all <- read.e11()
e11 <- e11[-which(row.names(e11)=="O109"),]
#cluster.number <- function(data){
  library('NbClust')
  # Omit the duplicated elements 
View(e11)
  all.unique <- unique(e11.all)
  # Scale the data
  all.unique.scaled <- scale(all.unique)
  # Calculate the distance between elements
  #d2 <- dist(all.unique.scaled)
  # Calculate the number
  NbClust(all.unique.scaled, distance="euclidean", min.nc=2, max.nc=15, method="average")
  # Return the result
  #return(res)
#}
#cluster.number(e11.all)

So we can see that the best number of clusters is 7.

ncluster <- 7

Clustering Plots

cluster.ellipse(e11.all, ncluster)

As we can see from the image above, the cluster of Turkic people, which is the one at the left lower corner is too far from the others so they are naturally separated. Since samples can be actually considered as outliers, other results are strongly influenced by them, because the mean value is used in k-means and outliers are very unfriendly here. So in order to have a better view of the image and avoid the influence, we admit that Turkic people should be in one separate cluster and calculate the result again where they are got rid of.

e11 <- read.e11(Khazakh = FALSE, Uyghur = FALSE)
e11 <- e11[-which(row.names(e11) %in% c("O106")),]
e11 <- unique(e11)
cluster.ellipse(e11,ncluster-1)

This time, the result is much clearer. After comparing with the region information, we can draw some basic conclusions: The red cluster includes mainly Han(A) samples, but all the Manchu(Ma) and Korean(K) samples as well as one Hui(U) sample are also included. This is the cluster for Manchu and northern Han.

Similarly, the magenta cluster is for the middle-southern Han and Bifzivkar people; the black cluster is for the southern(mainly Guangdong, Guangxi and Hainan) Han as well as for Zhuang people; the green cluster is for Salar, Hui and some northwestern Han people; the blue one is for Tibetan people and the brown one is for the Mongolian people.

Based on this result, we can figure out that this clustering result is successful and that E11 truly can distinguish different Chinese ethnic groups.
Furthermore, we can judge new samples by using this clustering plot. That is, we can say that one belongs to a certain group if his E11 result is clustered in that group. By doing this, we can overcome the current disadvantage to the gene reports given by most of the gene companies of ambiguity within vast Chinese population. Many Chinese people, though they are Chinese indeed, always expect a more accurate result when given the report that they are 100% Chinese, or get confused when given certain amount of foreign lineage because they might know clearly of their families’ eight-hundred-year-living in China. Also, many people from ethnic minorities are always curious about their genetic structures especially the difference from the majority Han.

Here are the PCA results of our data, with and without Turkic people respectively.

# Calculate the PCA of E11.
e11.pr <- prcomp(scale(e11.all))
# Visualize the result.
fviz_pca_biplot(e11.pr, palette = "jco", 
        addEllipses = FALSE, col.var = "red",label = "var", repel = TRUE, geom = c("point","label"), labelsize = 6)

# Calculate the PCA of E11.
e11.pr <- prcomp(scale(e11))
# Visualize the result.
fviz_pca_biplot(e11.pr, palette = "jco", 
        addEllipses = FALSE, col.var = "red",label = "var", repel = TRUE, geom = c("point","label"), labelsize = 6)

And here we show the hierarchical clustering results:

distance <- dist(as.matrix(e11.all))
e11.matrix <- hclust(distance)
plot(as.phylo(e11.matrix),cex = 5, label.offset=2, type="fan")

plot(e11.matrix, hang=-1, cex=4.5, labels = NULL, lwd=5)

set.seed(123456)
df <- scale(e11.all) 
km_result <- kmeans(df, ncluster, nstart = 30)
dd <- cbind(e11.all, cluster = km_result$cluster)
row.names(km_result$centers)[dd["U7","cluster"]] <- "Hui, Salar"
row.names(km_result$centers)[dd["Mo1","cluster"]] <- "Mongolian"
row.names(km_result$centers)[dd["T1","cluster"]] <- "Tibetan"
row.names(km_result$centers)[dd["Kh3","cluster"]] <- "Kazakh, Uyghur"
row.names(km_result$centers)[dd["A76","cluster"]] <- "Northern Han, Manchu, Korean"
row.names(km_result$centers)[dd["A56","cluster"]] <- "Southern-Eastern Han, Bifzivkar"
row.names(km_result$centers)[dd["A120","cluster"]] <- "Southern Han, Zhuang"
plot(as.phylo(hclust(dist(as.matrix(km_result$centers)))), cex=1, label.offset = 1, lwd=1)

cluster.result <- list()
rn <- list()
cl <- matrix(nrow = ncluster, ncol = 11)
for(i in 1:ncluster){
  cluster.result[[i]] <- read.table(paste(
    "cluster",i,".csv",sep = ""), sep = ",", as.is = TRUE, header = TRUE, row.names = 1)
  rn[[i]] <- row.names(cluster.result[[i]])
  cluster.result[[i]] <- sapply(cluster.result[[i]], as.numeric)
  cl[i,] <- colMeans(cluster.result[[i]])
  row.names(cluster.result[[i]]) <- rn[[i]]
}
judges <- c("A76","A56","A120","U7","Mo1","Kh3")
rn.order <- list()
rownames(cl) <- vector(mode = "character", length = ncluster)
for(i in 1:ncluster){
  for(j in 1:ncluster){
    if(judges[i] %in% rn[[j]]){
      rn.order[[i]] <- rn[[j]] 
    }
  } 
}

han <- read.e11.F(Han = TRUE)
han.north <- head(han,0)
han.east <- head(han,0)
han.south <- head(han,0)
cl.eth <- head(han,0)
for(i in 1:nrow(han)){
  if(rownames(han)[i] %in% rn.order[[1]]){
    han.north <- rbind(han.north, han[i,])
  }else if(rownames(han)[i] %in% rn.order[[2]]){
    han.east <- rbind(han.east, han[i,])
  }else if(rownames(han)[i] %in% rn.order[[3]]){
    han.south <- rbind(han.south, han[i,])
  }
}
hui <- read.e11.F(Hui = TRUE)
hui.authentic <- head(hui,0)
for(i in 1:nrow(hui)){
  if(rownames(hui)[i] %in% rn.order[[4]]){
    hui.authentic <- rbind(hui.authentic, hui[i,])
  }
}
cl.eth <- rbind(cl.eth, colMeans(han.north), colMeans(han.east), colMeans(han.south), colMeans(hui.authentic))
colnames(cl.eth) <- colnames(hui)
oth <- read.e11(dataframe = FALSE, Han = FALSE, Hui = FALSE, Others = FALSE)
for(i in 1:length(oth)){
  cl.eth <- rbind(cl.eth, colMeans(oth[[i]]))
}
rownames(cl.eth) <- c("Northern Han", "Central-Eastern Han", "Southern Han","Hui", "Manchu", "Zhuang", "Mongolian", "Salar", "Uyghur", "Bifzivkar", "Korean", "Bai", "Kazakh", "Tibetan")
plot(as.phylo(hclust(dist(scale(cl.eth),method = "manhattan"),method = "ward.D")), cex=1, label.offset = 1, lwd=1.5)

We used the mean E11 value of every ethnic group and hierarchical clustering algorithm to calculate the genetic difference between every ethnic group. And from the image above, we can figure out that using E11, Manchu people share a very close relationship with northern Han people, then coalesce with Korean people. And Bifzivkar people first meet middle-Eastern Han and meet Bai. In the mean time, Hui and Salar coalesce first, then meet Mongolian, and Tibetan. Surprisingly, the closest ethnic group to Southern Han is Zhuang people, and it is actually a very long distance to the northern Han from them. This shows that there must have been a large amount of people originally belong to the Zhuang or similar ethnic groups and have been assimilated into Han group culturally. Finally, Turkic people, namely Kazakh and Uyghur here, come with others at last with a very long distance.

Now we are going to do the same things on the world data.

# Read the data of the results of foreigners.
foreigners <- read.table("foreigners.csv", header = TRUE, sep = ",", row.names = 1)
# Have a look at the first few rows.
head(foreigners)
##   Africa Europe  India Malay Southern.China.Thai. Southwestern.China.Yi. Eastern.China Japanese
## 1  3.343 77.362 11.024 0.001                0.001                  0.043         0.001    0.001
## 2  0.412 80.373 12.149 0.001                0.001                  0.001         0.001    0.001
## 3  4.303 69.787 20.824 0.001                1.445                  0.001         0.001    0.001
## 4  3.615 69.454 21.110 0.001                0.202                  0.001         0.001    0.001
## 5  1.193 75.907 15.833 0.001                0.001                  0.001         0.001    0.001
## 6  1.169 79.759 12.345 0.134                0.001                  0.001         0.001    0.001
##   Northern.China.Oroqen. Yakut America
## 1                  0.980 2.034   5.210
## 2                  0.001 0.417   6.643
## 3                  0.001 0.800   2.835
## 4                  1.241 0.001   4.372
## 5                  0.001 1.870   5.191
## 6                  0.503 0.001   6.084
# Make the column names compatible.
colnames(foreigners) <- colnames(e11.all)
# Combine the Chinese samples and the foreign samples.
world <- rbind(e11.all, foreigners)
cluster.number(world)
cluster.ellipse(world, 2, ellipse.c = FALSE)

cluster.ellipse(world,8)

This is the whole result using k-means algorithm of the samples all around the world. There are two separate columns in the plot. Clusters in the left column are East Asians: the blue cluster is for Mongolian and Turkic people; the red cluster is for northern Han, Manchu, Hui, Salar, Korean, Tibetan and Japanese people; the green cluster is for the middle-eastern Han and Bifzivkar people; the brown cluster is for the southern Han and Zhuang people. The dark green cluster is for semi-local-Americans. The magenta cluster is for Europeans and American Caucasians. The black cluster is for the only African sample. The yellow cluster is for Indian people.

The point is that many varieties in European and American Caucasians exist but they are not distinguished here while East Asians are carefully categorized and labeled. I contribute this phenomenon to the standard samples and the characters E11 uses. Every E11 result has eleven dimensions and seven of them are characters of groups in East Asia. So if used to distinguish different ethnic groups, E11 performs much better when samples come from East Asians.

# Calculate the PCA of E11.
world.pr <- prcomp(scale(world))
# Visualize the result.
fviz_pca_biplot(world.pr, palette = "jco", 
        addEllipses = FALSE, col.var = "red",label = "var", repel = TRUE, geom = c("point","label"), labelsize = 6)

My Calculator

Now, we are going to develop our new calculator to identify the components of people’s ancestry, just like what E11 has done. But unlike E11, we want this calculator to be aimed at Chinese people. Thus, the ethnic groups we choose as bases are Northern Han, Middle Han, Southern Han, Mongolian, Tibetan and Turkic. (Note that here Turkic represents Turkic people in China, like Uyghur and Kazakh.) What we intend to do is to figure out the proportion of each ethnic group for a given person. That is to say, we are going to express a given person’s ancestry by the combination of the six ethnic groups above. Since we have the regional information of some of our data, we can check if our new calculator works well.

result <- data.frame("Northern.Han" = "numeric", "Middle.Han" = "numeric", "Southern.Han.Zhuang" = "numeric", "Mongolian" = "numeric", "Tibetan" = "numeric", "Turkic" = "numeric", "Value" = "numeric", stringsAsFactors=FALSE)

setwd("D:\\DNA\\R treatments\\E11_ANALYSIS")
source("myfuncs.R", encoding = "UTF-8")

for(turn in range){
  cat("turn = ")
  print(turn)
  
  ncluster <- 6
  sample <- row.names(df)[turn]
  set.seed(123456)
  df <- scale(df) 
  km_result <- kmeans(df, ncluster+1, nstart = 30)
  dd <- cbind(df, cluster = km_result$cluster)
  #U <- km_result$centers[dd["U7","cluster"],]
  Mo <- km_result$centers[dd["Mo1","cluster"],]
  Anorth <- km_result$centers[dd["A76","cluster"],] #bei
  Amiddle <- km_result$centers[dd["A56","cluster"],] #zhong
  Asouth <- km_result$centers[dd["A120","cluster"],] #nan
  Tur <- km_result$centers[dd["Kh3","cluster"],]
  Ti <- km_result$centers[dd["T1","cluster"],]
  
  coefficent.matrix <- as.matrix(cbind(Amiddle, Asouth, Mo, Ti, Tur, Anorth))
  b.matrix <- as.matrix(scale(df)[sample,])
  func <- function(a){
    s <- a[1] + a[2] + a[3] + a[4] + a[5]
    res <- norm(coefficent.matrix %*% c(a[1], a[2], a[3], a[4], a[5], 1-s) - b.matrix)
    return(res)
  }
  judgep <- 10000
  judgen <- 10000
  seatsp <- c(0,0,0,0,0)
  seatsn <- c(0,0,0,0,0)
  start <- c(0,0,0,0,0)
  for(j1 in 0:4){
    print(start)
    start[1] <- j1/4
    for(j2 in 0:4){
      start[2] <- j2/4
      for(j3 in 0:4){
        start[3] <- j3/4
        for(j4 in 0:4){
          start[4] <- j4/4
          for(j5 in 0:4){
            start[5] <- j5/4
            res <- optim(par = start, func, lower = c(0,0,0,0,0), upper = c(1,1,1,1,1), method = "L-BFGS-B")
            par <- res$par
            if(sum(par) <= 1){
              if(res$value < judgep){
                judgep <- res$value
                seatsp <- par
              }
            }else{
              if(res$value < judgen){
                judgen <- res$value
                seatsn <- par
              }
            }
          }
        }
      }
    }
  }
  seatsp.complete <- vector(mode = "numeric", length = 6)
  for(i in 1:5){
    seatsp.complete[i] <- seatsp[i]
  }
  seatsp.complete[6] <- 1-sum(seatsp)
  seatsn.complete <- vector(mode = "numeric", length = 6)
  for(i in 1:5){
    seatsn.complete[i] <- seatsn[i]
  }
  seatsn.complete[6] <- 1-sum(seatsn)

  stopmarker <- 0
  if(judgep <= judgen | judgep - judgen < 0.5){
    stopmarker <- 1
  }else if(seatsn.complete[6] < 0 & seatsn.complete[6] > -5){
    temp <- seatsn.complete
    ma <- max(seatsn.complete)
    temp[which(temp==ma)] <- ma + temp[6]
    temp[6] <- 0
    judge3 <- func(temp)
    if(judge3 <= judgen | judge3 - judgen < 0.5){
      stopmarker <- 2
    }
  }
  if(stopmarker == 0){
    coefficent.matrix <- as.matrix(cbind(Anorth, Amiddle, Asouth, Mo, Ti, Tur))
    b.matrix <- as.matrix(scale(df)[sample,])
    func <- function(a){
      s <- a[1] + a[2] + a[3] + a[4] + a[5]
      res <- norm(coefficent.matrix %*% c(a[1], a[2], a[3], a[4], a[5], 1-s) - b.matrix)
      return(res)
    }
    judgep <- 10000
    judgen <- 10000
    seatsp <- c(0,0,0,0,0)
    seatsn <- c(0,0,0,0,0)
    start <- c(0,0,0,0,0)
    for(j1 in 0:4){
      print(start)
      start[1] <- j1/4
      for(j2 in 0:4){
        start[2] <- j2/4
        for(j3 in 0:4){
          start[3] <- j3/4
          for(j4 in 0:4){
            start[4] <- j4/4
            for(j5 in 0:4){
              start[5] <- j5/4
              res <- optim(par = start, func, lower = c(0,0,0,0,0), upper = c(1,1,1,1,1), method = "L-BFGS-B")
              par <- res$par
              if(sum(par) <= 1){
                if(res$value < judgep){
                  judgep <- res$value
                  seatsp <- par
                }
              }else{
                if(res$value < judgen){
                  judgen <- res$value
                  seatsn <- par
                }
              }
            }
          }
        }
      }
    }
    seatsp.complete <- vector(mode = "numeric", length = 6)
    for(i in 1:5){
      seatsp.complete[i] <- seatsp[i]
    }
    seatsp.complete[6] <- 1-sum(seatsp)
    seatsn.complete <- vector(mode = "numeric", length = 6)
    for(i in 1:5){
      seatsn.complete[i] <- seatsn[i]
    }
    seatsn.complete[6] <- 1-sum(seatsn)
    for(m in 1:6){
      result[turn,m] <- seatsp.complete[m]
    }
    result[turn,7] <- judgep
    row.names(result)[turn] <- sample
  }else if(stopmarker == 1){
    pro <- vector(mode = "numeric", length = 7)
    pro[1] <- seatsp.complete[6]
    pro[2] <- seatsp.complete[1]
    pro[3] <- seatsp.complete[2]
    pro[4] <- seatsp.complete[3]
    pro[5] <- seatsp.complete[4]
    pro[6] <- seatsp.complete[5]
    pro[7] <- judgep
    result[turn,] <- pro
    row.names(result)[turn] <- sample
  }else if(stopmarker == 2){
    pro <- vector(mode = "numeric", length = 7)
    pro[1] <- seatsp.complete[6]
    pro[2] <- seatsp.complete[1]
    pro[3] <- seatsp.complete[2]
    pro[4] <- seatsp.complete[3]
    pro[5] <- seatsp.complete[4]
    pro[6] <- seatsp.complete[5]
    pro[7] <- judgep
    result[turn,] <- pro
    row.names(result)[turn] <- sample
  }
  if(result[turn, 7] == 10000){
    coefficent.matrix <- as.matrix(cbind(Anorth, Amiddle, Tur, Mo, Ti, Asouth))
    b.matrix <- as.matrix(scale(df)[sample,])
    func <- function(a){
      s <- a[1] + a[2] + a[3] + a[4] + a[5]
      res <- norm(coefficent.matrix %*% c(a[1], a[2], a[3], a[4], a[5], 1-s) - b.matrix)
      return(res)
    }
    judgep <- 10000
    judgen <- 10000
    seatsp <- c(0,0,0,0,0)
    seatsn <- c(0,0,0,0,0)
    start <- c(0,0,0,0,0)
    for(j1 in 0:4){
      print(start)
      start[1] <- j1/4
      for(j2 in 0:4){
        start[2] <- j2/4
        for(j3 in 0:4){
          start[3] <- j3/4
          for(j4 in 0:4){
            start[4] <- j4/4
            for(j5 in 0:4){
              start[5] <- j5/4
              res <- optim(par = start, func, lower = c(0,0,0,0,0), upper = c(1,1,1,1,1), method = "L-BFGS-B")
              par <- res$par
              if(sum(par) <= 1){
                if(res$value < judgep){
                  judgep <- res$value
                  seatsp <- par
                }
              }else{
                if(res$value < judgen){
                  judgen <- res$value
                  seatsn <- par
                }
              }
            }
          }
        }
      }
    }
    seatsp.complete <- vector(mode = "numeric", length = 6)
    for(i in 1:5){
      seatsp.complete[i] <- seatsp[i]
    }
    seatsp.complete[6] <- 1-sum(seatsp)
    seatsn.complete <- vector(mode = "numeric", length = 6)
    for(i in 1:5){
      seatsn.complete[i] <- seatsn[i]
    }
    seatsn.complete[6] <- 1-sum(seatsn)
    
    pro <- vector(mode = "numeric", length = 7)
    pro[1] <- seatsp.complete[1]
    pro[2] <- seatsp.complete[2]
    pro[3] <- seatsp.complete[6]
    pro[4] <- seatsp.complete[4]
    pro[5] <- seatsp.complete[5]
    pro[6] <- seatsp.complete[3]
    pro[7] <- judgep
    result[turn,] <- pro
    row.names(result)[turn] <- sample
  }
}
for(i in 1:nrow(result)){
  for(j in 1:ncol(result)){
    result[i,j] <- round(as.numeric(result[i,j]), 3)
  }
}
View(result)
result.data <- read.table("D:\\DNA\\R treatments\\E11_ANALYSIS\\statistics\\my_calculator sans Hui.csv", row.names = 1, sep = ",", header = TRUE)
result.new <- unique(rbind(result.data, result[range,]))
View(result.new)
write.table(result.new, "D:\\DNA\\R treatments\\E11_ANALYSIS\\statistics\\my_calculator sans Hui.csv", sep = ",", row.names = TRUE, col.names = TRUE)

We can see from the results that our new calculator performs well. Let’s have a look at a few examples:

Tag Region Northern Han Middle Han Southern Han/Zhuang Mongolian Tibetan Turkic Value
A114 Guangdong(Hakka) 0 0.346 0.654 0 0 0 1.976
A67 Shandong 0.956 0.044 0 0 0 0 2.241
A60 Sichuan * Jiangsu 0.050 0.949 0 0 0 0 2.473
Mo1 Inner Mongolia 0.129 0 0 0.771 0.031 0.069 3.565
Uy1 Xinjiang 0 0 0 0 0 1 5.980
T1 Tibet 0.077 0 0 0 0.923 0 4.255
Kh2 Xinjiang 0 0 0 0.095 0 0.905 2.944
U9 Liaoning 0.597 0.127 0 0 0.036 0.24 4.289
A70 Jiangsu * Shaanxi 0.581 0.419 0 0 0 0 3.268

We have just finished the main discussion of our results. In the following parts, we will come back to the functions we have already defined on our own, and also the data pre-processing section. Last but not least, we will put some very useful functions we have defined on our own into the R package we write (called ‘e11clust’), and present it as the final part of our report. From now on, it is more like an appendix, rather than the main body of our passage. So you can skip them at your will.

My Functions

# mynames: generate a string vector in which are different names.
# Inputs:
# - mode: string, the mode of strings that is needed. there are three options. "eth.abbr": generate abbreviations of ethnicity names; "eth.chinese": generate full ethnicity names in Chinese; "eth.english": generate full ethnicity names in English; "e11.chinese": generate full element names in E11 in Chinese; "e11.english": generate full element names in E11 in English.
# - Han: Boolean, TRUE if Han element is needed. Default is TRUE
# - Hui: Boolean, TRUE if Hui element is needed. Default is TRUE
# - Manchu: Boolean, TRUE if Manchu element is needed. Default is TRUE
# - Zhuang: Boolean, TRUE if Zhuang element is needed. Default is TRUE
# - Mongolian: Boolean, TRUE if Mongolian element is needed. Default is TRUE
# - Salar: Boolean, TRUE if Salar element is needed. Default is TRUE
# - Uyghur: Boolean, TRUE if Uyghur element is needed. Default is TRUE
# - Bifzivkar: Boolean, TRUE if Bifzivkar element is needed. Default is TRUE
# - Korean: Boolean, TRUE if Korean element is needed. Default is TRUE
# - Bai: Boolean, TRUE if Bai element is needed. Default is TRUE
# - Khazakh: Boolean, TRUE if Khazakh element is needed. Default is TRUE
# - Tibetan: Boolean, TRUE if Tibetan element is needed. Default is TRUE
# - Others: Boolean, TRUE if Others element is needed. Default is TRUE
# Output: vector, containing names.
mynames <- function(mode, Han = TRUE, Hui = TRUE, Manchu = TRUE, Zhuang = TRUE, Mongolian = TRUE, Salar = TRUE, Uyghur = TRUE, Bifzivkar = TRUE, Korean = TRUE, Bai = TRUE, Khazakh = TRUE, Tibetan = TRUE, Others = TRUE){
  # Initialize the name vector
  name <- vector(mode = "character")
  # Enter the "eth.abbr" mode, and every element is condidered if is asked to.
  if(identical(mode, "eth.abbr")){
    if(Hui){
      name <- append(name,"U")
    }
    if(Han){
      name <- append(name,"A")
    }
    if(Manchu){
      name <- append(name,"Ma")
    }
    if(Zhuang){
      name <- append(name,"Z")
    }
    if(Mongolian){
      name <- append(name,"Mo")
    }
    if(Salar){
      name <- append(name,"S")
    }
    if(Uyghur){
      name <- append(name,"Uy")
    }
    if(Bifzivkar){
      name <- append(name,"Bi")
    }
    if(Korean){
      name <- append(name,"K")
    }
    if(Bai){
      name <- append(name,"B")
    }
    if(Khazakh){
      name <- append(name,"Kh")
    }
    if(Tibetan){
      name <- append(name,"T")
    }
    if(Others){
      name <- append(name,"O")
    }
    return(name)
   # Enter the "eth.chinese" mode, and every element is condidered if is asked to.
  }else if(identical(mode, "eth.chinese")){
    if(Hui){
      name <- append(name,"回族")
    }
    if(Han){
      name <- append(name,"汉族")
    }
    if(Manchu){
      name <- append(name,"满族")
    }
    if(Zhuang){
      name <- append(name,"壮族")
    }
    if(Mongolian){
      name <- append(name,"蒙古族")
    }
    if(Salar){
      name <- append(name,"撒拉族")
    }
    if(Uyghur){
      name <- append(name,"维吾尔族")
    }
    if(Bifzivkar){
      name <- append(name,"土家族")
    }
    if(Korean){
      name <- append(name,"朝鲜族")
    }
    if(Bai){
      name <- append(name,"白族")
    }
    if(Khazakh){
      name <- append(name,"哈萨克族")
    }
    if(Tibetan){
      name <- append(name,"藏族")
    }
    if(Others){
      name <- append(name,"其他")
    }
    return(name)
   # Enter the "eth.english" mode, and every element is condidered if is asked to.
  }else if(identical(mode, "eth.english")){
    if(Hui){
      name <- append(name,"Hui")
    }
    if(Han){
      name <- append(name,"Han")
    }
    if(Manchu){
      name <- append(name,"Manchu")
    }
    if(Zhuang){
      name <- append(name,"Zhuang")
    }
    if(Mongolian){
      name <- append(name,"Mongolian")
    }
    if(Salar){
      name <- append(name,"Salar")
    }
    if(Uyghur){
      name <- append(name,"Uyghur")
    }
    if(Bifzivkar){
      name <- append(name,"Bifzivkar")
    }
    if(Korean){
      name <- append(name,"Korean")
    }
    if(Bai){
      name <- append(name,"Bai")
    }
    if(Khazakh){
      name <- append(name,"Khazakh")
    }
    if(Khazakh){
      name <- append(name,"Tibetan")
    }
    if(Others){
      name <- append(name,"Others")
    }
    return(name)
   # Enter the "e11.chinese" mode, and every element is condidered if is asked to.
  }else if(identical(mode, "e11.chinese")){
    name <- c("非洲","欧洲","印度","马来","中国南部\n(傣)","中国西南部\n(彝)","中国东部","日本","中国北部\n(鄂伦春)","雅库特","美洲")
    return(name)
   # Enter the "e11.english" mode, and every element is condidered if is asked to.
  }else if(identical(mode, "e11.english")){
    name <- c("Africa","Europe","India","Malay","Southern.China(Thai)","Southwestern.China(Yi)","Eastern.China","Japanese","Northern.China(Oroqen)","Yakut","America")
    return(name)
  }
}
# mynames.F: generate a string vector in which are different names.
# Inputs:
# - mode: string, the mode of strings that is needed. there are three options. "eth.abbr": generate abbreviations of ethnicity names; "eth.chinese": generate full ethnicity names in Chinese; "eth.english": generate full ethnicity names in English; "e11.chinese": generate full element names in E11 in Chinese; "e11.english": generate full element names in E11 in English.
# - Han: Boolean, TRUE if Han element is needed. Default is FALSE
# - Hui: Boolean, TRUE if Hui element is needed. Default is FALSE
# - Manchu: Boolean, TRUE if Manchu element is needed. Default is FALSE
# - Zhuang: Boolean, TRUE if Zhuang element is needed. Default is FALSE
# - Mongolian: Boolean, TRUE if Mongolian element is needed. Default is FALSE
# - Salar: Boolean, TRUE if Salar element is needed. Default is FALSE
# - Uyghur: Boolean, TRUE if Uyghur element is needed. Default is FALSE
# - Bifzivkar: Boolean, TRUE if Bifzivkar element is needed. Default is FALSE
# - Korean: Boolean, TRUE if Korean element is needed. Default is FALSE
# - Bai: Boolean, TRUE if Bai element is needed. Default is FALSE
# - Khazakh: Boolean, TRUE if Khazakh element is needed. Default is FALSE
# - Tibetan: Boolean, TRUE if Tibetan element is needed. Default is FALSE
# - Others: Boolean, TRUE if Others element is needed. Default is FALSE
# Output: vector, containing names.
mynames.F <- function(mode, Han = FALSE, Hui = FALSE, Manchu = FALSE, Zhuang = FALSE, Mongolian = FALSE, Salar = FALSE, Uyghur = FALSE, Bifzivkar = FALSE, Korean = FALSE, Bai = FALSE, Khazakh = FALSE, Tibetan = FALSE, Others = FALSE){
  # Since this function and the previous one overlaps, we use the previous one to form this.
  name <- mynames(mode = mode, Han = Han, Hui = Hui, Manchu = Manchu, Zhuang = Zhuang, Mongolian = Mongolian, Salar = Salar, Uyghur = Uyghur, Bifzivkar = Bifzivkar, Korean = Korean, Bai = Bai, Khazakh = Khazakh, Tibetan = Tibetan, Others = Others)
  return(name)
}
# get.wordtab.from.lines: get a word table from lines
# Inputs: NULL
# Output: dataframe, containing all the E11 data.
read.e11.all <- function(){
  e11.all <- read.table("ALL.csv", sep = ",", header = TRUE, row.names = 1)
  return(e11.all)
}
# read.e11: generate a dataframe or list of all the needed E11 data.
# Inputs:
# - dataframe: Boolean, TRUE if result is converted into dataframe. Default is TRUE
# - Han: Boolean, TRUE if Han data should be read in. Default is TRUE
# - Hui: Boolean, TRUE if Hui data should be read in. Default is TRUE
# - Manchu: Boolean, TRUE if Manchu data should be read in. Default is TRUE
# - Zhuang: Boolean, TRUE if Zhuang data should be read in. Default is TRUE
# - Mongolian: Boolean, TRUE if Mongolian data should be read in. Default is TRUE
# - Salar: Boolean, TRUE if Salar data should be read in. Default is TRUE
# - Uyghur: Boolean, TRUE if Uyghur data should be read in. Default is TRUE
# - Bifzivkar: Boolean, TRUE if Bifzivkar data should be read in. Default is TRUE
# - Korean: Boolean, TRUE if Korean data should be read in. Default is TRUE
# - Bai: Boolean, TRUE if Bai data should be read in. Default is TRUE
# - Khazakh: Boolean, TRUE if Khazakh data should be read in. Default is TRUE
# - Tibetan: Boolean, TRUE if Tibetan data should be read in. Default is TRUE
# - Others: Boolean, TRUE if Others data should be read in. Default is TRUE
# Output: List or dataframe, containing E11 data.
read.e11 <- function(dataframe = TRUE, Han = TRUE, Hui = TRUE, Manchu = TRUE, Zhuang = TRUE, Mongolian = TRUE, Salar = TRUE, Uyghur = TRUE, Bifzivkar = TRUE, Korean = TRUE, Bai = TRUE, Khazakh = TRUE, Tibetan = TRUE, Others = TRUE){
  # Initialize vec.name to contain results.
  vec.name <- list()
  # Set the working directory to current folder.
  setwd("D:/DNA/R treatments/E11_ANALYSIS/report")
  # Read each file if is asked to do so.
  if(Hui){
    hui<-read.table("Hui.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Hui <- hui
  }
  if(Han){
    han<-read.table("Han.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Han <- han
  }
  if(Manchu){
    manchu<-read.table("Manchu.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Manchu <- manchu
  }
  if(Zhuang){
    zhuang<-read.table("Zhuang.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Zhuang <- zhuang
  }
  if(Mongolian){
    mongolian<-read.table("Mongolian.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Mongolian <- mongolian
  }
  if(Salar){
    salar<-read.table("Salar.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Salar <- salar
  }
  if(Uyghur){
    uyghur<-read.table("Uyghur.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Uyghur <- uyghur
  }
  if(Bifzivkar){
    bifzivkar<-read.table("Bifzivkar.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Bifzivkar <- bifzivkar
  }
  if(Korean){
    korean<-read.table("Korean.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Korean <- korean
  }
  if(Bai){
    bai<-read.table("Bai.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Bai <- bai
  }
  if(Khazakh){
    khazakh<-read.table("Khazakh.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Khazakh <- khazakh
  }
  if(Tibetan){
    tibetan<-read.table("Tibetan.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Tibetan <- tibetan
  }
  if(Others){
    others<-read.table("Others.csv",header = TRUE,row.names = 1,sep=",")
    vec.name$Others <- others
  }
  # Convert the type of statistics in every dataframe to "numeric" and leave the names alone.
  for (i in 1:length(vec.name)) {
    rn <- row.names(vec.name[[i]])
    vec.name[[i]]<-as.data.frame(lapply(vec.name[[i]],as.numeric))
    row.names(vec.name[[i]]) <- rn
  }
  # Convert the result into dataframe if is asked to do so.
  if(dataframe){
    e11.all<-data.frame()
    for (i in 1:length(vec.name)) {
      e11.all<-rbind(e11.all,vec.name[[i]])
    }
    colnames(e11.all) <- mynames(mode = "e11.english")
    rm(vec.name)
    vec.name <- e11.all 
  }
  # Return results
  return(vec.name)
}
# read.e11.F: generate a dataframe or list of all the needed E11 data.
# Inputs:
# - dataframe: Boolean, TRUE if result is converted into dataframe. Default is TRUE
# - Han: Boolean, TRUE if Han data should be read in. Default is FALSE
# - Hui: Boolean, TRUE if Hui data should be read in. Default is FALSE
# - Manchu: Boolean, TRUE if Manchu data should be read in. Default is FALSE
# - Zhuang: Boolean, TRUE if Zhuang data should be read in. Default is FALSE
# - Mongolian: Boolean, TRUE if Mongolian data should be read in. Default is FALSE
# - Salar: Boolean, TRUE if Salar data should be read in. Default is FALSE
# - Uyghur: Boolean, TRUE if Uyghur data should be read in. Default is FALSE
# - Bifzivkar: Boolean, TRUE if Bifzivkar data should be read in. Default is FALSE
# - Korean: Boolean, TRUE if Korean data should be read in. Default is FALSE
# - Bai: Boolean, TRUE if Bai data should be read in. Default is FALSE
# - Khazakh: Boolean, TRUE if Khazakh data should be read in. Default is FALSE
# - Tibetan: Boolean, TRUE if Tibetan data should be read in. Default is FALSE
# - Others: Boolean, TRUE if Others data should be read in. Default is FALSE
# Output: List or dataframe, containing E11 data.
read.e11.F <- function(dataframe = TRUE, Han = FALSE, Hui = FALSE, Manchu = FALSE, Zhuang = FALSE, Mongolian = FALSE, Salar = FALSE, Uyghur = FALSE, Bifzivkar = FALSE, Korean = FALSE, Bai = FALSE, Khazakh = FALSE, Tibetan = FALSE, Others = FALSE){
  # Since this function and the previous one overlaps, we use the previous one to form this.
  vec.name <- read.e11(dataframe = dataframe, Han = Han, Hui = Hui, Manchu = Manchu, Zhuang = Zhuang, Mongolian = Mongolian, Salar = Salar, Uyghur = Uyghur, Bifzivkar = Bifzivkar, Korean = Korean, Bai = Bai, Khazakh = Khazakh, Tibetan = Tibetan, Others = Others)
  return(vec.name)
}
read.e11.original <- function(dataframe = TRUE, Han = TRUE, Hui = TRUE, Manchu = TRUE, Zhuang = TRUE, Mongolian = TRUE, Salar = TRUE, Uyghur = TRUE, Bifzivkar = TRUE, Korean = TRUE, Bai = TRUE, Khazakh = TRUE, Tibetan = TRUE, Others = TRUE){
  vec.name <- list()
  if(Hui){
    hui<-read.table("D:/DNA/R treatments/E11 of Hui.csv",sep=",")
    vec.name$Hui <- hui
  }
  if(Han){
    han<-read.table("D:/DNA/R treatments/E11 of Han.csv",sep=",")
    vec.name$Han <- han
  }
  if(Manchu){
    manchu<-read.table("D:/DNA/R treatments/E11 of Manchu.csv",sep=",")
    vec.name$Manchu <- manchu
  }
  if(Zhuang){
    zhuang<-read.table("D:/DNA/R treatments/E11 of Zhuang.csv",sep=",")
    vec.name$Zhuang <- zhuang
  }
  if(Mongolian){
    mongolian<-read.table("D:/DNA/R treatments/E11 of Mongolian.csv",sep=",")
    vec.name$Mongolian <- mongolian
  }
  if(Salar){
    salar<-read.table("D:/DNA/R treatments/E11 of Salar.csv",sep=",")
    vec.name$Salar <- salar
  }
  if(Uyghur){
    uyghur<-read.table("D:/DNA/R treatments/E11 of Uyghur.csv",sep=",")
    vec.name$Uyghur <- uyghur
  }
  if(Bifzivkar){
    bifzivkar<-read.table("D:/DNA/R treatments/E11 of Bifzivkar.csv",sep=",")
    vec.name$Bifzivkar <- bifzivkar
  }
  if(Korean){
    korean<-read.table("D:/DNA/R treatments/E11 of Korean.csv",sep=",")
    vec.name$Korean <- korean
  }
  if(Bai){
    bai<-read.table("D:/DNA/R treatments/E11 of Bai.csv",sep=",")
    vec.name$Bai <- bai
  }
  if(Khazakh){
    khazakh<-read.table("D:/DNA/R treatments/E11 of Khazakh.csv",sep=",")
    vec.name$Khazakh <- khazakh
  }
  if(Tibetan){
    tibetan<-read.table("D:/DNA/R treatments/E11 of Tibetan.csv",sep=",")
    vec.name$Tibetan <- tibetan
  }
  if(Others){
    others<-read.table("D:/DNA/R treatments/E11 of Others.csv",sep=",")
    vec.name$Others <- others
  }
  for (i in 1:length(vec.name)) {
    vec.name[[i]]<-t(vec.name[[i]])
    vec.name[[i]]<-data.frame(vec.name[[i]],stringsAsFactors = FALSE)
  }
  ethnicity <- mynames(mode = "e11.english")
  as.character(ethnicity)
  for (i in 1:length(vec.name)) {
    colnames(vec.name[[i]]) <- ethnicity
    vec.name[[i]] <- vec.name[[i]][-1,]
  }
  for (i in 1:length(vec.name)) {
    vec.name[[i]]<-as.data.frame(lapply(vec.name[[i]],as.numeric))
  }
  for (j in 1:length(vec.name)) {
    element.names<-character(nrow(vec.name[[j]]))
    for(i in 1:nrow(vec.name[[j]])){
      element.names[i]<-paste(mynames(mode = "eth.abbr", Han = Han, Hui = Hui, Manchu = Manchu, Zhuang = Zhuang, Mongolian = Mongolian, Salar = Salar, Uyghur = Uyghur, Bifzivkar = Bifzivkar, Korean = Korean, Bai = Bai, Khazakh = Khazakh, Tibetan = Tibetan, Others = Others)[j],i,sep="")
    }
    rownames(vec.name[[j]])<-element.names
  }
  if(dataframe){
    e11.all<-data.frame()
    for (i in 1:length(vec.name)) {
      e11.all<-rbind(e11.all,vec.name[[i]])
    }
    rm(vec.name)
    vec.name <- e11.all 
  }
  return(vec.name)
}

read.e11.F.original <- function(dataframe = TRUE, Han = FALSE, Hui = FALSE, Manchu = FALSE, Zhuang = FALSE, Mongolian = FALSE, Salar = FALSE, Uyghur = FALSE, Bifzivkar = FALSE, Korean = FALSE, Bai = FALSE, Khazakh = FALSE, Tibetan = FALSE, Others = FALSE){
  vec.name <- read.e11.original(dataframe = dataframe, Han = Han, Hui = Hui, Manchu = Manchu, Zhuang = Zhuang, Mongolian = Mongolian, Salar = Salar, Uyghur = Uyghur, Bifzivkar = Bifzivkar, Korean = Korean, Bai = Bai, Khazakh = Khazakh, Tibetan = Tibetan, Others = Others)
  return(vec.name)
}
# cluster.ellipse: visualize the k-means clustering result; ellipses can be added.
# Inputs:
# - data: dataframe, the data that needs to be clustered. 
# - number.cluster: integer, the number of clusters.
# - seed: numeric, the seed can be set to let the result repeatable. Default is 123456
# - nstart.c: integer, the number of the points to start clustering. Default is 30
# - ellipse.c: Boolean, TRUE if ellipses should be drawn. Default is TRUE
# - point.size.c: integer, the size of points. Default is 5
# - text: Boolean, TRUE if label should be added beside points. Default is TRUE
# - show.clust.cent.c: Boolean, TRUE if cluster centers should be shown. Default is TRUE
# Output: image, the PCA plot of the clustering result.
cluster.ellipse <- function(data , number.cluster, seed = 123456, nstart.c = 30, ellipse.c = TRUE, point.size.c = 5, text = TRUE, show.clust.cent.c = TRUE){
  library(cluster)
  library(factoextra)
  # Set seed to make the result repeatable
  set.seed(seed)
  # Scale the dataframe
  df <- scale(data) 
  # Calculate the k-means result
  km_result <- kmeans(df, number.cluster, nstart = nstart.c)
  # Add text if is asked to
  if(text){
    geom.c <- c("point", "text")
  }else{
    geom.c <- "point"
  }
  # Visualize the clustering result
  pic <- fviz_cluster(km_result, data = df,
                      main = "CLUSTER PLOT ELLIPSE",
                      palette = c("brown", "blue", "magenta", "red", "black", "green", "orange", "palegreen4", "grey", "palegreen4", "cyan", "pink", "yellow"),
                      #palette = rainbow(14),
                      ellipse = ellipse.c,
                      geom = geom.c,
                      axes = c(1,2),
                      ellipse.type = "euclid",
                      ellipse.level = 0.95,
                      star.plot = TRUE, 
                      show.clust.cent = show.clust.cent.c,
                      repel = TRUE,
                      shape = 19,
                      ggtheme = theme_minimal(base_size = 60),
                      labelsize = 60,
                      pointsize = point.size.c
  )
  # Return the result
  return(pic)
}
# cluster.polygon: visualize the k-means clustering result adding polygons.
# Inputs:
# - data: dataframe, the data that needs to be clustered. 
# - number.cluster: integer, the number of clusters.
# - seed: numeric, the seed can be set to let the result repeatable. Default is 123456
# - nstart.c: integer, the number of the points to start clustering. Default is 30
# - point.size.c: integer, the size of points. Default is 5
# Output: image, the PCA plot of the clustering result.
cluster.polygon <- function(data , number.cluster, seed = 123456, nstart.c = 30, point.size.c = 5){
  library(cluster)
  library(factoextra)
  # Set seed to make the result repeatable
  set.seed(seed)
  # Scale the dataframe
  df <- scale(data) 
  # Calculate the k-means result
  km_result <- kmeans(df, number.cluster, nstart = nstart.c)
  # Visualize the clustering result
  pic <- fviz_cluster(km_result, data = df,
                      main = "CLUSTER PLOT POLYGON",
                      palette = c("red", "blue", "green", "magenta", "black", "orange", "cyan", "yellow", "grey"),
                      geom = c("point","text"),
                      star.plot = TRUE, 
                      repel = TRUE,
                      shape = 19,
                      ggtheme = theme_minimal(base_size = 60),
                      labelsize = 60,
                      pointsize = point.size.c
  )
  # Return the result
  return(pic)
}
# cluster.number: Calculate the best cluster number in k-means clustering for data.
# Inputs:
# - data: dataframe, the data that needs to be calculated. 
# Output: list, all kinds of relative perspectives of clustering numbers. The number with the highest vote should be chosen as the finest one.
cluster.number <- function(data){
  library('NbClust')
  # Omit the duplicated elements 
  all.unique <- unique(data)
  # Scale the data
  all.unique.scaled <- scale(all.unique)
  # Calculate the distance between elements
  d2 <- dist(all.unique.scaled)
  # Calculate the number
  res <- NbClust(all.unique.scaled, distance="euclidean", min.nc=2, max.nc=15, method="average")
  # Return the result
  return(res)
}

Data Pre-processing

e11 <- read.e11(Khazakh = FALSE, Uyghur = FALSE) # Read E11 without Kazakh and Uyghur
outlier <- e11[which(row.names(e11) %in% c("O106")),]
e11 <- e11[-which(row.names(e11) %in% c("O106")),] # Get rid of a mixed descendent of Russian and Chinese
set.seed(123456) # To ensure the result can be replicated
df <- scale(e11) # Scale the dataset
km_result <- kmeans(df, 5, nstart = 30) # Since we have already 
dd <- cbind(e11, cluster = km_result$cluster)
for(i in 1:max(km_result$cluster)){
  ddd <- dd[dd$cluster==i,]
  ddd <- ddd[,-12]
  write.csv(ddd, file = paste(c("D:/DNA/R treatments/E11_ANALYSIS/statistics/cluster",i,".csv"), collapse = ""))
}

turkic <- read.e11.F(Khazakh = TRUE, Uyghur = TRUE)
turkic <- rbind(turkic, outlier)
cluster <- rep(max(km_result$cluster)+1, nrow(turkic))
turkic <- cbind(turkic, cluster)
dd <- rbind(dd, turkic)
write.csv(dd,file = "D:/DNA/R treatments/E11_ANALYSIS/statistics/ALL.csv",row.names = TRUE)
turkic <- turkic[,-12]
write.csv(turkic, file = paste(c("D:/DNA/R treatments/E11_ANALYSIS/statistics/cluster", max(km_result$cluster)+1,".csv"), collapse = ""))
setwd("D:/DNA/R treatments/E11_ANALYSIS/report")
# Read the data
e11.all <- read.e11.all()
# Show the first 5 rows of the data
head(e11.all) 
##     Africa Europe India Malay Southern.China.Thai. Southwestern.China.Yi. Eastern.China Japanese
## U1   0.358   3.36  3.19 0.000                11.20                   15.3          47.2     6.79
## U10  0.724   2.99  3.09 0.000                 5.88                   19.8          43.7     9.98
## U11  0.410   4.25  4.93 0.527                 7.35                   15.6          43.6     8.72
## U12  0.682   5.66  4.04 0.862                 8.93                   14.2          43.4     5.50
## U13  0.485   2.97  4.00 0.516                 8.89                   14.6          42.7     9.29
## U14  0.534   4.29  2.44 2.640                 8.35                   15.7          42.7     9.11
##     Northern.China.Oroqen. Yakut America cluster
## U1                    10.1  1.70   0.829       6
## U10                   11.4  2.28   0.116       6
## U11                   11.6  2.76   0.189       6
## U12                   12.9  2.76   1.050       6
## U13                   12.0  4.13   0.447       6
## U14                   11.3  1.56   1.370       6

We can see that there is an extra column cluster. That is the clustering result of the last time. So we need to get rid of that first.

# Get rid of the 12th column
e11.all <- e11.all[,-12]

Then we want to categorize every ethnic group into separated datasets.

# Get rid of the number in every row name
name <- gsub("\\d","",row.names(e11.all)) 
# Write the data from every ethnic group into separate documents
for(i in 1:length(mynames(mode = "eth.abbr"))){
  pos <- mynames(mode = "eth.abbr")[i] == name
  data <- e11.all[pos, ]
  write.csv(data, 
            file = paste(c(mynames(mode = "eth.english")[i],".csv"),
                              collapse = ""),
            row.names = TRUE)
}
# Check the result
read.csv("Mongolian.csv")
##     X Africa Europe India Malay Southern.China.Thai. Southwestern.China.Yi. Eastern.China Japanese
## 1 Mo1  0.086   2.26  4.39  0.00                 4.05                   15.2          31.0    10.50
## 2 Mo2  0.348   2.60  1.14  0.00                 1.08                   18.6          39.1     8.87
## 3 Mo3  0.431   3.25  3.04  0.00                 3.01                   17.1          33.9     7.20
## 4 Mo4  0.883   5.70  5.24  1.81                 0.00                   11.6          25.7     7.90
## 5 Mo5  0.146   3.98  3.62  0.00                 2.30                   13.5          33.2     9.96
## 6 Mo6  0.300   2.23  4.05  0.00                 8.74                   14.5          34.9     7.68
##   Northern.China.Oroqen. Yakut America
## 1                   20.5  9.52   2.590
## 2                   18.7  6.96   2.580
## 3                   20.6  9.58   1.870
## 4                   25.6 13.10   2.440
## 5                   20.4 12.60   0.314
## 6                   19.7  7.51   0.358

So we can see that the data pre-processing is finished.

R Package

We put some very useful functions we have defined on our own into the R package we write. The R package is called ‘e11clust’. We will enclose the R package with our report eventually.

Thanks

We want to thank Prof. Canhong Wen for giving some excellent lessons of R language and various algorithms this semester. We also want to thank the tutors who helped our group with this project. Thanks to everyone who gave us help and encouragement in the process. It’s really an honor to have you all!

Reference

[1] Xiaobing Yang; Research of Key Techniques in Cluster Analysis[D]; Zhejiang University; 2005.5
[2] https://blog.csdn.net/glodon_mr_chen/article/details/79867268

Specialization of Work

Xiran Zhang (PB17010447): Leader; Main Codes and Thoughts; New Calculator; Report

Daihe Sui (PB17010434): Data Pre-processing; Algorithms Introduction; R Package; Report

Junda Huang (PB17010432): Data Collection; Presentation PPT