Research Interests


In many scientific researches and investigation, the interest lies in the study of effects of many factors simultaneously. One may choose a full factorial design to estimate all possible level combinations of factors, but it involves many unnecessary trials to obtain insignificant high-order interactions. Alternatively, a fractional factorial design, commonly called a regular design, has been the standard choice of these experiments due to its simple construction and analysis via defining relations. On the other hands, researchers know that some nonregular designs, which are designs without defining relations among factors, may have better design properties.

Main Results:

Phoa and Xu (2009) pioneered the construction of nonregular designs, commonly called Quaternary Code (QC) Designs. A quaternary code takes on values from Z4 = {0, 1, 2, 3}. Let G be the generator matrix in Z4. The linear combinations of the rows in G over Z4 form a quaternary linear code denoted by C. Then each Z4 entry of C is transformed into two binary code in its binary image D via the Gray map defined as follows:

0→(1,1); 1→(1,-1); 2→(-1,-1); 3→(-1,1)

By using this construction method and some additional modifications, Phoa and Xu (2009) listed a class of (1/4)th-fraction QC designs optimized under the maximum resolution, minimum aberration and maximum projectivity criteria. Then Phoa, Mukerjee and Xu (2011) continued to investigate in (1/8)th- and (1/16)th-fractions while Phoa (2012) completed the research up to (1/64)th-fractions.

In addition, Zhang et al. (2011) expressed the Gray map in trigonometric representation, which helped in understanding its respective geometric meaning. Phoa (2012) introduced the code-arithmetic approach to derive the unique K-matrix for each fraction class of designs, which paved a way for researchers to continue the search of highly efficient optimal QC designs beyond (1/64)th-fraction. Phoa, Wang and Lin (2016) introduced the integer linear programming as an optimization tool to search for optimal QC designs once K-matrix was derived.

Selected Papers:

Phoa, F.K.H. and Xu, H. (2009). Quarter-fraction factorial design constructed via Quaternary codes. Annals of Statistics, 37, 2561-2581.

Zhang, R., Phoa, F.K.H., Mukerjee, R., and Xu, H. (2011). A trigonometric approach to Quaternary code designs with application to one-eighth and one-sixteenth fractions. Annals of Statistics, 39, 931-955.

Phoa, F.K.H., Mukerjee, R., and Xu, H. (2012). One-eighth- and one-sixteenth-fraction Quaternary code designs with high resolution. Journal of Statistical Planning and Inference, 142, 1073-1080.

Phoa, F.K.H. (2012). A code arithmetic approach for Quaternary code designs and its application to (1/64)th-fractions. Annals of Statistics, 40, 3161-3175.

Phoa, F.K.H., Wang, T.C. and Lin, S.C. (2016). A search of maximum generalized resolution Quaternary-code designs via integer linear programming. Statistics and Computing, 26, 277-283.

(Last Updated: 2018/12/31)


As science and technology have advanced, investigators become more interested in and capable of studying large-scale systems. The cost of probing and studying a large-scale system can be extremely expensive. To address the challenges posed by this technological trend, supersaturated designs are considered for their run-size economy and mathematical novelty. Under the condition of factor sparsity, these screening experiments aims at correctly identifying the subset of those active factors that have significant impact on the response. There are many analysis methods developed in recent years, ranging from stepwise regression to partial least squares.

Main Results:

Phoa, Pan and Xu (2009) proposed the use of Dantzig Selector, which was first proposed by Candes and Tao (2007), to solve the variable selection problem in the analysis of supersaturated designs. They recast the Dantzig Selector as a linear program. Consider a linear model y=Xβ+ε. Suppose an orthonormal transformation is applied to the data, giving y′=Uy, then (UX)T(Uy-UXβhat)=XT(y-Xβhat). Then the Dantzig Selector can be recast as a linear program:

min Σi ui s.t. -u ≤ βhat ≤ u and -δ 1k ≤ XT(y-X βhat) ≤ δ 1k

where the optimization variables are u, βhat is the estimate of β, and 1k is a vectors of k ones. Then a procedure for analyzing supersaturated designs is as follows:
1. Standardize data and compute δ0.
2. Solve the linear program to obtain Dantzig Selector for δ from 0 to $δ0.
3. Make a profile plot and identify important effects at specific δ.

Dantzig Selector was able to provide quick and accurate identification on important factors from the experimental data conducted by a supersaturated design. However, when the number of factors becomes very large, it became infeasible for computer to perform linear programs. This led to an introduction of a two-step method called Stepwise Response Refinement Screener (SRRS). It consists of a Factor Screening step and a Model Searching step. The former step aims at obtaining a factor subset that its elements are potentially important effects (PIEs) to the response, while the latter step builds the best model using the elements in the subset. The SRRS proceeds as follows:

I. SRRS - Factor Screening:
Step 1. Standardize data to obtain y0 and X.
Step 2. Compute the correlation ρ(Xi,y0) for all factors Xi.
Step 3. Choose E0 that ρ(Xi,y0) is maximium and include E0 as the first PIE in the subset.
Step 4. Obtain βE0 by regressing y0 on E0. Set noise threshold γ as 5%-10% of βE0.
Step 5. For the next m PIEs Ej, j=1,...,m, m≤n-3.
Step 5.1. Compute the refined response yj=yj-1-Ej-1βEj-1;
Step 5.2. Do similar steps as (2)-(4) to obtain βEj;
Step 5.3. If |βE0| ≥ γ, include Ej into the subset;
Step 5.4. Repeat up to mth step, where Em is not included in the subset.

II. SRRS - Model Searching:
Step 6. Perform a model search for all Ej for 1 to m factors, where m is the minimum between n/3 and the number of Ej in the subset.
Step 7. Compute mAIC for each model and choose the final model with minimum mAIC.

Selected Papers:

Phoa, F.K.H., Pan, Y.H. and Xu, H. (2009). Analysis of supersaturated designs via Dantzig selector. Journal of Statistical Planning and Inference, 139, 2362-2372.

Phoa, F.K.H. (2014). The Stepwise Response Refinement Screener (SRRS). Statistica Sinica, 24, 197-210.

Phoa, F.K.H. (2014). A graphical user interface platform of the Stepwise Response Refinement Screener for screening experiments. Proceedings of COMPSTAT 2014, 69-79.

(Last Updated: 2018/12/31)


Community is one of the most important features in social networks. It is defined as a group of vertices that share common properties or play similar roles in a network. It drew much attention due to its importance in applications. Traditionally, community detection methods are generally designed by comparing the similarity within the groups and difference between inside and outisde the groups, like thee modularity-based method and its variants. However, these traditional methods do not have a statistical testing procedure to verify cluster existence. Even worse, these methods do not pay attention on the node attributes. For example, homophily suggested that people usually interact with others whom are similar to themselves with some attributes. Although there are some Bayesian models and topic models that considered the information on the node attributes, they are not applied to community detection.

Main Results:

We apply the idea of clustering in social network via a scan statistic approach. A scan statistic is first proposed for detecting clusters in time domain but became popular in spatial domain. It is a useful tool to distinguish the difference when data are mixed with different components, and we apply it to compare two disjoint subsets. In general, it is a likelihood-based test statistic. Support a parameter of interest θ and a subset is selected as Z. We can separately estimate θ in the disjoint subsets Z and Zc, which are defined as θZ and θc respectively, under the independent assumption between Z and Zc. Then a scan statistic is expressed as

λ(Zhat) = supθZ > θc L(Z,ZcZc) / supθZ = θc L(G;θ)

where the Zhat is the region that maximizes the likelihood ratio through scanning all possible subgraph. One crucial feature of this scan statistic method is its ability to consider both structure and attribute clusters at the same time. Assume that an attribute and structure of a network are independent. The test statistic is expressed as:

λSA(Z) = ln(LS(Z)/LS(G))((LA(Z)/LA(G)) = λS(Z) + λA(Z)

After scanning all nodes with a pre-specified maximum radius via scanning windows, all circularly candidate clusters are constructed, and the scan statistics of all subgraphs can be made. Then the largest one is treated as the test statistics. To test the significance of the possible cluster, a monte carlo p-value with R simulated runs is adopted:

p = T+1 / R+1

where T is the number of simulations that λr ≥ λobs, i.e. it is the probability of finding more extreme values than the observed value. If the p-value is smaller than a pre-specified criterion, it is statistically significant to declare thatt there is a cluster.

The scan statistic approach is the core component to a systematic analysis method for community detection in a large-scale network. It consists of three main parts:
1. A brief network exploration via graph coloring.
2. A detailed community detection via the scan statistic.
3. A fine-tune of detected communities via metaheuristics.

Wang, Phoa and Lin (2017) introduced a graph theoretic approach to quickly but roughly identify which nodes belong to which clusters. It is an important step for large-scale networks that are common nowadays. The resulting partition graph provides a rough network partition, so that the complexity in scan statistic can be greatly reduced. Notice that the resulting communities from the scan statistic are unrealistic as they are always circular from a center node with a defined radius, some fine-tune tools are required to detect their irregularity on their cluster edges. Wang and Phoa (2017) proposed to use a feasible approach via metaheuristics to achieve this goal.

Selected Papers:

Wang, T.C., Phoa, F.K.H. and Hsu, T.J. (2015). Power-law distribution in community detection. Social Network Analysis and Modeling, 5: 1-10.

Wang, T.C. and Phoa, F.K.H. (2016). A scanning method for detecting clustering pattern of both attribute and structure in social networks. Physica A, 445, 295-309.

Wang, T.C., Phoa, F.K.H. and Lin, Y.L. (2017). Network exploration by complement graph with graph coloring. Journal of Advanced Statistics, 2, 78-95.

Wang, T.C. and Phoa, F.K.H. (2017). Community detection under exponential random graph model: a metaheuristic approach. Advances in Swarm Intelligence, Springer Volume 10386, 87-98.

(Last Updated: 2018/12/31)


The growth in the big data regime and the popularity of social media have enhaced the research interest in the topic of network pattern recognition and analysis. Similar to finding the mean among a group fo numbers, the network centrality is possibly the most important and fundamental quantity to verify central nodes in a complex network. In general, the centrality measures provide the information on the importance of a target node in a network. The basic idea is to check all nodes in a network and evaluate their centralities. There are many versions of centralities in traditional network analysis, like degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, and many others, and these centralities differ by their definitions and their corresponding focuses. However, these traditional centrality measuress did not provide statistical testing for verification, and thus, their recognition accuracy of the central nodes are in questions. Furthermore, some centralities in fact consider only a subset of the whole network in their calculations.

Main Results:

Wang and Phoa (2016) define a new centrality measure called focus degree centrality, which is defined as follows.

Fc = Σd g(d)(o(d) - e(d))

where o(d) and e(d) are the observed and expected number of neighbors with a distance d from the central node vc, and g(d) is a weighted function in exponential form or any monotonic decay shape. Notice that e(d) can be found in either a theoretical approach (Wang and Phoa 2016) or a computatioal approach (Chang, Phoa and Nakano 2019).

Under the null hypothesis that all nodes have the same connection probability by multiplying the weighted function, the statistic of focus degree centrality is

U0 = (Σd g(d)(O(d) - e0(d))√n-1) / √n-1

where O(d) and ec(d) are the observed and expected numbers of neighbors with the distance d. Then the standardized test statistic is U0-E(U0) / √Var(U0). It follows the standardized normal distribution based on the central limit theorem

The focus degree centrality can be obtained under an assumption that the expected number of neighbors can be derived theoretically under some distributions, but if this is not possible due to many reasons, a computational approach is introduced in Chang, Phoa and Nakano (2019). Let vt be a node of interest. Then the network influence, which is a similar measure to focus degree centrality except it handles the directed network with specific properties, is defined as:

Iir g(r)ot(r)

where r is the influence range from vt to all other nodes, o(r) is the observed number of neighboring nodes within the influence range r from vt, and g(r) is a weighted function to represent the decay effect as the influence range increases. g(r) is defined as an inverse of e(r) where e(r) = Σt at(r) / m, where at is the number of nodes whose pathlengths from vi are exactly r. In order to guarantee the monotonic decreasing property of g(r), it becomes a flat function once it reaches its minium value.

Selected Papers:

Wang, T.C. and Phoa, F.K.H. (2016). Focus statistics for testing the degree centrality in social networks. Network Science, 4, 460-473.

Chang, L.L.N., Phoa, F.K.H. and Nakano, J. (2019). A new metric for the analysis of the scientific article citation network. IEEE Access 7, 132027-132032.

(Last Updated: 2020/06/30)


There is considerable scope for reducing resources used in research by designing more efficient studies. Traditionally, Resolution III and IV two-level fractional factorial designs have been widely used for early-stage screening experiments. However, resolution III design fail to disentangle the confuounding between main effects and two-factor interactions, and similar situations exist in resolution IV designs for a pair of two-factor interactions. Although nonregular designs, like Quaternary Code designs, have the ability to disentangle the partial aliased structure, their structural-based analysis is not trivial. Furthermore, two-level designs have no capability for capturing curvature due to pure-quadratic effects, and this leads to a new class of three-level screening designs called Definitive Screening (DS) designs, first proposed by Jones and Nachtsheim (2011). Later, Xiao, Lin and Bai (2012) pointed ot that the DS designs can be constructed by stacking a positive and a negative conference matrix. However, none of them provided a construction of these conference matrices, so if these matrices are not available, their corresponding DS designs cannot be constructed. Unfortunately, these matrices are available only in some even number of columns.

Main Results:

Phoa and Lin (2015) investigated the structure of three-level DS designs and suggested a theoretically-driven approach to construct DS designs for any number of columns, no matter the number is even or odd. Let m be the number of factors, and set m = 2n+2 for even m and 2n+1 for odd m. We need two additional matrices, T and S. T can be obtained cyclically via a generator vector t=(0,t2,...,tn)T. Similarly, S can be obtained cyclicly via a generator vector s=(s1,s2,...,sn)T. When m is even, the conference matrix C can be represented as:

0 δ δT δT
1 0 δT T
Î S -Tδ

When m is odd,

0 T T
S -Tδ

where Î and δ are column vectors of length n with all entries 1 and δ respectively. Then the DS design D can be constructed as:

D=(C -C 0)T

Inspired by the construction of the three-level conference matrix, Lin and Phoa (2016) proposed a construction on two-level near-Hadamard matrix by a new combinatorial tool called general supplementary difference set (GSDS). It is known that the upper bound of the determinant of Hadamard matrix of order n = 2 (mod 4) is achievable only if 2n-2 is a sum of two squares, so there are 6 parameters with 100 that do not fulfill this condution. This method via GSDS can construct Hadamard matrices when the maximum determinant is achievable, and near-Hadamard matrices if not achievable.

Selected Papers:

Phoa, F.K.H. and Lin, D.K.J. (2015). A systematic approach to the construction of definitive screening designs. Statistica Sinica, 25, 853-862.

Lin, Y.L. and Phoa, F.K.H. (2016). Constructing near-Hadamard designs with (almost) D-optimality by General Supplementary Difference Sets. Statistica Sinica, 26, 413-427.

(Last Updated: 2018/12/31)


There are many important problems that require good solultions to improve productivity under some predefined conditions, and these problems are generally called optimization. It becomes a fundamental element of modern science and it has been widely applied to every scientific field. Thus, starting from calculus in old days to linear programming in modern days, optimization has been highly impact to most science. However, when a very large-scale problem is considered, these traditional methods may not be the most efficient choices for optimization purposes. Inspired by the advances in biological technologies, one attempts to solve some complicated optimization problems from a biological point of view, and it leads to the development of natural heuristics. It mimics some natural mechanisms by simulating the composition, evolution, thinking, foraging and many other behaviors of human, nature or animals, and constructs simulation models for searching the optimized solutions. When a multiple of subjects work together, the outcome from their cumulative abilities is more than a summation of every subject's ability. The simulation model of optimization from these collective behaviors is generally called group intelligence. Common group inteligence methods in the literature include the Genetic Algorithm, Memetic Algorithm, Particle Swarm Optimization, Any Colony Algorithm, and many others. Since these methods have been commonly implemented to solve engineering problems, they are designed for searching in continuous domains. There exists some variants of them to search in discrete domains, but they may still not fit the requirements in many mathematical and statistical problems.

Main Results:

Phoa (2017) proposed a new nature-inspired metaheuristic optimization algorithm called the Swarm Intelligence Based (SIB) algorithm. It consists of the following key steps:
1. Initialization: Parameter Inputs and Particle Initialization.
2. Iteration: For each iteration,
2a. MIX Operation: each current particle is MIX with LB and GB particles.
2b. MOVE Operation: three particles are compared based on the objective function.
2c. Update particle if MIX produces better particle, or Random Jump if not.
2d. Update LB and GB particles.
3. Output: the GB particle with its optimzed objective function value.

The above algorithm is commonly called SIB 1.0. It has been widely used in many applications. Phoa et al. (2016) applied it to the optimization of supersaturated designs under E(s2) criterion, Phoa and Chang (2016) optimized the designs of computer experiments under two criteria, Wang and Phoa (2017) appied it to detect communities under ERGM, Lin and Phoa (2017) applied it to efficiently construct confidence regions for target localization, Lin and Phoa (2019) applied it to optimize the scheduling on parallel processing supercomputers, and many others. In addition, Hsu and Phoa (2018) improved the efficiency of SIB 1.0 by choosing initial particles smartly, and applied it to the efficient search of optimal minimum energy designs.

SIB 1.0 worked well under an assumption that the particle size for optimal solution is known in advance. However, this assumption may not be always true in reality. For examples, stock selection in a portfolio, variable selection in a regression model, change point selection on a function, and many others. Thus, Phoa (2020) introduced an augmented version from the standard framework to allow the particle size to change during the optimization procedure. The new framework is widely called SIB 2.0. It consists of the following key steps:

1. Initialization: Parameter Inputs and Particle Initialization.
2. Iteration: For each iteration,
2a. MIX Operation: each current particle is MIX with LB and GB particles.
2b. MOVE Operation: three particles are compared based on the objective function.
2c. If MIX produces better particle, update the current particle.
2d. If not, VARY Operaion: each current particle is VARY with LB and GB particles.
2e. MOVE Operation: four particles are compared based on the objective function.
2f. Update particle if MIX products better particle, or Random Jump if not.
2g. Update LB and GB particles.
3. Output: the GB particle with its optimzed objective function value.

SIB 2.0 is also applied to many applications. Phoa (2020) applied it to the identification of change point numbers and search for the optimal change point locations. Wang, Phoa and Chang (2020) applied it to the fine-tune of the edge of communities once it is detected. Tsai and Phoa (2020) applied it to improve the results of minimum energy designs when the number of sensors is not fixed in prior.

There is an ongoing work for the new SIB 3.0.

Selected Papers:

Phoa, F.K.H., Chen, R.B., Wang, W.C. and Wong, W.K. (2016). Optimizing two-level supersaturated designs via swarm intelligence techniques. Technometrics, 58, 43-49.

Phoa, F.K.H. (2017). A swarm intelligence based (SIB) method for optimization in designs of experiments. Natural Computing, 16, 597-605.

Phoa, F.K.H. and Chang, L.L.H. (2016). A multi-objective implementation in swarm intelligence with applications in designs of computer experiments. Proceedings of ICNC-FSKD 2016, 253-258.

Wang, T.C. and Phoa, F.K.H. (2017). Community detection under exponential random graph model: a metaheuristic approach. Advances in Swarm Intelligence, Springer Volume 10386, 87-98.

Hsu, T.C. and Phoa, F.K.H. (2018). A smart initialization on the swarm intelligence based method for efficient search of optimal minimum energy design. Advances in Swarm Intelligence, Springer Volume 10941, 78-87.

Lin, F.P.C. and Phoa, F.K.H. (2019). Runtime estimation and scheduling on parallel processing supercomputers via instance-based learning and swarm intelligence. International Journal of Machine Learning and Computations 9, 592-598.

Phoa, F.K.H. (2020). An augmented version of the swarm inteligence based method (SIB 2.0) and its application to change point analysis. Natural Computing, in review.

Phoa, F.K.H., Wang, T.C. and Chang, L.L.N. (2020). A metaheuristic approach to fine-tune the irregular-shaped communities for better detections. IEEE Access, in review.

Phoa, F.K.H. and Tsai, T.C. (2020). A two-step approach to the search of minimum energy design via swarm intelligence. Proceeding of ICSI 2020, accepted.

(Last Updated: 2021/03/31)


Rapid event-related functional magnetic resonance imaging (ER-fMRI) allows the shape estimation of hemodynamic response function (HRF) associated with transient brain activation evoked by various mental stimuli. An ER-fMRI design is a sequence of ten to hundred stimuli to be presented to an experimental subject. Each stimulus evokes cerebral neuronal activity, leading to a rise and fall in the ratio of oxy- to deoxy-blood in the cerebral blook vessels a ta brain voxel, and a change in the strength of magnetic field is detected by the MR scanner. Traditionally, an m-sequence is used to precisely estimate HRF and its performance is known to be good. However, the application is unfortunately limited due to the large gap of run size. Recently, Hadamard sequence was proposed for ER-fMRI experiments with one stimulus type.

Main Results:

Cheng, Kao and Phoa (2017) developed analytical results for identifying designs that allow precise estimates of the contrasts between HRFs, each describing the effect of a mental stimulus over time. In addition, a systematic method for constructing optimal and very efficient designs is provided. These fMRI designs obtained by relabeling symbols of m-sequences can be very efficient in comparing the HRFs.

Lin, Phoa and Kao (2017a) developed a new class of designs, namely circulant partial Hadamard matrix (CPHM), for similar purposes. The following model is considered for estimating the HRF:

y = Xh + Sγ + ε

The key part in the construction of CPHM is the use of general difference sets (GDS) for sequence generation. In specific, a (v,k;λ1,...,λv-1)-GDS is a set D={d1,} of distinct elements of Zv such that the difference l appears λl times in the multiset {di - dj (mod) v). If all λs are equal, it can be reduced to (v,k,λ)-GDS. Then the incidence matrix of D is the corresponding CPHM. Then theoretical results indicated the relationship between a GDS and its corresponding CPHM when n is either 0 or 2 (mod 4). Furthermore, a difference variance algorithm is proposed to efficiently search for the GDS of the required designs. However, CPHM is limited to the construction of a matrix class that its entries can only be binary values, and one needs an expanded method when an ER-fMRI experiment has Q choices of stimulus types.

Lin, Phoa and Kao (2017b) formally define a new class of experimental design, called circulant almost orthogonal array (CAOA), for ER-fMRI experiments with Q simulus types. A circulant k × n array A with entries from Zs={0, 1, ..., s-1} is said to be a circulant almost orthongal array (CAOA) with s levels, strength t, and bandwidth b, if each ordered t-tuple α based on Zs occurs λ(α) times as column vectors of any t × n submatrix of A such that |λ(α) - λ(β)| ≤ b for any two t-tuples α and β such array A is denoted by CAOA(n,k,s,t,b). For convenience, its first row is called the generator vector. It is obvious that a CPHM, or formally 0-H(K × n), is equivalent to an CAOA(n,K,2,2,0).

Regarding the generator vector, the efficient GDS method is not applicable in multi-level cases, so Lin, Phoa and Kao (2017b) introduced a difference frequency set (DFS) for ordered pair (Xi,Xj). Then a complete difference system (CDS) that summarizes the information from GDS and DFS helps to understand the whole difference structure. Note that the incident matrix of an (n,k,s,Λ)-CDS is an CAOA of strength two.

To estimate HRF, for two-level CAOA of strength two:
1. If n=0 (mod 4), a CAOA(n,K,2,2,0) is universally optimal.
2. If n=1 (mod 4), a CAOA(4u+1,K,2,2,1) is optimal for all type 1 criteria.
3. If n=3 (mod 4), a CAOA(4u+3,K,2,2,1) is Φp-optimal.
4. If n=2 (mod 4), there exists a class of CAOA that has optimal D-efficiency.

Selected Papers:

Cheng, C.S., Kao, M.H. and Phoa, F.K.H. (2017). Optimal and efficient designs for functional brain imaging experiments. Journal of Statistical Planning and Inference, 181, 71-80.

Lin, Y.L., Phoa, F.K.H. and Kao, M.H. (2017a). Circulant partial Hadamard matrices: construction via general difference sets and its application to fMRI experiments. Statistica Sinica, 27, 1715-1724.

Lin, Y.L., Phoa, F.K.H. and Kao, M.H. (2017b). Optimal design of fMRI experiments using circulant (almost-)orthogonal arrays. Annals of Statistics, 45, 2483-2510.

Lin, Y.L. and Phoa, F.K.H. (2019). Efficient experimental plans for second-order event-related functional magnetic resonance imaging. Statistica Sinica, accepted.

(Last Updated: 2020/12/31)


Bibliometrics is an analysis of research performance that governments and research institutes advocate and emphasize nowadays. Among all bibliometrics, citation plays an important role in the performance analysis. Conventional methods include impact factor, eigengactor, h-index, and many others. In the regime of big data, citation information are gathered from large-scale databases and represented in network forms. In the language of networks, many conventional methods consider only the direct citation and ignore the indirect citation, or in other words, only a subgraph of the whole citation network is actually considered when one conducts an analysis on a specific manuscript or entity of interest. Notice that for the past hundred years, scientific advancement was built on the continuous propagation of one's research results to others, and thus, conventional methods are limited to quantify this phenomenon. Furthermore, in addition to citation analysis, there are many other interesting topics in bibliometrics that are lack of quantitative analysis methods.

Main Results:

All analyses are performed via the Web of Science (WoS). The WoS is one of the finest bibliographic database in the world. It contains relevant attributes of published scientific articles, such as journals where the article published, their publication years, their authors, their reference lists, and many others. Technically, the accessible WoS database is stored in the neo4j graph database. In the current framework, the information of scientific articles during 1981-2016 are accessible.

Chang, Phoa and Nakano (2019) introduced a new measure, namely network influence, to quantify the influence of an article in the scientific community after it was published. When a researcher published a scientific article, others read it and follow his step. When these followers completed their own researches that were influenced or inspired by the original article, they cited it in the reference and this led to a citation. Chang, Phoa and Nakano (2019) recast this relationship from network perspective.

Let G=(V,E) be a citation network, where V is a set of m articles (nodes) and E is a set of citation (edges). Here are some psecial properties of the citation network. (1) All edges in G are directed with only one direction; and (2) there is no self-connection on any nodes in G. Let vt be an article of interest. Then the Article Network Influence (ANI) of vt, denoted as ANIt, is:

ANIt = Σr g(r)ot(r)

where r is the influence range from vt to all other articles in G and the summation is from 1 to a maximum influence range of interest, ot(r) is the observed number of articles with influence range r from vt, and g(r) is a weighted function to represent the decay effect of article citations as the influence range increases. Chang, Phoa and Nakano (2019) applied the network influence to analyze the citation network of statistics in the WoS during 1981-2016, resulting in the tables of top-20 most influential articles in statistics annually from 1981-1991 to 2006-2016. An extended work on how the articles in one particular subject are cited by other subjects are studied in Chang, Phoa and Nakano (2020).

Lai et al. (2020) built a model on the subject type relationship in the WoS. Up to date, there are 20327 journals, which are classified into 275 distinct subjects and 3120 subject types in the database. Comparing to the whole history of scientific discovery, the information stored in WoS is very partial. When one wants to know the relationship between two subject types, it can be biased due to incomplete information, or even worse, infeasible if one of the two subject types does not exist in the database. Thus, a model on this relationship provides an adequately good estimation on two subject types, each is a combination of any of 275 subjects in the WoS.

In order to measure the relationship between two subject types, Lai et al. (2020) proposed a modified version of Pointwise Mutual Information (PMI). Let X and Y be the cite and cited subject types respectively. In the langage of networks, we consider a direct edge X ⟶ Y. Then the PMI of X ⟶ Y is defined as:

PMI(X ⟶ Y) = log2( (C(X ⟶ Y)N / C(X)C(Y) )

where C(X) and C(Y)are the number of articles of X and Y, and C(X ⟶ Y) is the number of articles of X that cite articles of Y. Note that PMI(X ⟶ Y) is not necessarily equal to PMI(Y ⟶ X). Then a two-step method is employed to connect the components in two subject types to their PMI value. In the first step, a three-layer multi-layer perceptron (MLP) method, which is a common method in deep learning, is used to determine if two subject types are independent or not. If the MLP suggests that two subject types are not independent, then their components are used as factors in a regression analysis to result in an estimated value of PMI. The accuracy of this two-step method is high.

Selected Papers:

Chang, L.L.N., Phoa, F.K.H. and Nakano, J. (2019). A new metric for the analysis of the scientific article citation network. IEEE Access 7, 132027-132032.

Chang, L.L.N., Phoa, F.K.H. and Nakano, J. (2020). Citations of statistics articles in Web of Science. Proceedings of the Institute of Statistical Mathematics, accepted.

Lai, H.Y., Phoa, F.K.H., Chang, L.L.N. and Honda, K. (2020). A two-step approach to estimate and predict the subject type relationship in the Web of Science. Proceedings of the Institute of Statistical Mathematics, accepted.

(Last Updated: 2020/12/31)

Manuscripts Under Review and In Preparations (status updated at 2020/05)

Manuscripts Under Review and In Preparations (status updated at 2020/05)

Manuscripts Under Review and In Preparations (status updated at 2020/05)

Manuscripts Under Review and In Preparations (status updated at 2020/05)