**Explanation of data mining methods**

A Bate, R Orre, M Lindquist, I R Edwards

**Uppsala Monitoring Centre, WHO Collaborating Centre for International Drug
Monitoring, S-75320 Uppsala, Sweden**

A Bate
*programme leader, signal research methodology*

M Lindquist
*head of research and development*

I R Edwards
*director*

**Department of Mathematics, Stockholm University, Stockholm, Sweden**

R Orre
*research associate*

The Uppsala Monitoring Centre’s main purpose is to find new information on drug safety. From experience it has become clear that if important signals on drug safety are not to be missed, the first analysis of information should be free from prejudice and a priori thinking. Data mining using a neural network is an ideal approach for finding associations, and patterns of associations, in a large amount of data.[1] Bayesian logic is intuitively correct for a process in which additional information is accruing continuously and where probability has to be reconsidered often. Human intelligence and experience is able to operate better with such a transparent method in the generation of hypotheses. Bayesian statistics implemented in a neural network are used to data mine the WHO database of drug adverse reactions. Quantitative filtering of the data focuses clinical review on the potentially most important combinations of drug and adverse reaction.[1][2][3][4][5]

**How a neural network is used**

The network we use is called the bayesian confidence propagation neural network.[3] This is a feed forward neural network where learning and inference are done by the principles of Bayes’s law. For regular routine output we use it as a one layer model,[6] although it has been extended to a multilayer network.[7] Such a multilayer network can be used in further investigations of combinations of several variables in the WHO database and has already been successfully applied to areas like diagnosis,[8] expert systems,[9] and data analysis in pulp and paper manufacturing.[10]

**Why bayesian statistics are used**

The information component measures dependencies between variables in the database. Estimates of precision (standard deviation) are provided for each point estimate of the information component, thus both the point estimate of unexpectedness as well as the level of certainty associated with it can be examined. An advantage of bayesian methods in data mining is that distributions can be constructed in such a way that they adapt quickly to the addition of new data from the first sample; additionally the interpretation of the probability distributions is intuitive. Despite the presence of missing data, the information component and its standard deviation can be calculated for any combination of variable values.

**Why a neural network is used**

The network is transparent, in that it is easy to see what has been calculated. It is also robust because valid, relevant results can be generated despite missing data. This is an important advantage as mostreports in the database contain some empty fields. The results are reproducible, making validation and checking simple. The network is easy to train; it takes only one pass across the data, which makes it time efficient. A small proportion of all possible drug and adverse reaction combinations are actually non-zero in the database, thus use of a sparse matrix method makes searches through the database quick and efficient.

**Value of bayesian statistics in a neural network**

The neural network provides an efficient computational model for the analysis of large amounts of data and combinations of variables, whether real, discrete, or binary. The efficiency is enhanced by the information component being the weight in the neural network. The neural network architecture allows the same framework to be used for analysis of data and data mining as well as for prediction, which is used to recognise patterns and for classification. Bayesian statistical principles fit intuitively into the framework of a neural network approach as both build on the concept of adapting on the basis of new data.

The method has also been extended to detect dependencies between several variables and is robust in handling missing data. Pattern recognition by the network does not depend on any a priori hypothesis, as an unsupervised learning approach is used. This is useful in detecting new syndromes, finding age profiles of patients with adverse reactions to a drug, and determining groups at high risk and dose relations. The network can thus be used to find complex dependencies that have not necessarily been considered before. Naturally, changes in patterns may also be important.[1]

**How bayesian statistics have been implemented**

In this bayesian analysis the probability distributions (p_{x, }p_{y})
for each marginal event are considered the random variables of interest.
The use of a conjugate prior model makes the probability distributions
of the events beta distributed and the joint distribution Dirichlet distributed.
Priors used for the marginals are uniform beta (α
,β )—that is, beta with hyperparameters α
=c_{x}+1, β =C-c_{y}+1, where
C=total number of counts, c_{x}=total number of reports of variable
x, and c_{y}=total number of reports of variable y. For the joint
distribution an estimate of the marginal product is used as prior—that
is, a Dirichlet distribution where each term is a beta with hyperparameters
α =c_{xy}+1/(p_{x}p_{y}),
β =C-c_{xy}+1/(p_{x}p_{y}),
where c_{xy}=total number of reports of variables x and y. The
logarithm to base 2 of the quotient between the joint distribution and
the product of the marginal distributions gives a density measure of the
dependency relation between the marginals. This dependency relation is
referred to as the information component and is related to mutual information
as used in information theory. The expectation and variance (and thus the
standard deviation) are then calculated by integration.

This approach, when routinely applied to drug and adverse reaction combinations where variable x is the drug and variable y is the adverse reaction, can be seen as the calculation of the logarithm of the ratio of observed rate of adverse drug reactions to expected rate of adverse drug reactions under the null hypothesis of no association between drug and adverse reaction. The calculation is, however, done in a bayesian statistical framework.

- Hand DJ. Statistics and data mining: intersecting disciplines.
*SIGKDD Explorations*1999;1:16-21. - Orre R, Lansner A, Bate A, Lindquist M. Bayesian neural
networks with confidence estimations applied to data mining.
*Computational Statistics and Data Analysis*2000;34:473-93. - Bate A, Lindquist M, Edwards IR, Olsson S, Orre R, Lansner
A, et al. A Bayesian neural network method for adverse drug reaction signal
generation.
*Eur J Clin Pharmacol*1998;54:315-21. - Lindquist M, Edwards IR, Bate A, Fucik H, Nunes AM, Ståhl
M. From association to alert—a revised approach to international signal
analysis.
*Pharmacoepidemiol Drug Safety*1999;8:S15-25. - Lindquist M, Ståhl M, Bate A, Edwards IR, Meyboom
RHB. A retrospective evaluation of a data mining approach to aid finding
new adverse drug reaction signals in the WHO international database.
*Drug Safety*2000;23:533-42. - Lansner A, Ekeberg O. A one layer feedback artificial neural
network with a bayesian learning rule.
*Int J Neural Syst*1989;1:77-87. - Holst A. The use of a Bayesian neural network model for classification tasks [[Thesis]]. Stockholm: Royal Institute of Technology, 1997.
- Holst A, Lansner A. A higher order neural network for classification
and diagnosis. In: Gammerman A, ed.
*Computational learning and probabilistic reasoning*. Chichester: Wiley, 1996:199-209. - Holst A, Lansner A. A flexible and fault tolerant query-reply
system based on a Bayesian neural network.
*Int J Neural Syst*1993;4:257-67. - Orre R, Lansner A. Pulp quality modelling using Bayesian
mixture density neural networks.
*J Syst Eng*1996;6:128-36.