Contenu connexe
Similaire à IJCET - Data Mining in Medical Databases
Similaire à IJCET - Data Mining in Medical Databases (20)
Plus de IAEME Publication
Plus de IAEME Publication (20)
IJCET - Data Mining in Medical Databases
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 6, November - December (2013), pp. 284-289
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
©IAEME
APPLICATIONS OF DATA MINING IN MEDICAL
DATABASES
P.N.Santosh Kumar1,
Dr. C.Venugopal2,
Dr. C.Sunil Kumar3
Assistant Professor in ECM, SNIST, Hyderabad, A.P., India1
Professor in ECM, SNIST, Hyderabad, A.P., India2
Professor in ECM, SNIST, Hyderabad, A.P., India3
ABSTRACT
By scattering the information systems (ISs) an enormous quantity of data has been
collected in these systems u p to the current. Because intentionally vital information can be
concealed in this mass of data, these pieces of information may be very expensive. With the
aid of data mining (DM) and knowledge discovery (KD) techniques; the hidden data from
these huge amounts of data can be extracted. These techniques can be applied to several
areas for e.g. Commerce, Telecommunication and healthcare, too. The hospital information
systems (HIS) are well-known around the world [5]. These s y s t e m s store a great deal of data
pertaining to the patients’ physical p a r a m e t e r s , laboratory values, treatment modality and
case history. With the a p p l ic a ti o n of DM techniques to the medical and healthcare data,
the unknown relationships among these parameters concerning the examined population is
discovered. This procedure includes forming clusters characterizing the patients from the
point of view of clinical outcome, identifying the risk factors, analyzing the trends of the
changes of clinical parameters, etc. In this work; the preparation steps that must be taken
before analyzing the medical data are discussed. T h e data mining methods that are
practical to use for different purposes are also dealt. Finally, the applicability of these tools
in a particular area of healthcare is discussed.
Keywords: DM, Healthcare, KDD, DW, DB
I. INTRODUCTION
From the late 1980s to the current the principal research area in the information
technology (IT) has been K D, including DM techniques and data warehouses (DWs). DM
itself can be viewed as a result of the natural evolution of ISs. After solving the problem of
284
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
creation and design of databases (DBs), a enormous amount of the collected data has been
aggregated in these DB systems up to the current. That is how the mechanism works in the
area of medical sciences, too. Around the world there are many research projects that are
based on the application of DM techniques in various fields of discipline. When considering
the Indian medical DB systems, these ISs are big and different enough to extract some
valuable hidden data from them. Researches based on DM that cover healthcare field have
difficult goals. On one hand these projects can examine the applicability of DM techniques in
health care [1]. H owever the common algorithms can be improved by building the skill
knowledge, and what the typical difficulties and mistakes are on which focus must be sited
during this work [3]. On the other hand by the application of these techniques in
healthcare DB systems some concealed pieces of information might be revealed, which
can be used in medical practice, for e.g. improving treatment or analyzing risk factors.
IT today is broadly adopted in current medical practice, especially supporting digitized
equipment, administrative jobs, and data organization but less has been achieved in the use of
computational methods to exploit the medical data in research or practice. There is a budding
demand for the integration and exploitation of diverse medical information for improved
medical practice, medical research and adapted healthcare. Some of the tasks suitable for the
application of DM are categorization, estimation, prediction, affinity grouping, clustering, and
description. Some of them are best approached in a top-down manner or hypothesis testing while
others are best approached in a bottom-up manner called KD either directed or undirected.
DM has a goal to discover knowledge out of data and present it in a form that is easily
understandable to public [6]. There are several DM methods, such as Cluster Detection (CD),
Decision Trees (DTs), Artificial Neural Networks (ANNs), Genetic Algorithms (GAs), and On-Line
Analytic Processing (OLAP). DTs may be used for categorization, clustering, prediction, or
estimation. There are different approaches in DM, namely assumption testing where a DB recording
past behavior is used to verify or disprove defined notions, ideas, and guesses concerning
relationships in the data, and KD where no prior hypothesis are made and the data is allowed to
speak for itself. As for KD, it may be directed or undirected. Directed KD tries to explain or classify
some particular data field while undirected knowledge KD aims at finding models or similarities
among groups of records without the use of a particular target field or group of predefined classes [2].
II. DATA MINING TECHNIQUES
Some of the frequently used techniques are the following
A. Neural Networks
A Neural network (NN) may be defined as a pattern of reasoning based on the human brain.
It is perhaps the most common DM method, since it is a simple model of neural interconnections in
brains, custom-made for use on digital computers. It learns from a training set, generalizing patterns
inside it for classification and prediction. Neural networks can also be applied to undirected DM and
time-series forecasting.
B. Decision Trees
DTs are a way of representing a sequence of rules that show the way to a class or value.
Therefore, they are used for directed DM, particularly categorization. One of the significant
advantages of DTs is that the pattern is quite understandable since it takes the form of unambiguous
rules. This allows the evaluation of results and the recognition of key attributes in the procedure. The
285
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
rules, which can be articulated easily as logic statements, in a language such as SQL, can be applied
directly to new records.
C. Cluster Detection
CD consists of building patterns that find data records parallel to each other. This is naturally
undirected DM, since the objective is to find previously unknown similarities in the data. Clustering
data may be measured a very good way to start any analysis on the data. Self-similar clusters can
provide the starting point for knowing what is in the data and for figuring out how to best make use
of it.
D. Genetic Algorithms
GAs, which applies the procedure of genetics and natural selection to a search, are used for
finding the most favorable set of parameters that describe an analytical function. Hence, they are
mainly used for directed DM. GAs use many operators such as the selection, crossover, and mutation
to evolve consecutive generations of solutions. As these generations evolve, only the most analytical
survive, until the functions converge on optimal results.
III. KNOWLEDGE DISCOVERY IN DATABASES
The concept of DM is often used as a synonym of knowledge discovery in databases
(KDD); however DM is only a crucial step of the KDD process. This procedure includes the
following key steps: learning the application domain; creating the target data set (DS) ;
choosing the DM functionalities and the correct algorithms; pattern assessment; knowledge
production; assessing of the discovered data. DM work in an indefinite domain always starts
with the understanding of the application domain and solving the specification of the problem.
In the domain of healthcare it means that the major medical terms need to be familiarized; then
the available data must be preprocessed, w h i c h includes data selection and aggregation,
data cleaning, and data reduction and transformation.
After collecting and preprocessing the data according to the objectives of the application,
the functionality of DM activity must be chosen, and d i s c o v e r the best DM algorithms. DM
functionalities include creating concept or class descriptions, clustering, classification, evolution
analysis, and association analysis. The alternative among these possibilities is mainly influenced
by the limitations of the DM system. After executing the DM algorithms the discovered
patterns need to be visualized to the experts for analysis. For this reason, the charts, tables,
diagrams, decision trees, rules, etc are used.
By evaluating these results the desired new
knowledge is obtained, which the end-users can utilize during their research work.
IV. SOURCE DATA
Medical data (MD) is arising f r o m diverse resources. There are two types of DBs
available i n medical domain. The first type of MD comes from medical experts. For e.g., it can
be medical diagnosis, drugs and so on. It is typical of this type of data that the number of
records is little, but the number of attributes for each record is relatively huge if compared with
the number of records and in this kind of data the missing values a r e n o t f o u n d frequently.
The other type of MD is coming mainly from HIS. This data is automatically stored in
DBs without any specific purpose. For e.g., laboratory test data is classified to this cluster.
The source systems of MD are mostly the HIS and flat files, but in some special cases DWs can
286
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
also provide this kind of information [4]. Regrettably in many cases important data is stored
in paper format only. The data needed for DM assessments must be integrated before analysis.
Most often the objective of the integrated data is a relational DB or a DW. Examining the
MD it may be often seen that the base data is not in the appropriate form, and/or is filthy,
and some data transformation ( D T) actions may needed to be performed on them. So
before running the DM algorithm it must be selected, cleaned, integrated, and transformed in
to appropriate format.
V. DATA PREPROCESSING
Analyzing filthy, wrong data never provides positive information. Before starting any
DM job, the data must be preprocessed. This activity includes solving the problem of filthy
data, the difficulty of missing values, managing redundant data, dealing with amorphous
information and other data preprocessing actions, such as creation of new features, data
normalization/data generalization techniques. In medical information systems (MISs) it often
occurs that some fields are Null. The reason for this may be that data isn’t available/ data
isn’t stored. To improve the discovery process it is recommended to get and fill in the
missing information. Generally there are some other promises, for e.g., using a global
constant/using the attribute/most probable values to fill in the missing values. Replacing the
missing value with a global constant (for e.g., “anonymous”) is not a good option, because the
DM algorithms may operate with this value as a new concept. In medical domain it is neither
recommended to replace the missing values with the attribute/the most probable values of the
field, as it can happen that this parameter would predict an illness or an adverse event which
can be analyzed.
The difficulty of noisy data can occur for a number of causes: arbitrary fault during
recording, diverse unit of measures of laboratory values. The default values in many cases can
cause difficulties, because for e.g., seeing a ‘0’ value in the field of a laboratory parameter, it
cannot make a decision whether it means the absence of the examination. Data outside the MD
can be corrected manually, or deleted. The outliers (A value far from most others in a set of data)
can be detected for e.g., by clustering. Outliers in medical databases (MDBs) may draw the
attention of the analyzer. Redundant data is mostly generated by the aggregation of several
different DBs. For e.g., physical parameters of patients are usually stored in more than one
database which needs integration. Comparing the correspondent data of the different DBs,
inconsistency may be found. The difference among the values may also derive from a temporal
change. In this case new information can be obtained from time series data. For this purpose
each piece of information can be placed in a new database or a DW accompanied with a
timestamp (TS), and then evolution analysis can be executed for finding hidden patterns.
The major difficulty of DM in the healthcare field is that a enormous amount of
data is stored as simple and in unstructured text format. The analysis of textual data requires
considerably diverse algorithms than the ones used for the analysis of continuous binary, ordinal
or nominal data. So it is recommended to transform this data into some structured form. This
can only be attained with the help of medical experts, because of the difficulty of the terminology
used. Alike to other data pre-processing activities, DM applications working on medical data
also require some data conversion procedures. In conditions, where the accurate value is not
concerned, only the uniqueness of that data needs to be generalized. Such a classic situation
may result, for e.g., blood- pressure values or laboratory values.
287
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
VI. RESEARCH ON BONE DISEASE: CASE STUDY ON OSTEOPOROSIS
The researches are carrying out work in the area of osteoporosis (bone disease). So far
25,000 persons data have been gathered, who were referred for assessment with uncertainty of
osteoporosis. Evidently, a number of these patients’ later proved to be v e r y healthy. The
examined persons were coming from different constituencies of county. From this vast amount
of data 1200 delegate patients were selected for further research.
The accessible data of patients is dissimilar, because the data pertaining to the personal
and familial history of the patients’, for e.g., birth weight, b o n e fractures, drug, prior
illnesses, and illnesses of relatives. Possibly the most valuable data is the results of
densitometry examinations for years back. In the future the DNA of the patients’ in association
with osteoporosis would like to be inspected. All this data was stored earlier in paper format.
So after getting identifiable with the application field our first mission was to provide
opportunity for recording this data in a DB system. Parallelly with this copy the preprocessing
of pattern discovery (PD) process has also started.
Seeing the procedure, the association analysis (AA) for finding out the society of
osteoporosis and the probable risk factors, and the link of densitometry values and fractures
are performed. With the categorization algorithms; the patients are grouped into three (3)
categories, namely osteoporosis, osteopenia and healthy status. In this work, the clustering
t ec h ni qu es are based on phenotype, genotype and assessment results. The number of
evolution analysis, including the examination of the change of density of bones in time are
planned and searched for other regularities and inclinations.
Osteoporosis is a bone disease that causes decrease of bone density and quality, leading to
weakness of the skeleton and enlarged risk of fracture, particularly of the spine, wrist, hip, pelvis and
upper arm. Osteoporosis and associated fractures characterize an important cause of mortality and
morbidity. Bone loss is gradual and shows no obvious symptoms or warning signs until the disease
has advanced to its late stage. Osteoporosis is a global crisis because 1 in 4 women and at least 1 in
15 men will develop osteoporosis during their existence. For these reasons, osteoporosis is often
referred to as the "silent epidemic". The world health organization (WHO) has identified it as a
priority health issue. The costs to national healthcare systems from osteoporosis-related
hospitalization are staggering. In the UK, according to estimates made by the National Osteoporosis
Society,
•
•
•
•
there are an estimated 3.5 million citizens in the UK suffering from osteoporosis
osteoporosis is liable for nearly 220,000 fractures per year
osteoporosis costs the NHS and government over £1.5 billion each year.
Although there are some treatments there is presently no cure for osteoporosis. But it could
be effectively prevented. Early discovery of bone loss is key to the prevention of suffering and
appreciation of healthcare costs. However, screening facilities and qualified scientific personnel
remain insufficient in most countries. The UK has only about two DXA Bone Mass Densitometers
per million of residents and less than 10% of patients receive treatment. The research has been
conducted on osteoporosis since 1997 with an aim of examining and developing a tactic to identify
the associated risk factor and to predict the likelihood of developing osteoporosis. The research has
produced some very cheering initial findings which have been published in journals and major
conferences, both in medical and computing fields.
288
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), © IAEME
VII. CONCLUSION
In the most recent decades, the amount of data stored in various ISs increased significantly.
DM is one of the most popular techniques to analyze this enormous amount of data. HIS and
other MDBs also store valuable data, raising a need for KD. MD is diverse and offers numerous
analyses potential. The classes are mined or concept description of medical terms, penetrating
for association rules (ARs), classifying patients’ and forecast medical events at new patients,
searching for clusters from diverse points of view, and carry out evolution analysis based on timeseries data.
REFERENCES
1.
Jiawei Han and Micheline Kamber: Data Mining: Concepts and Techniques, the
Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann
Publishers, August 2000. ISBN 1-55860-489-8.
2. H. Galhardas, D. Florescu, D. Shasha, E Simon, C-A. Saita: Declarative Data
Cleaning, Language, Model, and Algorithms, Proc of the 27th VLDB, pages 307-316,
Rome, Italy, 2001.
3. Prather JC, Lobach DF, Goodwin LK, Hales JW, Hage ML, Hammond WE. Medical
Data Mining: Knowledge Discovery in a Clinical Data Warehouse, Proc AMIA Annu
Fall Symp. 1997.
4. Tsumoto, S.: Knowledge discovery in clinical databases, Proceedings of the 11th
International Symposium on Foundations of Intelligent Systems, 1999.
5. Tsumoto S.: Clinical Knowledge Discovery i n Hospital Information Systems: Two Case
Studies, PKDD2000, Springer Verlag, pp.652-656, 2000.
6. M. Last, O. Maimon, A. Kandel: Knowledge Discovery in Mortality Records: An InfoFuzzy Approach, Medical Data Mining and Knowledge Discovery, Vol. 60, 2001.
7. P. Fazi, D. Luzi, F. L. Ricci, m Vignetti: The Conceptual Basis of WITH, a Collaborative
Writer System of Clinical Trials, ISMDA 2002 p. 86-97.
8. Asst. Prof. Jameelah H. Suad and Wurood A. Jbara, “Subjective Quality Assessment of New
Medical Image Database”, International Journal of Computer Engineering & Technology
(IJCET), Volume 4, Issue 5, 2013, pp. 155 - 164, ISSN Print: 0976 – 6367, ISSN Online: 0976
– 6375.
9. R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,
Present and Future”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
10. R. Lakshman Naik, D. Ramesh and B. Manjula, “Instances Selection using Advance Data
Mining Techniques”, International Journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 2, 2012, pp. 47 - 53, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
11. Rinal H. Doshi, Dr. Harshad B. Bhadka and Richa Mehta, “Development of Pattern
Knowledge Discovery Framework using Clustering Data Mining Algorithm”, International
Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013,
pp. 101 - 112, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
289