SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. V (Jan – Feb. 2015), PP 24-27
www.iosrjournals.org
DOI: 10.9790/0661-17152427 www.iosrjournals.org 24 | Page
Research on Clustering Method of Related Cases Based
On Chinese Text
Gang Zeng1
(Police Information Department, Liaoning Police Academy, Dalian, China)
Abstract: With the movement of population, crimes are showing mobility and professionalism characteristics,
the cases which the same suspects committed have the same or similar characteristics, it is helpful to find the
related cases. In this article, firstly, analyzes the factors associated with the cases, discusses how to form vector
space of Chinese cases, then proposes a method to associate cases by joint use of FKM and Canopy clustering
algorithm.
Keywords: related cases, Chinese segmentation, Fuzzy K-Means, Canopy.
I. Introduction
With the development of transportation and movement of population,Crimes are showing
mobility,professionalism and other characteristics, the suspects often gather his associates and wander around
to commit the crime. These cases have same characteristics, because the same person or gang do it. if the cases
are clustered, Investigation will make an unexpected progress.
When the case investigation is in trouble, Investigators often merge cases with the same characteristics
and investigate them, to find out the relationship between the cases, to find more clues and speed up the
investigation of cases. Cases often be searched in the cases database, to find out related cases with the same
characteristics, we call it related cases, after cases clustered, there will be a few hundred records in a class, or
even more. Investigators will spend much time and effort. to find out related cases in these cases, In this way it
is trivial and inefficient.
Ning Han presented a clustering method for related cases using FKM algorithm(Fuzzy K-means
algorithm, also known as the fuzzy C-means algorithm ), and analyze the information of the cases. and do the
experiment based on 681 records of cases, this analysis method is feasible in the low-dimensional stand-alone
environment. But his method does not solve the problem of Chinese word segmentation, Under condition of
high-dimensional and big data, the clustering results could not be gotten within an acceptable time using his
method, We propose a method, in the Hadoop environment, we can analyze related cases using Canopy and
FKM clustering algorithm, this method can overcome the lack of FKM clustering algorithm, and improve the
efficiency and quality of analysis.
II. Conditions Of Associated Cases
Conditions of related case includes many factors, such as time and location of the crime, the way into
the scene of crime, tool for criminal purpose, criminal activity sequence, modus operandi, the way and means of
escape, the way of transportation and sales of stolen goods, and so on. Because a variety of factors, such as the
suspect's life experiences, history of crime, level of education, professional features, life skills, mental set, etc.
In the course of committing the crime in different places, means and methods of the suspect committing the
crime have formed a stable characteristic, we call it crime stereotypes of criminal behavior, in the different cases
that the same suspect has made, the factors that mentioned above may are manifested more or less, this shows
that the stability of the crime. To find more investigation clues of cases, we must analyze the relationship of all
cases recorded in database, to find out the information we need.
1. The Same Or Similar Nature Of The Cases
All kinds of cases , such as criminal cases, security cases, economic cases, etc. the suspects are affected
by the above factors, their criminal behavior presents the characteristics of continuity. The suspect who unlock a
lock by tools will steal things in a long time, the suspect who murder will still kill someone.
2. Regularity Of The Time Of Occurrence Of The Case
A person or a group of people will commit crimes , at the same time or similar time. The time when the
suspect commit crime often shows a certain regularity, for example, a type of cases often occur at a certain time
in a day, a certain day in a week, a specific time period in a year. for example , Car theft cases are far more
likely to occur from midnight to 6 a.m. Robbery, theft and other usurpation of cases occurred on the eve of
Chinese New Year.
Research on Clustering Method of Related Cases Based on Chinese Text
DOI: 10.9790/0661-17152427 www.iosrjournals.org 25 | Page
3. Target Suspects Violated Is Same Or Similar
Target suspects violated is closely related to the suspect's motive, Because of different psychological
needs, the choice of target is in a somewhat tendentious. for example, young single woman often become the
object of rape or robbery, Precious industrial raw materials or digital products become the object of theft.
4. Suspect’S Body Features Is Same Or Similar
if suspect’s body features is same or similar in different cases, we can consider the cases is related.
body features including height, weight, gender, countenance, clothing, accent, and so on.
5. Crime Means, Methods, Processes Are Same Or Similar
Because of the suspect’s history of crime, life skills, and other factors, In terms of tools of crime, the
characteristics of the crime, the crime procedures, etc. a certain pattern has been formed, And with its own
specificity and stability. for example, to destroy grille, some suspect prefer to use the jack, some prefer to use a
pipe wrench, some suspect prefer to use stick. to avoid video surveillance, some suspect choose cap, some
suspect choose mask, some suspect choose headgear.
6. Vestige Are Same Or Similar
Vestige is defined as a vestige of something is a very small part that still remains of something that was
once much larger or more important. we can use it as evidence in court, vestige includes footprints, fingerprint,
handprint, shot mark, semen, bloodstain, hair, toxicant, exploder, handwriting, and so on. After comparison,
overlapping, anastomosis, checkout, If identified as same or similar things in different cases, We can believe
that these cases are related.
Now, police manage cases by RDMS, each field describe a characteristic of the case, for example, the
‘time’ field describe the time when the case occurred, the ‘body features’ field describe the suspect’s features of
his(her) body, including common human characteristics, such as left ring finger broken, yellow hair, myopia and
with glasses. The method for the management of the cases made important contributions.
But disadvantages are also obvious, firstly, It is complex to enter data into database, each case must be
recorded in the management system, a full-time police is responsible for data entry, because it is very complex
to describe the case. this is waste of police. secondly, the quality of the data is uneven. Because of different
understanding of the case, different expression level, different degree of familiarity with the management
system. much of the data is wrong, useless. thirdly, it is excessive precision, When associated with the cases, we
can seldom find related cases, It did not play the desired effect to associate with cases.
For the above reasons, we present that interrogation record of the case is adopted as the basis for
association. we can relate cases based on Chinese text, playing a specialty of big data, we can find related cases
nationwide.
III. Cluster Analysis Of Cases
Vector space model (VSM) is a common method to describe cases, the case set is described as a vector
space. Then we analyze the related cases by studying the similarity of cases. Clustering analysis is an important
aspect of the study of the data mining. It is process of division physical and abstracted collection into several
classes composed of similar objects. Clustering analysis is the learning process of a free guide. It can divide
criminal acts into a number of classes or cluster. In same cluster crime acts enjoy high similarity. On the
contrary, in different clusters they have large difference. in this article we will use fuzzy K-means clustering
algorithm to associate with cases.
1. The Choice Of Distance Measure
It is very important, how to calculate the distance between the vectors, and it is directly related to the
accuracy of clustering. how to calculate is the best way? there are many factors which infects the accuracy, but
the most important factor of all factors is the choice of distance measure.
(1) Euclidean Distance Measure
The Euclidean distance is the simplest of all distance measures. It’s the most intuitive and matches our
normal idea of distance. Euclidean distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2, ... ,
bn) is
d = 𝑎1 − 𝑏1
2 + 𝑎2 − 𝑏2
2 + ⋯+ 𝑎𝑖 − 𝑏𝑖
2
(2). Manhattan Distance Measure
The distance between any two points is the sum of the absolute differences of their coordinates. This
distance measure takes its name from the grid-like layout of streets in Manhattan. we can’t walk from 3th
Research on Clustering Method of Related Cases Based on Chinese Text
DOI: 10.9790/0661-17152427 www.iosrjournals.org 26 | Page
Avenue and 3th Street to 5thAvenue and 5th Street by walking straight through buildings. The real distance
walked is two blocks up and two blocks over. the Manhattan distance between two n-dimensional vectors (a1,
a2, ... , an) and (b1,b2, ... , bn) is
d = 𝑎1 − 𝑏1 + 𝑎2 − 𝑏2 + ⋯ 𝑎 𝑛 − 𝑏 𝑛
(3). Weighted Distance Measure
Weighted distance measure is based on Euclidean and Manhattan distance measures. A weighted
distance measure allows you to give weights to different dimensions in order to either increase or decrease the
effect of a dimension on the value of the distance measure. This will affect specific distance measures
differently but will in general make the distance value more sensitive to differences in a dimension.
2. Representing Chinese Text Documents As Vectors
The vector space model (VSM) is the common way of vectorizing text documents. All the words in the
world form a set of words, each word being assigned a number, the number is the dimension, it’ll occupy in
document vectors. the dimension of these document vectors can be very large. the maximum number of
dimensions possible is the cardinality of the vector. Because the count of all possible words or tokens is
unimaginably large, text vectors are usually assumed to have infinite dimensions.
Chinese text processing involves the Chinese segmentation. Chinese is different from English, English
words in the document are divided into independent word by a space or punctuation, Chinese Document can cut
apart into sentences by Chinese punctuation. but there is no space in sentence. Chinese segmentation is the most
important factor in representing Chinese text documents as vectors. The method commonly used in Chinese
segmentation is "two-way maximum matching", First, maximally splitting in the positive direction, then
maximally splitting in the opposite direction, the combination of the two parts is the final result, during the
procedure, an atomic word base is needed.
The more dimensions in vector space, the more complex the calculations become, It is proved by
theoretical calculations and practical tests. Representing the case with high-dimensional vector require
enormous computing resources, so two questions must be solved to calculate the related cases: firstly, reduce the
dimension of vector space, secondly, provide sufficient computing resources.
Atomic word is usually expressed as a natural word, the natural words are usually not standard, the
same meaning is expressed in different words, atomic word must be regulatory. the method to establish a
thesaurus and stop words dictionary can reduce the dimension of vector space effectively.
Sufficient computing resources in a stand-alone environment is impossible, with the emergence of
Hadoop, MapReduce model can break the huge computing task into small tasks, the small tasks can be executed
at different nodes, Hadoop provides us with sufficient computing resources. so we can calculate the relevance of
the cases in the Hadoop environment.
3. Term Frequency-Inverse Frequency
Term frequency–inverse document frequency (TF-IDF) weighting is a widely used improvement on
simple term-frequency weighting. The IDF part is the improvement; instead of simply using term frequency as
the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its
value is reduced more for words used frequently across all the documents in the dataset than for infrequently
used words.
the TF-IDF weight Wi for a word wi is:
𝑊𝑖 = 𝑇𝐹𝑖 ∙ 𝑙𝑜𝑔
𝑁
𝐷𝐹𝑖
Where: TFi represents the term frequency (TFi) of word wi, N represents the document count. the
document vector will have this value at the dimension for wordi. This is the classic TF-IDF weighting. stop
words get a small weight, and terms that occur infrequently get a large weight. The important words, or the topic
words, usually have a high TF and a somewhat large IDF, so the product of the two becomes a larger value,
thereby giving more importance to these words in the vector produced.
IV. Application Of Clustering Algorithm
1. Fuzzy K-Means Clustering Algorithm
K-means clustering algorithm tries to find the hard clusters (where each point belongs to one cluster),
But fuzzy k-means discovers the soft clusters. In a soft cluster, any point can belong to more than one cluster
with a certain affinity value towards each. So the theory of the fuzzy clustering is more suitable for the nature of
things, and it can reflect objectively the reality. Currently, the fuzzy K-means clustering (FKM) algorithm is the
most widely used.
Research on Clustering Method of Related Cases Based on Chinese Text
DOI: 10.9790/0661-17152427 www.iosrjournals.org 27 | Page
FKM partitions set of n objects X = (𝑥1, 𝑥2, ⋯ , 𝑥 𝑛 ) into K fuzzy clusters with C = 𝑐1, 𝑐2, ⋯ , 𝑐 𝑘
cluster centers. In fuzzy matrix U = (𝑢𝑖𝑗 , ), uij is the membership degree of the ith object with the cluster, The
characters of uij are as follows :
𝑢𝑖𝑗 ∈ 0,1 𝑖 ∈ 1,2, ⋯ , 𝑛; 𝑗𝜖1,2,⋯ , 𝑘
𝑢 𝑖, 𝑗 = 1, 𝑗 = 1, ⋯, 𝑛
𝑘
𝑖=1
Update uij according to (1):
𝑢𝑖𝑗 =
1
𝑑 𝑖𝑗
𝑑 𝑘𝑗
2
𝑚 −1
𝑘
𝑘=1
1
Where: m>1is fuzziness exponent, cj is the clustering center. 𝑑𝑖𝑗 = 𝑥𝑖 − 𝑐𝑗 is the distance between xi
and cj. Update cj according to (2)
𝑐𝑗 =
𝑢𝑖𝑗
𝑚
𝑥𝑖
𝑛
𝑖=1
𝑢𝑖𝑗
𝑚𝑛
𝑖=1
(2)
The objective function is the equation (3):
J U, c1, ⋯, c 𝑘 = 𝑢𝑖𝑗
𝑚
𝑛
𝑗 =1
𝑘
𝑖=1
𝑑𝑖𝑗
2
(3)
The nature of FKM algorithm is to apply the gradient descent method to find out optimal solution, so
there is a local optimization problem. And the algorithm convergence speed is greatly influenced by the initial
value, especially in the case of large number of clusters. For many real-world clustering problems, the number
of clusters isn’t known beforehand, A class of techniques known as approximate clustering algorithms can
estimate the number of clusters as well as the approximate location of the centroids from a given data set. One
such algorithm we can use is called canopy generation.
2. Canopy Clustering Algorithm
Canopy’s advantage lies in its ability to create clusters extremely quickly—it can do this with a single
pass over the data. But this algorithm may not give accurate and precise clusters. But it can give the optimal
number of clusters without even specifying the number of clusters, k, as required by fuzzy k-means.
V. Conclusion
This article proposed a method to associate cases by joint use of FKM and Canopy clustering
algorithm, it overcome the difficult problem of clustering Chinese cases, but this method is relatively complex,
the results are not very accurate, We hope to improve this method in the future.
Acknowledgment
This work was partly supported by general scientific research project of education department of
Liaoning Province of China (No. L2013490).
References
[1]. Ning Han, Wei Chen. Research on clustering analysis on related Cases. Journal of Chinese People’s Public Security
University(Science and Technology), 2012(1):53-58.
[2]. Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman. Mahout in Action[M]. Manning Publication, Westampton.
[3]. Tony H. Grubesic. On The Application of Fuzzy Clustering for Crime Hot Spot Detection[J]. Journal of Quantitative Criminology,
2006,22(1):77-105.
[4]. [Deguang Wang, Baochang Han, Ming Huang. Application of Fuzzy C-Means Clustering Algorithm Based on Particle Swarm
Optimization in Computer Forensics. 2012 International Conference on Applied Physics and Industrial Engineering, 1186-1191.
[5]. Gang Liu. Programing Hadoop[M]. BeiJing:China Machine Press, 2014.
[6]. Hesam Izakian, Ajith Abraham, Václav Snášel. "Fuzzy Clustering Using Hybrid Fuzzy c-means and Fuzzy Particle Swarm
Optimization" 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC 2009), pp.1690-1694,2009.

Contenu connexe

En vedette

En vedette (20)

A012630105
A012630105A012630105
A012630105
 
O1302027889
O1302027889O1302027889
O1302027889
 
K013128090
K013128090K013128090
K013128090
 
B013141316
B013141316B013141316
B013141316
 
Large Span Lattice Frame Industrial Roof Structure
Large Span Lattice Frame Industrial Roof StructureLarge Span Lattice Frame Industrial Roof Structure
Large Span Lattice Frame Industrial Roof Structure
 
H010424954
H010424954H010424954
H010424954
 
O1304039196
O1304039196O1304039196
O1304039196
 
A010420117
A010420117A010420117
A010420117
 
Crack Detection for Various Loading Conditions in Beam Using Hilbert – Huang ...
Crack Detection for Various Loading Conditions in Beam Using Hilbert – Huang ...Crack Detection for Various Loading Conditions in Beam Using Hilbert – Huang ...
Crack Detection for Various Loading Conditions in Beam Using Hilbert – Huang ...
 
C010321630
C010321630C010321630
C010321630
 
Analysis of Stress Distribution in a Curved Structure Using Photoelastic and ...
Analysis of Stress Distribution in a Curved Structure Using Photoelastic and ...Analysis of Stress Distribution in a Curved Structure Using Photoelastic and ...
Analysis of Stress Distribution in a Curved Structure Using Photoelastic and ...
 
I017616468
I017616468I017616468
I017616468
 
H017215057
H017215057H017215057
H017215057
 
G012454144
G012454144G012454144
G012454144
 
I1101047680
I1101047680I1101047680
I1101047680
 
E1302022628
E1302022628E1302022628
E1302022628
 
F010133036
F010133036F010133036
F010133036
 
E1303041923
E1303041923E1303041923
E1303041923
 
P01821104110
P01821104110P01821104110
P01821104110
 
M017218088
M017218088M017218088
M017218088
 

Similaire à Research on Clustering Method of Related Cases Based On Chinese Text

Crime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringCrime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means Clustering
Reuben George
 
CRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptxCRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptx
oluobes
 
georgia-tech-atlanta
georgia-tech-atlantageorgia-tech-atlanta
georgia-tech-atlanta
Peter Kim
 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docx
healdkathaleen
 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docx
todd271
 
Merseyside Crime Analysis
Merseyside Crime AnalysisMerseyside Crime Analysis
Merseyside Crime Analysis
Parang Saraf
 

Similaire à Research on Clustering Method of Related Cases Based On Chinese Text (20)

Crime
CrimeCrime
Crime
 
Text mining on criminal documents
Text mining on criminal documentsText mining on criminal documents
Text mining on criminal documents
 
Inspection of Certain RNN-ELM Algorithms for Societal Applications
Inspection of Certain RNN-ELM Algorithms for Societal ApplicationsInspection of Certain RNN-ELM Algorithms for Societal Applications
Inspection of Certain RNN-ELM Algorithms for Societal Applications
 
Investigating crimes using text mining and network analysis
Investigating crimes using text mining and network analysisInvestigating crimes using text mining and network analysis
Investigating crimes using text mining and network analysis
 
Crime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringCrime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means Clustering
 
Legal memorandum #1 to student fall 2018-ba 18 (81151)
Legal memorandum #1 to student   fall 2018-ba 18 (81151)Legal memorandum #1 to student   fall 2018-ba 18 (81151)
Legal memorandum #1 to student fall 2018-ba 18 (81151)
 
CRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptxCRIME-INVEST-PPP.pptx
CRIME-INVEST-PPP.pptx
 
Database and Analytics Programming - Project report
Database and Analytics Programming - Project reportDatabase and Analytics Programming - Project report
Database and Analytics Programming - Project report
 
Propose Data Mining AR-GA Model to Advance Crime analysis
Propose Data Mining AR-GA Model to Advance Crime analysisPropose Data Mining AR-GA Model to Advance Crime analysis
Propose Data Mining AR-GA Model to Advance Crime analysis
 
report
reportreport
report
 
Advancing The Concept Of Problem In Problem-Oriented Policing
Advancing The Concept Of Problem In Problem-Oriented PolicingAdvancing The Concept Of Problem In Problem-Oriented Policing
Advancing The Concept Of Problem In Problem-Oriented Policing
 
IRJET- Detecting Criminal Method using Data Mining
IRJET- Detecting Criminal Method using Data MiningIRJET- Detecting Criminal Method using Data Mining
IRJET- Detecting Criminal Method using Data Mining
 
georgia-tech-atlanta
georgia-tech-atlantageorgia-tech-atlanta
georgia-tech-atlanta
 
U24149153
U24149153U24149153
U24149153
 
A predictive model for mapping crime using big data analytics
A predictive model for mapping crime using big data analyticsA predictive model for mapping crime using big data analytics
A predictive model for mapping crime using big data analytics
 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docx
 
Running head CRIME ANALYSIS TECHNOLOGY .docx
Running head CRIME ANALYSIS TECHNOLOGY                           .docxRunning head CRIME ANALYSIS TECHNOLOGY                           .docx
Running head CRIME ANALYSIS TECHNOLOGY .docx
 
Merseyside Crime Analysis
Merseyside Crime AnalysisMerseyside Crime Analysis
Merseyside Crime Analysis
 
ACCESS.2020.3028420.pdf
ACCESS.2020.3028420.pdfACCESS.2020.3028420.pdf
ACCESS.2020.3028420.pdf
 
Crime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los AngelesCrime Data Analysis and Prediction for city of Los Angeles
Crime Data Analysis and Prediction for city of Los Angeles
 

Plus de IOSR Journals

Plus de IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Dernier

+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Dernier (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 

Research on Clustering Method of Related Cases Based On Chinese Text

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. V (Jan – Feb. 2015), PP 24-27 www.iosrjournals.org DOI: 10.9790/0661-17152427 www.iosrjournals.org 24 | Page Research on Clustering Method of Related Cases Based On Chinese Text Gang Zeng1 (Police Information Department, Liaoning Police Academy, Dalian, China) Abstract: With the movement of population, crimes are showing mobility and professionalism characteristics, the cases which the same suspects committed have the same or similar characteristics, it is helpful to find the related cases. In this article, firstly, analyzes the factors associated with the cases, discusses how to form vector space of Chinese cases, then proposes a method to associate cases by joint use of FKM and Canopy clustering algorithm. Keywords: related cases, Chinese segmentation, Fuzzy K-Means, Canopy. I. Introduction With the development of transportation and movement of population,Crimes are showing mobility,professionalism and other characteristics, the suspects often gather his associates and wander around to commit the crime. These cases have same characteristics, because the same person or gang do it. if the cases are clustered, Investigation will make an unexpected progress. When the case investigation is in trouble, Investigators often merge cases with the same characteristics and investigate them, to find out the relationship between the cases, to find more clues and speed up the investigation of cases. Cases often be searched in the cases database, to find out related cases with the same characteristics, we call it related cases, after cases clustered, there will be a few hundred records in a class, or even more. Investigators will spend much time and effort. to find out related cases in these cases, In this way it is trivial and inefficient. Ning Han presented a clustering method for related cases using FKM algorithm(Fuzzy K-means algorithm, also known as the fuzzy C-means algorithm ), and analyze the information of the cases. and do the experiment based on 681 records of cases, this analysis method is feasible in the low-dimensional stand-alone environment. But his method does not solve the problem of Chinese word segmentation, Under condition of high-dimensional and big data, the clustering results could not be gotten within an acceptable time using his method, We propose a method, in the Hadoop environment, we can analyze related cases using Canopy and FKM clustering algorithm, this method can overcome the lack of FKM clustering algorithm, and improve the efficiency and quality of analysis. II. Conditions Of Associated Cases Conditions of related case includes many factors, such as time and location of the crime, the way into the scene of crime, tool for criminal purpose, criminal activity sequence, modus operandi, the way and means of escape, the way of transportation and sales of stolen goods, and so on. Because a variety of factors, such as the suspect's life experiences, history of crime, level of education, professional features, life skills, mental set, etc. In the course of committing the crime in different places, means and methods of the suspect committing the crime have formed a stable characteristic, we call it crime stereotypes of criminal behavior, in the different cases that the same suspect has made, the factors that mentioned above may are manifested more or less, this shows that the stability of the crime. To find more investigation clues of cases, we must analyze the relationship of all cases recorded in database, to find out the information we need. 1. The Same Or Similar Nature Of The Cases All kinds of cases , such as criminal cases, security cases, economic cases, etc. the suspects are affected by the above factors, their criminal behavior presents the characteristics of continuity. The suspect who unlock a lock by tools will steal things in a long time, the suspect who murder will still kill someone. 2. Regularity Of The Time Of Occurrence Of The Case A person or a group of people will commit crimes , at the same time or similar time. The time when the suspect commit crime often shows a certain regularity, for example, a type of cases often occur at a certain time in a day, a certain day in a week, a specific time period in a year. for example , Car theft cases are far more likely to occur from midnight to 6 a.m. Robbery, theft and other usurpation of cases occurred on the eve of Chinese New Year.
  • 2. Research on Clustering Method of Related Cases Based on Chinese Text DOI: 10.9790/0661-17152427 www.iosrjournals.org 25 | Page 3. Target Suspects Violated Is Same Or Similar Target suspects violated is closely related to the suspect's motive, Because of different psychological needs, the choice of target is in a somewhat tendentious. for example, young single woman often become the object of rape or robbery, Precious industrial raw materials or digital products become the object of theft. 4. Suspect’S Body Features Is Same Or Similar if suspect’s body features is same or similar in different cases, we can consider the cases is related. body features including height, weight, gender, countenance, clothing, accent, and so on. 5. Crime Means, Methods, Processes Are Same Or Similar Because of the suspect’s history of crime, life skills, and other factors, In terms of tools of crime, the characteristics of the crime, the crime procedures, etc. a certain pattern has been formed, And with its own specificity and stability. for example, to destroy grille, some suspect prefer to use the jack, some prefer to use a pipe wrench, some suspect prefer to use stick. to avoid video surveillance, some suspect choose cap, some suspect choose mask, some suspect choose headgear. 6. Vestige Are Same Or Similar Vestige is defined as a vestige of something is a very small part that still remains of something that was once much larger or more important. we can use it as evidence in court, vestige includes footprints, fingerprint, handprint, shot mark, semen, bloodstain, hair, toxicant, exploder, handwriting, and so on. After comparison, overlapping, anastomosis, checkout, If identified as same or similar things in different cases, We can believe that these cases are related. Now, police manage cases by RDMS, each field describe a characteristic of the case, for example, the ‘time’ field describe the time when the case occurred, the ‘body features’ field describe the suspect’s features of his(her) body, including common human characteristics, such as left ring finger broken, yellow hair, myopia and with glasses. The method for the management of the cases made important contributions. But disadvantages are also obvious, firstly, It is complex to enter data into database, each case must be recorded in the management system, a full-time police is responsible for data entry, because it is very complex to describe the case. this is waste of police. secondly, the quality of the data is uneven. Because of different understanding of the case, different expression level, different degree of familiarity with the management system. much of the data is wrong, useless. thirdly, it is excessive precision, When associated with the cases, we can seldom find related cases, It did not play the desired effect to associate with cases. For the above reasons, we present that interrogation record of the case is adopted as the basis for association. we can relate cases based on Chinese text, playing a specialty of big data, we can find related cases nationwide. III. Cluster Analysis Of Cases Vector space model (VSM) is a common method to describe cases, the case set is described as a vector space. Then we analyze the related cases by studying the similarity of cases. Clustering analysis is an important aspect of the study of the data mining. It is process of division physical and abstracted collection into several classes composed of similar objects. Clustering analysis is the learning process of a free guide. It can divide criminal acts into a number of classes or cluster. In same cluster crime acts enjoy high similarity. On the contrary, in different clusters they have large difference. in this article we will use fuzzy K-means clustering algorithm to associate with cases. 1. The Choice Of Distance Measure It is very important, how to calculate the distance between the vectors, and it is directly related to the accuracy of clustering. how to calculate is the best way? there are many factors which infects the accuracy, but the most important factor of all factors is the choice of distance measure. (1) Euclidean Distance Measure The Euclidean distance is the simplest of all distance measures. It’s the most intuitive and matches our normal idea of distance. Euclidean distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2, ... , bn) is d = 𝑎1 − 𝑏1 2 + 𝑎2 − 𝑏2 2 + ⋯+ 𝑎𝑖 − 𝑏𝑖 2 (2). Manhattan Distance Measure The distance between any two points is the sum of the absolute differences of their coordinates. This distance measure takes its name from the grid-like layout of streets in Manhattan. we can’t walk from 3th
  • 3. Research on Clustering Method of Related Cases Based on Chinese Text DOI: 10.9790/0661-17152427 www.iosrjournals.org 26 | Page Avenue and 3th Street to 5thAvenue and 5th Street by walking straight through buildings. The real distance walked is two blocks up and two blocks over. the Manhattan distance between two n-dimensional vectors (a1, a2, ... , an) and (b1,b2, ... , bn) is d = 𝑎1 − 𝑏1 + 𝑎2 − 𝑏2 + ⋯ 𝑎 𝑛 − 𝑏 𝑛 (3). Weighted Distance Measure Weighted distance measure is based on Euclidean and Manhattan distance measures. A weighted distance measure allows you to give weights to different dimensions in order to either increase or decrease the effect of a dimension on the value of the distance measure. This will affect specific distance measures differently but will in general make the distance value more sensitive to differences in a dimension. 2. Representing Chinese Text Documents As Vectors The vector space model (VSM) is the common way of vectorizing text documents. All the words in the world form a set of words, each word being assigned a number, the number is the dimension, it’ll occupy in document vectors. the dimension of these document vectors can be very large. the maximum number of dimensions possible is the cardinality of the vector. Because the count of all possible words or tokens is unimaginably large, text vectors are usually assumed to have infinite dimensions. Chinese text processing involves the Chinese segmentation. Chinese is different from English, English words in the document are divided into independent word by a space or punctuation, Chinese Document can cut apart into sentences by Chinese punctuation. but there is no space in sentence. Chinese segmentation is the most important factor in representing Chinese text documents as vectors. The method commonly used in Chinese segmentation is "two-way maximum matching", First, maximally splitting in the positive direction, then maximally splitting in the opposite direction, the combination of the two parts is the final result, during the procedure, an atomic word base is needed. The more dimensions in vector space, the more complex the calculations become, It is proved by theoretical calculations and practical tests. Representing the case with high-dimensional vector require enormous computing resources, so two questions must be solved to calculate the related cases: firstly, reduce the dimension of vector space, secondly, provide sufficient computing resources. Atomic word is usually expressed as a natural word, the natural words are usually not standard, the same meaning is expressed in different words, atomic word must be regulatory. the method to establish a thesaurus and stop words dictionary can reduce the dimension of vector space effectively. Sufficient computing resources in a stand-alone environment is impossible, with the emergence of Hadoop, MapReduce model can break the huge computing task into small tasks, the small tasks can be executed at different nodes, Hadoop provides us with sufficient computing resources. so we can calculate the relevance of the cases in the Hadoop environment. 3. Term Frequency-Inverse Frequency Term frequency–inverse document frequency (TF-IDF) weighting is a widely used improvement on simple term-frequency weighting. The IDF part is the improvement; instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words. the TF-IDF weight Wi for a word wi is: 𝑊𝑖 = 𝑇𝐹𝑖 ∙ 𝑙𝑜𝑔 𝑁 𝐷𝐹𝑖 Where: TFi represents the term frequency (TFi) of word wi, N represents the document count. the document vector will have this value at the dimension for wordi. This is the classic TF-IDF weighting. stop words get a small weight, and terms that occur infrequently get a large weight. The important words, or the topic words, usually have a high TF and a somewhat large IDF, so the product of the two becomes a larger value, thereby giving more importance to these words in the vector produced. IV. Application Of Clustering Algorithm 1. Fuzzy K-Means Clustering Algorithm K-means clustering algorithm tries to find the hard clusters (where each point belongs to one cluster), But fuzzy k-means discovers the soft clusters. In a soft cluster, any point can belong to more than one cluster with a certain affinity value towards each. So the theory of the fuzzy clustering is more suitable for the nature of things, and it can reflect objectively the reality. Currently, the fuzzy K-means clustering (FKM) algorithm is the most widely used.
  • 4. Research on Clustering Method of Related Cases Based on Chinese Text DOI: 10.9790/0661-17152427 www.iosrjournals.org 27 | Page FKM partitions set of n objects X = (𝑥1, 𝑥2, ⋯ , 𝑥 𝑛 ) into K fuzzy clusters with C = 𝑐1, 𝑐2, ⋯ , 𝑐 𝑘 cluster centers. In fuzzy matrix U = (𝑢𝑖𝑗 , ), uij is the membership degree of the ith object with the cluster, The characters of uij are as follows : 𝑢𝑖𝑗 ∈ 0,1 𝑖 ∈ 1,2, ⋯ , 𝑛; 𝑗𝜖1,2,⋯ , 𝑘 𝑢 𝑖, 𝑗 = 1, 𝑗 = 1, ⋯, 𝑛 𝑘 𝑖=1 Update uij according to (1): 𝑢𝑖𝑗 = 1 𝑑 𝑖𝑗 𝑑 𝑘𝑗 2 𝑚 −1 𝑘 𝑘=1 1 Where: m>1is fuzziness exponent, cj is the clustering center. 𝑑𝑖𝑗 = 𝑥𝑖 − 𝑐𝑗 is the distance between xi and cj. Update cj according to (2) 𝑐𝑗 = 𝑢𝑖𝑗 𝑚 𝑥𝑖 𝑛 𝑖=1 𝑢𝑖𝑗 𝑚𝑛 𝑖=1 (2) The objective function is the equation (3): J U, c1, ⋯, c 𝑘 = 𝑢𝑖𝑗 𝑚 𝑛 𝑗 =1 𝑘 𝑖=1 𝑑𝑖𝑗 2 (3) The nature of FKM algorithm is to apply the gradient descent method to find out optimal solution, so there is a local optimization problem. And the algorithm convergence speed is greatly influenced by the initial value, especially in the case of large number of clusters. For many real-world clustering problems, the number of clusters isn’t known beforehand, A class of techniques known as approximate clustering algorithms can estimate the number of clusters as well as the approximate location of the centroids from a given data set. One such algorithm we can use is called canopy generation. 2. Canopy Clustering Algorithm Canopy’s advantage lies in its ability to create clusters extremely quickly—it can do this with a single pass over the data. But this algorithm may not give accurate and precise clusters. But it can give the optimal number of clusters without even specifying the number of clusters, k, as required by fuzzy k-means. V. Conclusion This article proposed a method to associate cases by joint use of FKM and Canopy clustering algorithm, it overcome the difficult problem of clustering Chinese cases, but this method is relatively complex, the results are not very accurate, We hope to improve this method in the future. Acknowledgment This work was partly supported by general scientific research project of education department of Liaoning Province of China (No. L2013490). References [1]. Ning Han, Wei Chen. Research on clustering analysis on related Cases. Journal of Chinese People’s Public Security University(Science and Technology), 2012(1):53-58. [2]. Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman. Mahout in Action[M]. Manning Publication, Westampton. [3]. Tony H. Grubesic. On The Application of Fuzzy Clustering for Crime Hot Spot Detection[J]. Journal of Quantitative Criminology, 2006,22(1):77-105. [4]. [Deguang Wang, Baochang Han, Ming Huang. Application of Fuzzy C-Means Clustering Algorithm Based on Particle Swarm Optimization in Computer Forensics. 2012 International Conference on Applied Physics and Industrial Engineering, 1186-1191. [5]. Gang Liu. Programing Hadoop[M]. BeiJing:China Machine Press, 2014. [6]. Hesam Izakian, Ajith Abraham, Václav Snášel. "Fuzzy Clustering Using Hybrid Fuzzy c-means and Fuzzy Particle Swarm Optimization" 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC 2009), pp.1690-1694,2009.