Format for the population data in forensic genetics ppt
1. PROPOSALS FOR THE FORMAT
FOR POPULATION DATA BASES
AND THEIR ANALYSIS
A. G. Smolyanitsky1, N. N. Khromov-Borisov1, G. B.A. G. Smolyanitsky1, N. N. Khromov-Borisov1, G. B.
Lazzarotto2 and T. B. L. Kist2
1Forensic Medicine Bureau of Leningrad District, Saint
Petersburg, Russia
2Institute of Biosciences, Federal University of Rio Grande do
Sul, Porto Alegre, Brazil
Andrew.Smolyanitsky@yandex.ru
Nikita.KhromovBorisov@gmail.com
Gustavo.Lazzarotto@terra.com.br
Kist@molgen.mpg.de
2. DNA-PCR Data Banks
DNA-PCR Databank: http://www.uni-
duesseldorf.de/WWW/MedFak/Serology/database.
html
DB on Nuclear DNADB on Nuclear DNA
http://www.ertzaintza.net/cgi-bin/
db2www.exe/adn.d2w/INPUT?IDIOMA=INGLES
World population data
J. Forensic Sci. 45 (1) 118-146 (2000)
CODIS STR loci data
J. Forensic Sci. 46 (3) 453-489 (2001)
3. Precision and accuracy
Sometime inaccurate calculation or
presentation of relative allele
frequencies are observedfrequencies are observed
Precision up to three significant
digits appear to be not sufficient
4. Round-off
Sometimes the sum of the frequencies is not
equal to unit due to low precision or round-off
errors, such as, e.g., 0.879 or 1.123
Sometime it is difficult to round-off correctly
the recalculated absolute frequencies, such as,
e.g., 18.51 or 75.48
As a result their sum may be odd or not equal to
the published value
5. Uncertainties
Some data sets appear to be completely
identical
Such duplications may result from the
fact that they are reproduced infact that they are reproduced in
different publications
SANCT software permits to identify
them in very large DB automatically
6. Independence
Some data sets seems to be non-
independent: preliminary data
published earlier are then combined
with the new data in subsequentwith the new data in subsequent
publications
SANCT software facilitates their
detection
7. Collapsability
Sometime rare alleles are combined with
the nearest ones, e.g., 14+15+16
SANCT puts this manipulation on the solidSANCT puts this manipulation on the solid
statistical ground:
Categories (both, alleles and/or samples)
are combined (collapsed) not arbitrarily,
but those which are statistically
homogeneous, e.g., 14+21
8. Precision
Compute relative frequencies with at least
four or even more significant digits (GDA)
Check the equality of their sum to unit:
Sum (pi)=1.0000
Check the “re-computability” of the initial
absolute counts:
Sum (pi ×N)=N
9. Show individual genotypes
when feasible
ID Locus A Locus B Locus Z
Xx-xxx 3.2/7 --/-- 6/6Xx-xxx
1
3.2/7
3207
--/--
0000
6/6
0606
Yy-yyy
2
6/14
0614
17/18
1718
9/9.3
0093
FSTAT is able to detect 0093 as an error
12. Show absolute counts
Present genotype counts in form of
triangle matrix.
Such presentation visualizes theSuch presentation visualizes the
“saturation” of the data and permits to
present important information on the
partial fixation indices in compact form
on the same matrix.
13. Template for genotype and allele counts,
partial fixation indices and relative allele
frequencies
Locus: GC n = 196
Allele A B C fii Ni pi
A 25 0.06 0.08 0.08 131 0.3308A 25 0.06 0.08 0.08 131 0.3308
B 14 2 0.06 -0.03 45 0.1136
C 67 27 63 0.04 220 0.5556
Total 0.044 396 1.0000
GDA software provides computing fii
14. Availability
“Open and show all your data”,
visualization and “statistification”
or GSP (Good Statistics Practice)or GSP (Good Statistics Practice)
must be the main principles in data
basing.
Make all your data available to the
users preferably online or under
request from the authors.