This document discusses definitions of a data scientist and proposes criteria for what qualifies someone for the job title. It argues that a data scientist should be considered a scientist since they apply scientific methods to data. A data scientist requires tertiary education in a relevant field like math, statistics or computer science. Their work involves autonomously applying expertise to analyze data and solve problems using scientific approaches. Definitions should recognize various specializations within data science and that technical skills alone don't make someone a data scientist. Formal education requirements and certification from professional bodies would give the field greater credibility and clarity.
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
What is a data scientist - a presentation I made to the Canberra IAPA
1. What is a Data Scientist?
Authored By Russell Tibballs MACS CP MSR
2. Caveat
This slideshow does not represent the views
of the company I work for. It represents my
evolving views at this point in time and is
mainly intended to provoke thought and
discussion.
Authored By Russell Tibballs MACS CP MSR
3. Google searches for “data
scientist”
2011 saw the rise of “Big Data” and the term Data Scientist.
2011 saw the release of Money Ball, starring Brad Pitt as a geek.
2012 Nate Silver correctly predicted the winner of all 50 states and
the District of Columbia when the pundits were claimingObama
had lost.
Authored By Russell Tibballs MACS CP MSR
4. So how is a Data Scientist
portrayed - Super Human
‘Data Scientists perform data science.They use
technology and skills to increase awareness,
clarity and direction for those working with
data.The data scientist role is here to
accommodate the rapid changes that occur in
our modern day environment and are bestowed
the task of minimising the disruption that
technology and data is having on the way we
work, play and learn. Data Scientists don’t just
present data, data scientists present data with
an intelligence awareness of the consequences
of presenting that data.’
Authored By Russell Tibballs MACS CP MSR
5. Super Human Continued
A large IT company – ‘What sets the data scientist
apart is strong business acumen, coupled with the
ability to communicate findings to both business
and IT leaders in a way that can influence how an
organization approaches a business challenge.
Good data scientists will not just address business
problems, they will pick the right problems that
have the most value to the organization.’
Authored By Russell Tibballs MACS CP MSR
7. DATA SCIENCE
Mark Biernbaum suggests ‘Data
Science is going 99% too fast’. His
complaint is that the “science’ is not
peer-reviewed and the techniques are
often questionable. He believes Data
Scientists should slow down,
specialize, and above all - have the
methodologies peer-reviewed.
Authored By Russell Tibballs MACS CP MSR
8. My Problem with Current
Definitions
I have been to a number of industry briefings
where supplied definitions are often very ‘pie
in sky’ and elitist.The definitions are designed
to indicate ‘you can’t possibly do this yourself
and there is no way any of your existing staff
will qualify for the role’.This may not be
intended; however it is the result.
Authored By Russell Tibballs MACS CP MSR
9. So What can we do about
that?
Recognise that Data Science is a science that
has a broad brush stroke across all industry
sectors.
Recognise that there are many specialty
areas.
Recognise that it is not a technological
implementation.
Recognise that there will be many levels of
expertise.
Authored By Russell Tibballs MACS CP MSR
10. Recognise that there are many
specialty areas.
There is not one version of data science.
There is data science applicable to the
research sectors of Maths, Physics,
Meteorology, and Medicine that will rarely be
applied elsewhere.
There is data science our friends in the NSA
and local equivalents will specialise in.
There is the data science economist and
financial sectors will specialise in.
Etc, etc … ad nauseam.
Authored By Russell Tibballs MACS CP MSR
11. Recognise that it is not a
technological implementation.
Being able to query unstructured data in a
HDFS does not make you a data scientist.
Being able to analyse Splunk data does not
make you a data scientist.
Being able to filter petabytes of data on a
MPP RDBMS does not make you a data
scientist.
Authored By Russell Tibballs MACS CP MSR
12. So What is a Scientist.
Authored By Russell Tibballs MACS CP MSR
13. The important aspects of any
definition of a Job Title.
The most important thing to remember here
is that we are talking about a JobTitle, and a
JobTitle should be meaningful.
Secondly what should qualify someone for
that title.
Authored By Russell Tibballs MACS CP MSR
14. So what is important about the
title ‘Data Scientist’?.
Authored By Russell Tibballs MACS CP MSR
15. Where did this title
originate?
‘On November 10, 1998, he (JeffWu) gave his
inaugural lecture entitled “Statistics = Data
Science?” in honor of his appointment to the H. C.
Carver Collegiate Professorship in Statistics at the
University of Michigan.[14] In this lecture, he first
focused on the identity of statistics in science. He
then characterized statistical work as data
collection, data modeling and analysis, and
problem solving and decision making. In
conclusion, he proposed that statistics be renamed
to Data Science.
Authored By Russell Tibballs MACS CP MSR
16. So What is a Scientist?
From the Oxford Dictionary:
‘A person who is studying or has expert knowledge of one
or more of the natural or physical sciences :a research
scientist’.
Note. A scientist is not necessarily a research scientist; they
can be a practicing expert in a field.
However all scientists share one feature, they are trained in
a science and they apply scientific method to obtain
understanding of a focus of interest, and their methods and
conclusions are subject to peer review.
Authored By Russell Tibballs MACS CP MSR
17. A comment from a recently
retired Scientist
My neighbor has recently retired after a long
career as a scientist and academic.We were
discussing the increasing growing exclusivity
of the term scientist a few weekends ago. In
his words, ‘In the 1970s a scientist had
degree, by mid 80s they needed honors, in the
90s they needed a masters or PHD, now they
need several Post-Doctoral projects under
their belt to be considered a ‘real’ scientist.’
However, he believes someone who is
qualified (has a science degree) and who is
practicing their studied discipline, is a
scientist.
Authored By Russell Tibballs MACS CP MSR
18. A Slight Detour.
What qualifies a professional
I see the Data Scientist as a specialty of the Computer Science
profession.
We have lawyers who specialise in corporate, family, criminal,
and other aspects of the law.
The accounting, architecture, engineering, teaching and medical
professions have several specialties and recognised levels of
expertise in each field.
These professional’s have academic training, and in many cases
acceptance by a professional body is what makes them
acceptable as professionals in the public eye.That is a model I
strongly believe the ICT industry needs to adopt or at least move
towards.
I believe the academic achievement makes the qualification.The
acceptance by a professional body should give standing within the
profession and wider community.
Authored By Russell Tibballs MACS CP MSR
19. The Australian Qualifications
Framework - AQF
The AQF has 10 levels
Level 1 – Certificate I
Level 2 – Certificate II
Level 3 – Certificate III
Level 4 – Certificate IV
Level 5 – Diploma
Level 6 – Advanced Diploma,Associate Degree.
Level 7 – Bachelor Degree
Level 8 – Bachelor Honors Degree, Graduate Certificate,
Graduate Diploma
Level 9 – Masters Degree
Level 10 – Doctoral Degree
Authored By Russell Tibballs MACS CP MSR
20. A THE BOTTOM LEVEL OF THIS
SPECTRUM OF QUALIFICATIONS.
Summary Graduates at this level will have knowledge and skills for
initial work, community involvement and/or further learning
Knowledge Graduates at this level will have foundational knowledge
for everyday life, further learning and preparation for initial work
Skills Graduates at this level will have foundational cognitive,
technical and communication skills to:
•undertake defined routine activities
•identify and report simple issues and problems
Application of knowledge and skills: Graduates at this level will apply
knowledge and skills to demonstrate autonomy in highly structured and
stable contexts and within narrow parameters
Authored By Russell Tibballs MACS CP MSR
21. At the highest level of the
spectrum of the AQF 10 – The
Doctorate
Summary Graduates at this level will have systematic and critical
understanding of a complex field of learning and specialised research skills for
the advancement of learning and/or for professional practice
Knowledge Graduates at this level will have systemic and critical
understanding of a substantial and complex body of knowledge at the frontier of
a discipline or area of professional practice
Skills Graduates at this level will have expert, specialised cognitive, technical and
research skills in a discipline area to independently and systematically:
engage in critical reflection, synthesis and evaluation
develop, adapt and implement research methodologies to extend and redefine existing
knowledge or professional practice
disseminate and promote new insights to peers and the community
generate original knowledge and understanding to make a substantial contribution to a
discipline or area of professional practice
Application of knowledge and skills Graduates at this level will apply knowledge
and skills to demonstrate autonomy, authoritative judgment, adaptability and
responsibility as an expert and leading practitioner or scholar
Authored By Russell Tibballs MACS CP MSR
22. The Degree
Summary Graduates at this level will have broad and coherent
knowledge and skills for professional work and/or further learning
Knowledge Graduates at this level will have broad and coherent
theoretical and technical knowledge with depth in one or more
disciplines or areas of practice
Skills Graduates at this level will have well-developed cognitive, technical
and communication skills to select and apply methods and technologies
to:
analyse and evaluate information to complete a range of activities
analyse, generate and transmit solutions to unpredictable and
sometimes complex problems
transmit knowledge, skills and ideas to others
Application of knowledge and skillsGraduates at this level will apply
knowledge and skills to demonstrate autonomy, well-developed
judgement and responsibility:
in contexts that require self-directed work and learning
within broad parameters to provide specialist advice and functions
Authored By Russell Tibballs MACS CP MSR
23. The Vendor’s Course
The vendors course will usually be about how
to apply a tool to a problem.
It is not generally designed to provide you
with knowledge that can be applied outside
the scope of their tool’s environment.
It would generally not qualify within the AFQ
guidelines.
Authored By Russell Tibballs MACS CP MSR
24. So how does the AQF apply to
the question of Data Science
If the person working in the field of applying
‘Data Science’ has a degree (AQF level 6 or
above) in a related subject, ie Maths, Statistics,
or Economics; or a higher degree including Grad
Cert and Diplomas they can be expected to:
apply knowledge and skills to demonstrate autonomy,
well-developed judgment and responsibility:
in contexts that require self-directed work and learning
within broad parameters to provide specialist advice
and functions
Authored By Russell Tibballs MACS CP MSR
25. Quo Bono. Who benefits from
this approach
The Public - they will have greater confidence in
the profession.
The employer – they get the assurance that
employee has the skills at the right levels to do
the work.
The employee – because they will know what is
expected of them and know they will be able to
deliver.
The professional body and industry through
greater faith and confidence by the public in the
profession in general.
Authored By Russell Tibballs MACS CP MSR
26. But!!!
There needs to be demand from within the
industry for this to happen.
Some group like the IAPA needs to take on the
responsibility of working out the Professional
specialisations and required frameworks for
acceptance of professional into those
specialisations.
Authored By Russell Tibballs MACS CP MSR
Notes de l'éditeur
The term "data scientist" started to rise rapidly around 2011 and almost caught up with "statistician". Searches for "data scientist" surpassed the searches for "data miner" in 2012. The chart below shows Google Trends for "Statistician", "Data Scientist", and "Data Miner" from Jan 2008 to Dec 2013.
For Graph go to http://www.google.com/trends/explore#q=Statistician%2C%20%22Data%20Scientist%22%2C%20%22Data%20Miner%22&date=1%2F2007%2084m&cmpt=q
http://www.datascientists.net/what-is-data-science
If this had stopped at the first couple of sentences I would have been happier.
I think the picture can do without the ‘go away if’ arm.
All these things are good; however this expresses an ideal. Not a reality.
For the quote. This has been written by a communications specialist who is telling someone ‘if you get the right person they will solve all your problems’. I believe in the Easter Bunny and Santa Claus too.
On the communications side. A friend of the family works as a ‘Science Communicator ‘for a large pharmaceutical in London. Maybe if data science is that important, it will lead to ‘data communicators’ – possibly Nate Silver already falls in that camp.
However if you to almost any profession there those that can do; and those who understand what they can do and can do it, and those who can do and communicate what it is they are doing. The communicator does not always rise to the top of the heap as technicians tend to respect technical ability above communications ability.
It is impossible to have all these skills, however some of them would be useful. Many of these tools will quickly become redundant as newer tools and methods evolve. The data access components will be merged into simpler interfaces and existing tools.
http://nirvacana.com/thoughts/becoming-a-data-scientist/
To be fair, Swami indicates this is a Roadmap to follow and is also getting people to think about what is a data scientist. He also indicates this is far from complete. I may be misinterpreting him; however it appears to that you need to an expert each stop, which seems a tall ask. By the time you have learnt many of these skills a fair percentage what you have learnt will be redundant as new tools and techniques replace them. Which is one of the joys of working in this field; you will never have time to get bored as you need to maintain continual learning to stay relevant.
http://www.kdnuggets.com/2014/01/biernbaum-data-science-99-percent-too-fast.html
From what I have seen tend to agree.
To Quote Steven Brobst (Teradata CTO) 20140402 – Teradata Summit Series, ‘IT people love to chase shiny objects’.
I am talking about vendor presentations and a few from Industry Special Interest Groups.
I believe in most organisations there are staff who can be moulded to fill the required Data Science capability.
http://www.abc.net.au/news/2011-12-21/albert-einstein-sticks-out-his-tongue-at-photographers/3742064
I picked this photo because everyone thinks of Einstien when they think of science
Bottom Right is Ed Deiner who has studying Well Being for decades.
These guys graduated as Geologists from the University of Wisconsin – They are science graduates and recognised specialists.
From Wikipedia. Note the problem solving and decision making component. Using that definition anyone who has a substantial statistics and research component to their degree such as maths, economics, science, and social science graduates who works in an analytics capacity are data scientists. Therefore if they are qualified and practicing in an information analytics capacity they are data scientists. data science and statisticians data scientists.[14] Later, he presented his lecture entitled “Statistics = Data Science?” as the first of his 1998 P.C. Mahalanobis Memorial Lectures.[15]’
C.F. Jeff Wu is the Coca-Cola Chair in Engineering Statistics and Professor in the H. Milton Stewart School of Industrial and Systems Engineering at the Georgia Institute of Technology.
Peer review does not make so much sense outside the academic realm. However it does help get rid of silly mistakes and helps ensure that outcomes are repeatable. Good method will ensure that if you repeat a process you will get the same result. A surprisingly rare feat in the wilds of data analysis.
Good analysis has purpose, context, and strong methodology. So outside it equates to the “active research cycle” with a ‘business’ focus.
Active Research Cycle from http://creativeeducator.tech4learning.com/v07/articles/Embracing_Action_Research
Why am I bothering with this anecdote. It is designed to show the creep in requirements overtime for a skill.
PS Thomas also noted the so called scientists no longer practised their craft – they spend their life applying for grants and networking. The post grad students do the work. He spent his last years pre-vetting submissions for publications by doctoral students.
The most important part is specialties or streams each with recognised levels of achievement and expertise. In terms of ‘Data Science’, just because you do not have a certification should not mean you will not be able to do certain work; it would mean that the Professional body would not be endorsing your ability to do that work.
In Australia we have the Australian Qualifications Framework. ‘The AQF is the national policy for regulated qualifications in Australian education and training. It incorporates the qualifications from each education and training sector into a single comprehensive national qualifications framework. The AQF was first introduced in 1995 to underpin the national system of qualifications in Australia encompassing higher education, vocational education and training and schools.’
Where there are existing frameworks that are working make use them.
http://www.aqf.edu.au/aqf/in-detail/aqf-levels/
This is where I see most vendor courses are sitting. They train to use a tool. In regard to ‘Data Science’, a course on Legal Privacy requirements at this level could and possibly should be compulsory.
Obviously people at this level in the hard and soft sciences have a demonstrated capacity to apply a level of qualitative, quantitative, or both analysis through the lense of the research cycle to provide significant insight. These people should be able to communicate exceptionally well. The argument put forward is often that they focus is narrow and should not be used outside that sphere.
I once heard a Oxford Professor state that ‘Oxford Phd graduates can learn any new subject and be an expert within 2 weeks’. Probably an exaggeration; however it does highlight the issue that this level of achievement is generally an attribute of the graduate which shows the general ability to learn and communicate ideas at a high level.
When I have quizzed a number of speakers after presentations that bemoaned the lack of “data science candidates’ I would ask about the 10s of thousand of Higher degree, and research graduates. Then they would agree that the problem is not so much the lack of ‘Data Scientists’ as a lack of manager who can comprehend what data scientists are talking about.
A graduate of the hard and soft sciences should be able to apply analytic tools to evaluate information and transmit solutions to complex problems. I believe this is the starting level for a Data Scientist.
Accreditation to use a tool is just that. It is not really a recognisable qualification. Often it is really telling you how to use a tool and little more. There are some Vendors whose courses are imbedded in Unervisity curriculums. However that is not the norm.
This is the end of the equation where people should qualify as a Professional. Below that we are really applying a tool.
There are many other academic streams that would fit into this model. Basically anything where you have to use Data Analysis to apply scientific method. Ie Pyschology, engineering and others.