Today, data science is enabling companies, governments, research centres and other organisations to turn their volumes of big data into valuable and actionable insights. It is important to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. According to the McKinsey Global Institute, the U.S. alone could face a shortage of about 190,000 data scientists and 1.5 million managers and analysts who can understand and make decisions using big data by 2018. In coming years, data scientists will be vital to all sectors —from law and medicine to media and nonprofits. Has the African continent planned to train the next generation of data scientists required on the continent?
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
1. Data Science a Multifaceted Discipline:
Data Science Engineering and Data
Science Analytics
A Keynote Address, 24 February 2017
By
Prof. Venansius Baryamureeba, PhD
Chairman and Managing Director, ICT Consults Ltd
www.ict.co.ug
www.baryamureeba.ug; barya@baryamureeba.ug
www.utamu.ac.ug/barya; barya@utamu.ac.ug
Africa Data Forum Johannesburg Conference, 22-24 February 2017.
2. Outline
• Data Science
• Data Science a Multifaceted Discipline
• Foundations of Data Science
• Data (Science) Engineering
• Importance and Evolving Role of Data Science
• Examples of Data Science in Action
• Data (Science) Analytics
• Big Data Analytics
• Conclusion
3. Data Science
• Data Science is an interdisciplinary discipline about methods and systems
to extract knowledge or insights from large quantities of data coming in
various forms. Historically, no single practice described the simultaneous
use of so many different skill sets and bases of knowledge.
• Data science has emerged as the field that exists at the intersection of
mathematics, statistics and computer science knowledge and expertise in
a science discipline.
• Data science employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, and computer and
information sciences and applies them on a wide range of data-rich
domains such as biomedical sciences, physical science, geoscience, social
science, engineering, business, and education.
4. Data Science a Multifaceted Discipline
• Data science is a very broad and multifaceted field
• Data science combines aspects of computer science, information science,
mathematics and statistics.
• Data Science requires a multidisciplinary skill set (i.e. requires skills in
computer science, analytics, data management, art and design and
entrepreneurship among others).
• Data science uses automated methods to analyze massive amounts of data
and extract knowledge from them.
• Data science applies various tools and techniques to data in order to gain a
data product, an exploitable insight derived from collected facts.
• Data science provides the underlying theory and methods of the data
revolution.
5.
6. Foundations of Data Science
• While there is not yet a consensus on what precisely constitutes data
science, three professional communities, all within computer science
and/or statistics, are emerging as foundational to data science:
• (i) Database Management enables transformation, conglomeration, and
organization of data resources;
• (ii) Statistics and Machine Learning convert data into knowledge; and
• (iii) Distributed and Parallel Systems provide the computational infrastructure
to carry out data analysis.
7. Role of Statistics in Data Science
• In a policy statement issued on October 1, 2015, the American
Statistical Association (ASA) stated that statistics is "foundational to
data science"—along with database management and distributed and
parallel systems—and its use in this emerging field empowers
researchers to extract knowledge and obtain better results from Big
Data and other analytics projects.
• The statement also encouraged "maximum and multifaceted
collaboration" between statisticians and data scientists to maximize
the full potential of big data and data science.
8. Data Scientist Vs Data Engineer
• Data Scientists and Data Engineers may be new job titles, but the core
job roles have been around for a while.
• Traditionally, anyone who analyzed data would be called a “Data
Analyst” and anyone who created backend platforms to support data
analysis would be called a “Business Intelligence (BI) Developer”.
• With the emergence of big data, new roles emerged in corporations,
research centers and governments — namely, Data Scientists and
Data Engineers.
9. Data Analyst
• Data Analysts are experienced data professionals in their organization who
can query and process data, provide reports, summarize and visualize data.
• They have a strong understanding of how to leverage existing tools and
methods to solve a problem, and help people from across the organisation
understand specific queries with ad-hoc reports and charts.
• However, they are not expected to deal with analyzing big data, nor are
they typically expected to have the mathematical or research background
to develop new algorithms for specific problems.
• Skills and Tools: Data Analysts need to have a baseline understanding of
some core skills: statistics, data munging, data visualization, exploratory
data analysis, Microsoft Excel, SPSS, SPSS Modeler, SAS, SAS Miner, SQL,
Microsoft Access, Tableau, SSAS.
10. Business Intelligence Developer
• Business Intelligence (BI) Developers are data experts that interact more
closely with internal stakeholders to understand the reporting needs, and
then to collect requirements, design, and build BI and reporting solutions
for the organisation.
• They have to design, develop and support new and existing data
warehouses, ETL ( Extract, Transform and Load) packages, cubes,
dashboards and analytical reports.
• They work with databases, both relational and multidimensional, and
should have great SQL development skills to integrate data from different
resources. They use all of these skills to meet the enterprise-wide self-
service needs.
• BI Developers are typically not expected to perform data analyses.
• Skills and tools: ETL, developing reports, OLAP, cubes, web intelligence,
business objects design, Tableau, dashboard tools, SQL, SSAS, SSIS.
11. Data Scientist
• A data scientist is the alchemist of the 21st century: someone who can turn
raw data into purified insights. Data scientists apply statistics, machine
learning, and other analytic approaches to solve critical business problems.
Their primary function is to help organizations turn their volumes of big
data into valuable and actionable insights.
• In addition to data analytical skills, Data Scientists are expected to have
strong programming skills, an ability to design new algorithms, handle big
data, with some expertise in the domain knowledge.
• Data Scientists are also expected to interpret and eloquently deliver the
results of their findings, by visualization techniques, building data science
apps, or narrating interesting stories about the solutions to their data
(business) problems.
12. Data Scientist Cont’d
• The problem-solving skills of a data scientist requires an understanding of
traditional and new data analysis methods to build statistical models or
discover patterns in data. For example, creating a recommendation engine,
predicting the stock market, diagnosing patients based on their similarity,
or finding the patterns of fraudulent transactions.
• Data Scientists may sometimes be presented with big data without a
particular business problem in mind. In this case, the curious Data Scientist
is expected to explore the data, come up with the right questions, and
provide interesting findings!
• They should have experience working with different datasets of different
sizes and shapes, and be able to run their algorithms on large size data
effectively and efficiently, which typically means staying up-to-date with all
the latest cutting-edge technologies.
• Skills and tools: Python, R, Scala, Apache Spark, Hadoop, data mining tools
and algorithms, machine learning, statistics.
13.
14. Data Engineer
• Data engineering includes what some organisations might call Data
Infrastructure or Data Architecture.
• The data engineer gathers and collects the data, stores it, does batch
processing or real-time processing on it, and serves it via an API to a
data scientist who can easily query it.
• A good data engineer has extensive knowledge on databases and best
engineering practices. These include handling and logging errors,
monitoring the system, building human-fault-tolerant pipelines,
understanding what is necessary to scale up, addressing continuous
integration, knowledge of database administration, maintaining data
cleaning, and ensuring a deterministic pipeline.
15. Data Engineer Cont’d
• Data Engineers are the data professionals who prepare the “big data”
infrastructure to be analyzed by Data Scientists.
• They are software engineers who design, build, integrate data from various
resources, and manage big data. Then, they write complex queries on that,
make sure it is easily accessible, works smoothly, and their goal is
optimizing the performance of their organisation’s big data ecosystem.
• They might also run some ETL (Extract, Transform and Load) on top of big
datasets and create big data warehouses that can be used for reporting or
analysis by data scientists. Beyond that, because Data Engineers focus
more on the design and architecture, they are typically not expected to
know any machine learning or analytics for big data.
• Skills and tools: Hadoop, MapReduce, Hive, Pig, MySQL, MongoDB,
Cassandra, Data streaming, NoSQL, SQL, programming.
16. Data Scientist and Data Engineer
• There is great deal of overlap between these two roles.
• For instance, a data scientist might use the Hadoop ecosystem to
serve up answers to their data questions, and a data engineer might
be programming an iterative machine learning algorithm to run over a
Spark cluster.
• Some companies, research centres or governments prefer that
candidates are comfortable with aspects from both data science and
data engineering. Additionally, if a company, research centre or
government has defined these two roles separately, it can be
possible to switch from one role to the other.
17. Key Skill Areas for a Graduate in Data Science
and Engineering
• For a Graduate in Data Science and Engineering, the core computer
science and statistics courses should cover: Process Mining, Data Mining,
Algorithms, Visualization, Real-life data challenges, Statistics for Big Data,
Statistical Learning Theory, and Probability and Stochastic Processes.
• For a Graduate in Data Science the core courses should cover: Database
and Cloud Computing technology for Big Data; Data Mining, Statistics and
Predictive Modeling; Machine Learning and Graph Analytics; Information
Retrieval and Natural Language Processing; Business Intelligence and Visual
Analytics; Data Warehousing and Decision Support; Communication and
Visualization of Results; Privacy, Security and Ethics; and Entrepreneurship
and Data Product Design.
• For a Graduate in Data Engineering, more courses can be got from a
graduate program in software engineering to add to the common courses
of a graduate program in data science and engineering above.
18. Importance of Data Science
• We live in a digitized world in which massive amounts of data are harvested daily
to inform actions and policies for the future.
• We build sophisticated systems to collect, organize, analyze, and share data.
• We each have unlimited access to huge amounts of information and the tools to
interpret it.
• We are more aware than ever how molecules and cells move, how inflation
fluctuates, and how the flu travels, all in real time.
• We can efficiently distribute bus stations and plan transit schedules.
• With the right tools, we can predict how proteins misfold in our brains, or what
our galaxy might look like in a thousand years.
• In a society driven by data, knowledge is a commodity that is created and shared
transparently all over the world.
19. Disciplinary Trends
• Data science is a rapidly growing field with an increasing demand in
industry, research, and government.
• A recent McKinsey Global Institute study states that the US will face a
shortage of about 190,000 data scientists and 1.5 million managers and
analysts who can understand and make decisions using big data by 2018.
• In a recent MIT Sloan Management Review survey, four in ten (43%)
companies report their lack of appropriate analytical skills as a key
challenge.
• The ideal data scientist is a scientist with entrepreneurial skills, who is
used to asking the right business questions, understands the techniques
and is familiar with the tools for solving them.
20. Turning Data into Insight
• From government, social networks and ecommerce sites to sensors,
smart meters and mobile networks, data is being collected at an
unprecedented speed and scale.
• The networked world is generating big data that no human, or group
of humans, can process fast enough.
• This big data has the potential to transform the way business,
government, science and healthcare are carried out.
• Data science holds the key to unlocking that potential i.e. Data
science can put big data to use.
21. The Evolving Role of Data Science
• In the social sciences, modern research problems demand analysis beyond
traditional statistical hypothesis testing. Students are increasingly faced
with the prospect of building their own analysis software and
methodologies.
• In the life sciences, vast quantities of data generated by new
Deoxyribonucleic acid (DNA), Ribonucleic acid (RNA), and protein
sequencing technologies have engulfed biologists and chemists, who rarely
have training in statistics or computer science.
• Physicists, who traditionally have the most computational training, are
tackling data sets of orders of magnitude larger than the previous
generation of researchers ever dealt with. As Bloom says, “big data is when
you have more data than you’re used to.”
22. Evolving Role of Data Science cont’d
• Why would customers go to physical shops if the majority of the
products can be bought online on Amazon - that in turn even
suggests products/articles that are bought by like-minded people?
• Why would future generations go to expensive financial advisors of
established banks, when Google offers often better financial advice by
analyzing search behavior using Google Trends?
• Understanding the needs of the new online society is key for
succeeding in today’s business world, and Data Science is one
approach towards data-driven decision making as opposed to using
“gut feelings”.
23. Scientific Method Vs Analytical Method
• Scientific Method Is a method of procedure that has characterized
natural science since the 17th century, consisting of systematic
observation, measurement, and experiment, and the formulation,
testing, and modification of hypotheses.
• Analytical Method is a generic process combining the power of the
Scientific Method with the use of formal process to solve any type of
problem.
24. Analytical Method
• Analytic Method has nine steps:
• 1. Identify the problem to solve.
• 2. Choose an appropriate process. (THE KEY STEP)
• 3. Use the process to hypothesize analysis or solution elements.
• 4. Design an experiment(s) to test the hypothesis.
• 5. Perform the experiment(s).
• 6. Accept, reject, or modify the hypothesis.
• 7. Repeat steps 3, 4, 5, and 6 until the hypothesis is accepted.
• 8. Implement the solution.
• 9. Continuously improve the process as opportunities arise.
25. Examples of Data Science in Action
Problems that we used to solve using operations research techniques
are now better solved using data science techniques.
• Planning and forecasting:
• identifying possible future developments in telecommunications
• Identifying possible future developments in banking
• deciding how much capacity is needed in a holiday business
• Marketing: evaluating the value of sale promotions, developing
customer profiles and computing the life-time value of a customer.
• Credit scoring: deciding which customers offer the best prospects for
credit companies.
26. Examples of Data Science in Action cont’d
• Scheduling:
• of aircrews and the fleet for airlines
• of vehicles in supply chains
• of orders in a factory
• of operating theatres in a hospital
• Yield management:
• setting the prices of airline seats and hotel rooms to reflect changing demand
and the risk of no shows
• Facility planning:
• computer simulations of airports for the rapid and safe processing of
travellers
• improving appointments systems for medical practice.
• Defense and peace keeping: finding ways to deploy troops rapidly.
27. Big Data Analytics
• Big data analytics is the process of examining large data sets to
uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.
• Big data analytics is used in many industries to allow companies and
organizations to make better business decisions and in the sciences to
verify or disprove existing models or theories.
• Computing power is needed for big data analytics
28. Why Big Data Analytics is Important
• To maximize the discovery potential, we must employ advanced big
data analytics methods and algorithms, visualization techniques, and
high-performance computing.
• The unprecedented and multifaceted challenges demand for
advanced big data analytics skills in statistics, data mining, machine
learning, signal/image processing and visualization, data management
and programming.
• These skills bridge several disciplines and push research frontiers:
from the methods disciplines of computer science, electrical
engineering, applied mathematics, and statistics to domain disciplines
across science and engineering.
29. Why Big Data Analytics is Important Cont’d
• Big data analytics examines large amounts of data to uncover hidden
patterns, correlations and other insights.
• With today’s technology, it’s possible to analyze data and get answers from
it almost immediately whereas the traditional business intelligence
solutions are slower and less efficient.
• Big data analytics helps organizations harness their data and use it to
identify new opportunities. That, in turn, leads to smarter business moves,
more efficient operations, higher profits and happier customers.
• Businesses can learn key insights about their customers to make informed
business decisions.
• Scientists can discover previously unknown patterns hidden deep inside the
mountains of data.
30. Why Big Data Analytics is Important Cont’d
• Cost reduction. Big data technologies such as Hadoop and cloud-based
analytics bring significant cost advantages when it comes to storing large
amounts of data – plus they can identify more efficient ways of doing
business.
• Faster, better decision making. With the speed of Hadoop and in-memory
analytics, combined with the ability to analyze new sources of data,
businesses are able to analyze information immediately – and make
decisions based on what they’ve learned.
• New products and services. With the ability to gauge customer needs and
satisfaction through analytics comes the power to give customers what
they want. With big data analytics, more companies are creating new
products to meet customers’ needs.
31. Emphasis on a Few Application Areas of Big
Data Analytics
• E-Business
• Politics
• Informal Sector
• Healthcare Management
• Mobile Money
32. E-Business Systems
• E-business systems are a set of online technologies, equipment and tools
that a business uses to conduct business via the Internet. These systems
help a company/ organisation connect with customers, process orders and
manage information.
• For instance, one high-profit e-business system is a web-based retail store
where customers can purchase products online.
• Components of Business
• Business Process
• Managing Business and Firm Hierarchies
• The Business Environment
• The Role of Information Systems in Business
• Systems that Span the Enterprise
• Enterprise Applications
• Intranets and Extranets
• E-Business, E-Commerce and E-Government
33. Politics and Big Data Analytics
• Winning politics is now tied to big data analytics
• One of the storylines in the November 2016 US presidential election
is how both major political parties used big data analytics to inform
their decisions and tried to get ahead.
• In winning the 2012 US presidential election, the Obama campaign
successfully employed big data analytics to influence people and get
them to vote. Analytics experts say enterprises can apply these same
tactics to influence customers and drive sales.
• The 2012 US Presidential election was a watershed event for
leveraging technology in the political arena. Both the Obama and
Romney campaigns relied heavily on technology, but many analysts
say the Obama campaign tapped into the power of big data analytics
more effectively.
34. The Informal Sector and Big Data Analytics
• International Labor Organization's (ILO) Guidelines on Measuring the
Informal Sector uses big data analytics techniques
• Knowing the size of the informal sector in any country/continent
helps in planning and deployment of key interventions
• Big data analytics is critical in informing strategies aimed at
transforming the informal sector to the formal sector
35. Healthcare Management and Big Data
Analytics
• The healthcare industry historically has generated large amounts of
data, driven by record keeping, compliance & regulatory
requirements, and patient care.
• Big data in healthcare refers to electronic health data sets so large
and complex that they are difficult (or impossible) to manage with
traditional software and/or hardware; nor can they be easily
managed with traditional or common data management tools and
methods
• Big data analytics in healthcare is evolving into a promising field for
providing insight from very large data sets and improving outcomes
while reducing costs.
• Big data analytics in healthcare has great potential despite the various
challenges to overcome.
36. Mobile Money and Big Data Analytics
• Mobile money providers, particularly mobile operators, are sitting on two
gold mines of data: one from their core GSM operations (Telco Call Detail
Record (CDR) data, detailed coordinates of their Cell IDs, etc.) and one from
their mobile money operation (Know Your Customer (KYC) data for
customers, agent registration forms, transactional databases, etc.).
• Uncovering, analyzing and transforming mobile money data into action:
• Big data analytics can help in understanding how issues like customer demographics,
usage in the first month after sign-up and quality of agents impact ongoing customer
activity.
• Big data analytics can also yield very powerful insights to track mobile money fraud,
how to better manage an agent network, manage float and cash, drive the marketing
expenditures, etc.
• Big data analytics can feed into most of the key business decisions a mobile money
manager can make.
37. Conclusion
• Data Science continues to evolve as a multifaceted discipline
• The demand for Data Scientists and Data Engineers is growing by leaps and
bounds every passing day
• According to the McKinsey Global Institute, the U.S. alone could face a shortage
of about 190,000 professionals with data science skills by 2018.
• McKinsey Global Institute found that sectors such as computer and electronic
products and information, finance and insurance, and government will likely gain
the most value from using big data, and thus employ many of the world’s data
scientists.
• Data scientists will be vital to all sectors in coming years—from law and medicine
to media and nonprofits.
Thank You.
END
Notes de l'éditeur
datascience.nyu.edu
Data Science Skill Set, T. Stadelmann et.al, Applied Data Science in Europe.