This session will tell us about the different key roles in a data-driven organization. And the core skill-set required by the team. How they are dependent on each other.
Agenda:
● Introduction
● Key Roles in Data-Driven Organisation.
○ Data Analyst
○ Data Engineer
○ Applied ML Engineering
■ Data Scientist
■ Statistician
■ Applied ML Engineer
■ Ethicist
■ Social Scientist
■ Researcher
○ Tech Lead
■ Analytics Manager
■ Decision Maker
Introduction
● Data Science jobs are one of the hottest jobs
of 21st century and its demand is increasing
by the day
● In industry, there are different data science
roles we come across
● It’s tough to get a general understanding of
how they differ in terms of skill sets and what
they work on
● Getting brief insights of key job roles and
responsibilities of each title along with
skills/qualifications can help in understanding
roles in data science field.
Data Engineer
● For many organizations, Data Engineers are first hires on a data team.
● Data Engineers develops, constructs, tests and maintains architectures of databases and systems.
● They gather data from other websites through web scraping, API’s or IoT devices and ingests the data
into the data warehouse.
● Data Engineers create ETL (Extract, Transform and Load) processes to make sure that the data gets into
the data warehouse.
● Responsible for building efficient data pipelines.
Skill sets:
● Big data tools: Hadoop, Spark, Kafka, etc.
● SQL and NoSQL databases like PostgreSQL, Cassandra, MongoDB etc.
● R, Python, C/C++ Programming Languages.
● Cloud Services
Data Analyst
● A Data Analyst collects, processes, performs statistical analysis and creates visualizations on data.
● Analysts implement feature engineering, feature selection, clean the data using programming
languages, spreadsheets, and business intelligence tools to describe and categorize the data.
● The master data collected is managed by an analyst including creation, updation, deletion and
processing confidential data.
● Analyst creates report and analysis. Provides expertise on data storage structure, data mining and data
cleaning.
Skills sets:
● Structured Query Language(SQL) or any databases
● Data Mining, cleaning
● Data Analysis, Visualizations
● R or Python Programming Language
● Presentation skills
Statistician
● Statisticians are professionals who apply statistical methods and models to real-world problems.
● They gather, analyze, and interpret data to aid in many business decision-making processes.
● Statisticians are valuable employees in a range of industries, and often seek roles in areas such as business,
health and medicine, government, physical sciences, and environmental sciences.
● Daily tasks are likely to include:
○ Collecting, analyzing, and interpreting data
○ Identifying trends and relationships in data
○ Designing processes for data collection
○ Communicating findings to stakeholders
○ Advising organizational and business strategy
○ Assisting in decision making
Skill sets:
● Statistical theory and methods. Data Mining & Machine Learning
● Distributed Computing (Hadoop)
● Databases (SQL and NoSQL)
● R, Python, Spark programming Language
Applied ML Engineer
● The work of a Machine Learning Engineer is to bridge the gap between Data Scientist’s work and
production environment.
● Machine Learning Engineer is more concerned with deploying production-ready models.
● Removes errors from data sets and find correct data representation methods.
● Deploys the machine learning model to be integrated into the application/ website.
● Scaling and optimizing the model for production.
● Monitoring and maintenance of deployed models
Skill sets:
● Probability & Statistics
● Data Modeling and Evaluation.
● MLOps.
● Applying Machine Learning algorithms and libraries(Tensorflow, Pytorch)
● Software Engineering and system design(AWS, Azure, GCP)
Data Scientist
● A Data Scientist work based on the visualization provided by the data analytics team to build and
optimize classifiers using machine learning techniques
● Thoroughly clean data to discard irrelevant information and prepare the data for preprocessing and
modeling
● Performs exploratory data analysis (EDA) to determine how to handle missing data.
● Discovers new algorithms to solve problems & build programs to improve current strategies.
● Perform feature engineering, feature selection to implement analytical methods, machine learning and
statistical methods to prepare data for use in predictive and prescriptive modeling
Skill sets:
● Programming: Python, Java
● Applying Machine Learning algorithms and libraries(Scikit Learn, Tensorflow, PyTorch)
● Predictive Modeling
● Maths and Stats
● Effective Communication
Ethicist
● Data ethics is a cross-cutting discipline that assesses the wider societal impact of technology, producing
recommendations for technologists and data professionals. It involves thinking about fairness,
accountability, the law, moral dilemmas, and the risks involved in creating technology and data products
and policies.
● Data Ethicist in teams will enable Data Engineers and Data Scientists to innovate responsibly and respond to
the ongoing demand for implementing data ethics best practice.
● This critical role has been extremely successful in recent years in the private sector, and has been
instrumental in the development of high-risk data and artificial intelligence (AI) products.
● Skill Sets:
○ communication skills (data)
○ applied knowledge of social sciences
○ stakeholder relationship management
○ analysis and synthesis (data ethics)
○ bridging the gap between the technical and non-technical (data ethics)
○ product development (data ethics)
○ empathy and inclusivity
○ ethics and privacy
○ Problem-solving
○ facilitating decisions and risks
Social ScientistA social scientist
● AI has the potential to bring along diverse benefits for our health, safety and general well-being.
● A Social Scientist performs research on link between AI and societal impact of it.
● They can detect potential use of AI by considering societal implications of these technologies.
● Such individuals may be especially equipped to spot the problems in AI that aggravate long-ingrained
prejudices.
● They have proper domain knowledge on problem statement for which AI is used.
Social Scientist
Researcher
● AI researchers conceptualize and explore new ways of leveraging data by developing new AI algorithms,
i.e., they create and ask new questions that can be answered using AI.
● AI researchers focus on finding ways to analyze data in innovative ways for automated decision-making
and action.
● AI researchers, research novel forms of AI technology to create new applications that use data to drive
independent actions.
● Skill Set:
○ AI programming skills: This one goes without saying, but coding skills is a given for any professional in
the AI and data science domain. The best programming languages for AI development currently are
Python, Lisp, Prolog, R, C/C++ and Java. Out of these languages, Python is most preferred by both tech
companies and AI researchers themselves, possibly because of its ease of use.
○ Analytical thinking: Since artificial intelligence is closely intertwined with data analysis, analytical skills
are necessary for potential AI researchers. Having good analytical skills translates into the ability to
■ make sense of data
■ verify the validity of the data gathered
■ identify connections between different variables, and
■ form logical conclusions based on the available data.
Analytics Manager
● The complete cycle revolves around the enterprise goal.
● Identify the key business variables that the analysis needs to predict.
● Define the project goals by asking and refining "sharp" questions that are relevant, specific, and
unambiguous.
● Find the relevant data that helps you answer the questions that define the objectives of the
project.
● An Analytics Manager manages a team of analysts and data scientists
Skills sets:
● R, Python , SQL, SAS, Java Programming
● Leadership & project management
● Data Mining & Predictive modeling
● Interpersonal Communication
Decision Maker
● Real-world data sets are often noisy, are missing values, or have a host of other discrepancies.
● Aim is to produce a clean, high-quality data set whose relationship to the target variables is
understood.
● Develop a solution architecture of the data pipeline that refreshes and scores the data regularly