SlideShare une entreprise Scribd logo
1  sur  18
A HYBRID
APPROACH TO DATA
SCIENCE PROJECT
MANAGEMENT
Elaine Lee
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
Building a Data-Driven WorldTM
Open Data Science Conference
A Hybrid Approach to Data
Science Project Management
Elaine Lee
elee@civisanalytics.com
@elaineklee
3Open Data Science Conference#ODSC
Organizations want to be data-driven but many obstacles stand in their way:
• Communication not trickling up to executives and key decision makers
• Silos between departments, making it difficult to share and collaborate on
analysis
• Data ingestion (ETL or Extract-Transform-Load) is difficult and time-consuming
• Lack of meaningful, yet customizable visual reporting
• Inability to flexibly scale up or down technological needs at a reasonable cost
• Inadequate or overwhelming learning resources about data science
A Common Problem With Many Faces
4Open Data Science Conference#ODSC
Where should Enroll America direct its insurance signup efforts?
Mapping the Uninsured in America
5Civis Analytics | Proprietary and Confidential
As a company, Civis traces its
origins to the 2012 Obama for
America analytics team.
We built a scientific
understanding of each voter.
Our data science influenced
every strategy and tactic: voter
targeting, messaging, media
buys, and fundraising.
This meant the campaign could
allocate resources where impact
would be greatest.
We ran the first
individualized
presidential
campaign
Civis Analytics | Proprietary and Confidential Open Data Science Conference#ODSC
6Civis Analytics | Proprietary and Confidential
Today, we
leverage data
science to help
our clients in
politics, non-
profits, and the
corporate world.
Civis Analytics | Proprietary and Confidential Open Data Science Conference#ODSC
Open Data Science Conference#ODSC Open Data Science Conference#ODSC
An easy-to-use,
end-to-end, incredibly
extendable, data science
platform in the cloud for
teams who want to make
great data-driven decisions
to drive their organizations
forward.
Introducing
Civis
8Open Data Science Conference#ODSC
The Civis Approach
ProductConsulting R&D
Applied Data Science
• Tackles the toughest data
science problems we can
find
Data Science R&D
• Generalizes and
automates the solution for
many scenarios
Software Engineering
• Integrates solutions into
user-empowering software
• Highly collaborative departments
• All departments contribute to both our services arm and product development
9Open Data Science Conference#ODSC
The Civis Approach
Our unique team structure allows
us to solve your biggest problems
with custom solutions and the
technology to scale them.
10Open Data Science Conference#ODSC
Strategies and philosophies
• Teams based on Civis’s product and consulting needs:
• “Built around code”
• Semi-annual departmental day-long off-sites to plan upcoming R&D initiatives
• Academia-influenced: evidence-based approaches to finding and reporting best
solutions
• Software development-influenced: standups, code review
• Favorite tools:
Data Science R&D
R&D
Modeling
Methodology
Unstructured
Data
Engineering
11Open Data Science Conference#ODSC
Tools
• Share and discuss data science news
• Receive feedback from colleagues
using our tools
• Discuss implementation
• Lower communication costs compared
to email
Data Science R&D
12Open Data Science Conference#ODSC
Tools
• Prototype new workflows
• Used like a log book to record and
present results
• Share preliminary results with
members of other departments
Data Science R&D
13Open Data Science Conference#ODSC
Tools
• Department heads set milestones,
check progress, and make project
staffing decisions
• Collaboratively plan development on
new functionality or organizational
processes (e.g. recruiting)
Data Science R&D
14Open Data Science Conference#ODSC
Tools
Strategies
• Designate “tag team” on R&D as
default R&D resources for client
engagements
• This is the Modeling Methodology
team
• Other R&D teams’ members may be
staffed on engagements depending on
expertise required
• R&D team member always serves as the
Consulted in the RACI model
• Transparency about challenges is
paramount
R&D <-> ADS
15Open Data Science Conference#ODSC
1. Assemble a project team of R&D data
scientists and Applied Data Scientists
2. Work with Enroll America to refine
requirements and come up with a plan
of analysis, ultimately resulting in the
design and execution of a phone
survey on a sample of individuals,
followed by building a predictive
model for the rest of the country.
3. The Applied Data Science Manager
has weekly calls with Enroll America
and status meetings with the project
team.
4. The project team delivers the
predictions and analysis to Enroll
America.
R&D <-> ADS: A Case Study
Mapping the Uninsured in America
The project team completes a postmortem
and determines these activities could be
automated: model building
16Open Data Science Conference#ODSC
Tools
Strategies
• Designate teams at the interface to
triage issues and plan new
development:
• R&D: “Engineering” team
• Tech: “Modeling” team
• Use module or project-specific chatrooms
to get answers to ad-hoc questions
quickly
• Identify opportunities to form cross-
functional teams, e.g.:
• Developing apps using the Platform’s
API
• Knowledge sharing on best practices
R&D <-> Tech
17Open Data Science Conference#ODSC
1. After the postmortem for the Enroll
America engagement, R&D begins
prototyping automated modeling
functionality and discussing its
implementation with the Tech
department.
2. R&D’s Engineering team finishes the
prototype and works with Tech’s
Modeling team to integrate it as a new
feature in the Platform.
3. During integration, ad hoc
discussions occur on GitHub and
Hipchat to address usability
questions, e.g. resource usage and
input/output specifications.
R&D <-> Tech: A Case Study
Mapping the Uninsured in America
The integration team successfully builds
and integrates the Build Model module in
the Platform.
Open Data Science Conference#ODSC
Our approach to data science consulting and product development
is enriched by valuable perspectives of our employees, who come
from a wide array of backgrounds, making our project management
strategies a hybrid of more conventional techniques.
Conclusion

Contenu connexe

Plus de odsc

Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
odsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
odsc
 

Plus de odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

A Hybrid Approach to Data Science Project Management

  • 1. A HYBRID APPROACH TO DATA SCIENCE PROJECT MANAGEMENT Elaine Lee O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2. Building a Data-Driven WorldTM Open Data Science Conference A Hybrid Approach to Data Science Project Management Elaine Lee elee@civisanalytics.com @elaineklee
  • 3. 3Open Data Science Conference#ODSC Organizations want to be data-driven but many obstacles stand in their way: • Communication not trickling up to executives and key decision makers • Silos between departments, making it difficult to share and collaborate on analysis • Data ingestion (ETL or Extract-Transform-Load) is difficult and time-consuming • Lack of meaningful, yet customizable visual reporting • Inability to flexibly scale up or down technological needs at a reasonable cost • Inadequate or overwhelming learning resources about data science A Common Problem With Many Faces
  • 4. 4Open Data Science Conference#ODSC Where should Enroll America direct its insurance signup efforts? Mapping the Uninsured in America
  • 5. 5Civis Analytics | Proprietary and Confidential As a company, Civis traces its origins to the 2012 Obama for America analytics team. We built a scientific understanding of each voter. Our data science influenced every strategy and tactic: voter targeting, messaging, media buys, and fundraising. This meant the campaign could allocate resources where impact would be greatest. We ran the first individualized presidential campaign Civis Analytics | Proprietary and Confidential Open Data Science Conference#ODSC
  • 6. 6Civis Analytics | Proprietary and Confidential Today, we leverage data science to help our clients in politics, non- profits, and the corporate world. Civis Analytics | Proprietary and Confidential Open Data Science Conference#ODSC
  • 7. Open Data Science Conference#ODSC Open Data Science Conference#ODSC An easy-to-use, end-to-end, incredibly extendable, data science platform in the cloud for teams who want to make great data-driven decisions to drive their organizations forward. Introducing Civis
  • 8. 8Open Data Science Conference#ODSC The Civis Approach ProductConsulting R&D Applied Data Science • Tackles the toughest data science problems we can find Data Science R&D • Generalizes and automates the solution for many scenarios Software Engineering • Integrates solutions into user-empowering software • Highly collaborative departments • All departments contribute to both our services arm and product development
  • 9. 9Open Data Science Conference#ODSC The Civis Approach Our unique team structure allows us to solve your biggest problems with custom solutions and the technology to scale them.
  • 10. 10Open Data Science Conference#ODSC Strategies and philosophies • Teams based on Civis’s product and consulting needs: • “Built around code” • Semi-annual departmental day-long off-sites to plan upcoming R&D initiatives • Academia-influenced: evidence-based approaches to finding and reporting best solutions • Software development-influenced: standups, code review • Favorite tools: Data Science R&D R&D Modeling Methodology Unstructured Data Engineering
  • 11. 11Open Data Science Conference#ODSC Tools • Share and discuss data science news • Receive feedback from colleagues using our tools • Discuss implementation • Lower communication costs compared to email Data Science R&D
  • 12. 12Open Data Science Conference#ODSC Tools • Prototype new workflows • Used like a log book to record and present results • Share preliminary results with members of other departments Data Science R&D
  • 13. 13Open Data Science Conference#ODSC Tools • Department heads set milestones, check progress, and make project staffing decisions • Collaboratively plan development on new functionality or organizational processes (e.g. recruiting) Data Science R&D
  • 14. 14Open Data Science Conference#ODSC Tools Strategies • Designate “tag team” on R&D as default R&D resources for client engagements • This is the Modeling Methodology team • Other R&D teams’ members may be staffed on engagements depending on expertise required • R&D team member always serves as the Consulted in the RACI model • Transparency about challenges is paramount R&D <-> ADS
  • 15. 15Open Data Science Conference#ODSC 1. Assemble a project team of R&D data scientists and Applied Data Scientists 2. Work with Enroll America to refine requirements and come up with a plan of analysis, ultimately resulting in the design and execution of a phone survey on a sample of individuals, followed by building a predictive model for the rest of the country. 3. The Applied Data Science Manager has weekly calls with Enroll America and status meetings with the project team. 4. The project team delivers the predictions and analysis to Enroll America. R&D <-> ADS: A Case Study Mapping the Uninsured in America The project team completes a postmortem and determines these activities could be automated: model building
  • 16. 16Open Data Science Conference#ODSC Tools Strategies • Designate teams at the interface to triage issues and plan new development: • R&D: “Engineering” team • Tech: “Modeling” team • Use module or project-specific chatrooms to get answers to ad-hoc questions quickly • Identify opportunities to form cross- functional teams, e.g.: • Developing apps using the Platform’s API • Knowledge sharing on best practices R&D <-> Tech
  • 17. 17Open Data Science Conference#ODSC 1. After the postmortem for the Enroll America engagement, R&D begins prototyping automated modeling functionality and discussing its implementation with the Tech department. 2. R&D’s Engineering team finishes the prototype and works with Tech’s Modeling team to integrate it as a new feature in the Platform. 3. During integration, ad hoc discussions occur on GitHub and Hipchat to address usability questions, e.g. resource usage and input/output specifications. R&D <-> Tech: A Case Study Mapping the Uninsured in America The integration team successfully builds and integrates the Build Model module in the Platform.
  • 18. Open Data Science Conference#ODSC Our approach to data science consulting and product development is enriched by valuable perspectives of our employees, who come from a wide array of backgrounds, making our project management strategies a hybrid of more conventional techniques. Conclusion

Notes de l'éditeur

  1. Hi everyone, it’s great to be here. My name is Elaine Lee. I am a Data Scientist in the R&D department at Civis Analytics. Civis is a Chicago-based data science consulting and software startup, and I’m excited to tell you a little bit about our company and the work that we do. In particular, I’ll be talking about how the R&D department juggles concurrent development of both our consulting services and our cloud-based data science platform. I’ll be emphasizing approaches borrowed from other more established industries as it pertains to department projects as well as interdepartmental collaborations.
  2. Many of you are already familiar with data science and the potential it has to change the way things are done. However, data science has a high barrier of entry for some teams, from a technical standpoint and organizational standpoint. It can be difficult to wrap your head around the technical needs and quantitative concepts that go into data science. In addition, it can be hard to assemble the right team to do data science and to keep the work organized. Picture a team of data scientists working on the same project. Some of them have written R or Python scripts to process the data, do feature engineering, and build models on it. Some of them have taken the results of the models and produced charts and visualizations in Excel, Tableau, or D3. All the work is being kept in a few different places – Dropbox, Google Drive, Github, MySQL, … It is difficult for this hypothetical team to figure out what exactly has been done, and even worse, what efforts have been duplicated. It is also incredibly difficult to validate the analysis. Does this sound familiar to anyone? Fortunately, many of us at Civis Analytics have faced these challenges in our previous work, but we’ve made those challenges a thing of the past! It didn’t happen overnight, but we were constantly coming up with new ideas to improve the data science workflow by, well, working on a variety of consulting projects and researching new methods. Today I will talk about what some of these ideas are. In addition, I will tell the story of how one client engagement provided us a valuable exercise in collaboration and data science best practices we’ve internalized.
  3. Throughout my talk today, I will be using our project with Enroll America to illustrate a lot my concepts. Enroll America was one of our first clients in 2013. They wanted our help identifying Americans without health insurance so they knew where to direct their outreach. This was a challenging problem because of its large scope – they want to do outreach throughout the country! – and it wasn’t obvious what’s predictive of being uninsured. Why did Enroll America specifically seek us out to solve this problem?
  4. Let’s talk a little about what expertise Civis has for tackling problems like Enroll America’s. The founding members of Civis Analytics were part of Obama For America’s analytics team in his 2012 re-election campaign. There, we developed the beginnings of a framework for doing person-level analytics (which is highly relevant for Enroll America). With scientific levels of rigor, we built models to understand all sorts of relevant vote-related behaviors in order to better identify and persuade supporters, which translated to optimizing how the campaign’s resources were used. The campaign spanned many months and during that time, lots of models were being built and refined; their results were constantly being sent to those in the field to take action upon. Developing an organized and repeatable workflow was especially crucial in order to minimize costs, time spent – especially since the staff was small, and any inadvertent human error, especially when models are built at such a large scale.
  5. After the campaign ended in 2012, we re-examined the strategies we employed and the problems they solved. We realized that if we generalized them, we could solve similar problems for clients in the political, non-profit, and corporate worlds. Which is exactly what Civis did. What you see here is a sample of clients, in addition to Enroll America, that we have helped better target their advertising dollars, identify potential customers for greener sources of electricity, and determine public awareness and sentiment on their brand or cause. In the past year, we took it a step further and we formed a partnership with Discovery Communications to inform more sophisticated audience targeting approaches, ratings forecasting, and marketing spend. We anticipate making more partnerships like this in the future. The examples I gave are all problems with a similar flavor to what Civis successfully solved in 2012 – identifying and reaching the people you care about most.
  6. Our diverse client portfolio, innovative approaches, and proven track record have made Civis Analytics’ consulting services highly sought after in the predictive analytics space. However, we’re equally passionate about removing obstacles to doing data science. Our steady client pipeline enables us to formalize our approach in the form of a cloud-based data science application. Our software, Civis, or “the Platform”, supports the entire workflow of a typical data science project, from data warehousing to data processing to predictive modeling to reporting. This enables organizations to easily take control of their own data and unlock their insights.
  7. This is how we turn our client work experiences into software. We select novel problems brought forth by our clients and work with them to deliver a solution. This is primarily addressed by our Applied Data Science department. Simultaneous to this, we’ve been conducting research and experimenting with different methods to solve the problem, with one eye towards determining how to generalize the solution. This is primarily done by the Data Science R&D department. Finally, solutions are integrated into our software platform by the Software Engineering, or Tech, department. Users of our software platform – clients and our Applied Data Scientists – provide us valuable feedback which are continuously incorporated. This unique, synergistic cycle enables us to deliver high quality results to our customers.
  8. In our day-to-day work, all departments pitch in on both lines of business, ensuring fluency on all the company’s offerings and thus better decision making. We also collaborate across departments on all projects, big or small. Today I will be focusing on how my department, the DS R&D department, manages its workload and how it works with the Applied Data Science and Tech departments.
  9. The R&D department is the only department that is intimately aligned with both lines of business. We’re split into 3 different teams. Modeling Methodology focuses on developing new modeling workflows. Unstructured data specializes in data that can’t neatly be summarized by a flat file, like text data. Engineering is responsible for managing our production codebases of new features for our software product. Our department is “built around code”: “We're trying to build up knowledge and best practices, and being built around code lowers our communication costs, errors, redundancy, and facilitates us making software.” To roadmap what we build, based on what we’ve learned from recent client engagements, we have day-long semi-annual department off-sites. When developing new methodologies, we use an academic-influenced approach – empirical and thorough such that our recommended solution covers all the edge cases. When building out workflows, we follow guidelines common to most software development projects, including some ideas from the Agile methodology – we have daily standups to make sure everyone’s on the same page about the status of the codebases and we do code reviews before any changes are shipped. Our standups are on a per-repository basis, so it doesn’t waste anyone’s time. To do our work, these are our favorite tools. Let’s take a look at how we use them.
  10. Hipchat and Github form the backbone of our communications. To those not familiar with these tools, Hipchat is an instant messaging tool for organizations. Github is a web interface, built on top of the version control system, git, for teams to collaborate on a codebase. These tools are crucial to our philosophy on being built around code They enable members across the company to participate by asking questions and generally weighing in Departmental members use it to discuss implementation These tools are much faster than email since it makes it easier to ask questions and get answers, since anyone who knows the answer can see the request and thus respond.
  11. When developing new methods, we like to use Jupyter and Google Drive. We use Jupyter for its Ipython Notebook capabilities. It allows us to run Python code, especially modules from our codebase, interactively – it allows us to chain components together to make new workflows. Jupyter also has presentation functionality, so we also use it as a log book to record and present results in internal meetings. Sometimes we also use Google Drive to record and share results with members of other departments, such as Applied Data Scientists, who have a vested interest in the project but don’t require all the details.
  12. Finally, to take the “pulse” on the R&D department as a whole, department heads use Google Drive and Asana for big picture planning. Asana is a project management tool which gives department heads a birds eye view of what each team member is working on and how each project is progressing. Google Drive tools are used to collaborate on planning documents, be it plans for new functionality to build or revising organizational processes, such as rewriting our hiring exam.
  13. That was how we, the R&D department, work together. How do we work with the Applied Data Scientists, the data scientists in our consulting arm? To make project staffing seamless, we designate a tag team to serve as the first point of contact for client engagements. This is the Modeling Methodology team. However, other R&D data scientists may be staffed on a project depending on expertise required. The R&D data scientist always serves as the Consulted in the RACI model. The RACI model is a popular project management model used in consulting. It emphasizes explicit roles for each team member to ensure accountability. R is for Responsible, a role held by the applied data scientists. A is for Accountable; this is the Applied Data Science Manager or project manager C is for consulted. And I is for Informed (the client) Lastly, we are open with Applied Data Scientists about R&D challenges in order to avoid schedule slips on the client engagement. The project plan is often tracked in Trello, a popular bulletin board app, with bulletin boards for each milestone’s requirements.
  14. Let’s revisit our client story – Mapping the Uninsured in America – to illustrate concretely how we work together. After Enroll America shared their problem to us, we assembled a project team of R&D data scientists and Applied data scientists to solve it. We worked with Enroll to refine the problem statement into a set of requirements, ultimately resulting in the design and execution of a phone survey on a sample of individuals, followed by building a model to capture the rest of the country. The project gets under way. Throughout the project, the Applied Data Science Manager has weekly status calls with Enroll and with the project team to make sure we’re on schedule. Occasionally we staffed a couple extra data scientists to the project to make sure we delivered results on time when there was risk of a schedule slip. For example, we brought in an extra data scientist towards the end of the project to help produce graphs and visualizations of the results. Finally, we finished our analysis and presented our predictions to Enroll America. Afterwards, we did a post mortem and realized that automated model building would’ve made us more efficient. This is because we conducted our experiment in waves and built similar models as the results came in, with the only difference being the input data. Also, the analysts were each working on individual components of the analysis, writing their own R scripts which had a lot of overlap (such as the data processing steps), which meant a lot of time was wasted.
  15. So that’s how we work with the Applied Data Scientists on consulting projects. How do we work with the Tech department? Much like how we work with the Applied Data Science department, we’ve designated a team to interface with the Tech department and they have as well. That would be the Engineering team on our side and the Modeling team on their side. The Engineering team in Data Science are data scientists who speak software development and the Modeling team in the Tech department are software engineers who speak data science. Most of our communications are done using module or project-specific chatrooms and github issue tickets, which gets answers quickly. To promote really inspired product development, we identify opportunities to form cross-functional teams, Such as using the Platform’s API to develop new apps And teaching each other best practices for software development via brownbag sessions.
  16. Let’s revisit the Enroll America project for an example of how the R&D data scientists work with the software engineers. After the post mortem for the Enroll engagement, we began prototyping automated modeling functionality, communicating to the Tech department the motivation for it and including them in discussions about implementation and feasibility. Once we finish the prototype, ensuring that it passes all the tests and code review, the Engineering team in R&D work with the Modeling team in Tech to integrate it as a new feature in the Platform. We use Github and Hipchat to discuss questions that come up, such as resource usage, input/output specifications, and data visualizations we wanted to provide to the end user. Together, the R&D department and the Tech department successfully built and integrated the Build Model module that exists today in Platform.
  17. In summary, a lot of our approaches have a common theme, which is minimizing communication costs within the R&D department and with other departments. This is evidenced by our embrace of some free or open-source tools for collaboration and our general belief in transparency about challenges. We also emphasize collaborative opportunities between departments to strengthen our cohesiveness as a team, be it working on a client engagements together or learning best practices in a seminar format. A lot of our ideas come from the valuable perspectives of our employees, who come from a wide array of backgrounds. Thus, our project management strategies are a hybrid of techniques seen in more established industries such as software engineering, consulting, and academia. I hope the tips presented in my talk today has made doing data science more manageable for your team. Thank you for your time.