Air breathing and respiratory adaptations in diver animals
wt2084 final presentation slides
1. STATISTICAL MEASUREMENTS,
ANALYSIS & RESEARCH
Final Presentation
Weixi Tan
Net ID: wt2084
NYU SPS Integrated Marketing
Professor: Luyao Zhang
2. Outline
Part I: Introduction
Part II: Summary of course takeaway
Part III: Market research report: Regression analysis
Part VI: Appendix
3. Part I: Introduction
Weixi (Vicky) Tan comes from Chongqing, China. She received her Bachelor's degree in statistics.
She has interned as a reporter for CQNEWS.Net, which is Chongqing's 1st and largest portal news
website, independently completing several interviewing and news reporting tasks at the 2019 Smart
China exposition for her clients, China Telecom Co., Ltd. During the process, she got involved in
news media marketing. She was one of the persons in charge of a photography studio called Match,
which is a university students’ innovative undertaking program. Its main business are taking
professional photographs or commercial shoot for large-scale events and each year’s graduation
season. Her operation at Match has driven the studio to develop from a small studio in its primary
start-up phase into one dominating university market and also gain increasing popularity beyond
campus. Additionally, she felt the significance of marketing planning and execution for businesses
and began to pursuit a real marketing career after a systematic study of marketing at the graduate
level at NYU. She is passionate about photography, volunteering, traveling, and exploring new things.
LinkedIn URL:https://www.linkedin.com/in/weixi-tan-5384911a4/
Github Repo URL:https://github.com/WeixiTan/NYU_Integrated_Marketing
Kaggle Notebook URL:https://www.kaggle.com/weixitan/customer-segementation-wt2084
4. Part II: Summary of course takeaway
I draw several simple mind maps:
5. Use tools such as Google data studio, Github, Kaggle and apply python codes.
T-test is very basic and important in Hypothesis Testing for testing continuous variables.
If we want to demonstrate the relativity of variables, we should use relevant analysis. Regression analysis
should be used if we want to reflect how much one variable affects another.
Cluster analysis is a class of techniques used to classify and segment our target customers.
Different methods are used for different types of data sets and each model has its own applicable conditions. As
a market analyst, you may try many methods and models and fail many times to find the most appropriate way
to help your company make the best marketing decisions.
Statistical analysis and research provide me quantitative and qualitative techniques for developing consumer
insights, determining market potential, maximizing market share and building customer relationships in an
integrated-marketing environment.
If I want to engage in analytical positions in the future, I still need more statistical analysis knowledge and
ability to improve my statistical thinking. This course really help me and make me realized the importance of
statical analysis in further work.
In today‘s big data age, mastering some data analytics technologies learned in this course can enhance our
competitiveness when applying for a great job.
Part II: Summary of course takeaway
Key learning and my takeaway for personal and professional growth:
6. Executive Summary
The URL to the data source:
https://www.kaggle.com/rafailmahammadli/regression
This is a Kaggle data sets for advertising data contain information about Sales of a product in 200 different
markets and the advertising budgets for the product in TV, Radio and Newspaper. Sales are measured in
thousands of units, the advertising budgets in thousands of dollars.
We use linear regression in this report to test the correlations. The results shows that TV and Sales are
correlated, and Radio and Sales are correlated, which implies we should pay more attention on TV and
Radio advertising.
Github Repo Link:https://github.com/WeixiTan/NYU_Integrated_Marketing
Part III: Market research report: Regression analysis
7. Research Design and The Data
Google Data studio Link :
https://datastudio.google.com/embed/reporting/02f262af-782f-4785-8904-
e45a433b21ee/page/1tUrB
Abstract: This data is about Sales of a product in 200
different markets and the advertising budgets for the
product in TV, Radio and Newspaper. By visualizing
secondary data from Kaggle, we find TV advertising
budget is much bigger than radio and newspaper.
We want to research on the relationship between sales
and other three variables and which media channel has
a further impact on product sales.
We use linear regression in this report to test the
correlations. The results shows that TV and Sales are
correlated, and Radio and Sales are correlated.
Through such research, product managers could make
better marketing decisions like on which channel to put
more budget.
8. Scatter plots
We draw threes scatter plots, first is for TV and sales, second is for radio and sales, third is for newspaper and sales.
Scatter plot 1 shows that TV advertising and sales have a linear relationship.
Scatter plot 2 shows that radio advertising and sales have a linear relationship.
Scatter plot 3 shows that newspaper advertising and sales have a linear relationship but not so clear.
9. Regression result
From the regression result, we can see the p value
of TV is 0 which smaller than 0.05, we can reject
the null hypothesis that TV and sales are not
correlated.
The p value of radio is 0 which smaller than 0.05,
we can reject the null hypothesis that radio and
sales are not correlated.
The p value of newspaper is 0.860 which greater
than 0.05, we can not reject the null hypothesis
that newspaper and sales are not correlated.
10. Insights
From the regression result, we can conclude that TV advertising and radio advertising
do have significant effect on product sales with 95% confidence level, but newspaper
advertising and sales are not correlated.
This conclusion implies that traditional media also play an important role in marketing
campaign, and company should pay more attention on TV and radio advertising instead
of newspaper advertising.
11. Assumptions Check
Then we further check the 6 assumptions of the linear model.
Results show assumption 2 and 3 are likely to be satisfied, but assumption 1, 4 and 6 are not likely to be
satisfied.
For assumption 1, the error term is not normally
distributed. For each fixed value of X, the distribution
of Y is not normal.
For assumption 3, the mean of the error term is 0.
12. Assumptions Check
For assumption 2, from the scatter plots above, the means of all these normal distributions of Y, given X,
lie on a straight line. So TV and sales have linear relationship, radio and sales have linear relationship, and
newspaper and sales have linear relationship.
• For assumption 4, The variance of the error term is not so constant. This variance depend on the values
assumed by X.
For assumption 5, the data set is not for time
series data, so we omitted here.
13. Assumptions Check
• For assumption 6, TV and radio are not correlated, also TV and newspaper are not correlated,
but radio and newspaper are correlated. Maybe there are some issues of multi-collinearity.
14. Further research:
From the regression result, the p value of newspaper is
0.860 which greater than 0.05, we can not reject the null
hypothesis that newspaper and sales are not correlated.
And through the assumption check, some assumptions
are not likely to be satisfied.
We should consider that this linear regression model is
not so valid, maybe we can remove the variable
(newspaper) which don’t have significant impact on
product sales.
Besides, based on the scatter plot between residuals and
predictions, we can consider non-linear regression to
conduct the research.
In this research we focus on traditional media but we can
also find more data about new media like social media to
analysis.
16. Capstone Project Milestone 3-Hypothesis Testing
Data source:
https://data.world/data-society/bank-marketing-data
The data is related with direct marketing campaigns(phone calls) of Portuguese banking institution.
https://stats.oecd.org
The data is quarterly growth rates of GDP in volume of G20 countries.
Tests: Paired T-test; Two sample T-test; Person Test of Correlations.
All of the results showed that there are significant differences.
Github:
https://github.com/WeixiTan/NYU_Integrated_Marketing
16
17. Paired T-test
Because the data is about before-and-after observations on the same sample(measured twice,
resulting in pairs of observations), we pick the paired T-test.
Conclusion: The p-value = 0.0<0.05, we can reject the null hypothesis that there is no
significant deference between mean GDP level 2018 and 2020.
17
18. Because the data is metric data but not paired, and has two groups, we pick two-sample T-Test.
Conclusion: The P-value<0.05, we can reject the null hypothesis that the mean of the balance equals those who
have loan and those who do not have loan at 0.05 significant level.
18
19. Because the two variables are normality distributed with no outlier, so we pick the person for testing.
Conclusion: We get the result P-value=0.0<0.5, we can reject the null hypothesis that the GDP for the
same country in 2018 and 2020 are not correlated.
19
20. • Conclusion: For effect size of 0.12, a power of 0.8, and a type Ⅰ error of 0.05, we need a
simple size of 25.
20
For the Two Sample T-test based on bank marketing data:
• The limitations: Bank marketing data can not do the paired T-test since the data is not paired.
• Future research plan: We can decide the simple size through power analysis and collect another data of
bank clients before marketing campaigns to measure the difference before and after the marketing campaign,
thereby measure effect of the marketing campaign(phone calls or something like that).
21. Capstone Project Milestone 4: Regression
Data source:
https://www.kaggle.com/c/customer-churn-prediction-2020/overview
This is a Kaggle competition data sets for competition in 2020. The original purpose is to predict whether a
customer will change telecommunications provider, something known as "churning". "total_day_calls",
"total_eve_calls" and "total_night_calls" mean "total number of day calls", "total number of evening calls" and
"total number of night calls".
We use linear regression in this report to test the correlations. The results shows that total day calls and total night
calls are not correlated, and total evening calls and total night calls are not correlated, which implies we should
design customized packages for day calls and night calls.
Github:
https://github.com/WeixiTan/NYU_Integrated_Marketing
21
Executive Summary
22. Scatter plots
We draw two scatterplots, one is for total day calls and
total night calls, another is for total evening calls and
total night calls.
Scatter plot 1 shows that total day calls and total night
calls have a linear relationship.
Scatter plot 2 shows that total evening calls and total
night calls have a linear relationship.
23. Regression result
From the regression result, we can see
the p value is 0.756 and 0.438 which
greater than 0.05. We can not reject the
null hypothesis that total day calls and
total night calls are not correlated, and
total evening calls and total night calls
are not correlated.
24. Insights
From the regression result, we can conclude that total day calls and total evening calls
don’t have significant effect on total night calls with 95% confidence level.
This conclusion implies that we should design customized packages for day calls,
evening calls and night calls.
25. Assumptions Check
For assumption 1 and 3, the error term is normally
distributed. For each fixed value of X, the distribution of Y is
normal. The mean of the error term is 0.
For assumption 2, from the scatter plots above, the means of
all these normal distributions of Y, given X, lie on a straight
line. So total day calls and total night calls have linear
relationship, and also total evening calls and total night calls
have linear relationship.
Then we further check the 6 assumptions of the linear model.
Results show all the assumptions are likely to be satisfied
26. For assumption 4, The variance of the error term
is constant. This variance does not depend on the
values assumed by X.
For assumption 5, the data set is not for time series
data, so we omitted here.
For assumption 6, the independent variables in X
are not correlated. This is no issue of multi-
collinearity.
27. Further research:
Though we can design customized packages for different periods, we should do
more further and detailed researches to see how to design different packages. For
example, we can do another linear regression considering whether total day
minutes or total day calls has more significant effect on total day charge to see
whether we should increase price for every minutes or every calls.
28. Capstone Project Milestone 5: Customer
Segmentation
Executive Summary
• Data source:
https://www.kaggle.com/hellbuoy/online-retail-customer-clustering
This is an online retail transnational data set which contains all the transactions occurring between
01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells
unique all-occasion gifts. Many customers of the company are wholesalers. The business goal is to build a
RFM clustering and choose the best set of customers which the company should target.
We choose K-mean clustering and Hierarchical clustering. The result is that K-mean clustering returns
57 customers and Hierarchical clustering returns 2 customers, which is a much smaller group than the
one that K-Means Clustering return.
• Kaggle Notebook:
https://www.kaggle.com/weixitan/customer-segementation-wt2084 28
29. K-Means Clustering: Finding the best k- The Elbow
Method
Since the can’t get return, we choose second k as
the best k, so when metric=“silhoustte”, we get
best k=3.
29
30. K-Mean Clustering: Interpreting the Clustering
30
By the RFM criteria, we should choose the customer clusters with
a lower recency, a higher frequency and amount. From the K-
means clustering results, we can see that customers with
Cluster_Id 2 best fit the criteria.
33. Hierarchical Clustering: Virtualize and Interpret Result
By the RFM criteria, we should choose the customer cluster with a lower recency, a higher
frequency and amount.
From the Hierarchical Clustering results, we can see that customers with Cluster_Id 1 best fit the
high Frequency criteria but customers with Cluster_Id 2 best fit the high Amount criteria .
33
34. Hierarchical Clustering: Interpreting the Clustering
We can see that Hierarchical Clustering returns 2 customers, which is a much smaller group than the
one that K-Means Clustering return.
If the manager value the frequency more, we choose Cluster_Id 2, and the company can provide some
daily discounts for customers in future marketing campaign. If manager more consider the amount,
we choose Cluster_Id 1, the company can provide discount over a certain amout.34