This document analyzes the relationship between Farmville players' levels and the number of trees they have on their virtual farms. Data was collected from 25 random players and showed a moderate positive correlation between level and number of trees. Higher-level players generally had more trees, though some outliers existed. Descriptive statistics and box plots showed lower-level players owned fewer trees on average. A chi-squared test rejected the hypothesis that level and number of trees were independent variables.
1. Tenorio 1
IB Math SL Internal Assessment:
Farmville Statistics
Arielle Tenorio
Period 6
Farmville is a popular computer game that is hosted by the social networking website, Facebook.
This game allows players to manage a virtual farm by plowing, planting, growing, and
harvesting on their virtual farmland. Crops, trees, and livestock can be purchased with the
“FarmCoins” that are earned by harvesting. There are also levels in this game that are achieved
by reaching a certain amount of experience points. Players of higher levels tend to have larger
farms, more crops, and more FarmCoins than those of lower levels.
This assignment will examine the relationship between the number of trees a Farmville player
has and what level they are on in the game. It is predicted that there will be a positive
relationship. This assumption can be confirmed or denied by analyzing and processing collected
data. First, a scatter plot will be produced with a line of linear regression to display the trend of
the data. The correlation coefficient value for the two variables will also be determined. A box
and whisker plot will compare the highest-ranking players out of those surveyed and the lowest-
ranking and the number of trees that both groups tend to own. A chi-squared test will test for
independence to find if the two factors occur as a result of one another or is they are unrelated
events.
2. Tenorio 2
Data samples were collected from 25 random Farmville players after logging onto Facebook and
opening the Farmville game. After visiting the virtual farms of 25 “Friends” and counting the
number of trees on each farm, a table was drawn up to organize the collected values.
Figure 1: Collected Data
# Farmville Level Number of Trees
1 7 7
2 8 9
3 9 9
4 9 16
5 10 11
6 13 17
7 13 23
8 13 45
9 15 19
10 15 20
11 16 33
12 16 35
13 16 16
14 18 21
15 19 18
16 20 20
17 22 79
18 22 35
19 23 41
20 23 28
21 25 62
22 26 35
23 28 44
24 31 94
25 34 40
Figure 1: This table displays the data that was collected.
From the table, it can be observed that the number of trees generally increases as the level
increases. The values on this table will be generated onto a scatter plot.
3. Tenorio 3
A scatter plot is used to visually display the relationship between two variables on a two-
dimensional graph. A line of linear regression, or trend line, can be found to confirm the
observation of the relationship. A correlation between the variables occurs as a result of the
clustering of data points around the trend line.
Figure 2: Scatter Plot and Linear Regression Line
The Relationship Between Level and
100 Number of Trees
90
80
70
Number of Trees
60
50
40
30
20
y = 2.0783x - 6.4127
10
0
0 5 10 15 20 25 30 35 40
Farmville Level
Figure 2: This scatter plot shows a positive relationship between the level of Farmville and the
number of trees a player has. The line of linear regression is produced by using Microsoft Excel.
The calculations to find this equation manually is produced below.
Line of Linear Regression:
The formula for finding the linear regression line for y on x is
S xy
y − y = 2 (x − x)
Sx
where y is the average of Y variables, x is the average of X variables, Sxy is the covariance of X
and Y and Sx2 is the standard deviation of X, squared.
In order to find these values, the data was organized into a table, below.
4. Tenorio 4
Figure 3: Table for Linear Regression Line
# Level (x) Trees (y) xy x² y²
1 7 7 49 49 49
2 8 9 72 64 81
3 9 9 81 81 81
4 9 16 144 81 256
5 10 11 110 100 121
6 13 17 221 169 289
7 13 23 299 169 529
8 13 45 585 169 2025
9 15 19 285 225 361
10 15 20 300 225 400
11 16 33 528 256 1089
12 16 35 560 256 1225
13 16 16 256 256 256
14 18 21 378 324 441
15 19 18 342 361 324
16 20 20 400 400 400
17 22 79 1738 484 6241
18 22 35 770 484 1225
19 23 41 943 529 1681
20 23 28 644 529 784
21 25 62 1550 625 3844
22 26 35 910 676 1225
23 28 44 1232 784 1936
24 31 94 2914 961 8836
25 34 40 1360 1156 1600
∑ = 451 777 16671 9413 35299
mean = 18.04 31.08 666.84 376.52 1411.96
Figure 3: The sums and averages of of x, y, xy, x² and y² were found and listed. By organizing the
data in this manner, it was easier to quickly find the values for Sxy and Sx2. The calculations are
shown below.
∑ x = 451 ∑ y = 777 ∑ xy = 16671 ∑x 2
= 9413 n = 25
To find the average of x:
x=
∑ x = 451 = 18.04
n 25
To find the average of y:
5. Tenorio 5
y=
∑ y = 777 = 31.08
n 25
To find Sxy:
S xy =
∑ ( xy) − x y
n
(16671)
S xy = − (18.04)(31.08)
25
S xy ≈ 106.16
To find Sx2:
Sx =
∑ x2 − x 2
n
9413
Sx = − 18.04 2
25
s x ≈ 51.08
To find the equation of the line of linear regression:
S xy
y − y = 2 (x − x)
Sx
106.16 y = 31.08 x = 18.04
y – (31.08) = ( x − 18.04)
51.08 s xy = 106.16 2
s x = 51.08
y – (31.08) = 2.078x – 37.493
y = 2.078x – 6.413
The correlation between the two values can also be found. Pearson’s correlation coefficient
formula is used to find this value. If r = 1, then it is said that the x and y values are perfectly
correlated. If r = 0, then x and y are not correlated. If r = -1, then x and y are perfectly negatively
correlated. By calculating the correlation coefficient, the degree of linearity between X and Y can
be determined.
Pearson’s Correlation Coefficient Formula:
The formula for finding the correlation coefficient is
6. Tenorio 6
r=
∑ ( xy) − nxy
.
∑ ( x ) − nx ∑ ( y
2 2 2
) − ny 2
Most of the values have already been determined while finding the linear regression line
equation.
To find the correlation coefficient, r:
r=
∑ ( xy) − nxy
∑ ( x ) − nx ∑ ( y
2 2 2
) − ny 2 n = 25 ∑ xy = 16671
∑ x 2 = 9413 ∑ y = 35299
2
(16671 − 25 ⋅ 18.04 ⋅ 31.08) x = 18.04 y = 31.08
r=
9413 − 25 ⋅ 325.44 35299 − 25 ⋅ 965.97 x 2 = 325.44 y 2 = 965.97
r = 0.70334
r² = 0.49468
The correlation value can be rounded to 0.703. It can be stated that there is a moderate, positive
correlation between x and y. The positive r value means that the level of a Farmville player (x)
increases, then so does the number of trees (y). The graph also represents the positive
relationship. However, it will be noted that there are data points that do not cluster as closely to
the trend line as the other data points such as points (22, 79) and (31, 94). These points are
considered outliers. They might appear as a result of the freedom every player has to purchase a
wide variety of items other than trees (animals, seeds, decorations, buildings etc.). Not all players
have the same desire to purchase trees. Parallel boxplots can be used to display some of the
descriptive statistics of the data sets x and y.
The parallel boxplots will present a visual comparison of the distribution of the data as well as
the descriptive statistics. These descriptive statistics are median, range, interquartile range
minimum and maximum. The spread of data for the number of trees owned by the lowest-
ranking half of Farmville players surveyed (levels 7-15) will be compared to that of the highest-
ranking players from the group of 25 players (levels 16-34). It is predicted that the lower-level
players will less trees while higher-level players will have a greater number of trees, but there
may be some overlapping data.
Figure 4: Number of Trees for Levels 7-15 and 16-34
Statistic Levels 7-15 Levels 16-34
Quartile 1 9 21
Minimum 7 16
7. Tenorio 7
Median 16.5 35
Maximum 45 94
Quartile 3 20 44
Figure 4: This table shows the five number summaries for level and number of trees. The data
that is organized here will be shown in the box and whisker plot.
Figure 5: Box and Whisker Plot
100
90
80
70
Quartile 1
60
Minimum
50 Median
Maximum
40
Quartile 3
30
20
10
0
Levels 7-15 Levels 16-34
Figure 5: The box and whisker plot compares the spread of data for Farmville players and the
number of trees they own. Fifty-percent of the highest ranking players out of the group that was
tested own anywhere from 21 to 45 trees, whereas the middle fifty-percent of lowest ranking
players own from 10 to 20 trees. Some beginner players, however, seem to own as many trees as
the higher-level players.
By comparing the descriptive statistics describing the number of trees that the highest ranking
players own versus the lower players, it can be seen that while higher-ranking players tend to
have more trees, it is not necessarily true that lower-ranking players cannot surpass them in
number of trees owned. This can be seen on the plot, as twenty-five percent of the lower level
players own about as much as the higher-level group’s middle fifty-percent. However, the
higher-level group has a greater median than that of the lower-level group, which suggests that
they own more trees than most of the beginner players.
8. Tenorio 8
A chi-squared test will now be performed to determine if the number of trees a player has and
their level in the game are independent or independent events. The equation for the chi-squared
test is
( f − fe )2
X2 =∑ o
fe
where fo is the observed frequency and fe is the expected frequency. Contingency tables will be
constructed to show the results of the 25 surveyed players. One table displays the observed
values, while another displays the expected values.
Observed values table:
Trees
7-30 >30 Total
7-15 10 0 10
16-34 4 11 15
Level
Total 14 11 25
Expected values table:
Trees
7-30 >30 Total
7-15 5.6 4.4 10
16-34 8.4 6.6 15
Level
Total 14 11 25
To find expected value (for box 7-15 x 7-30):
10 ⋅ 14
fe =
23
f e = 5.6
Before performing the chi-squared test, the null and alternative hypotheses are formed, the
degree of freedom is calculated, and the significance level is stated.
Ho (null hypothesis) states that game level and amount of trees are independent events.
H1 (alternative hypothesis) states that the two events are not independent.
There is 1 degree of freedom.
At a 5% (0.05) significance level with df = 1, X 0.05 = 3.84 .
2
To find degrees of freedom for a 2 x 2 contingency table:
df = (r-1)(c-1)
df = (2-1)(2-1)
df= 1
9. Tenorio 9
Using the contingency tables, X2 is found using the equation quoted above. The table below
organizes the values needed for the calculation.
Figure 6: X2 Calculation
( fo − fe )2
fo fe fo − fe ( fo − fe )2 fe
10 5.6 4.4 19.36 3.457142857
0 4.4 -4.4 19.36 4.4
4 8.4 -4.4 19.36 2.304761905
11 6.6 4.4 19.36 2.933333333
Total= 13.0952381
Figure 6: This table shows how the chi-squared value was found.
X 2 ≈ 13.1
Because the X2 is greater than 5.99, we will reject the null hypothesis that states that the
Farmville player’s level and amount of trees are dependent events.
According to the scatter plot and the line of linear regression, there is a positive relationship
between the number of trees a Farmville player has and what level they are on in the game. By
finding Pearson’s correlation coefficient, it was determined that there is a moderate correlation
between the two variables. As stated before, this could be because more experienced players tend
to have more “FarmCoins” to purchase trees. Lower-level players and beginners are more likely
to buy smaller, cheaper plants. The boxplot also showed that higher-level players own more
trees, but also suggested that lower-level players have the ability to own more trees than high-
level players. The chi-square test showed that the two factors are dependent events. The level of
a Farmville player and the number of trees they own in the game are dependent events. They
have a positive correlation suggesting that as a player rises in level, they buy more trees.
There were a couple data samples that did not cluster as closely to the linear regression line as
the other data points did. These data points are considered to be outliers. Each player has the
freedom to use their “FarmCoins” on various accessories for their farms, such as animals, seeds,
and decorations, and not all players are interested in buying the same items for their virtual farm.
Some players may buy more trees than seeds or animals. To determine if these outliers skew the
data significantly, a chi-squared test will be performed on the data again with the outliers
removed. The table below displays the data samples without the two outliers, (22, 79) and (31,
94).
10. Tenorio 10
Figure 7: Data without Outliers
Farmville Level Number of Trees
7 7
8 9
9 9
9 16
10 11
13 17
13 23
13 45
15 19
15 20
16 33
16 35
16 16
18 21
19 18
20 20
22 35
23 41
23 28
25 62
26 35
28 44
34 40
Figure 7: This data will be used to perform a second chi-squared test.
Observed values table:
Trees
Tota
7-30 >30 l
7-15 10 0 10
16-34 4 9 13
Level
Total 14 9 23
Expected values table:
11. Tenorio 11
Trees
Tota
7-30 >30 l
7-15 6.086957 3.913043 10
16-34 7.913043 5.086957 13
Level
Total 14 9 23
Ho (null hypothesis) states that game level and amount of trees are independent events.
H1 (alternative hypothesis) states that the two events are not independent.
There is 1 degree of freedom.
At a 5% (0.05) significance level with df = 1, X 0.05 = 3.84 .
2
Using the contingency tables, X2 is found using the equation quoted above. The table below
organizes the values needed for the calculation.
Figure 8: X2 Calculation without Outliers
( fo − fe )2
fo fe fo − fe ( fo − fe )2 fe
10 6.1 3.9 15.21 2.493443
0 3.9 -3.9 15.21 3.9
4 7.9 -3.9 15.21 1.925316
9 5.1 3.9 15.21 2.982353
Total= 11.30111
Figure 8: This table shows how the chi-squared value was found.
X 2 ≈ 11.3
Because the X2 is greater than 3.84, we will reject the null hypothesis that states that the
Farmville player’s level and amount of trees are dependent events. This concludes that the
outliers did not have a significant affect on the outcome of the processed data, and did not skew
the results.