a) The independent variable will be time played and points scored is the dependent variable, as the time played will affect how many points are scored - not the other way around.
b) The scatterplot will look as follows:
c) The points seem to be increasing linearly in a positive direction. Dots are not very close to each other so the relationship is moderate.
d) By first plotting how the best-fit line would look like:
We can see that the line passes through points (20,11) and (25,20). Those points are used to calculate the equation of the line, first, by finding the gradient:
\[ y - y_1 = m(x - x_1) \]
\[ m = \dfrac{20-11}{25-20} = \dfrac{9}{5} \]
Then, by using the point (20,11), the equation of the line was found:
\[ y - y_1 = m(x - x_1) \]
\[ y - 11 = \dfrac{9}{5}(x-20) = \dfrac{9}{5}x - 25 \]
e) By plugging the value to the function obtained in part (d) we get that:
\[ y = \dfrac{9}{5} * 16 - 25 = 3.8 \]
CloseSpearman's correlation coefficient should be used when one is sure of the presence of outliers, as it first ranks all data points. Because it doesn't deal with raw numbers, but with ranks, the coefficient itself won't be affected by an abnormally large all small value of the data point, as the rank will remain the same.
Closea) Pearson's correlation coefficient can be easily calculated by inputting the given data into the GDC. Then, we can see that the value of \( r = 0.938 \)
b) Spearman's correlation coefficient can be calculated after first ranking the data points. Then, by inputting the data into GDC we get that \( r = 0.847 \)
Experience (months) | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Weekly salary ($) | 1.5 | 3 | 1.5 | 4 | 5 | 6 | 7 |
c) The regression equation after inputting data into GDC is equal to \( y = 37.897x + 169.122 \)
d) The value can be predicted by replacing \( x \) with 25. Then our solution is equal to \( y = 37.897 \times 25 + 169.122 = 1116.547 \)
e) The formula for the percentage error is the following:
\[\text{Percentage Error} = \left| \frac{V_a - V_e}{V_e} \right| \times 100\% \]
So by plugging in the values we get:
\[\text{Percentage Error} = \left| \frac{950 - 1116.547}{1116.547} \right| \times 100\% = 14.916\% \]
CloseLarge values of Y compared to X are not a problem for a linear regression. That would simply mean that Y will change relatively more than X, which can be reflected in the coefficient.
On the other hand, extrapolation can be problematic, as that means that we are trying to make observations outside of our provided dataset, which don't necessarily have to hold. Low correlation can also be a problematic issue, since linear models can be used only when sufficiently high correlation can be found. Finally, predicitng X from Y can also read to wrong results, as the process needed to obtain a regression like that will lead to different results than a regression predicitng Y from X. Additionally, logically it also very often doesn't make sense to turn the variables around.
Closea) For Pearson's correlation, plugging in the values into the GDC results in \( r = -0.77 \).
b) To calculate the mean of \( x \), we simply sum up all observations and divide by 7, as there are 7 observations in total. As a result, we get \( \frac{995}{7} = 142.12 \). By doing the same thing with \( y \), we get the mean of \( y \) equal to 20.86.
c) For this task, first, the regression equation needs to be calculated. By plugging the values into the GDC we get that \( y = -8.43x + 317.93 \). It needs to be remembered that temperature is our \( x \) variable, and distance covered is \( y \) (the dependent variable), as the temperature outside will affect the distance travelled in a car.
Then, we know that the temperature outside was 21 degrees, meaning that the estimated distance travelled by car can be calculated using the regression equation and it is equal to: \( -8.43 \times 21 + 317.93 = 140.9 \). Finally, we are also told that Jack covered 15km on foot, resulting in the total distance of \( 140.9 + 15 = 155.9 \).
d) Applying the formula for the percentage error:
\[\text{Percentage Error} = \left| \frac{V_a - V_e}{V_e} \right| \times 100\% \]
We get that:
\[\text{Percentage Error} = \left| \frac{175 - 155.9}{155.9} \right| \times 100\% = 12.25\% \]
Closea) After inputting the values into the GDC the obtained correlation coefficient is equal to \( r = 0.90 \)
b) From the GDC it can be found that the regression equation is equal to \( y = 7.40x + 1237.85 \)
c) The value of coefficient \( a \) means that an additional week of studying leads to the increase in score by approximately 7.4 points.
d) By plugging 40 as the number of weeks into the regression equation, we get:
\[ y = 7.40 * 40 + 1237.85 \]
\[ y = 1533.85 \]
e) No, predictions like that should not be taken into consideration, as the sample used to create the regression had the minimum value of 22 for the number of weeks.
Closea) By inputting the values into the GDC we get taht the correlation coefficient is equal to \( r = 0.94 \).
b) From the GDC it can be found that the regression equation is equal to \( y = 0.20x + 8.27 \).
c) By flipping the independent and dependent variables, we get that the regression equation is \( x = 4.33y -32.50 \).
d) We are trying to predict the age, so the regression equation from part (c) should be used:
\[ x = 4.33 * 13.7 - 32.50 \]
\[ x = 26.82 \]
Closea) (i) \( \overline{x} = 699.11 \ ($) \)
a) (ii) \( \overline{y} = 3116.43 \ ($) \)
b) From the GDC we can find that \( r = 0.96 \).
c)
Advertising money spent ($) | 520 | 511 | 356 | 679 | 823 | 765 | 1100 |
---|---|---|---|---|---|---|---|
Sales revenue ($) | 2310 | 2600 | 2246 | 3129 | 3840 | 3561 | 4129 |
Advertising money spent ($) | 3 | 2 | 1 | 4 | 6 | 5 | 7 |
Sales revenue ($) | 2 | 3 | 1 | 4 | 6 | 5 | 7 |
\[ r_s = 0.96 \]
d) In the GDC we can find that \( y = 2.93x + 1125.32 \).
e) The answer to this question can be found in the value for the gradient, as each additional dollar spent on advertising will lead to a 2.93 dollar increase in sales revenue.
f) To find it, let's plug this point into the regression equation:
\[ y = 2.93 * 850 + 1125.32 \]
\[ y = 3615.82 \]
Thus, we have shown that this point indeed lies on this regression equation.
g)
\[ y = 2.93 * 930 + 1125.32 \]
\[ y = 3850.22 \]
Closea) To find the two regression equations, the respective values need to be plugged into GDC. Then, we get that for boys the regression equation is: \( y = 3.10x + 59.67 \), and for girls it is: \( y = 5.23x + 47.31 \)
b) To find the answer to this part the x-coefficients need to be analyzed for both regressions. For boys, the x-coefficient of 3.10 means that an additional hour of studying leads to an estimated 3.10 increase in the exam score. For girls, the x-coefficient of 5.23 means that an additional hour of studying on average increases their score by 5.23. So, girls benefit more from an additional hour compared to boys.
c) To find the intersection of those two equations we have to equate them. Thus, we get the following system of two equations:
\[ (1) \quad y = 3.10x + 59.67 \]
\[ (2) \quad y = 5.23x + 47.31 \]
Then, by equating them:
\[ 3.10x + 59.67 = 5.23x + 47.31 \]
\[ 12.36 = 2.13x \]
\[ x = 5.80 \]
\[ y = 5.23 * 5.80 + 47.31 = 77.64 \]
So the intersection point is at \( (5.80, 77.64) \).
CloseMinutes played | 25 | 18 | 29 | 31 | 35 |
---|---|---|---|---|---|
Points scored | 20 | 13 | 21 | 36 | 30 |
Experience (months) | 10 | 12 | 18 | 15 | 21 | 24 | 35 |
---|---|---|---|---|---|---|---|
Weekly salary ($) | 600 | 650 | 600 | 800 | 950 | 1200 | 1500 |
Distance (km) | 220 | 200 | 140 | 145 | 100 | 90 | 100 |
---|---|---|---|---|---|---|---|
Temperature (C) | 15 | 19 | 22 | 16 | 21 | 25 | 28 |