Type: Definition Essays
Sample donated: Boyd Warren
Last updated: September 25, 2019
Kanaan Almanasrah (8801092514), DAMI II (Fall/2017), Take-home Exam21/Jan/2018Question I (1 page, 2 points) (a) Explain (in your words) the bias/variance trade-off and compare and contrast bagging and boosting in terms of the trade-off.Bias and variance are the terms that comprise the error function in a model.
To have a high bias that means that our model is too simple to learn our dataset and makes i.e. it makes a lot of assumptions and underfit the data. However, high variance means that our model is too complex and that it overlearn on training data, and consequently performs badly on a testing or cross validation or test set i.
e. overfits the data and generalizes poorly on unseen data. The trade-off denotes that if we need to change our model complexity to a state where it can generalize well on unseen data and also be able to learn from data effectively, a balance between both terms has to be found, as increasing the complexity means increasing variance and decreasing the bias and vice versa. The Swedish term Lagom seems to capture the spirit of the tradeoff objective, we want a model that is a lagom in terms of bias and variance. We are always looking for a performance where the two are balanced in some way to achieve a sweet spot with low error function value which can be achieved through changing model complexity, data complexity, size of the sample..
etc. (Galarnyk, 2017.)Bagging and boosting are useful ensembles for enhancing the models predictive performance of tree models.
bagging combines complex models and works on the variance term and try to reduce it, by taking a collection of these low bias-high variance models and averaging them together which reducesthe total variance over that data without significantly increasing the bias. Boosting works on combining many simple (weak) learners with high bias and try to reduce their bias as it scale up the complexity in an ensemble with less bias than those individual learners (James et al., 2017.
)(b) Discuss how the bias-variance trade-off can be used to diagnose the performance of machine learning models.We can use the bias-variance trade-off to diagnose machine learning models through looking at the error of both training set and test or cross validation set over a specific data that the used models is trying to learn (Geng and Shih, 2017). If we employ a simple machine learning model (e.g. linear regression) to fit data that requires a model with more complexity (to be able to learn it), and that training and cross validation or test set errors were both high then we can understand that we have an underfitting problem or high bias (the model is too simple for the data) . However if we employ a complex model where the the difference between the training set (with low error) and test or cross validation error (high error) was significant then our algorithm suffers from high variance, the model then is too complex for our dataset.Question II (1 page, 2 points) (a)Each rule can be extracted by following root-to-leaf paths.
Each attribute-value pair along a path forms a conjunction through AND operator (as the antecedent) where the leaf holds the class prediction (consequent). As the extracted rules come directly from the tree, they are mutually exclusive and exhaustive. Since the resulting ruleset is mutually exclusive and exhaustive properties are met, the order of the rules does not matter, unless pruning was applied then a class-based rule ordering can be used to handle the resulting conflicts (Han, Kamber and Pei, 2012).A ruleset can be ordered based on: Size: the rules that have the highest number of conjunctions, i.e. number of conditions, get prioritized and order accordingly. This yields unordered rules that can be be executed regardless of their order. Class: classes are sorted in order of decreasing “importance” such as by decreasing order of frequency, the higher the class frequency, it comes earlier.
Within each class, no ordering for the rules is applied since there cannot be a class conflict where all rules predict the same class.Rule: rules are ordered based on measures of rule quality, the better the quality of a rule in terms of measures like accuracy and average or size (number of attribute tests in the rule antecedent) or according to input by subject matter experts, it gets ordered at a higher priority, then the ruleset gets triggered sequentially (top-down) according to the specified order. (b) Since the rules are extracted by following independent paths directly from the tree, this guarantees such properties, as for mutually exclusiveness, that happens as we have one specific antecedent combination per leaf (that represent the unique path), so no identical rules will be triggered for the same tuple. And as for exhaustive property, there is one rule for each possible attribute–value split combination, so that this set of rules does not require the default catch all rule.
Rule simplification affects the properties of the classifier as it neglects the importance of checking the splits conditions that happens through the induction process, which meets by definition the mutually exclusive and exhaustive properties by enabling each leaf to be covered by one rule, and this becomes invalid when simplifying the rule, the unique antecedent will be redundant in connection to multiple leafs (for the mutual exclusive property case), and some example might lose the rule that cover them, so no rules will fire for a specific example, i.e. there won’t be that one rule for each possible attribute–value combination, and in that sense, the set of rules will require a default catch all rule.Question III (1 page, 2 points) (a) AR(p): An Autoregressive process of p order can be defined in the following way (course materials):Xt= j=1pjXt-j+t For Autoregressive Models, it’s assumed that the dependent value X at time t can be predicted using previous values of a specific lag in time, where p refers to the number of lags, in that sense the process is usually used to model a time series which exhibit long term dependencies between past data. As an example, sales increase in christmas are explained through the increase in sales from previous christmas (12 months period lag) rather than last month or current quarter (Q4 of that specific year).
AR can be fitted by looking at Partial ACF plot and determining the maximum value of p afterwhich partial autocorrelation values become equal to zero in relation to a significance level.MA(q): A moving average process of q order can be defined in the following way(course materials):Xt=t+j=1qjt-j Moving Average process is the idea of predicting a point in a time series based on a combination of previous time points, that such predictions follow a linear trend of such previous time, with an emphasis on the representing trends that are short and preceding the point in time that we want to predict its value. As an example, MA can model market trends like “mean reversion” where the assumption that a stock’s price will tend to move to the average price over time (En.wikipedia.org, 2018). MA can be fitted by looking where ACF value turns to zero, in that case we will be able to see where q+1 lag actually is. So determining q is related to where the ACF value becomes not so significant from zero for the lags beyond the maximum lag q.ARMA(p,q): an autoregressive–moving-average model of q and p orders can be defined in the following way( course materials):Xt=j=1pjXt-j+j=1qjt-j+t ARMA tries to capture both short term trends and long term cycles by combining both Autoregressive processes and Moving average processes to predict the next prediction in time t.
An example of ARMA is when stock prices are affected by fundamental information as well as being affected by for instance it’s tendency to move to the average price over time effects due to market participants. For ARMA, the fitting would start with checking stationarity, in case there was no stationarity, it should stationarize in an iterative manner, then we should compute both the ACF and Partial ACF, in the ACF we try to find the lag value where ACF stabilizes inside significance bounds, and where PACF we find the lag after which all correlation is explained. In this way we find the orders p and q to fit the model (Quantstart.com, 2015.)(b) MA(q) process is said to be invertible to an AR(infinite) process, under the condition that final moving average parameter ||<1, as we need to make sure that equivalent AR process' coefficients decrease to 0 as we move back in time, in contrast with the situation of ||>1, which in that case, the effect of past observations increases with the distance (Anon, 2018).
AR(q) is invertible by definition, therefore, an ARMA(p,q) process is invertible if it’s MA(p) part is invertible (Zaiontz, 2014).Question IV (2 pages, 2 × 2 points) (a) Using the provided data, define a null-hypothesis for testing whether there is any significant difference between BOSS and COTE using the paired t-test at the ? = 0.05 significance level 1 .
You should carefully state the null-hypothesis and the alternative hypothesis and show the steps required for performing the significance test. H=COTE-BOSS=0 or m=0H=|COTE-BOSS| >0 or m0Our null hypothesis is that the mean performance of COTE and BOSS is equal, and our alternative hypothesis is that the two means are different. The provided dataset contains the results of each algorithm on 85 dataset (matched samples). We consider the performance results on each dataset as a trial, similar to the case provided in the lecture of testing the two classifiers on the same partitions of 10 runs of 10-fold cross-validation and where the result of each 10-fold CV is considered a single result, i.e., each run is a different trial. T-test would be considered not suitable if it is difficult to establish the normality assumption — where we need more than 30 domains, however, t-test here can be possible as there are over 30 domains.
1 We consider n=85 as corresponding to the number of trials i.e. the number of resultsWe calculate the degree of freedom as df=n-1 as df=84 in order to calculate the critical value along with ? = 0.
05 significance level (Student t-value for a given probability and degrees of freedom), where the t-value (two-tailed): +/- 1.98860968 is obtained from the table of the critical values.2 We calculate d i.e.
the difference between the performance means of both algorithm over the matched samples, and the result would be d= – 0.02448148.3 We calculate the standard deviation d where d= 0.04730315.4 We calculate t where t=- 0.
024481480.04730315/85 = -4.7715.The null-hypothesis can be rejected at the 0.05 significance level as the value we obtained is greater than +/- 1.98860968 from the critical values table.
As our p-value is inferior to 0.05 (p-value= 0.000007627), we can also conclude that the means difference between the two paired samples is significant.(b) Using the provided data, define a null-hypothesis for testing whether there are any significant differences between BOSS, COTE, LS and TSBF using a Friedman test at the ? = 0.01 significance level 2 . You should carefully state the null-hypothesis and alternative hypothesis and show the steps required for performing the significance test.We can state the null and alternative hypothesis using friedman test as following:H= RBOSS , RCOTE , RLS , RTSBF have no significant differenceH= RBOSS , RCOTE , RLS , RTSBF have significant differenceWe start by designating n=85 and k=4, computing and ranking the models performance on each dataset, calculate the mean rank of each model, and calculate Friedman test 2F value which results in 2F= ((12*85)/(4*5))*((2.
3^2+1.5^2+3.094118^2+3.105882^2)-((4*(5^2))/4)) = 89.
763 and p-value < 2.2e-16. With the following ranks of RBOSS =2.3, RCOTE=1.5, RLS=3.094118, RTSBF=3.105882.Then we compute Iman-Davenport statistics FF = (84*89.
763)/((85*3)-89.763)) = 45.632.
The correction result of Iman-Davenport statistics will be compared with the critical values of the F-distribution with k?1=3 and (k?1)(n?1)=252 degrees of freedom, which gives for ?= 0.01 a Critical F-value=3.86024650.As 45.632>3.86, this indicates null-hypothesis cannot be rejected.
The obtained p-value, however, indicate that we can reject the null hypothesis that all the models perform the same. Hence, we can perform the Nemenyi post-hoc test to figure out where the differences are. We compute the Nemenyi Critical Difference Value, given =0.01 and q = 3.113 as CD=(3.113)*(sqrt((4*5)/(6*85)))=0.61646, and compare the difference between all pairs, we see that the pairs (BOSS-COTE, BOSS-LS, BOSS-TSBF, COTE-LS, COTE-TSBF) with their average difference are larger than CD and therefore we can reject the hypothesis that they have the same rank.
As for the average difference between LS and TSBF it is found to be less than CD value 0.01176 < 0.61646, and therefore we cannot reject the hypothesis that they have the same rank for this pair. ReferencesGalarnyk, M. (2017). Machine Learning. online GitHub. Available at: https://github.
com/mGalarnyk/datasciencecoursera/blob/master/Stanford_Machine_Learning/Week6/AdviceQuiz.md Accessed 15 Jan. 2018.Geng, D.
and Shih, S. (2017). Machine Learning Crash Course: Part 4 – The Bias-Variance Dilemma. online Ml.berkeley.edu. Available at: https://ml.
berkeley.edu/blog/2017/07/13/tutorial-4/ Accessed 12 Jan. 2018.Mallick, S. (2017).
Bias-Variance Tradeoff in Machine Learning | Learn OpenCV. online Learnopencv.com. Available at: https://www.learnopencv.
com/bias-variance-tradeoff-in-machine-learning/ Accessed 21 Jan. 2018.James, G.
, Witten, D., Hastie, T. and Tibshirani, R. (2017).
An introduction to statistical learning. 8th ed. Springer, pp.
33-36.Han, J., Kamber, M. and Pei, J. (2012). Data Mining. 3rd ed. Elsevier, pp.
355-359.Anon, (2018). online Available at: https://onlinecourses.science.psu.edu/stat510/node/48 Accessed 18 Jan. 2018.
Zaiontz, C. (2014). Invertibility of MA(q) Process | Real Statistics Using Excel. online Real-statistics.com. Available at: http://www.
real-statistics.com/time-series-analysis/moving-average-processes/invertibility-ma-processes/ Accessed 13 Jan. 2018.Quantstart.com. (2015).
Autoregressive Moving Average ARMA(p, q) Models for Time Series Analysis – Part 1 | QuantStart. online Available at: https://www.quantstart.com/articles/Autoregressive-Moving-Average-ARMA-p-q-Models-for-Time-Series-Analysis-Part-1 Accessed 18 Jan. 2018.
(2018). Mean reversion (finance). online Available at: https://en.wikipedia.org/wiki/Mean_reversion_(finance) Accessed 17 Jan. 2018.
Note: in addition to course materials the following references were used. However as they were not copied, properly referencing them within the answer text was applied whenever possible. Course materials were not referenced.