Solve the optimization problem of tree, should we make each rectangle contains exactly one training data...
$begingroup$
I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:
"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."
My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?
machine-learning predictive-models optimization cart
$endgroup$
add a comment |
$begingroup$
I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:
"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."
My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?
machine-learning predictive-models optimization cart
$endgroup$
$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
18 mins ago
add a comment |
$begingroup$
I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:
"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."
My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?
machine-learning predictive-models optimization cart
$endgroup$
I was reading Trevor Hastie and Robert Tibshirani's book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:
"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by"
$$sum_{j=1}^Jsum_{iin R_j}(y_i-hat{y}_{R_j})^2,$$
where $hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into $J$ boxes."
My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and ${R_j}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?
machine-learning predictive-models optimization cart
machine-learning predictive-models optimization cart
asked 5 hours ago
ftxxftxx
1233
1233
$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
18 mins ago
add a comment |
$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
18 mins ago
$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
18 mins ago
$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
18 mins ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.
This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393018%2fsolve-the-optimization-problem-of-tree-should-we-make-each-rectangle-contains-e%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.
This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.
$endgroup$
add a comment |
$begingroup$
You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.
This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.
$endgroup$
add a comment |
$begingroup$
You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.
This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.
$endgroup$
You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.
This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.
answered 2 hours ago
user20160user20160
16.8k12758
16.8k12758
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393018%2fsolve-the-optimization-problem-of-tree-should-we-make-each-rectangle-contains-e%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors.
$endgroup$
– Nick Cox
18 mins ago