Review of "The geometry of non-linear stochastic gradient descent"
a paper by K.A. Krakowski, R.E. Mahony, R.C. Williamson, M.K. Warmuth
---------
Reviewer 1 recommends accepting the paper after major revisions
Reviewer 2 recommends accepting the paper as is.
Reviewer 3 recommends rejecting the paper but is willing to reconsider
it after major revisions.
As JMLR does not allow for "Reconsider after major revision" but only
for "Reconsider after major revisions" I am rejecting the paper.
I encourage the authors to rewrite the paper from scratch, taking into
account the comments of the reviewers. If you decide to submit the
paper to JMLR, please include in the cover letter a reference to this
submission.
------------------------------------------------------------
review 1
This paper studies generalization of the famous "gradient
descent" online-learning algorithm where the domain, or
"decision set" of the online learner is a general Riemannian
manifold.
Geometry is intimately linked to online learning, and in
particular the geometry of the decision set. As such, the paper
has potential to be a superb research paper, but at its current
state I think it is half-baked. I appreciate the theoretical
contribution presented, and think that the introduction of
Riemannian geometry could prove to be extremely useful in the
future, thus would recommend acceptance subject to the comments
and recommendations below.
1. There is no mention of computational efficiency in the
paper, and this is a great caveat. When is the geometric
variant of GD efficient ? If the manifold is not a convex set I
doubt this, or any algorithm for that matter, can be made
efficient. Please address and say when it's efficient and when
lower bounds can be shown.
2. Page 4 references some "sampling process". In the online
setting this is irrelevant. I suggest using game-theoretic
notation, and regret as the performance measure. You can say it
generalizes the PAC and other statistical learning models.
3. presentation - I consider the major contribution of this
paper to be the introduction of the geometric concepts. As such
this is not well presented. A general rule - do not define
concepts till the place where you use them. Is there any use
for "Gauss's great theorem" ? If yes, you can explain it where
it is used. If not, you can put this and other irrelevant
(though interesting) facts in separate section, warning the
reader that these concepts are not used. (if they do provide
crucial insight, then such facts should be stated, but again
with a clear separation of what is used and what not).
On the same issue, the basic concepts are not well presented.
Section 2 is extremely dense and at first reading very hard to
make out almost anything from it. Give plenty of examples, this
is perhaps the most important section of the paper, given that
it's contribution is conceptual. Give examples of everything
starting from manifolds to "exponential maps".
4. Results - It is not clear if the paper gives algorithms with
better bounds than the state of the art. I imagine that this
should be the case at least for some manifolds. If so, state it
prominently. Define the parameters of manifolds which are
important for performance bounds, and state previous and
current algorithms bounds in such terms.
I suggest given a short formal statement of results in the
beginning.
5. Related work - The geometry of the decision set was recently
considered in a related paper of Abernethy, Hazan and Rakhlin
al in the latest COLT 2008. Their notion of geometry was to use
the characterization of self-concordant barriers, and in their
application this gave substantial benefit in performance (which
indicates maybe that the paper's geometric framework can be
beneficial, after all, the motivation should be better
performance). At the very least this paper should be
referenced, but it would be great to draw connections (I
realize this may not be time-feasible for the current paper).
-----------------------------------------------------------------
review 2
Short Summary: The paper provides a study of gradient descent algorithms which respect the Riemannian geometry of the manifold. Diffeomorphisms (isomorphisms for smooth manifolds) between manifolds naturally suggest an equivalence relation on algorithms. Regret bounds are proved for explicit and implicit updates which closely follow the corresponding results in R^n dating back to the 90's. These algorithms originate from minimization of the loss plus a divergence to the previous point. Finally, the authors show that link functions (studied previously) provide a diffeomorphism between the manifold and R^n if the manifold has zero curvature.
Conclusion: I think the paper provides a nice generalization of known results to manifolds and is suitable for publication.
Informally, the ideas can be summarized as follows. Look at a manifold with its Riemannian metric. This induces a notion of distances on the manifold. Take the divergence to be this distance squared (which leads to GD in R^n). An previously studied idea of minimizing a "loss plus divergence" leads to an algorithm with a bound on its regret; the proof, on a high level, is very similar to the proof for the Euclidean metric. Here, the geometric interpretation of the explicit and implicit updates is very nice. The divergence terms in the regret bound need, at the end of the day, to be bounded using properties of the specific manifold. Link functions turn out to provide a diffeomorphism with R^n, thus giving a handle on the regret terms. Not surprisingly, the authors recover the known results for EG and GD (gradient descent).
Novelty: In some sense, the results are straightforward, yet require some basic knowledge of differential geometry. I would characterize the main contribution of the paper as "generalizing the known results to manifold distances". The connection between isometries and link functions is nice. Overall, the paper serves as a starting point for obtaining interesting results in specific settings.
One interesting algorithm is introduced, which arises from looking at the simplex as diffeomorphic to the positive orthant of a sphere. In addition to my concerns about the validity of this construction (see below), it is not clear why this leads to a better algorithm.
Significance: I believe this work provides an important generalization of known results to the manifold domain.
Clarity: The paper is well-written.
Specific comments:
1. p. 9, before Example 3: It would be good to provide more information about Fisher Information metric of the multinomial distribution.
2. please, define the pullback map
3. In Example 3, the simplex is mapped to only a positive orthant of the sphere. This fact is missing from the notation. Furthermore, this means that the mapping is not a diffeomorphism with the whole sphere. What happens if the algorithm for the sphere creates a point w_{t+1} outside of the positive orthant? How is it mapped back to the simplex? Is Corollary 5 valid?
4. The assumption in Theorem 7 that the image of \gamma_t is \lambda_t-convex essentially means that it must hold for every geodesic, as w^* is unknown.
5. The implicit update algorithm is essentially what has been termed as "to be the leader" algorithm, as it "knows the future". It has been shown many times before that this algorithm suffers only a constant regret (independent of T), which is similar to the statement of Theorem 9. It would be nice to mention these connections to the known results.
10. Please, define E_l and E_m on page 20.
Typos:
1. p.2, 3rd paragraph: "the view that the geometric algorithm are" -> "algorithms"
2. p. 6, Remark 2: on-linear -> on-line
3. p. 6, 2nd paragraph in 3.2: "for given a sample" -> "for a given sample"
4. p. 13, last line: "it make sense" -> "it makes sense"
5. p. 22, last line: "it follows the the distance between two consecutive point"
----------------------------------------------------------
review 3
Review of "The geometry of non-linear stochastic gradient descent"
a paper by K.A. Krakowski, R.E. Mahony, R.C. Williamson, M.K. Warmuth
The paper studies stochastic gradient descent, and a variant based on
an implicit update, in the case when the weights belong to a Riemannian
manifold. Stochastic gradient descent (with implicit or explicit updates)
are well studied in the case of weights belonging to the Euclidean space
or to some RKHS. A different generalization is provided by link functions
(introduced by one of the authors), by which we can reparametrize the
weights of stochastic gradient descent obtaining variants as p-norm
gradient descent or exponentiated gradient.
In a previous paper by two of the authors the framework of stochastic
gradient descent on manifold was introduced, and link functions were
shown to be special cases. This paper continues the investigation,
proving relative loss bounds for stochastic gradient descent with
both types of updates. Moreover, the relationship with link functions
is futher explicitated, and the known relative loss bounds are recovered
as special cases of the manifold framework.
My overall impression on this paper is negative for three main reasons.
First, the algorithmic framework described in the paper fails to provide
concrete interesting instances of new algorithms. The generality of the
approach is not exploited to design new updates that are shown to be
effective in specific domains. Also, no really new bounds are presented,
and no new proof techniques are introduced.
Second, although one of the paper's main contributions is the explanation of
the connection between Riemannian manifolds and learning, the writing is not
particularly friendly to the profane reader. In my opinion this is a paper
written for people who are already quite familiar with the subject of
Riemannian manifolds (not the majority of JMLR readers perhaps).
Third, although this paper has been submitted seven years after the initial
work on preferential structures for online learning, much of it appears to
reiterate concepts and notions that already appeared in that earlier paper.
This might be a false impression, but in all cases I think the presentation
should be improved in order to better distinguish the conceptual novelties
that mark the progress and the insights gained in this paper.
In conclusion, even if the last two comments essentially involve presentational
issue, the first one is meant to be more substantial. My recommendation is
thus rejection, although I am willing to reconsider the paper after a major
revision addressing all three comments.