What if you could improve your AI algorithms performance by up to twenty percent? Describe natural phenomena better? And enhance classical math theorems? Perhaps with a new Calculus operator – on top of the derivative and the integral?
Here we present a new and simple operator in basic Calculus. It renders prominent deep learning frameworks and various AI applications more numerically robust. It is further advantageous in continuous domains, where it classifies trends accurately regardless of functions’ differentiability or continuity.
I used to teach my Advanced Calculus students to infer a function’s momentary trend from its gradient sign. Years later, in our AI research, my team encounters this notion often. Upon analyzing custom loss functions numerically, we treat momentary trends concisely and efficiently:
Whenever you scrutinize your car’s dashboard, you notice Calculus. The mileage is equivalent to the definite integral of the way you did so far, and the speedometer reflects the derivative of your way with respect to time. Both physical instruments merely approximate abstract notions.
The gear stick evidences your travel direction. Often, its matching mathematical concept is the derivative sign. If the car moves forward, in reverse, or freezes, then the derivative is positive, negative, or zero, respectively. However, calculating the derivative to evaluate its sign is occasionally superfluous. As Aristotle and Newton famously argued, nature does nothing in vain. Following their approach, we probably needn’t go through rates calculation to define the instantaneous trend of change. If the trend of change is an important term in processes analysis, perhaps we ought to reflect it concisely rather than as a by-product of the derivative?
This occasional superfluousness of the derivative causes the aforementioned issues in numeric and analytic trend classification tasks. To tackle them, we’ll attempt to simplify the derivative sign as follows:
$$\require{color}\begin{align*}
sgn\left[f_{\pm}’\left(x\right)\right] & =sgn\underset{{\scriptscriptstyle h\rightarrow0^{\pm}}}{\lim}\left[\frac{f\left(x+h\right)-f\left(x\right)}{h}\right]\\
& \colorbox{yellow}{$\texttip{\neq}{Apply suspension of disbelief: this deliberate illegal transition contributes to the below discussion }$}\underset{{\scriptscriptstyle h\rightarrow0^{\pm}}}{\lim}sgn\left[\frac{f\left(x+h\right)-f\left(x\right)}{h}\right]\\
& =\pm\underset{{\scriptscriptstyle h\rightarrow0^{\pm}}}{\lim}sgn\left[f\left(x+h\right)-f\left(x\right)\right]
\end{align*}$$
Note the deliberate erroneous transition in the second line. Switching the limit and the sign operators is wrong because the sign function is discontinuous at zero. Therefore the resulting operator, the limit of the sign of $\Delta y$, doesn’t always agree with the derivative sign. Further, the multiplicative nature of the sign function allows us to cancel out the division operation. Those facts may turn out in our favor, because of the issues we saw earlier with the derivative sign. Perhaps it’s worth scrutinizing the limit of the change’s sign in trend classification tasks?
Clearly, the numerical approximations of both the (right-) derivative sign and that of $\underset{{\scriptscriptstyle h\rightarrow0^{+}}}{\lim}sgn\left[f\left(x+h\right)-f\left(x\right)\right]$ equal the sign of the finite difference operator, $sgn\left[f\left(x+h\right)-f\left(x\right)\right]$, for some small value of $h$. However, the sign of the difference quotient goes through a redundant division by $h$. This roundtrip amounts to an extra logarithmic- or linearithmic-time division computation (depending on the precision) and might result in an overflow since $h$ is small. In that sense, we find it lucrative to think of trends approximations as $\underset{{\scriptscriptstyle h\rightarrow0^{\pm}}}{\lim}sgn\left[f\left(x+h\right)-f\left(x\right)\right]$, rather than the derivative sign. On average, it spares 30% runtime in the calculation of the trend itself, relative to the numerical derivative sign. Accounting for the overhead of other atomic operations of the algorithm itself, the percent of time spared can still be up to 20%, depending on the algorithm.
Similar considerations lead us to apply the quantization directly to $\Delta y$ when we’re after a generic derivative quantization rather than its sign. That is, instead of quantizing the derivative, we calculate $Q\left[f\left(x+h\right)-f\left(x\right)\right]$ where $Q$ is a custom quantization function. In contrast with the trend operator, this quantization doesn’t preserve the derivative value upon skipping the division operation. However, this technique is good enough in algorithms such as gradient descent. If the algorithmic framework entails multiplying the gradient by a constant (e.g., the learning rate), we may spare the division by $h$ in each iteration and embody it in the pre-calculated constant itself.
We can introduce a coarse estimation of the percentage of computational time spared by considering the computations other than the trend analysis operator itself. For example, suppose we can spare a single division operation in each iteration of gradient descent. Assume that the function is defined numerically at a finite discrete set of points. Then avoiding the division in the optimization iterations is significant, as that is the most time-consuming operation in each optimization iteration. In contrast, if the optimized function is computationally involved, for example, one that includes logical operators, then the merit of sparing a division operator, while it still exists, would be humbler.
Another numerical advantage of this operator relative to the derivative sign is the bound on its estimation error. That’s prevalent in case the first derivative vanishes. While the estimation error of the derivative is $\mathcal{O}\left(\Delta x\right)$, that of higher-order derivatives is by orders of magnitude smaller, $\mathcal{O}\left(\Delta x^k\right)$. As we show below at the Taylor series expansion of this trend operator it equals the sign of a higher-order derivative up to a sign coefficient (more precisely, the first one that doesn’t vanish). It thus turns out that when estimating the trend operator with the first sign of the first non zeroed derivative, the error estimation has a tighter bound than that of the derivative sign.
The exploding gradient issue also hinders the derivative sign, and we can mitigate it with this trend operator. Other numerical issues with the derivative are naturally transferred to the derivative sign and often tackled with the trend operator.
Given this operator’s practical merit in discrete domains, let’s proceed with theoretical aspects in continuous ones.
We can establish a more rigorous justification by noticing how the definition of local extrema points coalesces with that of the operator at stake. In contrast with its basic Calculus analog, the following claim provides both a sufficient and necessary condition for stationary points:
Next, let’s check in which scenarios is this operator well defined. We’ll cherry-pick functions with different characteristics around $x=0$. For each such function, we ask which of the properties (continuity, differentiability, and the existence of the operator at stake, that is, the existence of a local trend from both sides), take place at $x=0$. Scrutinize the following animation to gain intuition with some basic examples:
We would also like to find out which of those properties hold across an entire interval (for example, $[-1,1]$). To that end, we add two interval-related properties: Lebesgue and Riemann integrability. Feel free to explore those properties in the following widget, where we introduce slightly more involved examples than in the above animation. Switch between the granularity levels, and hover over the sections or click on the functions in the table, to confirm which conditions hold at each case:
Function |
---|
$f_1\left(x\right)=x^2$ |
$f_2\left(x\right)=\left(\frac{1}{2}-\boldsymbol{1}_{\mathbb{Q}}\right)x^{2}$ |
$f_3\left(x\right)=x^{\frac{1}{3}}$ |
$f_4\left(x\right)=t\left(x-\sqrt{2}\right),$
where $t$ is Thomae's function |
$f_5\left(x\right)=sgn\left(x\right)$ |
$f_6\left(x\right)=\begin{cases} \sin\left(\frac{1}{x}\right), & x\neq0\\ 0, & x=0 \end{cases}$ |
$f_7\left(x\right)= x^{1+2\cdot\boldsymbol{1}_{\mathbb{Q}}}$ |
$f_8\left(x\right)=R\left(x\right),$
where $R$ is Riemann function ([43]) |
$f_{9}\left(x\right)=\begin{cases} \sin\left(\frac{1}{x}\right), & x\neq0\\ 2, & x=0 \end{cases}$ |
Function |
---|
$f_1\left(x\right)=\boldsymbol{1}_{\mathbb{Q}}$ (Dirichlet) |
$f_2\left(x\right)=\begin{cases} \sin\left(\frac{1}{x}\right), & x\neq0\\ 0.5, & x=0 \end{cases}$ |
$f_3\left(x\right)=\begin{cases} \sin\left(\frac{1}{x}\right), & x\neq0\\ 0, & x=0 \end{cases}$ |
$f_4\left(x\right)=\begin{cases} x^2\sin\left(\frac{1}{x}\right), & x\neq0\\ 0, & x=0 \end{cases}$ |
$f_5\left(x\right)=x$ |
$f_6\left(x\right)=\sqrt{\left|x\right|}$ |
$f_{7}\left(x\right)=sgn\left(x\right)$ |
$f_{8}\left(x\right)=\begin{cases} \frac{1}{\sqrt{\left|x\right|}}, & x\neq0\\ 0, & x=0 \end{cases}$ |
$f_{9}\left(x\right)=\begin{cases} \frac{1}{x}, & x\neq0\\ 0, & x=0 \end{cases}$ |
We may extend the discussion to properties that hold in intervals almost everywhere, rather than throughout the interval. This is out of this article’s scope, but as a taste we’ll mention the function $f(x)=\sum\limits_{n=1}^{\infty}f_n(x)$, where $f_n(x)=2^{-n}\phi\left(\frac{x-a_n}{b_n-a_n}\right)
, \phi$ is the Cantor-Vitali function and $\{(a_n,b_n):n\in\mathbb{N}\}$ is the set of all intervals in $\left[0,1\right]$ with rational endpoints. It’s possible to show that this function has trend everywhere, it’s strictly monotonic, but its derivative is zeroed almost everywhere. In this example, the notion of instantaneous trend is coherent with the function’s monotonic behavior in the interval, in contrast with the vanishing derivative. Further, according to Baire categorization theorem, almost all the continuous functions are nowhere differentiable. Therefore, we could use an extension to the set of functions whose monotony can be analyzed concisely. Finally, we mention the function $g\left(x\right)=x^{1+\boldsymbol{1}_{\mathbb{Q}}}$ defined over $\left(0,1\right)$. In its definition domain, $g$ is discontinuous everywhere, but detachable from left almost everywhere.
Equipped with a concise definition of the instantaneous trend of change, we may formulate analogs to Calculus theorems with the following trade-off. Those simple corollaries inform us of the function’s trend rather than its rate. In return, they hold for a broad set of detachable and non-differentiable functions. They are also outlined in [31].
Don’t worry about the sum rule holding only for functions whose detachments aren’t additive inverses. We will handle these cases, assuming differentiability, later on with Taylor series.
Let’s begin this discussion with a simple example. If $f^{;}=g^{;}=+1$, and $f=g=0$ at a point $x$, then the detachment of the product is $\left(fg\right)^{;}=+1$ as well. Following this simple scenario, one might be tempted to think that the product rule for the detachment is simply $\left(fg\right)^{;}=f^{;}g^{;}$. However, upon shifting $g$ vertically downwards, that is, requiring that $g(x)<0$, we obtain a setting where $\left(fg\right)^{;}=-1$, and that’s although the detachments of $f,g$ remained as before. It means that the detachment of the product $\left(fg\right)^{;}$ necessarily depends also on the sign of $f,g$ at the point of interest. Indeed, assuming $g$ is continuous we have that the product’s detachment is $-1$ in this new setting.
However, let us recall that in Semi-discrete Calculus we aren’t restricted to differentiable or even continuous functions. What if $g$ is discontinuous? As long as $g$ maintains its sign in the neighborhood of the point, it’s possible to show that the detachment of the product remains $-1$. But if $g$ changes its sign, meaning there exists a neighborhood of $x$ where $g$ is zeroed or one where $g$ is positive, then $\left(fg\right)^{;}$ can be either $0$ or $+1$ respectively. This implies that we should somehow bound the functions’ signs. I suggest to pay attention to an intuitive trait of the functions $f,g$: sign-continuity.
Given a function $f$, we will say that it is sign-continuous (s.c.) at a point $x$ if $sgn\left(f\right)$ is continuous there. If all the partial limits of $sgn\left(f\right)$ are different than its sign at $x$, then we will say that $f$ is sign-discontinuous (s.d.) there. Observe that the function’s sign-continuity at a point may be determined given its sign and detachment there. That is, if $f^{;}=0$ or $ff^{;}>0$, then $f$ is sign-continuous. We will say that $f$ is inherently sign-continuous (i.s.c.) in such cases. In contrast, if $f^{;}\neq0$ and $f=0$, or $ff^{;}<0$, then $f$ is inherently sign-discontinuous (i.s.d).
A function that is either sign-continuous or sign-discontinuous will be called sign-consistent. We restrict ourselves in the following discussion to sign-consistent functions.
In the process of formulating the product rule, we repeatedly conduct analyses similar to the above, with different combinations of $f^{;},g^{;}$, their signs, and assumptions on their sign-continuity or lack thereof.
You’ll find here a simulation code for the creation of the data required to extract those rules. The final results obtained from the simulation are also available, in two separate Google sheets, here.
Recall that the product rule for derivatives dictates the following combination of the original functions’ derivatives: $\left(fg\right)’=f’g+fg’$. Evidently, the results we gather regarding the detachments product follow a similar rule in part of the cases and another intuitive formula in others. Recall that the following product and quotient rules hold for detachable, not necessarily continuous functions.
We already introduced an analog to Fermat’s stationary points theorem in a previous section. Let us formulate analogs to other basic results.
First direction. Without loss of generality, assume that $f$ is strictly increasing in the interval. We’ll show that $f_{-}^{;}=f_{+}^{;}=+1.$ On the contrary, assume without loss of generality that there’s $x$ in the interval for which $f_{+}^{;}\left(x\right)\neq+1.$ According to the definition of the one-sided detachment, it implies that there is a right neighborhood of $x$ such that $f\left(\bar{x}\right)\leq f\left(x\right).$ But $\bar{x}>x,$ contradicting the strict monotonicity.
Second direction. Without loss of generality, let us assume that $f^{;}\equiv+1$ in the interval. Then clearly $f_{+}^{;}=+1$ there. It must also hold that $f_{-}^{;}=+1$ in the interval, as otherwise, there would exist a point with $f_{-}^{;}=0,$ and $f$ would be constant in the left neighborhood of that point, hence there would be another point with $f_{+}^{;}=0.$
Let $x_{1},x_{2} \in \left( a,b \right)$ such that $x_{1}<x_{2}.$ We would like to show that $f\left(x_{1}\right)<f\left(x_{2}\right).$ From the definition of the one-sided detachment, there exists a left neighborhood of $x_{2}$ such that $f\left(x\right)<f\left(x_{2}\right)$ for each $x$ in that neighborhood. Let $t\neq x_{2}$ be an element of that neighborhood. Let $s=\sup\left\{ x|x_{1}\leq x\leq t,f\left(x\right)\geq f\left(x_{2}\right)\right\}.$ On the contrary, let us assume that $f\left(x_{1}\right)\geq f\left(x_{2}\right).$ Then $s\geq x_{1},$ and the supremum exists. If $f\left(s\right)\geq f\left(x_{2}\right)$ (i.e., the supremum is accepted in the defined set), then since for any $x>s$ it holds that $f\left(x\right)<f\left(x_{2}\right)\leq f\left(s\right),$ then $f_{+}^{;}\left(s\right)=-1,$ contradicting $f_{+}^{;}\equiv+1$ in $\left(a,b\right).$ Hence the maximum is not accepted. Especially it implies that $s\neq x_{1}.$ Therefore according to the definition of the supremum, there exists a sequence $x_{n}\rightarrow s$ with $\left\{ x_{n}\right\} _{n=1}^{\infty}\subset\left(x_{1},s\right)$ such that: $f\left(x_{n}\right)\geq f\left(x_{2}\right)>f\left(s\right),$ i.e., $f\left(x_{n}\right)>f\left(s\right),$ contradicting our assumption that $f^{;}\left(s\right)=+1$ (which implies that $f_{-}^{;}\left(s\right)\neq-1).$ Hence $f\left(x_{1}\right)<f\left(x_{2}\right).$
If those conditions hold and the interval is closed, then assume without loss of generality that the function strictly increases in the interval. Then clearly by the definition of the one-sided detachments,
$f^{;}=f^{;}_{+}\left(a\right)=f^{;}_{-}\left(b\right)=+1.\,\,\,\,\blacksquare$
Note that claims 3,4 and 5 (the sum and difference, product and quotient rules respectively) impose varied conditions. However, in the special case that the functions $f,g$ are both detachable and differentiable infinitely many times, the following corollaries from claim 14 hold independently of these conditions:
For example, consider the functions $f\left(x\right)=x^{2},g\left(x\right)=-x^{4}$ at $x=0$. Rule 3 does not yield an indication regarding $\left(f+g\right)^{;}$ since $f^{;}g^{;}=-1\notin\left\{ 0,1\right\}$. However, the aforementioned statement lets us know that $\left(f+g\right)_{\pm}^{;}\left(x\right)=\left(\pm1\right)^{2+1}sgn\left[2+0\right]=\pm1$.
Theorem 15. Analog to L’Hôpital Rule. Let $f,g:\mathbb{R}\rightarrow\mathbb{R}$ be a pair of functions and a point $x$ in their definition domain. Assume that $\underset{t\rightarrow x^{\pm}}{\lim}f^{;}\left(t\right)$ and $\underset{t\rightarrow x^{\pm}}{\lim}g^{;}\left(t\right)$ exist. If $\underset{t\rightarrow x^{\pm}}{\lim}\left|f\left(t\right)\right|=\underset{t\rightarrow x^{\pm}}{\lim}\left|g\left(t\right)\right|\in\left\{ 0,\infty\right\},$ then:
$$\underset{t\rightarrow x^{\pm}}{\lim}sgn\left[f\left(t\right)g\left(t\right)\right]=\underset{t\rightarrow x^{\pm}}{\lim}f^{;}\left(t\right)g^{;}\left(t\right).$$
We prove a more generic claim: Let $\left\{ f_{i}:\mathbb{R}\rightarrow\mathbb{R}\right\} _{1\leq i\leq n}$ be a set of functions and a point $x$ in their definition domain. Assume that $\underset{t\rightarrow x^{\pm}}{\lim}f_{i}^{;}\left(t\right)$ exist for each $i.$ If $\underset{t\rightarrow x^{\pm}}{\lim}\left|f_{i}\left(t\right)\right|=L\in\left\{ 0,\infty\right\},$ then we will show that:$$\underset{t\rightarrow x^{\pm}}{\lim}sgn\underset{i}{\prod}f\left(t\right)=\left(\pm C\right)^{n}\underset{t\rightarrow x^{\pm}}{\lim}\underset{i}{\prod}f^{;}\left(t\right),$$
where $C$ equals $+1$ or $-1$ if $\underset{t\rightarrow x^{\pm}}{\lim}\left|f_{i}\left(t\right)\right|$ is $0$ or $\infty,$ to which we refer below as part 1 and 2, respectively.
We apply induction on $n.$ Let $n=1,$ and for simplicity denote $f=f_{1}.$ Without loss of generality we focus on right limits and assume that $\underset{t\rightarrow0^{+}}{\lim}f^{;}\left(t\right)=+1.$ Then $f^{;}=+1$ for each point in a right $\delta$-neighborhood of $x.$ According to lemma 8, $f$ is strictly increasing in $\left(x,x+\delta\right).$ Therefore:$$\inf\left\{ f\left(t\right)|t\in\left(x,x+\delta\right)\right\} =\underset{t\rightarrow x^{+}}{\lim}f\left(t\right).$$
Proof of part 1. According to our assumption, $\inf\left\{ f\left(t\right)|t\in\left(x,x+\delta\right)\right\} = 0.$ Thus $f\left(t\right)\geq0$ for $t\in\left(x,x+\delta\right).$ Clearly $f$ can’t be zeroed in $\left(x,x+\delta\right)$ because that would contradict the strict monotony. Thus $f>0$ there, and $\underset{t\rightarrow x^{+}}{\lim}sgnf\left(t\right)=\underset{t\rightarrow x^{+}}{\lim}f^{;}\left(t\right)=+1.$ If $\underset{t\rightarrow x^{+}}{\lim}f^{;}\left(t\right)=0,$ then $f$ is constant in a right-neighborhood of $x,$ and from the continuity $f\equiv0$ there. Thus $\underset{t\rightarrow x^{+}}{\lim}sgnf\left(t\right)=\underset{t\rightarrow x^{+}}{\lim}f_{+}^{;}\left(t\right)=0.$ The signs switch for left-limits, hence the $\pm$ coefficient in the right handside.
Proof of part 2. Since $f$ is strictly increasing in a right-neighborhood of $x,$ then clearly $\underset{t\rightarrow x^{+}}{\lim}f\left(t\right)=-\infty,$ and $\underset{t\rightarrow x^{+}}{\lim}sgnf\left(t\right)=-\underset{t\rightarrow x^{+}}{\lim}f^{;}\left(t\right)=-1.$ The signs switch for left-limits, hence the $\mp$ coefficient in the right handside.
Assume the theorem holds for $n,$ and we show its correctness for $n+1:$$$\underset{t\rightarrow x^{\pm}}{\lim}sgn\underset{1\leq i\leq n+1}{\prod}f_{i}\left(t\right) =\underset{t\rightarrow x^{\pm}}{\lim}sgn\underset{1\leq i\leq n}{\prod}f_{i}\left(t\right)\cdot\underset{t\rightarrow x^{\pm}}{\lim}sgnf_{n+1}\left(t\right)
=\left(\pm C\right)^{n}\underset{t\rightarrow x^{\pm}}{\lim}\underset{1\leq i\leq n}{\prod}f_{i}^{;}\left(t\right)\cdot\underset{t\rightarrow x^{\pm}}{\lim}sgnf_{n+1}\left(t\right)
=\left(\pm C\right)^{n+1}\underset{t\rightarrow x^{\pm}}{\lim}\underset{1\leq i\leq n+1}{\prod}f_{i}^{;}\left(t\right),$$where the second transition follows from the induction hypothesis, and the third follows from the induction base.$\,\,\,\,\blacksquare$
Let’s calculate the detachments directly according to the detachment definition, or according to claim 13. In the following examples, we scrutinize cases where the detachable functions are either continuous but not differentiable, discontinuous, or differentiable. As a side note, we’ll also examine a case where the trend doesn’t exist.
Finally, let’s calculate trends based on claim 14 (in case the function is differentiable infinitely many times). Those are the explicit calculations that are otherwise obfuscated by Taylor series based theorems on critical points classification. For instance, consider the function $f\left(x\right) = -3x^{5}+5x^{3},$ whose critical points are at $0, \pm 1$:
$$\begin{align*}f_{\pm}^{;}\left(x\right) & =\left(\pm1\right)^{k+1}sgn\left[f^{\left(k\right)}\left(x\right)\right]\\
f_{\pm}^{;}\left(0\right) & =\left(\pm1\right)^{3+1}sgn\left(5\right)=+1\\
f_{\pm}^{;}\left(-1\right) & =\left(\pm1\right)^{2+1}sgn\left(15\right)=\pm1\\
f_{\pm}^{;}\left(1\right) & =\left(\pm1\right)^{2+1}sgn\left(-15\right)=\mp1,
\end{align*}$$
where the transition on the first raw is due to claim 14. We gather that $0, +1$ and $-1$ are inflection, maximum and minimum points, respectively.
The detachment indirectly serves to generalize a familiar integration algorithm (originated in computer vision), to generic continuous domains.
The Integral Image algorithm calculates sums of rectangles in images efficiently. It was introduced in 2001, in a prominent AI article ([36]). The algorithm states that to calculate the integral of a function over a rectangle in the plane, it’s possible to pre-calculate the antiderivative, then in real-time summarize its values in the rectangle’s corners, signed alternately. It can be thought of as an extension of the Fundamental Theorem of Calculus to the plane. The following theorem further generalized it in 2007 ([38]). While in our preliminary discussion we introduced works in which the instantaneous trend of change has been applied explicitly (either numerically or analytically), in the following theorem the detachment is leveraged only implicitly.
Theorem 16. (Wang et al.) Let $D\subset\mathbb{R}^{2}$ be a generalized rectangular domain (polyomino), and let $f:\mathbb{R}^{2}\rightarrow\mathbb{R}$ admit an antiderivative $F.$ Then:
$$\underset{{\scriptscriptstyle D}}{\iint}f\overrightarrow{dx}=\underset{{\scriptscriptstyle x\in\nabla D}}{\sum}\alpha_{D}\left(x\right)F\left(x\right),$$
where $\nabla D$ is the set of corners of $D$, and $\alpha_{D}$ accepts values in $\left\{ 0,\pm1\right\} $ and is determined according to the type of corner that $x$ belongs to.
The derivative sign doesn’t reflect trends at cusp points such as corners, therefore it doesn’t classify corner types concisely. In contrast, it turns out that the detachment operator classifies corners (and thus defines the parameter $\alpha_{D}$ coherently) independently of the curve’s parameterization. That’s a multidimensional generalization of our above discussion regarding the relation between both operators. Feel free to interact with the following widget to gain intuition regarding theorem 16. The textboxes toggle the antiderivative at the annotated point, which is reflected in the highlighted domain. Watch the inclusion-exclusion principle in action as the domains finally aggregate to the domain bounded by the vertices. While this theorem is defined over continuous domains, it is limited to generalized rectangular ones. Let’s attempt to alleviate this limitation.
We leverage the detachment to define a novel integration method that extends theorem 16 to non-rectangular domains by combining it with the mean value theorem, as follows. Given a simple closed detachable curve $C$, let $D$ be the region bounded by the curve. Let $R\subseteq D$ be a rectangular domain circumscribed by $C$. Let ${C_i}$ be the sub-curves of $C$ between each pair of its consecutive touching points with $R.$ Let $D_{i}\subset D\backslash R$ be the sub-domain bounded between $C_i$ and $R$. Let $\partial D_i$ be the boundary of $D_i$, and let $\nabla D_{i}$ be the intersection between $D_i$ and the vertices of $R$. According to the mean value theorem in two dimensions, $\forall i:\exists x_{i}\in D_{i}$ such that $\underset{\scriptscriptstyle D_{i}}{\iint}f\overrightarrow{dx}=\beta_{i}f\left(x_{i}\right),$ where $\beta_{i}$ is the area of $D_{i}.$
Our semi-discrete integration method accumulates a linear combination of the function and its antiderivative along the sub-domains $D_{i}$, as follows:
Theorem 17. Let $D\subset\mathbb{R}^{2}$ be a closed, bounded, and connected domain whose edge is detachable. Let $f:\mathbb{R}^{2}\rightarrow\mathbb{R}$ be a continuous function that admits an antiderivative $F$ there. Then: $$\underset{\scriptscriptstyle D}{\iint}f\overrightarrow{dx}=\underset{i}{\sum}\left[\overrightarrow{\alpha_{i}}\cdot F\left(\overrightarrow{x_{i}}\right)+\beta_{i}f\left(x_{i}\right)\right],$$where $F\left(\overrightarrow{x_{i}}\right)=\left(F\left(x\right)\,|\,x\in\nabla D_{i}\right)^T$ is the vector of antiderivative values at the vertices of the subdomain $D_{i}$, $\overrightarrow{\alpha_{i}}=\left(\alpha\left(\partial D_{i}^{;},x\right)\,|\,x\in\nabla D_{i}\right)^T$ is a vector containing the results of applying a function $\alpha$ to the detachments of the curve $\partial D_{i}^{;}$ at its vertices $\nabla D_{i}$, and $\beta_i$ is the area of $D_i$, which we incorporate as part of the mean value theorem for integrals along with it matching point in $D_i$ denoted by $x_i$. The function $\alpha$ is constructed within the integration method (see [31]). This formula holds for any detachable division of the curve. We don’t assume the differentiability of the curve’s parameterization, rather only the continuity of the function in $\underset{i}{\bigcup}D_{i}$, for the mean value theorem to hold.
Feel free to gain intuition from the following widget. Initially, the domain is rectangular: $D=R$. Tuning the vertices locations exposes the yellow sub-domains $D_i$, whose areas are $\beta_i$. The integral over $R$ is calculated by aggregating the antiderivative’s value along the vertices of each $D_i$ (the highlighted points in this diagram). The antiderivative values are added to the aggregation with coefficients that are determined by the curve’s detachments at each vertex (the function $\alpha$).
Modern Physics is abundant with phenomena described by everywhere continuous and nowhere differentiable functions. A prominent example is fluctuations, random invisible movements of objects in their seemingly steady-state. These are studied in the fields of Thermodynamics and Quantum Mechanics, to mention a few. Another prominent example is the Brownian motion, whose statistical model – Wiener Process – resembles fluctuations.
Researchers have been suggesting several possible characterizations of this process, and we’ll specify one of them . Its value $W\left(t\right)$ adheres to the following:
This characterization of the Wiener process results in the following qualitative properties, as stated here:
Note how we can reformulate the above conditions in terms of the detachment operator:
This description isn’t just more concise and elegant than the former. It also states the existence of a property (detachability) at a dense subset (local optima) rather than the absence of differentiability everywhere.
On top of its theoretical rigor, the numerical detachment is also 30% more efficient than the numerical derivative in the numerical task of detecting fluctuations.
This was a short Semi-discrete Calculus tour where we presented the main definitions and results. Hope you’ve learned and enjoyed it. Good luck with your Calculus adventures and stay safe!
[1] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-ecient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709-1720, 2017.
[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981-3989, 2016.
[3] Sven Behnke. Hierarchical neural networks for image interpretation, volume 2766. Springer, 2003.
[4] Jeffrey Byrne. Nested motion descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 502-510, 2015.
[5] Zhixiang Chen, Xin Yuan, Jiwen Lu, Qi Tian, and Jie Zhou. Deep hashing via discrepancy minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6838-6847, 2018.
[6] Kai Fan. Unifying the stochastic spectral descent for restricted boltzmann machines with bernoulli or gaussian inputs. arXiv preprint arXiv:1703.09766, 2017.
[7] Mathias Gallardo, Daniel Pizarro, Adrien Bartoli, and Toby Collins. Shape-from-template in flatland. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2847-2854, 2015.
[8] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages
1180-1189, 2015.
[9] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3681-3688, 2019.
[10] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[11] Zhe Hu, Sunghyun Cho, Jue Wang, and Ming-Hsuan Yang. Deblurring low-light images with light streaks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3382-3389, 2014.
[12] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869-6898, 2017.
[13] Xiangyu Kong, Bo Xin, Yizhou Wang, and Gang Hua. Collaborative deep reinforcement learning for joint object search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1695-1704, 2017.
[14] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
[15] Jean Lafond, Hoi-To Wai, and Eric Moulines. On the online frankwolfe algorithms for convex and non-convex optimizations. arXiv preprint arXiv:1510.01171, 2015.
[16] Debang Li, Huikai Wu, Junge Zhang, and Kaiqi Huang. A2-rl: Aesthetics aware reinforcement learning for image cropping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8193-8201, 2018.
[17] Wei Li and Xiaogang Wang. Locally aligned feature transforms across views. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3594-3601, 2013.
[18] Li Liu, Ling Shao, Fumin Shen, and Mengyang Yu. Discretely coding semantic rank orders for supervised image hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1425-1434, 2017.
[19] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574-2582, 2016.
[20] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9078-9086, 2019.
[21] Ali Mousavi, Arian Maleki, Richard G Baraniuk, et al. Consistent parameter estimation for lasso and approximate message passing. The Annals of Statistics, 46(1):119-148, 2018.
[22] Vidhya Navalpakkam and Laurent Itti. Optimal cue selection strategy. In Advances in neural information processing systems, pages 987-994, 2006.
[23] Jinshan Pan, Zhouchen Lin, Zhixun Su, and Ming-Hsuan Yang. Robust kernel estimation with outliers handling for image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2800-2808, 2016.
[24] Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. Crafting adversarial input sequences for recurrent neural networks. In MILCOM 2016-2016 IEEE Military Communications Conference, pages 49-54. IEEE, 2016.
[25] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pages 3360-3368, 2016.
[26] Aaditya Ramdas and Aarti Singh. Algorithmic connections between active learning and stochastic convex optimization. In International Conference on Algorithmic Learning Theory, pages 339-353. Springer, 2013.
[27] Miriam Redi, Neil O’Hare, Rossano Schifanella, Michele Trevisiol, and Alejandro Jaimes. 6 seconds of sound and vision: Creativity in micro-videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4272-4279, 2014.
[28] Jaakko Riihimäki and Aki Vehtari. Gaussian processes with monotonicity information. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 645-652, 2010.
[29] Swami Sankaranarayanan, Arpit Jain, Rama Chellappa, and Ser Nam Lim. Regularizing deep networks using efficient layerwise adversarial training. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[30] Hee Seok Lee and Kuoung Mu Lee. Simultaneous super-resolution of depth and images using a single camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 281-288, 2013.
[31] Amir Shachar. Applying semi-discrete operators to calculus. arXiv preprint arXiv:1012.5751, 2010.
[32] Eero Siivola, Aki Vehtari, Jarno Vanhatalo, Javier González, and Michael Riis Andersen. Correcting boundary over-exploration deficiencies in bayesian optimization with virtual derivative sign observations. In 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1-6. IEEE, 2018.
[33] Michel Silva, Washington Ramos, Joao Ferreira, Felipe Chamone, Mario Campos, and Erickson R Nascimento. A weighted sparse sampling and smoothing frame transition approach for semantic fast-forward first-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2383-2392, 2018.
[34] Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5323-5332, 2018.
[35] Roberto Tron and Kostas Daniilidis. On the quotient representation for the essential manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1574-1581, 2014.
[36] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, volume 1, pages II. IEEE, 2001.
[37] Haohan Wang and Bhiksha Raj. On the origin of deep learning. arXiv preprint arXiv:1702.07800, 2017.
[38] Xiaogang Wang, Gianfranco Doretto, Thomas Sebastian, Jens Rittscher, and Peter Tu. Shape and appearance context modeling. In 2007 ieee 11th international conference on computer vision, pages 18. Ieee, 2007.
[39] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509-1519, 2017.
[40] Shaodan Zhai, Tian Xia, Ming Tan, and Shaojun Wang. Direct 0-1 loss minimization and margin maximization with boosting. In Advances in Neural Information Processing Systems, pages 872-880, 2013.
[41] Yu Zhu, Yanning Zhang, Boyan Bonev, and Alan L Yuille. Modeling deformable gradient compositions for single-image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5417-5425, 2015.
[42] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
[43] Bernhard Riemann. Uber die Darstellbarkeit einer Function durch eine trigonometrische Reihe, Gott. Abh. (1854/1868); also in: Gesammelte Mathematische Werke, Springer-Verlag, Berlin 1990, 259–296.