- Optimization
- Root finding
- gradient descent
- Newton's method
- Fisher scoring
- Coordinate descent
- optimization in R:
optimize
,optim
andnls
optimize
, optim
and nls
Given an objective function \(f\), find \[ \theta^* = {arg\text{min}}_{\theta}{f(\theta)} \]
Produce sequence of points \(\theta^{(k)}, k=0,1,2,\dots\) with \(\theta^{(k)} \to \theta^*\).
Can be interpreted as iterative methods for solving optimality condition \[ \nabla f(\theta^*)=0 \]
f <- function(y, a, n){ a^2 + y^2 + 2*a*y/(n-1) - (n-2) } a <- 0.5 n <- 20 b0 <- 0 b1 <- 5*n iter <- 0 eps <- .Machine$double.eps^0.25 y <- seq(b0, b1, length.out = 3) fv <- c(f(y[1], a, n), f(y[2], a, n), f(y[3], a, n)) if(fv[1]*fv[3]>0) stop("f does not have opposite sign at endpoints.\n")
while(iter < 1000 & abs(fv[2]) > eps){ iter <- iter + 1 if(fv[1]*fv[2] < 0){ y[3] <- y[2] fv[3] <- fv[2] } else{ y[1] <- y[2] fv[1] <- fv[2] } y[2] <- (y[1]+y[3])/2 fv[2] <- f(y[2], a, n) print(c(y[1],fv[1], fv[3]-fv[2])) }
## [1] 0.000 -17.750 1876.316 ## [1] 0.0000 -17.7500 469.4079 ## [1] 0.0000 -17.7500 117.5164 ## [1] 0.00000 -17.75000 29.46135 ## [1] 3.125000 -7.819901 17.172081 ## [1] 3.125000 -7.819901 6.754986 ## [1] 3.906250 -2.285619 3.530081 ## [1] 3.906250 -2.285619 1.650599 ## [1] 4.1015625 -0.7113133 0.8348365 ## [1] 4.1015625 -0.7113133 0.4102657 ## [1] 4.1503906 -0.3058160 0.2057289 ## [1] 4.1748047 -0.1012793 0.1030135 ## [1] 4.17480469 -0.10127926 0.05139497 ## [1] 4.1809082 -0.0499588 0.0257068 ## [1] 4.18395996 -0.02427063 0.01285573 ## [1] 4.185485840 -0.011419556 0.006428445 ## [1] 4.186248779 -0.004992275 0.003214368 ## [1] 4.186630249 -0.001778197 0.001607221 ## [1] 4.1868209839 -0.0001710497 0.0008036194 ## [1] 4.1868209839 -0.0001710497 0.0004018029 ## [1] 4.1868209839 -0.0001710497 0.0002008997
y[2]
## [1] 4.186845
fv[2]
## [1] 2.984885e-05
iter
## [1] 21
It is a hybridmethod which combines the reliability of bracketing method and the speed of open methods
\[ x = \frac{[y-f(a)][y-f(b)]c}{[f(c)-f(a)][f(c)-f(b)]} + \frac{[y-f(b)][y-f(c)]a}{[f(a)-f(b)][f(a)-f(c)]} + \frac{[y-f(c)][y-f(a)]b}{[f(b)-f(c)][f(b)-f(a)]} \]
out <- uniroot(f, lower = 0, upper = 5*n, a = a, n = n) unlist(out)
## root f.root iter init.it estim.prec ## 4.186870e+00 2.381408e-04 1.400000e+01 NA 6.103516e-05
out <- uniroot(f, interval=c(-5*n,0), a = a, n = n) unlist(out)
## root f.root iter init.it estim.prec ## -4.239501e+00 2.382050e-04 1.400000e+01 NA 6.103516e-05
\[ \theta^{(k+1)} = \theta^{(k)} - \eta^{(k)} \nabla f(\theta^{(k)})\]
\[ \min_\beta \|y-X\beta\|^2 \]
x <- rnorm(100) y <- 2*x iter <- 0 b <- 0 while(abs(sum(x^2)*b - sum(x*y))>1e-3 & iter < 1000){ iter <- iter + 1 h <- 2*sum(x^2) b <- b - 2*h^-1*(sum(x^2)*b - sum(x*y)) cat(iter, b, "\n") }
## 1 2
Taylor-expand for the value at the minimum \(\theta^*\) \[ f(\theta^*) \approx f(\theta) + (\theta^*-\theta) \nabla f(\theta) + \frac{1}{2}(\theta^*-\theta)^T \mathbf{H}(\theta) (\theta^*-\theta) \]
Take gradient, set to zero, solve for \(\theta^*\): \[ 0 = \nabla f(\theta) + \mathbf{H}(\theta) (\theta^*-\theta) \\ \theta^* = \theta - {\left(\mathbf{H}(\theta)\right)}^{-1} \nabla f(\theta) \] Works exactly if \(f\) is quadratic and \(\mathbf{H}^{-1}\) exists, etc.
If \(f\) isn't quadratic, keep pretending it is until we get close to \(\theta^*\), when it will be nearly true
\[ \theta^{(k+1)} = \theta^{(k)} - (\mathbf{H}(\theta))^{-1} \nabla f(\theta^{(k)})\]
Want to use the Hessian to improve convergence, but don't want to have to keep computing the Hessian at each step
Approaches:
Remove observations, replace Hessian by the expected Hessian \[ I(\theta) = E\{\mathbf{H}[-\log l(\theta)]\} = - E\left\{\frac{\partial^2 \log l(\theta)}{\partial \theta \partial\theta^\top}\right\} \] which is the Fisher information matrix.
Under mild regularity conditions \[ I(\theta) = E\{\mathbf{H}[-\log l(\theta)]\} = - E\left\{\left(\frac{\partial}{\partial\theta} \log l(\theta)\right)^2\right\} \]
Exponential family: provides a general framework to parameterize distributions \[ f(x|\theta) = g(x) e^{\beta(\theta) + h(x)^\top v(\theta)} \]
Fisher information for general exponential family \[ I(\theta) = \nabla v(\theta)^\top \Sigma(\theta)\nabla v(\theta), \quad \Sigma(\theta) = Var(h(x)) \]
n <- 56 x <- c(20, 9, 1, 26) p <- rep(1, 4)/4 p0 <- rep(0, 4) iter <- 0 while(sum(abs(p-p0))>1e-3 & iter < 10){ p0 <- p iter <- iter + 1 p <- p0 + x/n p <- p/sum(p) cat(iter, p, "\n") }
## 1 0.3035714 0.2053571 0.1339286 0.3571429 ## 2 0.3303571 0.1830357 0.07589286 0.4107143 ## 3 0.34375 0.171875 0.046875 0.4375 ## 4 0.3504464 0.1662946 0.03236607 0.4508929 ## 5 0.3537946 0.1635045 0.02511161 0.4575893 ## 6 0.3554688 0.1621094 0.02148438 0.4609375 ## 7 0.3563058 0.1614118 0.01967076 0.4626116 ## 8 0.3567243 0.1610631 0.01876395 0.4634487 ## 9 0.3569336 0.1608887 0.01831055 0.4638672 ## 10 0.3570382 0.1608015 0.01808384 0.4640765
Gradient descent, Newton's method, etc., adjust all coordinates of \(\theta\) at once — gets harder as the number of dimensions \(p\) grows
Coordinate descent: never do more than 1D optimization
## Coordinate Descent
Cons:
Pros:
optimize()
Maximize the function \[
f(x) = \frac{\log(1+log(x))}{\log(1+x)}
\]
f <- function(x) log(1+log(x))/log(1+x) optimize(f, lower = 4, upper = 8, maximum = TRUE)
## $maximum ## [1] 4.00006 ## ## $objective ## [1] 0.5404008
optim()
optim(par, fn, gr, method, control, hessian)
fn
: function to be minimized; mandatorypar
: initial parameter guess; mandatorygr
: gradient function; only needed for some methodsmethod
: Optimization algorithm, could be BFGS (Newton-ish)control
: optional list of control settingshessian
: should the final Hessian be returned? default FALSEReturn contains the location ($par
) and the value ($val
) of the optimum, diagnostics, possibly $hessian
gmp <- read.table("data/gmp.dat") gmp$pop <- gmp$gmp/gmp$pcgmp library(numDeriv) mse <- function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) } grad.mse <- function(theta) { grad(func=mse,x=theta) } theta0 <- c(5000,0.15) #Initialization fit1 <- optim(theta0,mse,grad.mse,method="BFGS",hessian=TRUE)
fit1: Newton-ish BFGS method
fit1[1:3]
## $par ## [1] 6493.2563738 0.1276921 ## ## $value ## [1] 61853983 ## ## $counts ## function gradient ## 63 11
fit1: Newton-ish BFGS method
fit1[4:6]
## $convergence ## [1] 0 ## ## $message ## NULL ## ## $hessian ## [,1] [,2] ## [1,] 5.25021e+01 4422070 ## [2,] 4.42207e+06 375729087977
Nonlinear least squares estimate
nls(formula, data, start, control, [[many other options]])
formula
: Mathematical expression with response variable, predictor variable(s), and unknown parameter(s)data
: Data frame with variable names matching formula
start
: Guess at parameters (optional)control
: Like with optim
(optional)algorithm
: Optimization algorithm, the default is a Gauss-Newton algorithm (a version of Newton's method).Returns an nls
object, with fitted values, prediction methods, etc.
fit2: Fitting the Same Model with nls()
fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1)) summary(fit2)
## ## Formula: pcgmp ~ y0 * pop^a ## ## Parameters: ## Estimate Std. Error t value Pr(>|t|) ## y0 6.494e+03 8.565e+02 7.582 2.87e-13 *** ## a 1.277e-01 1.012e-02 12.612 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7886 on 364 degrees of freedom ## ## Number of iterations to convergence: 5 ## Achieved convergence tolerance: 1.789e-07
fit2: Fitting the Same Model with nls()
plot(pcgmp~pop,data=gmp) pop.order <- order(gmp$pop) lines(gmp$pop[pop.order],fitted(fit2)[pop.order]) curve(fit1$par[1]*x^fit1$par[2],add=TRUE,lty="dashed",col="blue")
\[ log L(\theta|x) = nr\log(\lambda) + (r-1) \sum_{i=1}^n \log x_i - \lambda \sum_{i=1}^n x_i - n\log \Gamma(r) \]
logL <- function(theta, sx, slogx, n){ r <- theta[1] lambda <- theta[2] loglik <- n*r*log(lambda) + (r-1)*slogx - lambda*sx - n*log(gamma(r)) -loglik } n <- 200 r <- 5 lambda <- 2 x <- rgamma(n, shape = r, rate = lambda)
optim(c(1,1), logL, sx = sum(x), slogx = sum(log(x)), n = n)
## $par ## [1] 4.809551 1.983855 ## ## $value ## [1] 289.2557 ## ## $counts ## function gradient ## 71 NA ## ## $convergence ## [1] 0 ## ## $message ## NULL
mlests <- replicate(100, expr = { x <- rgamma(n, shape = r, rate = lambda) optim(c(1,1), logL, sx = sum(x), slogx = sum(log(x)), n = n)$par }) rowMeans(mlests)
## [1] 5.049364 2.009791