# Result

**Some snippets from your converted document:**

6.867 Machine learning, lecture 6 (Jaakkola) 1 Lecture topics: • Active learning • Non-linear predictions, kernels Active learning We can use the expressions for the mean squared error to actively select input points x1 , . . . , xn , when possible, so as to reduce the resulting estimation error. This is an active learning (experiment design) problem. By letting the method guide the selection of the training examples (inputs), we will generally need far fewer examples in comparison to selecting them at random from some underlying distribution, database, or trying available experiments at random. To develop this further, recall that we continue to assume that the responses y come from some linear model y = θ∗ T x + θ0∗ + � where � ∼ N (0, σ ∗ 2 ). Nothing is assumed about the distribution of x as the choice of the inputs is in our control. For any given set of inputs, x1 , . . . , xn , we derived last time an expression for the mean squared error of the maximum likelihood parameter estimates θ̂ and θ̂0 : � �� � � � �2 � � θ̂ θ∗ �� | X = σ ∗ 2 T r (XT X)−1 � � E � − ∗ (1) � θ̂ 0 θ0 � where the expectation is relative to the responses generated from the underlying linear model, i.e., over diﬀerent training sets generated from the linear model. We do not know the noise variance σ ∗ 2 for the correct model but it only appears as a multiplicative constant in the above expression and therefore won’t aﬀect how we should choose the inputs. When the choice of inputs is indeed � up to us� (e.g., which experiments to carry out) we can select them so as to minimize T r (XT X)−1 . One caveat of this approach is that it relies on the underlying relationship between the inputs and the responses to be linear. When this is no longer the case we may end up with clearly suboptimal selections. Given the selection criterion, how should we ﬁnd say n input examples x1 , . . . , xn that minimize it? One simple approach is to select them one after the other, merely optimizing the selection of the next one in light of what we already have. Let’s assume then that we already have X and thus have A = (XT X)−1 (assuming it is already invertible). We are trying to select another input example x that adds a row [xT , 1] to X. In an applied context we are typically constrained by what x can be (e.g., due to the experimental setup). We Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. 6.867 Machine learning, lecture 6 (Jaakkola) 2 will discuss simple constraints below. Let’s now evaluate the eﬀect of adding a new (valid) row: � �T � � � � � �T X X x x T T T = (X X) + = A−1 + vvT (2) x 1 x 1 1 1 where v = [xT , 1]T . We would like to ﬁnd a valid v that minimizes T r (A−1 + vvT )−1 � � (3) The matrix inverse can actually be carried out in closed form (easy enough to check) 1 (A−1 + vvT )−1 = A − AvvT A (4) (1 + vT Av) so that the trace becomes 1 T r (A−1 + vvT )−1 = T r [A] − T � � � � T r Avv A (5) (1 + vT Av) 1 � T � = T r [A] − T r v AAv (6) (1 + vT Av) vT AAv = T r [A] − (7)

**Recently converted files (publicly available):**