(videos)"International Conference on Learning Representations (ICLR) 2016, San Juan | VideoLectures"
Deep Structured Energy Based Models for Anomaly Detection
In this paper, we attack the anomaly detection problem by directly modeling the data distribution with deep architectures. We propose deep structured energy based models (DSEBMs), where the energy function is the output of a deterministic deep neural network with structure. We develop novel model architectures to integrate EBMs with different types of data such as static data, sequential data, and spatial data, and apply appropriate model architectures to adapt to the data structure. Our training algorithm is built upon the recent development of score matching \cite{sm}, which connects an EBM with a regularized autoencoder, eliminating the need for complicated sampling method. Statistically sound decision criterion can be derived for anomaly detection purpose from the perspective of the energy landscape of the data distribution. We investigate two decision criteria for performing anomaly detection: the energy score and the reconstruction error. Extensive empirical studies on benchmark tasks demonstrate that our proposed model consistently matches or outperforms all the competing methods.
Simultaneous Sparse Dictionary Learning and Pruning
How priors of initial hyperparameters affect Gaussian process regression models
Bayesian Variable Selection and Estimation Based on Global-Local Shrinkage Priors
Information Matrix Splitting
Efficient statistical estimates via the maximum likelihood method requires the observed information, the negative of the Hessian of the underlying log-likelihood function. Computing the observed information is computationally expensive, therefore, the expected information matrix—the Fisher information matrix—is often preferred due to its simplicity. In this paper, we prove that the average of the observed and Fisher information of the restricted/residual log-likelihood function for the linear mixed model can be split into two matrices. The expectation of one part is the Fisher information matrix but has a simper form than the Fisher information matrix. The other part which involves a lot of computations is a zero random matrix and thus is negligible. Leveraging such a splitting can simplify evaluation of the approximate Hessian of a log-likelihood function.
A Concise Overview of Standard Model-fitting Methods
"A Concise Overview of Standard Model-fitting Methods - Fitting a model via closed-form equations vs. Gradient Descent vs Stochastic Gradient Descent vs Mini-Batch Learning. What is the difference?" by Sebastian Raschka
《Qs - Deep Gaussian Processes》by Neil Lawrence
FLAG: Fast Linearly-Coupled Adaptive Gradient Method
The celebrated Nesterov’s accelerated gradient method offers great speed-ups compared to the classical gradient descend method as it attains the optimal first-order oracle complexity for smooth convex optimization. On the other hand, the popular AdaGrad algorithm competes with mirror descent under the best regularizer by adaptively scaling the gradient. Recently, it has been shown that the accelerated gradient descent can be viewed as a linear combination of gradient descent and mirror descent steps. Here, we draw upon these ideas and present a fast linearly-coupled adaptive gradient method (FLAG) as an accelerated version of AdaGrad, and show that our algorithm can indeed offer the best of both worlds. Like Nesterov’s accelerated algorithm and its proximal variant, FISTA, our method has a convergence rate of
Cognitive Dynamic Systems: A Technical Review of Cognitive Radar
We start with the history of cognitive radar, where origins of the PAC, Fuster research on cognition and principals of cognition are provided. Fuster describes five cognitive functions: perception, memory, attention, language, and intelligence. We describe the Perception-Action Cyclec as it applies to cognitive radar, and then discuss long-term memory, memory storage, memory retrieval and working memory. A comparison between memory in human cognition and cognitive radar is given as well. Attention is another function described by Fuster, and we have given the comparison of attention in human cognition and cognitive radar. We talk about the four functional blocks from the PAC: Bayesian filter, feedback information, dynamic programming and state-space model for the radar environment. Then, to show that the PAC improves the tracking accuracy of Cognitive Radar over Traditional Active Radar, we have provided simulation results. In the simulation, three nonlinear filters: Cubature Kalman Filter, Unscented Kalman Filter and Extended Kalman Filter are compared. Based on the results, radars implemented with CKF perform better than the radars implemented with UKF or radars implemented with EKF. Further, radar with EKF has the worst accuracy and has the biggest computation load because of derivation and evaluation of Jacobian matrices. We suggest using the concept of risk management to better control parameters and improve performance in cognitive radar. We believe, spectrum sensing can be seen as a potential interest to be used in cognitive radar and we propose a new approach Probabilistic ICA which will presumably reduce noise based on estimation error in cognitive radar. Parallel computing is a concept based on divide and conquers mechanism, and we suggest using the parallel computing approach in cognitive radar by doing complicated calculations or tasks to reduce processing time.
Predict or classify: The deceptive role of time-locking in brain signal classification
Several experimental studies claim to be able to predict the outcome of simple decisions from brain signals measured before subjects are aware of their decision. Often, these studies use multivariate pattern recognition methods with the underlying assumption that the ability to classify the brain signal is equivalent to predict the decision itself. Here we show instead that it is possible to correctly classify a signal even if it does not contain any predictive information about the decision. We first define a simple stochastic model that mimics the random decision process between two equivalent alternatives, and generate a large number of independent trials that contain no choice-predictive information. The trials are first time-locked to the time point of the final event and then classified using standard machine-learning techniques. The resulting classification accuracy is above chance level long before the time point of time-locking. We then analyze the same trials using information theory. We demonstrate that the high classification accuracy is a consequence of time-locking and that its time behavior is simply related to the large relaxation time of the process. We conclude that when time-locking is a crucial step in the analysis of neuronal activity patterns, both the emergence and the timing of the classification accuracy are affected by structural properties of the network that generates the signal.
Discrete Deep Feature Extraction: A Theory and New Architectures
First steps towards a mathematical theory of deep convolutional neural networks for feature extraction were made—for the continuous-time case—in Mallat, 2012, and Wiatowski and B\’olcskei, 2015. This paper considers the discrete case, introduces new convolutional neural network architectures, and proposes a mathematical framework for their analysis. Specifically, we establish deformation and translation sensitivity results of local and global nature, and we investigate how certain structural properties of the input signal are reflected in the corresponding feature vectors. Our theory applies to general filters and general Lipschitz-continuous non-linearities and pooling operators. Experiments on handwritten digit classification and facial landmark detection—including feature importance evaluation—complement the theoretical findings.