サクサク読めて、アプリ限定の機能も多数!
トップへ戻る
ノーベル賞
stats.stackexchange.com
I'm currently trying to wrap my head around the t-SNE math. Unfortunately, there is still one question I can't answer satisfactorily: What is the actual meaning of the axes in a t-SNE graph? If I were to give a presentation on this topic or include it in any publication: How would I label the axes appropriately? P.S: I read this Reddit question but the answers given there (such as "it depends on i
What is the practical difference between Wasserstein metric and Kullback-Leibler divergence? Wasserstein metric is also referred to as Earth mover's distance. From Wikipedia: Wasserstein (or Vaserstein) metric is a distance function defined between probability distributions on a given metric space M. and Kullback–Leibler divergence is a measure of how one probability distribution diverges from a s
Ok, this is a quite basic question, but I am a little bit confused. In my thesis I write: The standard errors can be found by calculating the inverse of the square root of the diagonal elements of the (observed) Fisher Information matrix: \begin{align*} s_{\hat{\mu},\hat{\sigma}^2}=\frac{1}{\sqrt{\mathbf{I}(\hat{\mu},\hat{\sigma}^2)}} \end{align*} Since the optimization command in R minimizes $-\l
Kernel is a way of computing the dot product of two vectors $\mathbf x$ and $\mathbf y$ in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called "generalized dot product". Suppose we have a mapping $\varphi \, : \, \mathbb R^n \to \mathbb R^m$ that brings our vectors in $\mathbb R^n$ to some feature space $\mathbb R^m$. Then the dot product of $\ma
I've found that Imagenet and other large CNN makes use of local response normalization layers. However, I cannot find that much information about them. How important are they and when should they be used? From http://caffe.berkeleyvision.org/tutorial/layers.html#data-layers: "The local response normalization layer performs a kind of “lateral inhibition” by normalizing over local input regions. In
Thank you for the interesting question! Difference: One limitation of standard count models is that the zeros and the nonzeros (positives) are assumed to come from the same data-generating process. With hurdle models, these two processes are not constrained to be the same. The basic idea is that a Bernoulli probability governs the binary outcome of whether a count variate has a zero or positive re
Stack Exchange Network Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. Visit Stack Exchange
In today's pattern recognition class my professor talked about PCA, eigenvectors and eigenvalues. I understood the mathematics of it. If I'm asked to find eigenvalues etc. I'll do it correctly like a machine. But I didn't understand it. I didn't get the purpose of it. I didn't get the feel of it. I strongly believe in the following quote: You do not really understand something unless you can expla
Principal component analysis (PCA) is usually explained via an eigen-decomposition of the covariance matrix. However, it can also be performed via singular value decomposition (SVD) of the data matrix $\mathbf X$. How does it work? What is the connection between these two approaches? What is the relationship between SVD and PCA? Or in other words, how to use SVD of the data matrix to perform dimen
I have just heard, that it's a good idea to choose initial weights of a neural network from the range $(\frac{-1}{\sqrt d} , \frac{1}{\sqrt d})$, where $d$ is the number of inputs to a given neuron. It is assumed, that the sets are normalized - mean 0, variance 1 (don't know if this matters). Why is this a good idea?
What's the similarities and differences between these 3 methods: Bagging, Boosting, Stacking? Which is the best one? And why? Can you give me an example for each?
I would like to implement an algorithm for automatic model selection. I am thinking of doing stepwise regression but anything will do (it has to be based on linear regressions though). My problem is that I am unable to find a methodology, or an open source implementation (I am woking in java). The methodology I have in mind would be something like: calculate the correlation matrix of all the facto
Many authors of papers I read affirm SVMs is superior technique to face their regression/classification problem, aware that they couldn't get similar results through NNs. Often the comparison states that SVMs, instead of NNs, Have a strong founding theory Reach the global optimum due to quadratic programming Have no issue for choosing a proper number of parameters Are less prone to overfitting Nee
The wikipedia page claims that likelihood and probability are distinct concepts. In non-technical parlance, "likelihood" is usually a synonym for "probability," but in statistical usage there is a clear distinction in perspective: the number that is the probability of some observed outcomes given a set of parameter values is regarded as the likelihood of the set of parameter values given the obser
I'm confused about how to calculate the perplexity of a holdout sample when doing Latent Dirichlet Allocation (LDA). The papers on the topic breeze over it, making me think I'm missing something obvious... Perplexity is seen as a good measure of performance for LDA. The idea is that you keep a holdout sample, train your LDA on the rest of the data, then calculate the perplexity of the holdout. The
Autoencoder is a simple 3-layer neural network where output units are directly connected back to input units. E.g. in a network like this: output[i] has edge back to input[i] for every i. Typically, number of hidden units is much less then number of visible (input/output) ones. As a result, when you pass data through such a network, it first compresses (encodes) input vector to "fit" in a smaller
Let your (centered) data be stored in a $n\times d$ matrix $\mathbf X$ with $d$ features (variables) in columns and $n$ data points in rows. Let the covariance matrix $\mathbf C=\mathbf X^\top \mathbf X/n$ have eigenvectors in columns of $\mathbf E$ and eigenvalues on the diagonal of $\mathbf D$, so that $\mathbf C = \mathbf E \mathbf D \mathbf E^\top$. Then what you call "normal" PCA whitening tr
EDIT: The Web Technologies and Services CRAN task view contains a much more comprehensive list of data sources and APIs available in R. You can submit a pull request on github if you wish to add a package to the task view. I'm making a list of the various data feeds that are already hooked into R or that are easy to setup. Here's my initial list of packages, and I was wondering what else I'm missi
I found this confusing when I use the neural network toolbox in Matlab. It divided the raw data set into three parts: training set validation set test set I notice in many training or learning algorithm, the data is often divided into 2 parts, the training set and the test set. My questions are: what is the difference between validation set and test set? Is the validation set really specific to ne
I'm using libsvm in C-SVC mode with a polynomial kernel of degree 2 and I'm required to train multiple SVMs. Each training set has 10 features and 5000 vectors. During training, I am getting this warning for most of the SVMs that I train: WARNING: reaching max number of iterations optimization finished, #iter = 10000000 Could someone please explain what does this warning implies and, perhaps, how
I am searching for [free] software that can produce nice looking graphical models, e.g. Any suggestions would be appreciated.
I need to determine the KL-divergence between two Gaussians. I am comparing my results to these, but I can't reproduce their result. My result is obviously wrong, because the KL is not 0 for KL(p, p). I wonder where I am doing a mistake and ask if anyone can spot it. Let $p(x) = N(\mu_1, \sigma_1)$ and $q(x) = N(\mu_2, \sigma_2)$. From Bishop's PRML I know that $$KL(p, q) = - \int p(x) \log q(x) d
I understand the formal differences between them, what I want to know is when it is more relevant to use one vs. the other. Do they always provide complementary insight about the performance of a given classification/detection system? When is it reasonable to provide them both, say, in a paper? instead of just one? Are there any alternative (maybe more modern) descriptors that capture the relevant
Say I want to estimate a large number of parameters, and I want to penalize some of them because I believe they should have little effect compared to the others. How do I decide what penalization scheme to use? When is ridge regression more appropriate? When should I use lasso?
Data analysis cartoons can be useful for many reasons: they help communicate; they show that quantitative people have a sense of humor too; they can instigate good teaching moments; and they can help us remember important principles and lessons. This is one of my favorites: As a service to those who value this kind of resource, please share your favorite data analysis cartoon. They probably don't
Lots of people use a main tool like Excel or another spreadsheet, SPSS, Stata, or R for their statistics needs. They might turn to some specific package for very special needs, but a lot of things can be done with a simple spreadsheet or a general stats package or stats programming environment. I've always liked Python as a programming language, and for simple needs, it's easy to write a short pro
このページを最初にブックマークしてみませんか?
『Cross Validated』の新着エントリーを見る
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く