Nikola Vitas: data mining

Tuesday, 11 April 2017

Deep-learning about horizontal velocities at the solar surface

The velocity fields are of great importance for understanding dynamics and structure of the solar atmosphere. The line of sight velocities are coded in the wavelength shifts of the spectral lines, thanks to the Doppler effect, and relatively easy to measure. On the other hand, the orthogonal ("horizontal") components of the velocity vector are impossible to measure directly.

The most popular method for estimating the horizontal velocities is so-called local correlation tracking (LCT, November & Simon, 1988). It is based on comparing successive images of the solar surface in the continuum light and transforming their differences into information about the horizontal fields. However, the LCT algorithm suffers from several limitations.

In a paper by Andres Asensio Ramos and Iker S. Requerey (with a small contribution from my side) accepted by A&A and published on Arxiv some weeks ago (2017arXiv170305128A) this problem is tackled by the deep-learning approach. A deep fully convolutional neural network is trained on synthetic observations from 3D MHD simulations of the solar photosphere and then applied to the real observation with the IMaX instrument on board the SUNRISE balloon (Martinez Pillet et al, 2011; Solanki, 2010). The method is validated using simulation snapshots of the quiet sun produced with the MANCHA code that I have been developing in the last couple of years.

Monday, 20 March 2017

K-means clustering

The problem of clustering is a rather general one: If one has $m$ observations or measurements in $n$ dimensional space, how to identify $k$ clusters (classes, groups, types) of measurements and their centroids (representatives)?

The k-means method is extremely simple, rather robust and widely used in it numerous variants. It is essentially very similar (but not identical) to Lloyd's algorithm (aka Voronoi relaxation or interpolation used in computer sciences).

k-means

Let's use the following indices: $i$ counts measurements, $i \in [0, m-1]$; $j$ counts dimensions, $j \in [0, n-1]$; $l$ counts clusters, $l \in [0, k-1]$.

Each measurement in $n$-dimensional space is represented by a vector $x_i = \{x_{i, 0}, \dots x_{i, n-1}\}$, where index $i$ is counting different measurements ($i = 0, \dots, m-1$). The algorithm can be summarized as:

1. Choose randomly $k$ measurements as initial cluster centers: $c_0, ..., c_{k-1}$. Obviously, each of the clusters is also $n$-dimensional vector.

2. Compute Euclidean distance $D_{i, l}$ between every measurement $x_i$ and every cluster center $c_l$:
$$D_{i, l} = \sqrt{\sum_{j=0}^{n-1} (x_{i, j} - c_{l, j})^2}.$$
3. Assign every measurement $x_i$ to the cluster represented by the closest cluster center $c_l$.

4. Now compute new cluster centers by simply averaging all the measurements in each cluster.

5. Go back to 2. and keep iterating until none of the measurements changes its cluster in two successive iterations.

This procedure is initiated randomly and the result will be slightly different in every run. The result of clustering (and the actual number of necessary iteration) significantly depends on the initial choice of cluster centers. The easiest way to improve the algorithm is to improve the initial choice, i.e. to alter only the step 1. and then to iterate as before. There are to simple alternatives for the initialization.

Tuesday, 11 April 2017

Deep-learning about horizontal velocities at the solar surface

Monday, 20 March 2017

K-means clustering

Background image