Compute and return an asymptotically optimal set of cluster centers for the input sample
, cluster membership
IDs, and sample
distances-squared from their corresponding cluster centers.
More...
Compute and return an asymptotically optimal set of cluster centers for the input sample
, cluster membership
IDs, and sample
distances-squared from their corresponding cluster centers.
See the documentation of pm_clusKmeans for more information on the Kmeans++ clustering algorithm.
The metric used within this generic interface is the Euclidean distance.
- Parameters
-
[in,out] | rng | : The input/output scalar that can be an object of,
-
type rngf_type, implying the use of intrinsic Fortran uniform RNG.
-
type xoshiro256ssw_type, implying the use of xoshiro256** uniform RNG.
|
[out] | membership | : The output vector of shape (1:nsam) of type integer of default kind IK, containing the membership of each input sample in sample from its nearest cluster center , such that cluster(membership(i)) is the nearest cluster center to the i th sample sample(:, i) at a squared-distance of disq(i) .
|
[out] | disq | : The output vector of shape (1:nsam) of the same type and kind as the input argument sample , containing the Euclidean squared distance of each input sample in sample from its nearest cluster center .
|
[out] | csdisq | : The output vector of shape (1:nsam+1) of the same type and kind as the input argument sample , containing the cumulative sum of the Euclidean squared distance of each input sample in sample from its nearest cluster center .
While the output contents are mostly useless, this argument can aid the algorithm efficiency to resolving the need for internal space allocation.
This potential speed-up is particularly relevant when the procedure is called repeatedly many times on samples of the same size.
|
[in] | sample | : The input vector, or matrix of,
-
type
real of kind any supported by the processor (e.g., RK, RK32, RK64, or RK128),
containing the sample of nsam points in a ndim -dimensional space whose corresponding cluster centers must be computed.
-
If
sample is a vector of shape (1 : nsam) and center is a vector of shape (1 : ncls) , then the input sample must be a collection of nsam points (in univariate space).
-
If
sample is a matrix of shape (1 : ndim, 1 : nsam) and center is a matrix of shape (1 : ndim, 1 : ncls) , then the input sample must be a collection of nsam points (in ndim -dimensional space).
|
[in] | ncls | : The input scalar of type integer of default kind IK, containing the number of the desired clusters to be identified in the sample.
(optional, default = ubound(center, 2) . It must be present if and only if the output arguments center , size , and potential are all missing.) |
[out] | center | : The output vector of shape (1:ncls) or matrix of shape (1 : ndim, 1 : ncls) of the same type and kind as the input argument sample , containing the set of ncls unique random cluster centers (centroids) selected from the input sample based on the computed memberships and minimum sample-cluster distances disq .
(optional. If missing, no cluster center information will be output.) |
[out] | size | : The output vector of shape (1:ncls) type integer of default kind IK, containing the sizes (number of members) of the clusters with the corresponding centers output in the argument center .
(optional. If missing, no cluster size information will be output.) |
[out] | potential | : The output vector of shape (1:ncls) of the same type and kind as the input argument sample , the i th element of which contains the sum of squared distances of all members of the i th cluster from the cluster center as output in the i th element of center .
(optional. If missing, no cluster potential information will be output.) |
Possible calling interfaces ⛓
call setKmeansPP(rng, membership(
1 : nsam), disq(
1 : nsam), csdisq(
0 : nsam), sample(
1 : ndim,
1 : nsam), ncls)
call setKmeansPP(rng, membership(
1 : nsam), disq(
1 : nsam), csdisq(
0 : nsam), sample(
1 : ndim,
1 : nsam), center(
1 : ndim,
1 : ncls),
size(
1 : ncls), potential(
1 : ncls))
Compute and return an asymptotically optimal set of cluster centers for the input sample,...
This module contains procedures and routines for the computing the Kmeans clustering of a given set o...
- Warning
- The condition
ubound(center, rank(center)) > 0
must hold for the corresponding input arguments.
The condition ubound(sample, rank(sample)) == size(disq, 1)
must hold for the corresponding input arguments.
The condition ubound(sample, rank(sample)) == size(csdisq, 1) - 1
must hold for the corresponding input arguments.
The condition ubound(center, rank(center)) == size(size, 1)
must hold for the corresponding input arguments.
The condition ubound(center, rank(center)) == size(potential, 1)
must hold for the corresponding input arguments.
The condition ubound(sample, rank(sample)) == size(membership, 1)
must hold for the corresponding input arguments.
The condition ubound(center, rank(center)) <= size(sample, rank(sample))
must hold for the corresponding input arguments (the number of clusters must be less than or equal to the sample size).
The condition ubound(sample, 1) == ubound(center, 1)
must hold for the corresponding input arguments.
These conditions are verified only if the library is built with the preprocessor macro CHECK_ENABLED=1
.
-
By definition, the number of points in the input
sample
must be larger than the specified number of clusters.
-
The
pure
procedure(s) documented herein become impure
when the ParaMonte library is compiled with preprocessor macro CHECK_ENABLED=1
.
By default, these procedures are pure
in release
build and impure
in debug
and testing
builds. The procedures of this generic interface are always impure
when the input rng
argument is set to an object of type rngf_type.
- Note
- Dropping the optional arguments can aid runtime performance.
This is particularly relevant when the output of this generic interface is directly passed to the k-means algorithm.
- See also
- setKmeans
setCenter
setMember
setKmeansPP
Arthur, D.; Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding
Example usage ⛓
12 integer(IK) :: ndim, nsam, ncls, itry
13 real(RKG) ,
allocatable :: sample(:,:), center(:,:), disq(:), csdisq(:), potential(:)
14 integer(IK) ,
allocatable :: membership(:),
size(:)
15 type(display_type) :: disp
20 call disp%show(
"!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
21 call disp%show(
"! Compute cluster centers based on an input sample and cluster memberships and member-center distances.")
22 call disp%show(
"!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
27 call disp%show(
"ndim = getUnifRand(1, 5); ncls = getUnifRand(1, 5); nsam = getUnifRand(ncls, 2 * ncls);")
31 call disp%show(
"sample = getUnifRand(0., 5., ndim, nsam) ! Create a random sample.")
35 call disp%show(
"call setResized(disq, nsam)")
37 call disp%show(
"call setResized(csdisq, nsam + 1_IK)")
39 call disp%show(
"call setResized(membership, nsam)")
41 call disp%show(
"call setResized(center, [ndim, ncls])")
43 call disp%show(
"call setResized(potential, ncls)")
45 call disp%show(
"call setResized(size, ncls)")
49 call disp%show(
"call setKmeansPP(rngf, membership, disq, csdisq, sample, center, size, potential) ! compute the new clusters and memberships.")
50 call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
71 integer(IK) :: funit, i
83 call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
84 open(newunit
= funit, file
= "setKmeansPP.center.txt")
86 write(funit,
"(*(g0,:,','))") i, center(:,i)
89 open(newunit
= funit, file
= "setKmeansPP.sample.txt")
91 write(funit,
"(*(g0,:,','))") membership(i), sample(:,i)
Allocate or resize (shrink or expand) an input allocatable scalar string or array of rank 1....
Generate and return a scalar or a contiguous array of rank 1 of length s1 of randomly uniformly distr...
This is a generic method of the derived type display_type with pass attribute.
This is a generic method of the derived type display_type with pass attribute.
This module contains procedures and generic interfaces for resizing allocatable arrays of various typ...
This module contains classes and procedures for computing various statistical quantities related to t...
type(rngf_type) rngf
The scalar constant object of type rngf_type whose presence signified the use of the Fortran intrinsi...
This module contains classes and procedures for input/output (IO) or generic display operations on st...
type(display_type) disp
This is a scalar module variable an object of type display_type for general display.
This module defines the relevant Fortran kind type-parameters frequently used in the ParaMonte librar...
integer, parameter LK
The default logical kind in the ParaMonte library: kind(.true.) in Fortran, kind(....
integer, parameter IK
The default integer kind in the ParaMonte library: int32 in Fortran, c_int32_t in C-Fortran Interoper...
integer, parameter SK
The default character kind in the ParaMonte library: kind("a") in Fortran, c_char in C-Fortran Intero...
integer, parameter RKS
The single-precision real kind in Fortran mode. On most platforms, this is an 32-bit real kind.
Generate and return an object of type display_type.
Example Unix compile command via Intel ifort
compiler ⛓
3ifort -fpp -standard-semantics -O3 -Wl,-rpath,../../../lib -I../../../inc main.F90 ../../../lib/libparamonte* -o main.exe
Example Windows Batch compile command via Intel ifort
compiler ⛓
2set PATH=..\..\..\lib;%PATH%
3ifort /fpp /standard-semantics /O3 /I:..\..\..\include main.F90 ..\..\..\lib\libparamonte*.lib /exe:main.exe
Example Unix / MinGW compile command via GNU gfortran
compiler ⛓
3gfortran -cpp -ffree-line-length-none -O3 -Wl,-rpath,../../../lib -I../../../inc main.F90 ../../../lib/libparamonte* -o main.exe
Example output ⛓
12+4.67906761,
+1.89873314,
+4.87618637,
+4.92659283,
+1.82102025
20call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
22+0.00000000,
+0.603929395E-2,
+0.254081073E-2,
+0.00000000,
+0.00000000
24+0.00000000,
+0.612687320E-1,
+0.673080236E-1,
+0.698488355E-1,
+0.698488355E-1,
+0.698488355E-1
28+0.603929395E-2,
+0.254081073E-2,
+0.00000000
30+1.82102025,
+4.92659283,
+4.67906761
40+4.19636393,
+3.43198967,
+3.17646265,
+3.53995275,
+2.83120489,
+0.310877562
41+4.40605450,
+3.56519103,
+3.20387983,
+3.82957911,
+0.219681859E-1,
+1.60333455
42+4.48841381,
+3.11326718,
+3.35133982,
+2.04096079,
+1.27075970,
+1.53446734
50call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
52+0.00000000,
+0.00000000,
+0.252518415,
+0.00000000,
+0.00000000,
+0.00000000
54+0.00000000,
+3.18234777,
+3.18234777,
+3.43486619,
+3.43486619,
+3.43486619,
+3.43486619
58+0.00000000,
+0.252518415,
+0.00000000,
+0.00000000,
+0.00000000
60+2.83120489,
+3.43198967,
+0.310877562,
+3.53995275,
+4.19636393
61+0.219681859E-1,
+3.56519103,
+1.60333455,
+3.82957911,
+4.40605450
62+1.27075970,
+3.11326718,
+1.53446734,
+2.04096079,
+4.48841381
82call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
86+0.00000000,
+0.00000000
104+4.88991022,
+2.53621292,
+2.52460837,
+2.10078573
105+4.19053268,
+2.82458758,
+2.79259229,
+2.46770835
106+2.20974803,
+1.29334474,
+1.06490874,
+2.44260454
107+0.912711620,
+4.63725138,
+0.304335654,
+4.82903814
115call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
117+0.00000000,
+0.00000000,
+9.22966766,
+1.67453992
119+0.00000000,
+0.00000000,
+22.1176872,
+31.3473549,
+57.4865303
123+9.22966766,
+1.67453992
125+4.88991022,
+2.53621292
126+4.19053268,
+2.82458758
127+2.20974803,
+1.29334474
128+0.912711620,
+4.63725138
138+2.01086187,
+2.52779198,
+2.64334178
139+4.19251156,
+4.94817638,
+1.48467541
140+1.65522122,
+3.27386475,
+3.62396979
148call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
150+0.00000000,
+0.00000000,
+0.00000000
152+0.00000000,
+0.00000000,
+0.00000000,
+11.6083775
156+0.00000000,
+0.00000000,
+0.00000000
158+2.01086187,
+2.52779198,
+2.64334178
159+4.19251156,
+4.94817638,
+1.48467541
160+1.65522122,
+3.27386475,
+3.62396979
170+0.572903752,
+1.96094215,
+2.58151531
178call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
180+0.00000000,
+0.00000000,
+0.00000000
182+0.00000000,
+0.00000000,
+0.385111064,
+0.385111064
186+0.00000000,
+0.00000000,
+0.00000000
188+0.572903752,
+2.58151531,
+1.96094215
198+3.96224713,
+3.25298929,
+2.88011456
206call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
208+0.00000000,
+0.00000000,
+0.00000000
210+0.00000000,
+0.00000000,
+0.139035568,
+0.139035568
214+0.00000000,
+0.00000000,
+0.00000000
216+3.96224713,
+2.88011456,
+3.25298929
226+0.940437615,
+3.03366470
227+1.38300800,
+2.12286234
235call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
237+4.92898417,
+0.00000000
239+0.00000000,
+4.92898417,
+4.92898417
256+3.21397519,
+3.77438998,
+4.40075779,
+1.44634104,
+2.70580196,
+3.76576853,
+4.70713234
264call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
266+0.00000000,
+0.743294731E-4,
+0.938653648E-1,
+0.00000000,
+0.00000000,
+0.00000000,
+0.00000000
268+0.00000000,
+0.258240014,
+0.258314341,
+0.352179706,
+0.352179706,
+0.352179706,
+0.352179706,
+0.352179706
270+5,
+1,
+3,
+2,
+4,
+1,
+3
272+0.743294731E-4,
+0.00000000,
+0.938653648E-1,
+0.00000000,
+0.00000000
274+3.76576853,
+1.44634104,
+4.70713234,
+2.70580196,
+3.21397519
284+3.64627814,
+1.52552938,
+2.11188316,
+2.35274196,
+3.75406837,
+2.77198815,
+3.52616692
285+2.00119615,
+4.50615692,
+3.18104506,
+0.974512100E-1,
+3.89269066,
+2.95771623,
+3.77801681
286+2.29501033,
+0.977418125,
+2.03021860,
+2.84445333,
+1.56335926,
+1.03065431,
+4.48922062
287+1.48458695,
+4.49082422,
+3.59534144,
+4.80123711,
+1.01136088,
+4.20556879,
+0.105085969
295call setKmeansPP(
rngf, membership, disq, csdisq, sample, center, size, potential)
297+0.00000000,
+4.03553295,
+1.85712051,
+0.00000000,
+4.34862614,
+0.00000000,
+0.00000000
299+0.00000000,
+0.00000000,
+4.03553295,
+5.89265347,
+17.8942223,
+22.2428474,
+22.2428474,
+22.2428474
301+3,
+1,
+1,
+4,
+3,
+1,
+2
303+5.89265347,
+0.00000000,
+4.34862614,
+0.00000000
305+2.77198815,
+3.52616692,
+3.64627814,
+2.35274196
306+2.95771623,
+3.77801681,
+2.00119615,
+0.974512100E-1
307+1.03065431,
+4.48922062,
+2.29501033,
+2.84445333
308+4.20556879,
+0.105085969,
+1.48458695,
+4.80123711
Postprocessing of the example output ⛓
3import matplotlib.pyplot
as plt
11fig = plt.figure(figsize = 1.25 * np.array([6.4, 4.8]), dpi = 200)
14parent = os.path.basename(os.path.dirname(__file__))
15pattern = parent +
"*.txt"
17fileList = glob.glob(pattern)
22 kind = file.split(
".")[1]
23 prefix = file.split(
".")[0]
24 df = pd.read_csv(file, delimiter =
",", header =
None)
27 ax.scatter ( df.values[:, 1]
34 legends.append(
"center")
35 elif kind ==
"sample":
36 ax.scatter ( df.values[:, 1]
41 legends.append(
"sample")
43 sys.exit(
"Ambiguous file exists: {}".format(file))
45 ax.legend(legends, fontsize = fontsize)
46 plt.xticks(fontsize = fontsize - 2)
47 plt.yticks(fontsize = fontsize - 2)
48 ax.set_xlabel(
"X", fontsize = 17)
49 ax.set_ylabel(
"Y", fontsize = 17)
50 ax.set_title(
"Membership Scatter Plot", fontsize = fontsize)
53 plt.grid(visible =
True, which =
"both", axis =
"both", color =
"0.85", linestyle =
"-")
54 ax.tick_params(axis =
"y", which =
"minor")
55 ax.tick_params(axis =
"x", which =
"minor")
56 ax.set_axisbelow(
True)
59 plt.savefig(prefix +
".png")
61 sys.exit(
"Ambiguous file list exists.")
Visualization of the example output ⛓
- Test:
- test_pm_clusKmeans
Final Remarks ⛓
If you believe this algorithm or its documentation can be improved, we appreciate your contribution and help to edit this page's documentation and source file on GitHub.
For details on the naming abbreviations, see this page.
For details on the naming conventions, see this page.
This software is distributed under the MIT license with additional terms outlined below.
-
If you use any parts or concepts from this library to any extent, please acknowledge the usage by citing the relevant publications of the ParaMonte library.
-
If you regenerate any parts/ideas from this library in a programming environment other than those currently supported by this ParaMonte library (i.e., other than C, C++, Fortran, MATLAB, Python, R), please also ask the end users to cite this original ParaMonte library.
This software is available to the public under a highly permissive license.
Help us justify its continued development and maintenance by acknowledging its benefit to society, distributing it, and contributing to it.
- Copyright
- Computational Data Science Lab If you use or redistribute ideas based on this generic interface implementation, you should also cite the original k-means++ article:
Arthur, D.; Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding
- Author:
- Amir Shahmoradi, September 1, 2012, 12:00 AM, National Institute for Fusion Studies, The University of Texas Austin
Definition at line 1181 of file pm_clusKmeans.F90.