PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors.
In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes.
Requisites
CMake
brew install cmake
for OS Xsudo apt install cmake
for Ubuntu
OpenMP (Optional)
If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.
Build & install
You can install the library from PyPI:
pip install pqkmeans
Or, if you would like to use the current master version, you can manually build and install the library by:
git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git
cd pqkmeans
python setup.py install
Run samples
# with artificial data
python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100
# with texmex dataset (http://corpus-texmex.irisa.fr/)
python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100
Test
python setup.py test
Usage
For PQk-means
import pqkmeans
import numpy as np
X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples
# Train a PQ encoder.
# Each vector is divided into 4 parts and each part is
# encoded with log256 = 8 bit, resulting in a 32 bit PQ code.
encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder.fit(X[:1000]) # Use a subset of X for training
# Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8.
# You can train the encoder and transform the input vectors to PQ codes preliminary.
X_pqcode = encoder.transform(X)
# Run clustering with k=5 clusters.
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5)
clustered = kmeans.fit_predict(X_pqcode)
# Then, clustered[0] is the id of assigned center for the first input PQ code (X_pqcode[0]).
More details at: Github
Github: https://github.com/DwangoMediaVillage/pqkmeans
Paper: https://arxiv.org/pdf/1709.03708.pdf
Project: http://yusukematsui.me/project/pqkmeans/pqkmeans.html
Tutorial: https://github.com/DwangoMediaVillage/pqkmeans/tree/master/tutorial
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.