PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors.
In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes.
brew install cmakefor OS X
sudo apt install cmakefor Ubuntu
If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.
Build & install
You can install the library from PyPI:
pip install pqkmeans
Or, if you would like to use the current master version, you can manually build and install the library by:
git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git cd pqkmeans python setup.py install
# with artificial data python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100 # with texmex dataset (http://corpus-texmex.irisa.fr/) python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100
python setup.py test
import pqkmeans import numpy as np X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples # Train a PQ encoder. # Each vector is divided into 4 parts and each part is # encoded with log256 = 8 bit, resulting in a 32 bit PQ code. encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256) encoder.fit(X[:1000]) # Use a subset of X for training # Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8. # You can train the encoder and transform the input vectors to PQ codes preliminary. X_pqcode = encoder.transform(X) # Run clustering with k=5 clusters. kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5) clustered = kmeans.fit_predict(X_pqcode) # Then, clustered is the id of assigned center for the first input PQ code (X_pqcode).
More details at: Github