PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors.
In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes.
brew install cmakefor OS X
sudo apt install cmakefor Ubuntu
If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.
Build & install
You can install the library from PyPI:
pip install pqkmeans
Or, if you would like to use the current master version, you can manually build and install the library by:
git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git cd pqkmeans python setup.py install
# with artificial data python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100 # with texmex dataset (http://corpus-texmex.irisa.fr/) python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100
python setup.py test
import pqkmeans import numpy as np X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples # Train a PQ encoder. # Each vector is divided into 4 parts and each part is # encoded with log256 = 8 bit, resulting in a 32 bit PQ code. encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256) encoder.fit(X[:1000]) # Use a subset of X for training # Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8. # You can train the encoder and transform the input vectors to PQ codes preliminary. X_pqcode = encoder.transform(X) # Run clustering with k=5 clusters. kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5) clustered = kmeans.fit_predict(X_pqcode) # Then, clustered is the id of assigned center for the first input PQ code (X_pqcode).
More details at: Github
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.