PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors.
In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes.
brew install cmakefor OS X
sudo apt install cmakefor Ubuntu
If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.
Build & install
You can install the library from PyPI:
pip install pqkmeans
Or, if you would like to use the current master version, you can manually build and install the library by:
git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git cd pqkmeans python setup.py install
# with artificial data python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100 # with texmex dataset (http://corpus-texmex.irisa.fr/) python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100
python setup.py test
import pqkmeans import numpy as np X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples # Train a PQ encoder. # Each vector is divided into 4 parts and each part is # encoded with log256 = 8 bit, resulting in a 32 bit PQ code. encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256) encoder.fit(X[:1000]) # Use a subset of X for training # Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8. # You can train the encoder and transform the input vectors to PQ codes preliminary. X_pqcode = encoder.transform(X) # Run clustering with k=5 clusters. kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5) clustered = kmeans.fit_predict(X_pqcode) # Then, clustered is the id of assigned center for the first input PQ code (X_pqcode).
More details at: Github
Asif Razzaq is an AI Journalist and Cofounder of Marktechpost, LLC. He is a visionary, entrepreneur and engineer who aspires to use the power of Artificial Intelligence for good.
Asif's latest venture is the development of an Artificial Intelligence Media Platform (Marktechpost) that will revolutionize how people can find relevant news related to Artificial Intelligence, Data Science and Machine Learning.
Asif was featured by Onalytica in it’s ‘Who’s Who in AI? (Influential Voices & Brands)’ as one of the 'Influential Journalists in AI' (https://onalytica.com/wp-content/uploads/2021/09/Whos-Who-In-AI.pdf). His interview was also featured by Onalytica (https://onalytica.com/blog/posts/interview-with-asif-razzaq/).