PQk-means: Billion-scale Clustering for Product-quantized Codes

PQk-means is a Python library for efficient clustering of large-scale data. While k-means clustering is slow and not much efficient to handle large scale data, PQk-means is an efficient clustering method for billion-scale feature vectors.

In terms of PQk-means, it achieves its speed and efficiency by first compressing input vectors into short product-quantized (PQ) codes.



  • brew install cmake for OS X
  • sudo apt install cmake for Ubuntu

OpenMP (Optional)

If openmp is installed, it will be automatically used to parallelize the algorithm for faster calculation.

Build & install

You can install the library from PyPI:

pip install pqkmeans

Or, if you would like to use the current master version, you can manually build and install the library by:

git clone --recursive https://github.com/DwangoMediaVillage/pqkmeans.git
cd pqkmeans
python setup.py install

Run samples

# with artificial data
python bin/run_experiment.py --dataset artificial --algorithm bkmeans pqkmeans --k 100
# with texmex dataset (http://corpus-texmex.irisa.fr/)
python bin/run_experiment.py --dataset siftsmall --algorithm bkmeans pqkmeans --k 100


python setup.py test


For PQk-means

import pqkmeans
import numpy as np
X = np.random.random((100000, 128)) # 128 dimensional 100,000 samples

# Train a PQ encoder.
# Each vector is divided into 4 parts and each part is
# encoded with log256 = 8 bit, resulting in a 32 bit PQ code.
encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder.fit(X[:1000])  # Use a subset of X for training

# Convert input vectors to 32-bit PQ codes, where each PQ code consists of four uint8.
# You can train the encoder and transform the input vectors to PQ codes preliminary.
X_pqcode = encoder.transform(X)

# Run clustering with k=5 clusters.
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=5)
clustered = kmeans.fit_predict(X_pqcode)

# Then, clustered[0] is the id of assigned center for the first input PQ code (X_pqcode[0]).

More details at: Github

Github: https://github.com/DwangoMediaVillage/pqkmeans

Paper: https://arxiv.org/pdf/1709.03708.pdf

Project: http://yusukematsui.me/project/pqkmeans/pqkmeans.html

Tutorial: https://github.com/DwangoMediaVillage/pqkmeans/tree/master/tutorial

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🐝 [FREE AI WEBINAR] 'Beginners Guide to LangChain: Chat with Your Multi-Model Data' Dec 11, 2023 10 am PST