pgvector for Python developers

Python in a fast car with an elephant in a cannon, with vectors coming out of them

About me

Python Cloud Advocate at Microsoft

Formerly: UC Berkeley, Coursera, Khan Academy, Google


Find me online at:

Mastodon @pamelafox@fosstodon.org
Twitter @pamelafox
GitHub www.github.com/pamelafox
Website pamelafox.org

Vectors 101

Vector space diagram with an oranges vector and apples vector

What are vectors?

Vector space diagram with an oranges vector and apples vector

Vectors are lists of numbers that represent items in a high-dimensional space.

For example, a vector representing the string "apple" might be [0.3, 0.5, 0.8].

Each number in the vector is a dimension of the space.

Generating vectors

Use a model to generate vectors for items:

Input Model Vector
"dog" word2vec [0.017198, -0.007493, -0.057982, ..]
"cat" word2vec [0.004059, 0.06719, -0.093874, ...]

Popular models (find more on HuggingFace):

Model Input types Dimensions
Word2Vec Word 50-300
OpenAI text-embedding-ada-002 Text 1536
OpenAI text-embedding-3 Text 256-3072
Azure Computer Vision Multi-modal Text or Image 1024

Why care about vector embeddings?

Similarity

Find similar items in a large dataset, useful for recommendations.

Vector space diagram with telephone and similar items highlighted

Search

Find items that are similar to a query.

Vector space diagram with devices and search query highlighted

Visualizing vectors

Vector distance metrics

Manhattan (L1)

Manhattan distance formula
[1, 2, 3]
vs.
[3, 1, 2]

Euclidean (L2)

Euclidean distance formula

Generally preferred over Manhattan.

Cosine distance

Two vectors with cosine distance calculated

Same as inner product for normalized vectors.

Storing vectors in PostgreSQL

Screenshot of a table in pgvector with an id and embedding column

Using the pgvector extension

To use the pgvector extension in PostgreSQL:

  1. Install the extension per the instructions at:
    github.com/pgvector/pgvector
  2. Enable the extension:
    
                        CREATE EXTENSION IF NOT EXISTS vector;
                      

🔗 Try the extension in the pgvector dev container @ github.com/pamelafox/pgvector-playground

Storing and querying vectors

  • Create a table with a vector column:
    
                      CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
                      
  • Insert vectors into the table:
    
                        INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
                      
  • Query using the distance operators:
    
                        SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 5;
                      
    
                        SELECT embedding <=> '[3,1,2]' AS distance FROM items;
                      

Distance operators

Manhattan (L1)

Manhattan distance formula
'[1, 2, 3]' <+> '[3, 1, 2]'
or
l1_distance('[1, 2, 3]', '[3, 1, 2]')

Negative inner (dot) product

Inner product formula
v1 <#> v2
or
inner_product(v1, v2)

* pgvector multiplies the dot product by -1, so 11 becomes -11

Euclidean (L2)

Euclidean distance formula
v1 <-> v2
or
l2_distance(v1, v2)

Cosine distance

Two vectors with cosine distance calculated
v1 <=> v2
or
cosine_distance(v1, v2)

Approximate nearest neighbors (ANN) indexes

Use ANN indexes for fast vector searches:

HNSW (Hierarchical Navigable Small World)

Multi-level search space for HNSW

                CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);
              

Based on algorithm from arxiv.org/pdf/1603.09320

Learn more from this tutorial: https://github.com/brtholomy/hnsw

IVFflat (Inverted File with Flat indexes)


                  CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
                

Using pgvector with Python

Python in a fast car with an elephant in a cannon, with vectors coming out of them

Using the pgvector package

Install the pgvector package:


                pip install pgvector
              

Then use it with one of these libraries:

  • psycopg2, psycopg3
  • asyncpg
  • SQLAlchemy
  • SQLModel
  • Django
  • Peewee

Learn more @ https://github.com/pgvector/pgvector-python

Building a similar items API


                query = select(Item).where(Item.id == target_id)
                target_item = session.execute(query).scalars().first()

                closest = session.scalars(
                    select(Item).order_by(Item.embedding.cosine_distance(target_item.embedding)).limit(5)
                )
                for item in closest:
                    print(item.title)
              

Full code in:
github.com/Azure-Samples/rag-postgres-openai-python

Building a search API

Search for items similar to a query:


                target_embedding = get_embedding_from_text(query)

                closest = session.scalars(
                  select(Item).order_by(Item.embedding.cosine_distance(target_embedding)).limit(5)
                )
                for item in closest:
                    print(item.title)
              

Full code in:
github.com/Azure-Samples/rag-postgres-openai-python

Building a hybrid search API

Combine full-text search with vector search:


              WITH semantic_search AS (
                  SELECT id, RANK () OVER (ORDER BY embedding <=> %(embedding)s) AS rank FROM documents
                  ORDER BY embedding <=> %(embedding)s LIMIT 20
              ),
              keyword_search AS (
                  SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC)
                  FROM documents, plainto_tsquery('english', %(query)s) query
                  WHERE to_tsvector('english', content) @@ query
                  ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC LIMIT 20
              )
              SELECT
                  COALESCE(semantic_search.id, keyword_search.id) AS id,
                  COALESCE(1.0 / (%(k)s + semantic_search.rank), 0.0) +
                  COALESCE(1.0 / (%(k)s + keyword_search.rank), 0.0) AS score
              FROM semantic_search
              FULL OUTER JOIN keyword_search ON semantic_search.id = keyword_search.id
              ORDER BY score DESC LIMIT 5
              

Full code in:
github.com/Azure-Samples/rag-postgres-openai-python

Thank you!

Photo of Pamela smiling with a stuffed elephant

Grab the slides @
pamelafox.github.io/my-py-talks/pgvector-python/

Find me online at:

Mastodon @pamelafox@fosstodon.org
Twitter @pamelafox
GitHub www.github.com/pamelafox
Website pamelafox.org

Let me know about your experiences with PostgreSQL and pgvector!