Python Cloud Advocate at Microsoft
Formerly: UC Berkeley, Coursera, Khan Academy, Google
Find me online at:
Mastodon | @pamelafox@fosstodon.org |
@pamelafox | |
GitHub | www.github.com/pamelafox |
Website | pamelafox.org |
Vectors are lists of numbers that represent items in a high-dimensional space.
For example, a vector representing the string "apple" might be [0.3, 0.5, 0.8]
.
Each number in the vector is a dimension of the space.
Use a model to generate vectors for items:
Input | → | Model | → | Vector |
---|---|---|---|---|
"dog" | word2vec | [0.017198, -0.007493, -0.057982, ..] | ||
"cat" | word2vec | [0.004059, 0.06719, -0.093874, ...] |
Popular models (find more on HuggingFace):
Model | Input types | Dimensions |
---|---|---|
Word2Vec | Word | 50-300 |
OpenAI text-embedding-ada-002 | Text | 1536 |
OpenAI text-embedding-3 | Text | 256-3072 |
Azure Computer Vision Multi-modal | Text or Image | 1024 |
Find similar items in a large dataset, useful for recommendations.
Find items that are similar to a query.
[1, 2, 3]
[3, 1, 2]
Generally preferred over Manhattan.
Same as inner product for normalized vectors.
To use the pgvector
extension in PostgreSQL:
CREATE EXTENSION IF NOT EXISTS vector;
🔗 Try the extension in the pgvector dev container @ github.com/pamelafox/pgvector-playground
vector
column:
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 5;
SELECT embedding <=> '[3,1,2]' AS distance FROM items;
'[1, 2, 3]' <+> '[3, 1, 2]'
l1_distance('[1, 2, 3]', '[3, 1, 2]')
v1 <#> v2
inner_product(v1, v2)
* pgvector multiplies the dot product by -1, so 11 becomes -11
v1 <-> v2
l2_distance(v1, v2)
v1 <=> v2
cosine_distance(v1, v2)
Use ANN indexes for fast vector searches:
CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);
Based on algorithm from arxiv.org/pdf/1603.09320
Learn more from this tutorial: https://github.com/brtholomy/hnsw
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
Install the pgvector
package:
pip install pgvector
Then use it with one of these libraries:
Learn more @ https://github.com/pgvector/pgvector-python
query = select(Item).where(Item.id == target_id)
target_item = session.execute(query).scalars().first()
closest = session.scalars(
select(Item).order_by(Item.embedding.cosine_distance(target_item.embedding)).limit(5)
)
for item in closest:
print(item.title)
Full code in:
github.com/Azure-Samples/rag-postgres-openai-python
Search for items similar to a query:
target_embedding = get_embedding_from_text(query)
closest = session.scalars(
select(Item).order_by(Item.embedding.cosine_distance(target_embedding)).limit(5)
)
for item in closest:
print(item.title)
Full code in:
github.com/Azure-Samples/rag-postgres-openai-python
Combine full-text search with vector search:
WITH semantic_search AS (
SELECT id, RANK () OVER (ORDER BY embedding <=> %(embedding)s) AS rank FROM documents
ORDER BY embedding <=> %(embedding)s LIMIT 20
),
keyword_search AS (
SELECT id, RANK () OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC)
FROM documents, plainto_tsquery('english', %(query)s) query
WHERE to_tsvector('english', content) @@ query
ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC LIMIT 20
)
SELECT
COALESCE(semantic_search.id, keyword_search.id) AS id,
COALESCE(1.0 / (%(k)s + semantic_search.rank), 0.0) +
COALESCE(1.0 / (%(k)s + keyword_search.rank), 0.0) AS score
FROM semantic_search
FULL OUTER JOIN keyword_search ON semantic_search.id = keyword_search.id
ORDER BY score DESC LIMIT 5
Full code in:
github.com/Azure-Samples/rag-postgres-openai-python
Grab the slides @
pamelafox.github.io/my-py-talks/pgvector-python/
Find me online at:
Mastodon | @pamelafox@fosstodon.org |
@pamelafox | |
GitHub | www.github.com/pamelafox |
Website | pamelafox.org |
Let me know about your experiences with PostgreSQL and pgvector!