Skip to content

Commit aaebdaa

Browse files
committed
Sharpen up the blog post
- Give a snappier title - Shorten the background
1 parent 97ef68e commit aaebdaa

File tree

1 file changed

+57
-53
lines changed

1 file changed

+57
-53
lines changed

content/posts/2025-niche-targeting-updates.md

Lines changed: 57 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,62 @@
1-
Title: Contextual Topic Targeting with Embedding Centroids
1+
Title: Improving AI Ad Targeting with Embeddings
22
Date: August 5, 2025
33
description: This post shows how we improved our contextual targeting to handle hundreds of developer-specific topic niches with embeddings, pgvector, and centroids.
44
tags: content-targeting, engineering, postgresql
55
authors: David Fischer
66
image: /images/posts/2025-embedding-map.png
7-
image_credit: <span>Image generated with <a href="https://matplotlib.org/">Matplotlib</a> from embeddings and centroids</span>
7+
image_credit: <span>Image generated with <a href="https://matplotlib.org/">Matplotlib</a> from embeddings and centroids and <a href="https://github.com/lmcinnes/umap">umap</a> for dimension reduction</span>
88

99

10-
Going back to our [original vision for EthicalAds]({filename}../pages/vision.md),
11-
our goal has always been to show the best ad possible based on page context rather than any data about the user.
12-
By delivering the best possible ad on a given page,
13-
this will result in great advertiser performance with high earnings for the sites
14-
where the ads appear; without compromising privacy.
15-
16-
However, our approach to best fulfill that vision has changed over time.
17-
The tools available to target contextually are rapidly improving
18-
with the advances in language models (LLMs).
19-
This post is going to delve into how to use those advances for ad targeting
20-
but similar approaches can be used for many types of classifiers.
10+
Large language models and their surrounding tools are evolving fast
11+
and they are a powerful way to improve ad targeting and content classification,
12+
which is great when you're building a contextual ad network that doesn't track people.
13+
However, LLM prompts and responses can be inconsistent or unpredictable.
14+
We've taken a more reliable approach.
15+
By using (more) deterministic embeddings, we were able to sharpen up our targeting
16+
and boost performance with less guesswork without relying on any user-specific data.
17+
The method in this post should work well for many multi-classification tasks,
18+
particularly when the set of classes evolves or grows over time.
2119

2220

2321
## Historical context and scaling topic classification
2422

23+
First, a little bit of background.
2524
A few years back, we built [our first topic classifier](https://www.ethicalads.io/blog/2022/11/a-new-approach-to-content-based-targeting-for-advertising/)
26-
that essentially bundled content and keywords together into topics that advertisers could target and buy.
27-
To give a few examples, in addition to our [core audiences]({filename}../pages/advertisers.md#audiences),
28-
this allowed advertisers to target database related content or blockchain related content with relevant ads.
29-
This approach scaled well up to about 15-20 topics which was great for ad performance.
30-
However, adding another topic to target involved not just adding training set examples for that topic
31-
but also making sure any of our existing examples that also applied to the new topic were marked appropriately.
32-
Scaling became a pain.
33-
34-
Last year, we built a more advanced way of targeting very specific content with language model embeddings
35-
that we called [niche targeting]({filename}../pages/niche-targeting.md)
36-
(see our [blog]({filename}../posts/2024-niche-ad-targeting.md) with more details).
37-
This approach worked by targeting pages similar to an advertiser's specific landing and product pages.
38-
Using this approach, we saw ad performance 25-30% better in most cases.
39-
However, campaign sizes were very limited, because there just aren't enough very similar pages and
40-
it was hard to fill campaign sizes advertisers wanted to run.
41-
It also was harder to explain how this worked to marketers which made it harder to sell despite strong performance.
25+
that essentially bundled content and keywords together into topics that advertisers could target and buy
26+
similar to what they do for search ads.
27+
To give an example, this allowed advertisers to target DevOps related content with relevant ads.
28+
This approach scaled well up to about 10-15 topics
29+
and gave advertisers an easily understandable way to get good contextual targeting for their campaigns.
30+
31+
Last year, we built a more advanced way to target content using language model embeddings,
32+
a strategy we called [niche targeting]({filename}../pages/niche-targeting.md)
33+
(see our [blog]({filename}../posts/2024-niche-ad-targeting.md) with more dev details).
34+
It works by comparing embedding vectors to find pages semantically similar to an advertiser's landing or product page.
35+
The results were strong, about 25% better performance on average, but scale was a challenge.
36+
There simply weren't enough closely related pages to build large campaigns.
37+
Also, while the results were great, explaining embeddings and page similarity to marketers proved difficult,
38+
making the approach harder to sell despite its effectiveness.
4239

4340

4441
## Hybrid approach with embedding centroids
4542

4643
After generating embeddings for nearly a million pages across our network,
47-
clusters started to emerge of related content.
48-
Think of Kubernetes related content clustering together
49-
and Python related content clustering together in a different section.
50-
A centroid is simply the average of these embeddings: a single vector that represents the center of that topic cluster.
51-
52-
New content that's semantically similar will automatically fall close to related content in the embedding space.
53-
Just as before with our topic classifier model, this let us sell advertisers on the topic they're looking for.
54-
But unlike the previous approach, you only need to classify 15-20 pages of content for a new centroid to start taking shape. This scales much better to hundreds of topics or more.
55-
It's also far easier to explain to advertisers that we are targeting content related to the right topic for their product.
56-
57-
To show some concrete code examples, here's a code example of generating a centroid for a number of manually classified embeddings with [pgvector](https://github.com/pgvector/pgvector-python) and Django:
44+
clear clusters of related content began to emerge.
45+
For example, pages about Kubernetes tended to group closely together,
46+
while Python-related content formed its own nearby cluster in a different region of the embedding space.
47+
One of the powerful things about embeddings is that you can apply standard math to them,
48+
like taking an average of a group of vectors.
49+
A **centroid** is just that: the average of a set of related embeddings,
50+
representing the semantic center of a topic.
51+
52+
Classifying new content that's semantically related lands near similar content in embedding space
53+
(as shown in the 2D projection graphic in this post).
54+
Like our earlier topic classifier, this allows us to target ads based on the topics advertisers care about.
55+
But unlike the old model, this approach requires only a few examples to form a new centroid,
56+
making it far more scalable to hundreds of topics or more.
57+
It's also far easier to explain this type of classification to advertisers.
58+
59+
To show some concrete code examples, here's code to generate a centroid for a number of manually classified embeddings with [pgvector](https://github.com/pgvector/pgvector-python) and Django:
5860

5961
```python
6062
from django.db import models
@@ -69,9 +71,10 @@ centroid = embeddings.aggregate(
6971
)["centroid"]
7072
```
7173

72-
When classifying new content (a new embedding), it's easy to see how similar it is to all of the topic centroids.
74+
When classifying new content, it's easy to see how similar the content's embedding is to each of the topic centroids.
7375
This essentially answers the question of "how DevOps-ey is this content" or "how Frontend-ey is this content"
74-
for all possible topics.
76+
for an arbitrary number of topics.
77+
To steal a term, it's a vibes-based classifier.
7578

7679
```python
7780
from pgvector.django import CosineDistance
@@ -82,27 +85,28 @@ from .models import TopicCentroid
8285
vector = [-1.457664e-02, 3.473443e-02, ...]
8386

8487
# Closer than this threshold implies the content is related.
85-
# This threshold differs based on your embedding model.
86-
distance_threshold = 0.45
88+
# This threshold differs based on your embedding model
89+
distance_threshold = 0.40
8790

8891
TopicCentroid.objects.all().annotate(
8992
distance=CosineDistance("vector", vector)
9093
).filter(distance__lte=distance_threshold).order_by("distance")
9194
```
9295

93-
This approach yields all the benefits of using embeddings like much better semantic relevance than simple keywords
94-
while still being explainable like simple keyword targeting used in search ads.
95-
It also scales perfectly fine with any number of topics
96-
and new content just gets an embedding and gets matched and clustered automatically.
97-
As more content is manually classified and added to the centroid, the centroid better reflects that topic
98-
and classifications for that topic improve over time.
99-
Adding new topics for classification
96+
This approach offers the best of both worlds.
97+
It has the semantic depth of embeddings, far beyond what simple keywords can capture,
98+
with the clarity and explainability of keyword-style targeting.
99+
It scales to any number of topics, since new content just gets an embedding and automatically matched to the right clusters.
100+
Adding a new topic is as simple as giving 15-20 example pages.
101+
From there, a new centroid forms and begins matching relevant content automatically.
102+
As more content is manually classified and added to a centroid, that topic representation improves,
103+
making future classifications even more accurate.
100104

101105

102106
## Conclusion
103107

104-
From the moment we started using embeddings for ad targeting,
105-
we recognized they had great potential for improving contextual targeting performance for advertisers.
108+
From the moment we started using embeddings for contextual ad targeting,
109+
we recognized they had great potential for improving performance for advertisers.
106110
Better ad performance means we can generate more money for the sites that host our ads
107111
which is a great virtuous cycle.
108112

0 commit comments

Comments
 (0)