You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2025-niche-targeting-updates.md
+57-53Lines changed: 57 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,60 +1,62 @@
1
-
Title: Contextual Topic Targeting with Embedding Centroids
1
+
Title: Improving AI Ad Targeting with Embeddings
2
2
Date: August 5, 2025
3
3
description: This post shows how we improved our contextual targeting to handle hundreds of developer-specific topic niches with embeddings, pgvector, and centroids.
4
4
tags: content-targeting, engineering, postgresql
5
5
authors: David Fischer
6
6
image: /images/posts/2025-embedding-map.png
7
-
image_credit: <span>Image generated with <ahref="https://matplotlib.org/">Matplotlib</a> from embeddings and centroids</span>
7
+
image_credit: <span>Image generated with <ahref="https://matplotlib.org/">Matplotlib</a> from embeddings and centroids and <ahref="https://github.com/lmcinnes/umap">umap</a> for dimension reduction</span>
8
8
9
9
10
-
Going back to our [original vision for EthicalAds]({filename}../pages/vision.md),
11
-
our goal has always been to show the best ad possible based on page context rather than any data about the user.
12
-
By delivering the best possible ad on a given page,
13
-
this will result in great advertiser performance with high earnings for the sites
14
-
where the ads appear; without compromising privacy.
15
-
16
-
However, our approach to best fulfill that vision has changed over time.
17
-
The tools available to target contextually are rapidly improving
18
-
with the advances in language models (LLMs).
19
-
This post is going to delve into how to use those advances for ad targeting
20
-
but similar approaches can be used for many types of classifiers.
10
+
Large language models and their surrounding tools are evolving fast
11
+
and they are a powerful way to improve ad targeting and content classification,
12
+
which is great when you're building a contextual ad network that doesn't track people.
13
+
However, LLM prompts and responses can be inconsistent or unpredictable.
14
+
We've taken a more reliable approach.
15
+
By using (more) deterministic embeddings, we were able to sharpen up our targeting
16
+
and boost performance with less guesswork without relying on any user-specific data.
17
+
The method in this post should work well for many multi-classification tasks,
18
+
particularly when the set of classes evolves or grows over time.
21
19
22
20
23
21
## Historical context and scaling topic classification
24
22
23
+
First, a little bit of background.
25
24
A few years back, we built [our first topic classifier](https://www.ethicalads.io/blog/2022/11/a-new-approach-to-content-based-targeting-for-advertising/)
26
-
that essentially bundled content and keywords together into topics that advertisers could target and buy.
27
-
To give a few examples, in addition to our [core audiences]({filename}../pages/advertisers.md#audiences),
28
-
this allowed advertisers to target database related content or blockchain related content with relevant ads.
29
-
This approach scaled well up to about 15-20 topics which was great for ad performance.
30
-
However, adding another topic to target involved not just adding training set examples for that topic
31
-
but also making sure any of our existing examples that also applied to the new topic were marked appropriately.
32
-
Scaling became a pain.
33
-
34
-
Last year, we built a more advanced way of targeting very specific content with language model embeddings
35
-
that we called [niche targeting]({filename}../pages/niche-targeting.md)
36
-
(see our [blog]({filename}../posts/2024-niche-ad-targeting.md) with more details).
37
-
This approach worked by targeting pages similar to an advertiser's specific landing and product pages.
38
-
Using this approach, we saw ad performance 25-30% better in most cases.
39
-
However, campaign sizes were very limited, because there just aren't enough very similar pages and
40
-
it was hard to fill campaign sizes advertisers wanted to run.
41
-
It also was harder to explain how this worked to marketers which made it harder to sell despite strong performance.
25
+
that essentially bundled content and keywords together into topics that advertisers could target and buy
26
+
similar to what they do for search ads.
27
+
To give an example, this allowed advertisers to target DevOps related content with relevant ads.
28
+
This approach scaled well up to about 10-15 topics
29
+
and gave advertisers an easily understandable way to get good contextual targeting for their campaigns.
30
+
31
+
Last year, we built a more advanced way to target content using language model embeddings,
32
+
a strategy we called [niche targeting]({filename}../pages/niche-targeting.md)
33
+
(see our [blog]({filename}../posts/2024-niche-ad-targeting.md) with more dev details).
34
+
It works by comparing embedding vectors to find pages semantically similar to an advertiser's landing or product page.
35
+
The results were strong, about 25% better performance on average, but scale was a challenge.
36
+
There simply weren't enough closely related pages to build large campaigns.
37
+
Also, while the results were great, explaining embeddings and page similarity to marketers proved difficult,
38
+
making the approach harder to sell despite its effectiveness.
42
39
43
40
44
41
## Hybrid approach with embedding centroids
45
42
46
43
After generating embeddings for nearly a million pages across our network,
47
-
clusters started to emerge of related content.
48
-
Think of Kubernetes related content clustering together
49
-
and Python related content clustering together in a different section.
50
-
A centroid is simply the average of these embeddings: a single vector that represents the center of that topic cluster.
51
-
52
-
New content that's semantically similar will automatically fall close to related content in the embedding space.
53
-
Just as before with our topic classifier model, this let us sell advertisers on the topic they're looking for.
54
-
But unlike the previous approach, you only need to classify 15-20 pages of content for a new centroid to start taking shape. This scales much better to hundreds of topics or more.
55
-
It's also far easier to explain to advertisers that we are targeting content related to the right topic for their product.
56
-
57
-
To show some concrete code examples, here's a code example of generating a centroid for a number of manually classified embeddings with [pgvector](https://github.com/pgvector/pgvector-python) and Django:
44
+
clear clusters of related content began to emerge.
45
+
For example, pages about Kubernetes tended to group closely together,
46
+
while Python-related content formed its own nearby cluster in a different region of the embedding space.
47
+
One of the powerful things about embeddings is that you can apply standard math to them,
48
+
like taking an average of a group of vectors.
49
+
A **centroid** is just that: the average of a set of related embeddings,
50
+
representing the semantic center of a topic.
51
+
52
+
Classifying new content that's semantically related lands near similar content in embedding space
53
+
(as shown in the 2D projection graphic in this post).
54
+
Like our earlier topic classifier, this allows us to target ads based on the topics advertisers care about.
55
+
But unlike the old model, this approach requires only a few examples to form a new centroid,
56
+
making it far more scalable to hundreds of topics or more.
57
+
It's also far easier to explain this type of classification to advertisers.
58
+
59
+
To show some concrete code examples, here's code to generate a centroid for a number of manually classified embeddings with [pgvector](https://github.com/pgvector/pgvector-python) and Django:
0 commit comments