Skip to content

Setting Up Vector Search in Uli Community

Aatman Vaidya edited this page Sep 24, 2025 · 4 revisions

Notes on Python and Elixir (Phoenix) integration, scripts, and how to setup vector search end-to-end.

Important

Please make sure the PR that does the python setup https://github.com/tattle-made/Uli/pull/812 in the Dockerfile is merged before following the instructions below

How Python is set up and how it communicates with Elixir Phoenix

  • Python location: Python code lives in lib/python/ with modules like text_vec.py, video_vec.py, and clustering.py. A project-local virtualenv is created by the build using uv - see more here.
  • Interop library: We are using the Export library (which wraps ErlPort) to start a Python interpreter and call Python functions from Elixir.
  • Model Download: A GenServer UliCommunity.MediaProcessing.TextVecRepVyakyarth downloads the ML model and loads it into RAM
  • Config: The Python executable and path are read from config/* via Application.compile_env(:uli_community, [:python, :python_path]). The Dockerfile installs Python 3.10 and ships the virtualenv + HF cache dirs so this works inside containers.

Step-by-step commands to set up vector search

  1. Exec into the running container (or pod) and open IEx remote shell:
bin/uli_community remote
  1. Enqueue embedding jobs to extract vectors for unique unprocessed slurs: Run in IEx:
Scripts.ExtractCrowdsourcedSlurEmbedding.enqueue_unprocessed_texts_batch()
  • What it does: queries crowdsourced_slurs left-joined with text_vec_store_vyakyarth tables to find items without embeddings (deduped by lowercased trimmed label). It enqueues batches (size 128) to Oban queue :text_index.
  1. Cluster the stored embeddings: Run in IEx:
Scripts.ClusterTextVecStore.run()
  • What it does: cluster's all the slur's into unique cluster's. This is helpful to find what type of slurs are similar to each other.

Vector Search is now setup and from the UI you can start using it now.


Available scripts and what they do

  • Scripts.SeedCrowdsourcedSlurData210525.run()

    • Inserts seed data from priv/crowdsourced-21-14-2025/slur_metadata.json into domain tables.
    • Run: Scripts.SeedCrowdsourcedSlurData210525.run()
  • Scripts.ExtractCrowdsourcedSlurEmbedding.enqueue_unprocessed_texts_batch()

    • Enqueues Oban jobs to compute embeddings for slurs missing entries in text_vec_store_vyakyarth (deduplicated by normalized label).
    • Run: Scripts.ExtractCrowdsourcedSlurEmbedding.enqueue_unprocessed_texts_batch()
  • Scripts.ClusterTextVecStore.run()

    • Performs clustering over all stored embeddings by delegating to Python clustering.get_clusters, then persists cluster IDs back to text_vec_store_vyakyarth.
    • Run: Scripts.ClusterTextVecStore.run()

Clone this wiki locally