Back to blog
Engineering

Existence vs. Guarantee: How We Handle Search Result Selection

Semantic search powers Echosaw's media intelligence. This post dives into our vector search architecture, distance-based filtering, and the trade-offs between recall and precision.

Echosaw TeamMay 28, 20268 min read

Semantic search is at the heart of Echosaw's media intelligence platform. When you search your library with natural language queries, you expect relevant results—but what does "relevant" actually mean in the context of vector similarity search? This post dives into our search architecture and the trade-offs between recall (finding everything) and precision (finding the right things).

The Vector Index

Echosaw uses AWS S3 Vectors with a 1024-dimensional embedding index powered by Amazon Titan Embed Text v2. Each analyzed media item is converted to a vector representation that captures its semantic content—transcripts, summaries, labels, locations, and other metadata are all encoded into this high-dimensional space. When you search, your query is similarly embedded, and we find the nearest vectors using cosine distance.

But here's the crucial distinction: just because a document exists in your library doesn't guarantee it will appear in search results. The vector index returns the top-K nearest neighbors, but we apply multiple filtering layers before presenting results to you.

Distance-Based Filtering

Raw vector similarity scores (cosine distance) range from 0 (identical) to 1 (completely unrelated). We don't simply return the top 10 matches—we apply a sophisticated filtering pipeline:

Absolute Distance Ceiling (0.85) — We reject any result with a distance above 0.85, regardless of its relative ranking. This hard cutoff prevents truly irrelevant noise from appearing in results, even for conceptual queries like "comedic content" that naturally produce looser matches. Specific queries (proper nouns, exact phrases) naturally produce tighter distances. Adaptive Relative Threshold (0.05 gap) — After applying the absolute ceiling, we calculate an adaptive threshold based on the best match's distance. We only include results within 0.05 of that best score. This ensures that results are clustered around the highest-quality matches rather than including progressively weaker candidates. Keyword Boost Ceiling (0.95) — For results that contain exact keyword matches (proper nouns, locations, names), we relax the distance ceiling to 0.95. This rescues results that might have weaker semantic similarity but contain the exact terms you're looking for—particularly important for entity searches where semantic models sometimes struggle.

The Filtering Pipeline in Action

Here's how a query flows through our system:

  1. Generate query embedding via Bedrock Titan
  2. Query S3 Vectors with scope-based filter (personal, org, or public)
  3. Apply absolute distance ceiling (remove clearly irrelevant results)
  4. Rescue keyword-matched results within boost ceiling
  5. Apply adaptive relative threshold (cluster around best match)
  6. Cap at maximum results (10) to prevent noise flooding
This multi-stage approach balances recall and precision. The absolute ceiling prevents catastrophic failures where completely irrelevant content appears. The adaptive threshold ensures result quality is consistent regardless of query specificity. The keyword boost handles edge cases where semantic similarity misses but exact matches exist.

Scope-Based Filtering

Beyond distance-based relevance, we also filter by ownership scope. Your searches can target your personal media (mine), your organization's shared library (org), or the public library across all users (public). Each scope uses different filter criteria on the vector index, ensuring you only see content you're authorized to access.

The distinction between existence and guarantee is fundamental to our search philosophy. We guarantee that results we return are relevant, but we don't guarantee that every relevant document in your library will appear. This trade-off prioritizes result quality over exhaustive recall—which we believe is the right choice for a media intelligence platform where trust in results matters most.

Ready to bring powerful multimodal AI to your media operations?

Trusted at scale to extract semantic insights, build intelligent timelines, deliver accurate transcripts, analyze audio and visual content, and generate synthetic media — with full control and security. Start with our Starter plan for $9/month — usage-based pricing so you only pay for what you analyze.