Dec 10 and 11 2024

Started Kaggle/Google Day 2 - Embeddings and Vector Stores

  • Summary podcast episode for this unit
  • Whitepaper on Embeddings and Vector stores
    • The Continuous bag of words (CBOW) approach: Tries to predict the middle word, using the embeddings of the surrounding words as input. This method is agnostic to the order of the surrounding words in the context. This approach is fast to train and is slightly more accurate for frequent words.
    • The skip-gram approach: The setup is inverse of that of CBOW, with the middle word being used to predict the surrounding words within a certain range. This approach is slower to train but works well with small data and is more accurate for rare words.
    • Then covered GloVe and Word2Vec, with a visual of the word embeddings based on the model algo used.
    • Figure 9. Some architectures of dual encoders - siamese dual encoder - SDE, asymmetric dual encoder - ADE, asymmetric dual encoder w/ shared token embedder - ADE-STE, asymmetric dual encoder w/ frozen token embedder ADE-FTE, and ADE-SPL
  • Comparing a query embedding to the text embeddings
    • Euclidean distance (i.e., L2 distance) is a geometric measure of the distance between two points in a vector space. This works well for lower dimensions (word embeddings, tabular data, or an image embedding AFTER feature extraction).
    • Cosine similarity is a measure of the angle between two vectors.
    • And inner/dot product, is the projection of one vector onto another. They are equivalent when the vector norms are 1. This seems to work better for higher dimensional data (Raw text representations, raw image pixels, genomics data, user-item recommendation systems) . Vector databases store and help manage and operationalize the complexity of vector search at scale, while also addressing the common database needs.
  • Options for searching across all the Vectors entries -
    • Brute force – look at each and every one
    • Approximate nearest neighbor (ANN)
    • Locality Sensitive hashing (LSH) –think about group residences by postal code–
    • and Trees (KD, Ball, etc)
    • Hashing and tree-based approaches can be combined and extended upon to obtain desired tradeoff between recall and latency for the search algo (e.g., FAISS with HNSW and ScaNN)
  • Vector Embedding Challenges
    • Vector embeddings are NOT STATIC - Embeddings, unlike traditional content, can mutate over time. This means that the same text, image, video or other content could and should be embedded using different embedding models to optimize for the performance of the downstream applications. This is especially true for embeddings of supervised models after the model is retrained to account for various drifts or changing objectives. Similarly, the same applies to unsupervised models when they are updated to a newer model.
    • Vector Embeddings and NON-Semantic data - while embeddings are great at representing semantic information, sometimes they can be suboptimal at representing literal or syntactic information. This is especially true for domain-specific words or IDs. These values are potentially missing or underrepresented in the data the embeddings models were trained on.
      • For example, if a user enters a query that contains the ID of a specific number along with a lot of text, the model might find semantically similar neighbors which match the meaning of the text closely, but not the ID, which is the most important component in this context. You can overcome this challenge by using a combination of full-text search to pre-filter or post-filter the search space before passing it onto the semantic search module.
      • For OLTP workloads that require frequent reads/write operations, an operational database like Postgres or CloudSQL is the best choice. For large-scale OLAP analytical workloads and batch use cases, using Bigquery’s vector search is preferable, for example

Summary / Key Takeways

  • Choose your embedding model wisely for your data and use case. Ensure the data used in inference is consistent with the data used in training. The distribution shift from training to inference can come from various areas, including domain distribution shift or downstream task distribution shift. If no existing embedding models fit the current inference data distribution, fine-tuning the existing model can significantly help on the performance.
  • For most business applications using a pre-trained embedding model provides a good baseline, which can be further fine-tuned or integrated in downstream models.
  • In case the data has an inherent graph structure, graph embeddings can provide superior performance.
  • ScaNN and HNSW have proven to provide some of the best accuracy and performance trade offs in that order.
  • Popular applications for Embedding models | Embedding Task | Purpose | When to Use | |—————————-|———————————————————————————————|—————————————————————————————————————| | “retrieval_document” | Embed documents for retrieval. | When storing content to be searched. | | “retrieval_query” | Embed user queries for retrieval tasks. | When embedding user search inputs. | | “semantic_similarity” | Measure semantic similarity between texts. | For detecting duplicates, paraphrases, or similar content. | | “classification” | Create embeddings for input classification tasks. | For categorizing data (e.g., spam detection, sentiment analysis). | | “clustering” | Generate embeddings for grouping similar objects. | For clustering tasks (e.g., grouping reviews or articles). | | “reranking” | Adjust the ranking of retrieved objects. | When refining search results or recommendation rankings. | | “knowledge_embedding” | Embed entities and relationships in a knowledge graph. | When working with knowledge graphs or reasoning tasks. |

Kaggle Day 2 code labs * Build a RAG question-answering system over custom document * Additional resources - https://ai.google.dev/gemini-api/docs/semantic_retrieval, https://developers.google.com/machine-learning/crash-course/embeddings, https://ai.google.dev/gemini-api/docs/embeddings (has detailed examples for different type of embedding tasks) * Explore text similarity with embeddings * Additional Resources - Anomaly Detection with Embeddings, Search re-ranking with Embeddings * Build a neural classification network with Keras using embeddings – Trained a data set on 20 categories and used keras for the DeepLearning * Additional Resources - Training with build-In Methods –using Tensorflow+Keras * Scikit Labeled Data for 20 New Group Categories * ScaNN for AlloyDB Whitepaper