--- title: "Getting Started with TidyVec" author: "Nick Gauthier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with TidyVec} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` ## Introduction TidyVec is a lightweight package that brings vector embeddings and similarity search to the tidyverse ecosystem. It allows you to store embeddings alongside your data in tibbles and perform vector search operations while maintaining the ability to use all your familiar dplyr verbs. This vignette will show you how to: 1. Create vector collections 2. Generate embeddings for text and images 3. Find similar items using vector search 4. Visualize embedding spaces 5. Combine vector search with tidyverse operations ## Installation You can install TidyVec from GitHub: ```{r eval=FALSE} # install.packages("remotes") remotes::install_github("flmnh-ai/tidyvec") ``` Load the required packages: ```{r setup} library(dplyr) library(ggplot2) library(tidyvec) ``` ## Creating a Vector Collection At its core, TidyVec treats vector collections as enhanced tibbles. Any tibble can be converted to a vector collection using the `vec()` function: ```{r} # Create a simple dataset books <- tibble( id = c("book1", "book2", "book3", "book4", "book5"), title = c( "The Art of Data Science", "Advanced R Programming", "Tidy Data Visualization", "Statistical Learning Methods", "Machine Learning with R" ), author = c("Smith", "Jones", "Brown", "Davis", "Wilson"), year = c(2018, 2020, 2019, 2021, 2022), description = c( "A comprehensive guide to data analysis using modern techniques", "Deep dive into R programming for advanced users", "Creating beautiful visualizations with ggplot2 and the tidyverse", "Introduction to statistical learning methods and their applications", "Practical machine learning approaches with R examples" ) ) # Convert to a vector collection books_vec <- vec(books) books_vec ``` By default, a column named `embedding` will be created to store vector embeddings. ## Generating Embeddings For this example, we'll use a simple TF-IDF embedder for our text data: ```{r} # Create a TF-IDF embedder from our descriptions embedder <- embedder_tfidf(books$description) # Update our collection with the embedding function books_vec <- vec(books, embedding_fn = embedder) # Generate embeddings for the description column books_vec <- embed(books_vec, content_column = "description") # Look at the result books_vec ``` Each book now has an embedding vector derived from its description. ## Finding Similar Items Now we can find books similar to a text query: ```{r} # Find books similar to a query query_results <- books_vec %>% nearest("machine learning and statistics", n = 3) # View the results with similarity scores query_results %>% select(title, author, similarity) ``` We can also query directly with the `nearest()` function: ```{r} # Using the nearest() function nearest(books_vec, "programming in R") %>% select(title, author, similarity) ``` ## Combining with Tidyverse Operations What makes TidyVec special is how seamlessly it integrates with tidyverse workflows. You can use standard dplyr operations before or after vector search: ```{r} # Filter to recent books, then find similar ones books_vec %>% filter(year >= 2020) %>% nearest("R methods", n = 2) %>% select(title, year, similarity) ``` You can also perform vector search first, then filter the results: ```{r} # Find similar books, then filter by similarity threshold books_vec %>% nearest("R methods", n = 5) %>% filter(similarity > 0.2) %>% select(title, similarity) ``` ## Persistence Save collections to avoid re-computing embeddings: ```{r eval=FALSE} # Save to disk write_vec(books_vec, "my_books.qs") # Load later books_vec <- read_vec("my_books.qs") ``` ## Hybrid Search Combine semantic search with keyword matching: ```{r eval=FALSE} # Hybrid: 70% vector, 30% keyword matching nearest(books_vec, "deep learning", keyword_weight = 0.3, keyword_column = "description") ``` This is useful for "known-item" searches where you remember specific words. ## Clustering Discover semantic groups: ```{r eval=FALSE} books_vec %>% cluster_embeddings(n_clusters = 5) %>% group_by(cluster) %>% summarize(example = first(title)) ``` ## Working with Images TidyVec excels at working with multimodal data, including images. For this, we typically use neural embedding models like CLIP via HuggingFace. > Note: Python dependencies are automatically provisioned on first use. HuggingFace embedders batch process for performance. ```{r eval=FALSE} # Create a CLIP embedder clip_embedder <- embedder_hf("openai/clip-vit-base-patch32", modality = "multimodal") # Get paths to example images included with the package img_paths <- c( cat = system.file("images/cat.jpeg", package = "tidyvec"), dog = system.file("images", "dog.jpeg", package = "tidyvec"), beach = system.file("images", "beach.jpeg", package = "tidyvec"), mountain = system.file("images", "mountain.jpeg", package = "tidyvec"), city = system.file("images", "city.jpeg", package = "tidyvec") ) # Create an image collection images <- tibble( id = names(img_paths), path = unname(img_paths), category = c("pet", "pet", "nature", "nature", "urban") ) %>% vec(embedding_fn = clip_embedder) %>% embed(content_column = "path") # Find images similar to text nearest(images, "a cat playing") %>% select(id, path, similarity) # Find images similar to another image nearest(images, system.file("images", "dog-on-beach.jpeg", package = "tidyvec"), n = 2) # Find images similar to text and visualize nearest(images, "a dog on a mountain") %>% viz_images(path_column = "path", label_columns = c("id", "category"), n = 2) nearest(images, "a dog on a beach") %>% viz_images(path_column = "path", label_columns = c("id", "category"), n = 2) ``` ## Visualizing Embedding Spaces TidyVec provides a simple way to visualize your embedding spaces using dimensionality reduction techniques: ```{r eval=FALSE} # Visualize our book embeddings images %>% viz_embeddings(method = "tsne", labels = "id", color = "category", perplexity = 1) ``` ## Advanced Use Cases ### RAG (Retrieval-Augmented Generation) TidyVec is perfect for creating simple RAG systems: ```{r eval=FALSE} # Split document into chunks document_chunks <- tibble( id = paste0("chunk", 1:10), text = c( "R is a programming language for statistical computing.", "The tidyverse is a collection of R packages for data science.", "ggplot2 is used for data visualization in R.", "dplyr provides functions for data manipulation.", "tidyr helps to create tidy data.", "purrr enhances R's functional programming capabilities.", "readr provides functions to read rectangular data.", "tibble is a modern reimagining of the data frame.", "stringr provides functions for string manipulation.", "forcats provides tools for working with categorical variables." ), source = "R Documentation" ) %>% vec(embedding_fn = embedder_tfidf(.$text)) %>% embed(content_column = "text") # User query query_results <- document_chunks %>% nearest("How do I visualize data in R?", n = 3) # Use results to generate answer with an LLM query_results %>% select(text, similarity) ``` ### Custom Embedders You can easily create custom embedding functions: ```{r} # Create a simple embedder that counts word frequencies word_freq_embedder <- function(vocabulary = c("r", "data", "programming", "statistics", "visualization")) { function(text) { text <- tolower(text) vapply(vocabulary, function(word) { sum(gregexpr(word, text)[[1]] > 0) }, numeric(1)) } } # Use custom embedder simple_embedder <- word_freq_embedder() books_vec <- books %>% vec(embedding_fn = simple_embedder) %>% embed(content_column = "description") # Query with custom embedder nearest(books_vec, "data visualization") %>% select(title, similarity) ``` ## Conclusion TidyVec provides a lightweight, tidyverse-friendly way to work with vector embeddings. By treating embeddings as just another column in your tibbles, you get all the power of vector search while maintaining the flexibility and familiarity of the tidyverse. Key benefits: 1. Seamless integration with dplyr, ggplot2, and other tidyverse packages 2. Support for multiple modalities (text, images) 3. Fast batch processing (10-50x speedup) 4. Persistence support for saving/loading collections 5. Hybrid search and clustering capabilities 6. Simple, intuitive API 7. Visualization capabilities For production use cases with >100K items or sub-millisecond queries, consider using FAISS, Chroma, or other specialized vector databases. For learning, prototyping, and personal-scale projects, TidyVec provides an elegant and easy-to-use solution.