2.10 Semantic Search on Big Data#

In the first chapter of the course we had lessons where we leveraged the SBERT (sentence transformers) library to compute similarities between texts, which can be used to build smart search engines, recommender systems, or clustering texts. In this lesson, we’ll learn how to use those models over a large number of texts efficiently.

Models from the SBERT Library#

The models used by the SBERT library are pre-trained models that can be found in the Hugging Face Hub here. Have a look at them and learn about their differences.

../_images/sbert_models.png

The all-mpnet-base-v2 model achieves good scores both on embedding (e.g. text classification and text similarity) and semantic search tasks, making it the preferable model for most use cases. If your specific use case needs a smaller and faster model, you can trade off output quality for speed/size and choose other models like all-MiniLM-L6-v2.

Code Exercises#

Quiz#

What should I typically do if the embedding model that I want to use is too slow?

  1. Use a different smaller and faster embedding model, even if it may produce lower quality embeddings.

  2. Increase the speed of the model by optimizing the architecture and hyperparameters.

  3. Implement caching to enable faster embedding retrieval.

Why are operations like dot or cosine similarity fast on CPU?

  1. Because of vectorization.

  2. Because the operations are simple and easy to calculate.

  3. Because the calculations are done without involving the memory.

  4. Because of cache locality.

What are data structures that split spaces into cells to optimize computations called?

  1. Grid-based data structures.

  2. Spatial indexing data structures.

  3. Space-partitioning data structures.

What are two popular Python libraries used to perform fast semantic search?

  1. Pattern and Numpy.

  2. Spacy and NLTK.

  3. Gensim and Scikit-Learn.

  4. Faiss and Annoy.

  5. NLTK and TextBlob.

Questions and Feedbacks#

Have questions about this lesson? Would you like to exchange ideas? Or would you like to point out something that needs to be corrected? Join the NLPlanet Discord server and interact with the community! There’s a specific channel for this course called practical-nlp-nlplanet.