Hugging Face and Pre-trained Models
Contents
Hugging Face and Pre-trained Models#
First Steps with Hugging Face#
What’s the Hugging Face library that provides APIs and easy-to-use functions to utilize transformers or download pre-trained models?
transformers
.tokenizers
.accelerate
.
Answer
The correct answer is 1.
Complete the sentence with the correct option. Hugging Face provides a _____ interface for working with NLP models than PyTorch and TensorFlow.
lower-level.
higher-level.
Answer
The correct answer is 2.
True or False. All the models available on the Hugging Face Hub have been trained by Hugging Face.
Answer
The correct answer is False. Everyone can create an account and upload a model on the Hugging Face Hub.
True or False. On the Hugging Face Hub you can find models for NLP tasks only.
Answer
The correct answer is False. There are models for Computer Vision and Reinforcement Learning as well. However, today the majority of models is about NLP.
True or False. Hugging Face Spaces lets you deploy and host your ML demos on Hugging Face, but you need a premium account.
Answer
The correct answer is False. Most of the demos on Hugging Face run on a free tier, using limited CPUs that are ok for most of the demos (they are demos, after all). It’s possible to pay to leverage better hardware (e.g. GPUs).
Hugging Face Hub: Models and Model Cards#
Select the option that describes something not typically found on a model card.
The evaluation results.
The model intended uses and potential limitations, including biases and ethical considerations.
Training datasets.
The model price.
The model architecture (e.g. BERT, RoBERTa, etc).
The training configuration and experimental info.
Answer
The correct answer is 4.
True or False. All the models in the Hugging Face Hub can be used commercially.
Answer
The correct answer is False. Whether a model can be used commercially or not is specified in the model license.
True or False. If you have a question about a model on the Hugging Face Hub, you should send a message to its author.
Answer
The correct answer is False. While you may be able to send a message to the author, the best way to proceed is to ask the question in the Community section.
True or False. On the Hugging Face Hub you can often test models directly from the browser.
Answer
The correct answer is True, thanks to the Hosted Inference API. Keep in mind that you have a limited amount of calls that you can make daily for free. Moreover, some models can’t be tested directly from the browser if they require complex input (e.g. Reinforcement Learning models).
Hugging Face Hub: Datasets#
Choose the incorrect option. On the Hugging Face Hub, datasets can be filtered by:
License.
Languages of the samples contained.
Number of models that have been trained on the dataset.
Multilinguality, i.e. whether the dataset can be used for multilingual model or not.
Size of the dataset (i.e. the number of samples contained).
Tasks for which the dataset has been created.
Answer
The correct answer is 3.
True or False. Is it possible to see some samples directly from the dataset page?
Answer
The correct answer is True. There is a “Dataset Preview” section on the dataset page.
What is a dataset card and what does it typically contain?
A document that lists useful information about a dataset, like its creation process or how to responsibly use the data.
A document that lists useful information about a dataset, like the best models trained with it and their scores.
A document that lists useful information about a dataset, like its authors and other similar datasets.
Answer
The correct answer is 1.
Hugging Face Spaces#
What are the libraries that can be used to publish apps with Hugging Face Spaces?
Streamlit and Gradio.
Bokeh and Plotly.
Matplotlib and Seaborn.
Answer
The correct answer is 1.
Hugging Face Pipeline for Quick Prototyping#
Choose the best option. In the transformers
library, the Pipeline
class is a _____ class.
High-level.
Low-level.
Answer
The correct answer is 1.
True or False. In the transformers
library, the Pipeline
class executes the same code for each task.
Answer
The correct answer is False. Each task has then a corresponding task-specific pipeline (e.g. “audio-classification” has a AudioClassificationPipeline, “question-answering” has a QuestionAnsweringPipeline, etc).
Evaluating a Sentiment Analysis Model#
Select the option that is not related to a sentiment analysis dataset.
IMDb
SST-2
SQuAD.
Answer
The correct answer is 3. SQuAD is a popular dataset for question answering.
What’s the name of the function from the datasets
library that allows downloading and computing metrics, such as accuracy?
accuracy_score
load_metric
evaluate
compute_metric
Answer
The correct answer is 2.
What type of data is contained in the IMDb dataset?
Tweets
General info about each movie
Movie ratings
Movie reviews
Answer
The correct answer is 4.
When should someone fine-tune a model on new data instead of using a pre-trained model directly, even if trained on similar data?
When the improvements that fine-tuning brings to your model have more benefits than the costs of building a dataset and fine-tuning the model.
When the data you have is specialized to a particular domain or task.
When the pre-trained model does not have enough capacity to capture the complexity of your data.
When the pre-trained model has not been trained on a sufficiently large dataset.
Answer
The correct answer is 1.
What’s the meaning of the device
parameter of a Hugging Face pipeline?
It specifies the size of the output data to be produced by the model.
It specifies which type of algorithm to use for natural language processing tasks.
It specifies the size of the input data to be processed by the model.
It specifies the device (e.g. CPU or GPU) to use for computations by the model.
Answer
The correct answer is 4.
Project: Detecting Emotions from Text#
What is Emotion Detection in NLP?
A technique that allows understanding the context of a text.
A way to predict the sentiment (positive or negative) of a text.
A technique that allows classifying texts with human emotions.
A tool to detect the level of understanding of a text.
Answer
The correct answer is 3.
Project: Language Detection#
What is Language Detection commonly used for?
To develop a better understanding of the structure of a language.
To detect text that has been plagiarized from another source.
Automatically translating a text from one language to another.
As a preprocessing step before passing texts to mono-lingual models.
Answer
The correct answer is 4.
How many languages are present in the Language Identification dataset?
10
20
50
200
Answer
The correct answer is 2.
Semantic Search on Big Data#
What should I typically do if the embedding model that I want to use is too slow?
Use a different smaller and faster embedding model, even if it may produce lower quality embeddings.
Increase the speed of the model by optimizing the architecture and hyperparameters.
Implement caching to enable faster embedding retrieval.
Answer
The correct answer is 1.
Why are operations like dot or cosine similarity fast on CPU?
Because of vectorization.
Because the operations are simple and easy to calculate.
Because the calculations are done without involving the memory.
Because of cache locality.
Answer
The correct answer is 1.
What are data structures that split spaces into cells to optimize computations called?
Grid-based data structures.
Spatial indexing data structures.
Space-partitioning data structures.
Answer
The correct answer is 3.
What are two popular Python libraries used to perform fast semantic search?
Pattern and Numpy.
Spacy and NLTK.
Gensim and Scikit-Learn.
Faiss and Annoy.
NLTK and TextBlob.
Answer
The correct answer is 4.