What is HuggingFace🤗
HuggingFace is an open-source software company that offers a wide range of natural language processing (NLP) tools and libraries which are built on top of the PyTorch and TensorFlow frameworks.. It provides access to pre-trained language models like GPT-3, GPT-4, BERT etc, as well as other models for various NLP tasks like text classification and question-answering. In this guide, we will explore how to use Hugging Face with several different examples.
Prerequisites:
💡 Before we begin, you should have a basic understanding of Python programming language and NLP concepts. You should also have the following installed on your computer:
- Python 3.6 or higher – Python Installation Guide
- pip package manager
- Hugging Face Transformers library
You can install the Transformers library by running the following command in your terminal:
pip install transformers

Example 1: Text Classification
In this example, we’ll use Hugging Face to perform text classification on a dataset of movie reviews. The goal is to classify each review as either positive or negative.
Step 1: Load Data
First, we need to load our data. We’ll use the IMDB movie review dataset, which is included in the datasets package from Hugging Face. Here’s an example of how to load the data:
from datasets import load_dataset
dataset = load_dataset('imdb')
In this example, we’re using the load_dataset()
function from the datasets
package to load the IMDB dataset.
Step 2: Tokenize Data
Next, we need to tokenize our input data. We’ll use the BERT tokenizer from the Transformers library to tokenize the movie reviews. Here’s an example of how to tokenize the data:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
encoded_dataset = dataset.map(lambda x: tokenizer(x['text'], padding=True, truncation=True), batched=True)
In this example, we’re using the AutoTokenizer
class from the Transformers library to load the BERT tokenizer. We’re then using the map()
method from the datasets
package to tokenize each review in the dataset. The padding=True
and truncation=True
arguments tell the tokenizer to pad or truncate the reviews to a fixed length.
Step 3: Load Model and Train Classifier
Next, we need to load our pre-trained BERT model and train a classifier on the tokenized data. Here’s an example of how to do this:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)
training_args = TrainingArguments('test_trainer')
trainer = Trainer(model=model, args=training_args, train_dataset=encoded_dataset)
trainer.train()
In this example, we’re using the AutoModelForSequenceClassification
class from the Transformers library to load the pre-trained BERT model for sequence classification. We’re also using the TrainingArguments
class to set up our training parameters, and the Trainer
class to perform the training.
Step 4: Test Classifier
Finally, we need to test our classifier on some test data. Here’s an example of how to do this:
test_dataset = encoded_dataset['train'].select(range(100, 200))
trainer.predict(test_dataset)
In this example, we’re using the select()
method from the datasets
package to extract a subset of our encoded dataset as our test dataset. We’re then using the predict()
method from the Trainer
class to make predictions on the test dataset.
Example 2: Question Answering
In this example, we’ll use Hugging Face to perform question answering on a passage of text. The goal is to find the answer to a given question within the passage.
Step 1: Load Data
First, we need to load our data. We’ll use a sample passage of text and a sample question. Here’s an example of how to do this:
context = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889 as the entrance to the 1889 World's Fair, it was initially criticized by some of France's leading artists and intellectuals for its design, but it has become a global cultural icon of France and one of the most recognizable structures in the world."
question = "Who designed and built the Eiffel Tower?"
Step 2: Tokenize Data
Next, we need to tokenize our input data. We’ll use the DistilBERT tokenizer from the Transformers library to tokenize the passage and the question. Here’s an example of how to tokenize the data:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
encoded_inputs = tokenizer(question, context, padding=True, truncation=True, return_tensors='pt')
In this example, we’re using the AutoTokenizer
class from the Transformers library to load the DistilBERT tokenizer. We’re then using the tokenizer()
method to tokenize the question and the passage. The padding=True
and truncation=True
arguments tell the tokenizer to pad or truncate the inputs to a fixed length. We’re also using the return_tensors='pt'
argument to return PyTorch tensors.
Step 3: Load Model and Get Answer
Next, we need to load our pre-trained DistilBERT model and use it to get the answer to our question. Here’s an example of how to do this:
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')
output = model(**encoded_inputs)
start_scores, end_scores = output.start_logits, output.end_logits
start_index = start_scores.argmax(dim=-1).item()
end_index = end_scores.argmax(dim=-1).item()
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(encoded_inputs['input_ids'][0][start_index:end_index+1]))
In this example, we’re using the AutoModelForQuestionAnswering
class from the Transformers library to load the pre-trained DistilBERT model for question answering. We’re also using the output.start_logits
and output.end_logits
attributes to get the start and end scores for each token in the input. We’re then using the argmax()
method to get the index of the highest score for the start and end tokens. Finally, we’re using the convert_tokens_to_string()
and convert_ids_to_tokens()
methods from the tokenizer to get the answer as a string.
Limitations of using HuggingFace:
Firstly, Hugging Face models are pre-trained on large datasets and may not perform as well on specific tasks or with certain types of data. Fine-tuning is often required to improve model performance for specific use cases, which can be time-consuming and require significant computational resources.
Secondly, while the Transformers library is open-source and available to the community, the Hugging Face Hub relies on proprietary software and is not fully transparent. This can limit the ability of researchers and developers to fully understand how models are trained and make modifications to suit their specific needs.
Finally, Hugging Face models are only as good as the data they are trained on. If the data used to train a model is biased or limited, this can impact the accuracy and fairness of the model’s predictions. It’s important to carefully consider the data used to train models and ensure that it is diverse and representative.
Conclusion
HuggingFace is a powerful tool for building and integrating NLP models into your applications. In this guide, we covered two examples for using Hugging Face: text classification and question answering. Keep in mind that the specific steps and code will vary depending on the pre-trained model and NLP task, so it’s important to refer to the documentation for your specific use case. With the help of HuggingFace, you can easily implement and fine-tune NLP models to suit your needs. We hope this guide has helped you get started with using Hugging Face for NLP tasks.
Additional Resources
If you want to learn more about Hugging Face and NLP, here are some additional resources to check out:
Please let me know in the comment section if you have any comments or questions.
No Comments Yet