Complete Guide to Building an AI Summarizer with Python

AI Summarizers are powerful tools that make content concise and impactful. They leverage the latest NLP algorithms and Python language to function accurately and precisely.

However, developing such a tool on your own can be a very difficult task. Firstly, you'll need the proper knowledge of the libraries required and, secondly, how to execute the code.

Thankfully, you've landed on the right article to learn the entire process of developing an AI summarizer model with Python. We will try to cover each aspect in detail so you don't have to worry about anything. Let's get started.


Understanding Text Summarization

Understanding Text Summarization

Mainly, there are two types of text summarization techniques, abstractive and extractive. In extractive summarization, you pick only a few sentences verbatim from the original passage and present a boiled-down version.

The revised version still contains the same meaning as the source but with way fewer words and redundancy.

However, abstractive summarization works completely differently. It requires one to understand the context of the original work and make the summary in own words. This is normally how humans write summaries.

The text summarizer that we'll develop today will provide abstract summaries. It is because we'll use Natural Language Processing (NLP) algorithms at the backend that provide human touch to the outputs.


Setting up the Environment

Setting up the Environment

To begin the procedure, you'll need to set up a virtual environment. This is necessary to keep all files in the project isolated from global ones and avoid any potential data leaks, etc.

Start by opening the command prompt with administrative privileges. Then, change the directory to where you'll save the project files and put in the following lines of code.

python -m venv text_summarization text_summarization\Scripts\activate

Pressing enter after each will construct your virtual environment ready for the project. Now, install a suitable IDE like Atom, Spyder, Visual Studio, etc., to access the created text_summarization environment.


Installing Libraries

Installing Libraries

After you're done with setting up the basics, it's time to download and install the required libraries.

For developing an AI summarizer model you need to input the following line in your command prompt.

pip install pandas transformers torch sentencepiece

This single command will install all the required libraries onto your PC or laptop. The reason why we're using these libraries is another discussion and can be covered in a separate post later.


Importing Libraries

Importing Libraries

Moving on, you need to hop on to your IDE and import the required libraries in the workspace by using the following code.

import pandas as pd

from transformers import T5ForConditionalGeneration, T5Tokenizer

import torch

from torch.utils.data import Dataset, DataLoader

from transformers import AdamW

The Pandas library will be used to handle dataset inputs.

The HuggingFace's T5 pre-trained transformer will help us quickly get accurate summarization results.

These lines of code will ensure that all your tools are ready at your disposal. So, we don't face any bugs/errors later on in the execution phase.


Loading the Dataset for Model Training

Loading the Dataset for Model Training

Once you've imported the libraries, the next step is to load and preprocess the data for the abstractive summarization process. To do so, follow the code given below.

# Paths to your dataset

train_file = 'C:/Users/Common/Desktop/train.csv'

test_file = 'C:/Users/Common/Desktop/test.csv'

val_file = 'C:/Users/Common/Desktop/validation.csv'

# Load datasets

train_data = pd.read_csv(train_file, usecols=['article', 'highlights'])

test_data = pd.read_csv(test_file, usecols=['article', 'highlights'])

val_data = pd.read_csv(val_file, usecols=['article', 'highlights'])

# Ensure that the columns are correct

assert 'article' in train_data.columns and 'highlights' in train_data.columns, "Required columns not found in the train dataset."

Here, we have given the CNN Daily Mail dataset as input. This will provide the code with enough context to train and provide us with valid results.

We have given a static input from our Desktop directory. However, you can also take dynamic inputs from users, but it would require you to make some changes in the written program.


Tokenizing the Dataset

Tokenizing the Dataset

After you're done getting the dataset in the workspace, it is time to initialize and run the transformer for the tokenization process. Start by inputting the following lines of code in the IDE.

# Initialize the tokenizer and model

tokenizer = T5Tokenizer.from_pretrained("t5-small")

model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Function to tokenize data

def tokenize_data(texts, max_length=512):

return tokenizer(texts, max_length=max_length, truncation=True, padding="max_length", return_tensors="pt")

# Custom dataset class

class TextSummaryDataset(Dataset):

def __init__(self, articles, summaries):

self.articles = articles

self.summaries = summaries

def __len__(self):

return len(self.articles)

def __getitem__(self, idx):

article = self.articles.iloc[idx]

summary = self.summaries.iloc[idx]

# Tokenize the inputs and outputs (article and summary)

encodings = tokenizer(article, max_length=512, truncation=True, padding="max_length", return_tensors="pt")

labels = tokenizer(summary, max_length=150, truncation=True, padding="max_length", return_tensors="pt")

# Flatten the tensor (get rid of extra batch dimension)

encodings = {key: val.squeeze(0) for key, val in encodings.items()}

labels = {key: val.squeeze(0) for key, val in labels.items()}

encodings["labels"] = labels["input_ids"]

return encodings

Using the above code, you'll successfully initialize and tokenize the input content. For efficiency purposes, we defined a function for the tokenizer.

This will ensure that the program returns with encodings necessary to evaluate the input while maintaining code readability.


Creating Dataset Objects

Creating Dataset Objects

The next step is to create dataset objects and prepare the data loader for the validation loop. This will establish the AI summarizer model, which we will test later on.

# Create dataset objects

train_dataset = TextSummaryDataset(train_data['article'], train_data['highlights'])

test_dataset = TextSummaryDataset(test_data['article'], test_data['highlights'])

val_dataset = TextSummaryDataset(val_data['article'], val_data['highlights'])

# DataLoader: Decrease batch size to reduce memory usage

batch_size = 4 # Reduce this if you run into memory issues

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_loader = DataLoader(test_dataset, batch_size=batch_size)

val_loader = DataLoader(val_dataset, batch_size=batch_size)

# Check if GPU is available

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)

# Set up the optimizer

optimizer = AdamW(model.parameters(), lr=3e-5) # Lower learning rate for stability

We've currently kept the batch size to 4, however, you can decrease this value if your processor isn't that strong. This will save some time but degrade the overall quality of the output a bit.

Along with the dataset objects and data loader, you'll also need to initialize the optimizer. You will see why it is important in a bit.


Running a Training Loop

Running a Training Loop

We must run a training loop with the data loaders and objects created in the previous step. This step is essential and cannot be skipped.

# Training loop

epochs = 3

model.train()

for epoch in range(epochs):

total_loss = 0

for batch in train_loader:

optimizer.zero_grad()

# Move the batch to the correct device

input_ids = batch['input_ids'].to(device)

attention_mask = batch['attention_mask'].to(device)

labels = batch['labels'].to(device)

# Forward pass

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

loss = outputs.loss

# Backward pass and optimization step

loss.backward()

optimizer.step()

total_loss += loss.item()

# Print loss for each epoch

avg_train_loss = total_loss / len(train_loader)

print(f"Epoch {epoch+1}/{epochs}, Training Loss: {avg_train_loss}")

Here, the nested for loop is currently working on 3 epochs. Meaning, our program will go forward and backward (representative of the passes in the transformer layer) 3 times before generating a summary.

Again, you can reduce the number of epochs here at a significant reduction of summary quality.

The next part is, logically, running a validation loop to see how well your text summarizer model treats unseen data. However, to keep things simple, we will skip this loop for now for a quick execution of the code.


Extracting Results

Extracting Results

The final step in the entire procedure is to define a summary-generating function that will give us the output.

# Function to generate summaries

def generate_summary(text, max_length=150):

inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)

summary_ids = model.generate(inputs.input_ids, max_length=max_length, length_penalty=1.5, num_beams=6, early_stopping=True)

return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Test with an example from the test set

test_article = test_data['article'].iloc[0]

print("Original Article: ", test_article)

print("Generated Summary: ", generate_summary(test_article))

When the model is done training, these lines of code will provide the user with the original article, along with its generated summary.

And, that is pretty much it for the entire development process. Now, you can also take user input to generate summaries. Play around a bit with the base version of the code until you find the perfect fit for your needs.


A Deployed AI Text Summarizer

A Deployed AI Text Summarizer

In our development process, we didn't deploy the AI summarizer on the internet. However, today, many online tools do make their models available for web users.

A case in point is AI Summarizer, which is worth mentioning due to its high operational efficiency and better training parameters.

The reason for mentioning this tool here is to show you how an industry-grade AI text summarizer works. This will be your motivation for creating a highly accurate model with an attractive UI.

Just like the mentioned tool, you can also add features like 'Show Bullets' to your model to create a one-stop solution for users looking to make information concise. However, this may require more coding and deployment efforts, both front end and back end.


Conclusion

Conclusion

AI Summarizers are essential tools that condense content effectively using NLP algorithms and Python. Developing such a tool independently can be challenging, but this guide covers everything to help you get started with the process.

It starts by completing the prerequisites. Then, install and import the required libraries. Then, prepare the data along with completing the model and optimizer setup.

Finally, creating and running a summary-generating function will provide the user with the original article along with the shortened text.