Text Summarization with BART and T5 models

April 02, 2020    Text Summarization BART T5 HuggingFace

I use the HuggingFace Transformers pipeline to summarize a Wikipedia page, and the results are mind-blowing. This pipeline uses models that have been fine-tuned on a summarization task, namely 'bart-large-cnn' and 't5-large'. It should be noted that the max length of the sequence to be generated is set to 150.

Basically, BART and T5 are pre-training methods for conditional generation applications like summarization.


from transformers import pipeline
from bs4 import BeautifulSoup
import requests
import re

wiki_data = requests.get("https://en.wikipedia.org/wiki/Coronavirus").text
soup = BeautifulSoup(wiki_data, 'lxml')

data = []
for k in soup.select('p'):
  #append and remove citation in text, e.g. [1]
  data.append(re.sub("[\(\[].*?[\)\]]", "", k.getText())) 

data = ''.join([s for s in data if isinstance(s,str)])
spchar_list = ['\n', '/', '\\', '[', ']']
data = data.translate({ord(x): '' for x in spchar_list})
data = data.replace(".", ". ")


smr_bart = pipeline(task="summarization", model="bart-large-cnn")
smbart = smr_bart(data, max_length=150)
print(smbart[0]['summary_text'])

=> """
Coronaviruses are enveloped viruses with a positive-sense single-stranded RNA genome and a nucleocapsid of helical symmetry. The name coronavirus is derived from the Latin corona, meaning "crown" or "halo", which refers to the characteristic appearance reminiscent of a crown or a solar corona around the virions. In humans, coronavirus cause respiratory tract infections that can be mild, such as some cases of the common cold. In chickens, they cause an upper respiratory tract disease, while in cows and pigs they cause diarrhea.
"""

smr_t5 = pipeline(task="summarization", model="t5-large", framework="tf")
smt5 = smr_t5(data, max_length=150)
print(smt5[0]['summary_text'])

=> """
coronaviruses are a group of related viruses that cause diseases in mammals and birds . they are enveloped viruses with a positive-sense single-stranded RNA genome . there are yet to be vaccines or antiviral drugs to prevent or treat infections .
"""


comments powered by Disqus