Improving Punctuation Restoration for Speech Transcripts via External Data

Deep Learning Explainer

asr punctuation bert layer reduction corpus similarity deep learning nlp domain data sampling natural language processing nlp nlproc punctation insertion punctuation bert punctuation insertion bert punctuation prediction punctuation restoration sampling external data text similarity text similarity language model transcript punctuation transformers layer reduction two-stage fine-tuning

description

It proposes a data sampling technique and a two-stage fine-tuning approach, allowing people to sample more training data similar to our in-domain ASR transcripts and improve the model performance.

0:00 - How to make a model more accurate 1:02 - I published a paper 3:05 - Punctuation restoration 5:32 - In-domain data 7:29 - Annotated data is expensive 8:47 - Opensubtitles 10:04 - Data sampling via LM 11:34 - Two-stage fine-tuning 14:55 - Layer reduction 16:49 - Takeaway 18:10- EMNLP 2021

Connect Linkedin https://www.linkedin.com/in/xue-yong-fu-955723a6/ Twitter https://twitter.com/home email edwindeeplearning@gmail.com

Paper Improving Punctuation Restoration for Speech Transcripts via External Data https://arxiv.org/abs/2110.00560?context=cs

Abstract Automatic Speech Recognition (ASR) systems generally do not produce punctuated transcripts. To make transcripts more readable and follow the expected input format for down-stream language models, it is necessary to add punctuation marks. In this paper, we tackle the punctuation restoration problem specifically for the noisy text (e.g., phone conversation scenarios). To leverage the available writ-ten text datasets, we introduce a data sampling technique based on an n-gram language model to sample more training data that are similar to our in-domain data. Moreover, we propose a two-stage fine-tuning approach that utilizes the sampled external data as well as our in-domain dataset for models based on BERT. Extensive experiments show that the proposed approach outperforms the baseline with an improvementof1.12%F1 score. ... https://www.youtube.com/watch?v=jxOpu4hXPJY

created

2021-10-20

staked

0.0 LBC

license

Copyrighted (contact publisher)

File size

85523846 Bytes