2018, Google researchers published the BERT (Bidirectional Encoder Representations from Transformers) model, which has since served as a starting point for hundreds of NLP (Natural Language Processing) related experiments and other derivative models. BERT was trained on masked-language modelling (sentence prediction) but its capabilities extend to more common NLP tasks, such as language inference and text classification. Naralytics is a company that seeks to use natural language in order to be able to categorize users who create text into multiple categories – which is a modified version of classification. However, the text that Naralytics seeks to pull from exceed the maximum token length of 512 tokens that BERT supports – so this report discusses the research towards multiple BERT derivatives that seek to address this problem – and then implements a solution that addresses the multiple concerns that are attached to this kind of model.
Details
- Examining the usage of New NLP Techniques to process Raw Text Data Entries
- Ngo, Nicholas (Author)
- Carter, Lynn (Thesis director)
- Lee, Gyou-Re (Committee member)
- Barrett, The Honors College (Contributor)
- Computer Science and Engineering Program (Contributor)
- Economics Program in CLAS (Contributor)