Abstract

In such a paradigm, the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing.

a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access. We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges.

<aside> 💡 data是怎么被restructured的?

</aside>

Hypothesis of NLP technique evolution

自然语言处理技术进化的假说

截屏2022-07-06 17.21.43.png

1 Introduction

We argue that the ultimate goal of data storage is to better serve human life, and how data is accessed is as important as how it is stored. However, there are often differences in the way that data is stored and accessed.

作者提出存储数据的最终目标是更好地服务于人们的生活,因此如何获取数据和如何存储数据一样重要。

Although prompting methods have narrowed the difference between data storage and access, it does not fundamentally eliminate the gap, as the way models store data in the pre-training stage is not transparent to diverse downstream tasks.

尽管prompting methods减少了存储和获取的差别,但没有在根本上消除他们之间的代沟,因为模型在预训练过程中存储数据的方式对不同的下流任务是不透明的

换句话说,下流任务不知道使用何种方法(即prompts)可以更好地从预训练模型中获取想要的数据。

比如,在情感分类任务中,为了在预训练模型的帮助下预测句子的情感,我们必须选择一个模型熟悉的提问方式,然而系统设计者并不了解模型更倾向于使用那种提问格式,因为预训练数据的分布或者结构是不可解释的。 下面的图可以生动地解释这个例子:

截屏2022-07-06 20.32.43.png

Methodologically, we present a new way to look at data that contains various types of information, which could be regarded as pre-training signals that can instruct models for parameter optimization. We structurally represent data in the unit of signals and claim that a good PLM should mark various signals during pre-training in a way that expected information could be accessed efficiently by downstream tasks.

作者将数据中包含的不同种类的数据看作预训练信号,用来指导模型的参数优化,在结构上以信号为单位表示数据。

一个好的PLM应该在预训练过程中标记不同种类的信号,以便下游任务可以有效地获取需要的数据