Jaewoo Song
Jaewoo Song

Categories

  • Tech

When doing simple practices, we can download datasets provided by the framework itself, process them into loaders and put them into our models.

But in most cases, we use external datasets which we should get ourselves.

To preprocess external datasets with Pytorch framework is quite simple.


https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html


As the above document says, if we change each $x$ and $y$ into whatever we want and make __getitem__(index) and __len__() return proper values, the dataloader automatically processes the data and makes it into the shape fit into the model in given batch size.

In other words, if the form is proper for the framework to give correct data for each index, it is acceptable so we can customize data very easily.


This is the Korean dataset code, which I programmed.

def __init__(self, data_path, data_list, opt):
    # store the option
    self.opt = opt

    text_list = []
    for data in data_list:
        with open(data_path + data, 'r', encoding='utf-8') as f:
            text = f.read()

        if '<c>' in text:
            pair = text.split('<c>')
            pair_tuple = (re.sub('\s{2,}', repl=' ', string=pair[0]), \
                re.sub('\s{2,}', repl=' ', string=pair[1]))
            text_list.append(pair_tuple)
        else:
            text_list.append((text, ''))

    if not len(text_list) % opt.batch_size == 0:
        r = len(text_list) % opt.batch_size
        text_list = text_list[:len(text_list)-r]

    self.x_data_list = []
    self.y_data_list = []
    for text_pair in text_list:
        context = text_pair[0]
        comment = text_pair[1]

        x_data = [token for token in tokenize(context)]
        y_data = [token for token in tokenize(comment)]

        while len(x_data) < opt.seq_len:
            x_data.append('')

        while len(y_data) < opt.seq_len:
            y_data.append('')

        x_data = [opt.vocab_ctoi[token] \
            if token in opt.vocab_ctoi else 0 for token in x_data]
        y_data = [opt.vocab_ctoi[token] \
            if token in opt.vocab_ctoi else 0 for token in y_data]

        x_data = torch.LongTensor(x_data)
        y_data = torch.LongTensor(y_data)

        self.x_data_list.append(x_data)
        self.y_data_list.append(y_data)

        
def __getitem__(self, index):
    input = self.x_data_list[index]
    label = self.y_data_list[index]
    return input, label


def __len__(self):
    return len(self.x_data_list)    


As you can see from the above code, there is nothing special.

I just tokenized original data, split it into input($x$) and output($y$), and make them into tensors, a list sequentially.

And making each function return the length and $x$, $y$ for each index is enough.

Actually, I’m not sure that before putting into the loader, the type of data should be tensor.

Just in my personal opinion, it is easier to use later if we make data into tensor in advance.


https://github.com/IllgamhoDuck/ko_novel_generator


This is a link to original GitHub repository I forked.

It is a project about writing Korean relay novel.