# Data preparation in DeepPavlov
Learn how to read and prepare data for trainable components.

## Data
Deeppavlov library has functionality to download and decompress the data. For this purpose the `download_decompress` from `data.utils` is used. 
The following cell will download the CoNLL-2003 data for the Named Entity Recognition (NER) task and put it to the folder `data/`.

In [None]:
import deeppavlov
from deeppavlov.core.data.utils import download_decompress
download_decompress('http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz', 'data/')

### Parsing text data into a machine-readable dataset 

We will work with a corpus which contains tweets with NE tags. A typical file with NER data contains lines with pairs of tokens (word or punctuation symbol) and tags separated by a whitespace. In many cases additional information such as POS-tags is included. 

Different documents are separated by lines **started** with **-DOCSTART-** token. Different sentences are separated by an empty line. Example:

    -DOCSTART- -X- -X- O

    EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O

    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER

In this tutorial we will focus only on tokens and tags (first and last elements of the line) and drop POS information located between them.

We start by building a class *NerDatasetReader*  that provides functionality for reading the dataset. It returns a dictionary with fields *train*, *test*, and *valid*. Each field stores a list of samples. Each sample is a tuple of tokens and tags. Both tokens and tags are lists. The following example depicts the structure that should be returned by *read* method:

    {'train': [(['Mr.', 'Dwag', 'are', 'derping', 'around'], ['B-PER', 'I-PER', 'O', 'O', 'O']), ....],
     'valid': [...],
     'test': [...]}

There are three separate parts in the dataset:
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.
 

Each of these parts is stored in a separate txt file.


In [None]:
from pathlib import Path

class NerDatasetReader:
    def read(self, data_path):
        data_parts = ['train', 'valid', 'test']
        extension = '.txt'
        dataset = {}
        for data_part in data_parts:
            file_path = Path(data_path) / Path(data_part + extension)
            dataset[data_part] = self.read_file(str(file_path))
        return dataset
            
    @staticmethod
    def read_file(file_path):
        
        # Use utf-8 encoding when open the file
        ######################################
        ########## YOUR CODE HERE ############
        ######################################
        return samples

In [None]:
dataset_reader = NerDatasetReader()

In [None]:
dataset = dataset_reader.read('data/')
assert len(dataset) == 3, 'The dataset must be a dict with three fields: train, test, and valid'
assert len(set(dataset) & {'train', 'test', 'valid'}) == 3, 'The dataset keys must be exactly train, test, and valid'
assert isinstance(dataset['train'][0][0][0], str) and isinstance(dataset['train'][0][0][1], str), 'Both tokens and tags must be strings'
assert len(dataset['train']) == 14041, 'there must be exactly 14041 samples in train'
assert len(dataset['valid']) == 3250, 'there must be exactly 3250 samples in train'
assert len(dataset['test']) == 3453, 'there must be exactly 3453 samples in test'

You should always understand what kind of data you deal with. For this purpose, you can print the data by running the code in the following cell:

In [None]:
for sample in dataset['train'][:2]:
    for token, tag in zip(*sample):
        print('%s\t%s' % (token, tag))
    print()

You can find an implementation of the dataset reader that implemets the same interfaces in the library: [Conll2003DatasetReader](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/dataset_readers/conll2003_reader.py). The functionality of the presented code is wider and the `register` wrapper allows to use this component as a part of config file (will be discussed later).

### Prepare dictionaries

To train a neural network, we will use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

Token indices will be used to address a row in embeddings matrix. The mapping for tags will be used to create one-hot ground-truth probability distribution vectors to compute the loss at the output of the network.

Now you need to implement the *Vocab* class which will return {token or tag}$\to${index} and vice versa. 

In [None]:
from collections import defaultdict, Counter
from itertools import chain
import numpy as np

In [None]:
class Vocab:
    def __init__(self,
                 special_tokens=tuple()):
        self.special_tokens = special_tokens
        self._t2i = defaultdict(lambda: 1)
        self._i2t = []
        
    def fit(self, tokens):
        count = 0
        self.freqs = Counter(chain(*tokens))
        # The first special token will be the default token
        for special_token in self.special_tokens:
            self._t2i[special_token] = count
            self._i2t.append(special_token)
            count += 1
        for token, freq in self.freqs.most_common():
            if token not in self._t2i:
                self._t2i[token] = count
                self._i2t.append(token)
                count += 1

    def __call__(self, batch, **kwargs):
        # Implement the vocab() method. The input could be a batch of tokens
        # or a batch of indices. A batch is a list of utterances where each
        # utterance is a list of tokens
        pass
        ######################################
        ########## YOUR CODE HERE ############
        ######################################

    def __getitem__(self, key):
        # Implement the vocab[] method. The input could be a token
        # (string) or an index. You have to detect what type of data
        # is key and return. 
        pass
        ######################################
        ########## YOUR CODE HERE ############
        ######################################
    
    def __len__(self):
        return len(self._i2t)


After implementing the function *build_dict* you can make dictionaries for tokens and tags. Special tokens in our case will be:
 - `<UNK>` token for out of vocabulary tokens
 - `'O'` for the tag vocab to place out of label tag to the first place with index 0

In [None]:
special_tokens = ['<UNK>']
special_tags = ['O']

token_vocab = Vocab(special_tokens)
tag_vocab = Vocab(special_tags)

Now we will fit the vocabularies on the *train* part of the data.

In [None]:
all_tokens_by_sentenses = [tokens for tokens, tags in dataset['train']]
all_tags_by_sentenses = [tags for tokens, tags in dataset['train']]

token_vocab.fit(all_tokens_by_sentenses)
tag_vocab.fit(all_tags_by_sentenses)

assert len(token_vocab) == 23624, 'There must be exactly 23624 in the token vocab!'
assert len(tag_vocab) == 9, 'There must be exactly 9 in the tag vocab!'

Try to get the indices. Keep in mind that we are working with batches of the following structure:
    
    [['utt0_tok0', 'utt1_tok1', ...], ['utt1_tok0', 'utt1_tok1', ...], ...]

In [None]:
indices_batch = token_vocab([['How', 'to', 'cook', 'a', 'turnip', '?']])

assert len(indices_batch) == 1, 'the batch length must be 1'
assert isinstance(indices_batch[0][0], int), 'The batch must contain lists of ints!'

print(indices_batch)

In [None]:
tag_indices_batch = tag_vocab([['O', 'O', 'O'], ['B-PER']])

assert len(tag_indices_batch) == 2, 'the batch length must be 2'
assert isinstance(tag_indices_batch[0][0], int), 'The batch must contain lists of ints!'

print(tag_indices_batch)

Now we will try converting from indices to tokens.

In [None]:
token_vocab([np.random.randint(0, 512, size=10)])

A similar vocabulary is already implemented in the [library](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/core/data/simple_vocab.py). It has extended functionality:
- token cutoff by frequency
- limitation of the vocabulary size
- saving and loading
- dict like dunders (\_\_contain\_\_, \_\_len\_\_, etc.)

### Dataset Iterator

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. You have to iterate over the dataset and generate `x` and `y` batch by batch. The batch of `x`-s is a list of sentences of tokens like

    [['Yan', 'is', 'a', 'good', 'fellow],
     ['For', 'instance']]

and the tag sequence should be:

    [['B-PER', 'O', 'O', 'O', 'O'],
     ['O', 'O']]

An important concept in the batch generation is shuffling. Shuffling is taking sample from the dataset at random order. It is important to train on the shuffled data because large number consequetive samples of the same class may result in pure quality of the model.
    
The idea behind the iterator is to perform computation in the lazy way. Use yield generator expression to do so. An example of using yield for generator creation is provided below:

In [None]:
def iterator():
    data = [1, 2, 3]
    for d in data:
        yield d
            
print(iterator)
    
for i in iterator():
    print(i)

Now create the `DatasetIterator`:

In [None]:
class DatasetIterator:
    def __init__(self, data):
        self.data = {
            'train': data['train'],
            'valid': data['valid'],
            'test': data['test']
        }

    def gen_batches(self, batch_size, data_type='train', shuffle=True):
        ######################################
        ########## YOUR CODE HERE ############
        ######################################


Create the dataset iterator from the loaded dataset

In [None]:
data_iterator = DatasetIterator(dataset)

Try it out:

In [None]:
x, y = next(data_iterator.gen_batches(2))

assert len(x) == 2, 'There must be two examples in the batch!'
assert len(y) == 2, 'There must be two examples in the batch!'
assert len(x[0]) == len(y[0]), 'The numbers of tokens and tags are different!'
assert isinstance(x[0][0], str), 'Token must be a string!'

This is a typical part of the data preprocessing pipeline. This parts will be used in the following tutorials. 