✴︎ , ,

Using BigEarthNet to Develop an Image Classification Model pt. 2

Hello! This is my follow up post to my last post. Over the past little bit I have learned new things and I’ve seriously changed my preprocessing set up. Initially I thought I would simply edit the original post, but I figured why not show my errors instead.

Furthermore, during the process of reviewing deep neural networks I have come to have a huge re-appreciation for linear algebra and calculus. I have plans to write about that in the coming weeks, so be ready to see some of those posts coming out!

This post will illustrate how I rejigged my data processing and will look at the start of my dive into building a multi-label image classification to identify BigEarthNet images. In the coming weeks I will publish a compact version of this whole project that goes through the exact steps and intuition behind the decisions I’ve made, but for now this is just an update and it will go more into the specifics of each operation I perform.

Preprocessing

As I mentioned, the preprocessing method I developed in the last post wasn’t very good. In fact it glossed over a very large idea which is that I am trying to make the data easy to understand for Keras (which is a bit more rigid) rather than for me. My last custom data loader I was generating a tuple that contained a matrix that contained the image itself, and a list that contained the label for that image. Unfortunately, Keras doesn’t read data loaders in that way, so I had to change it up, but this was a blessing in disguise since I was able to clean up the code.

I didn’t mention this in the last post but thought it was worth mentioning now: the reason we have to create a data generator rather than simply loading all of the data into a massive dictionary or some list is that due to the size of our data we can’t store all of the data into our memory. Thus, what a generator does is create the data accessible on-demand. When the generator is called it will generate a dataset that is a subset of the original dataset you’re working from.

To avoid having a huge wall of code I’m going to be breaking up the code into smaller chunks so it can be a bit more accessible. Overall, the preprocessing consists of three sections: a method that gets all of the labels; a method that splits the data into training, validation, and test sets; and the data generator itself.

def label_finder(data_dir):
    labels = []
    for subdir in os.listdir(data_dir):
        if (len(labels) == 43):
            break
        subdir_path = os.path.join(data_dir, subdir)
        if os.path.isdir(subdir_path):
            json_path = os.path.join(subdir_path, subdir + '_labels_metadata.json')
            # Load labels from JSON file
            json_labels = []
            with open(json_path) as json_file:
                json_data = json.load(json_file)
                json_labels = json_data['labels']
            for label in json_labels:
                if label in labels:
                    continue
                else:
                    labels.append(label)
    return labels

unique_labels = label_finder('BigEarthNet-v1.0')
unique_labels.sort()

# Initialize MultiLabelBinarizer to convert list of labels to binary matrix
mlb = MultiLabelBinarizer(classes=unique_labels)

# Fit and transform MultiLabelBinarizer on labels_list
encoded_labels = mlb.fit_transform(test_labels1)

def get_json (data_dir, folder):
    subdir = os.path.join(data_dir, folder)
    json_path = os.path.join(subdir, folder + '_labels_metadata.json')
    
    with open(json_path) as json_file:
        json_data = json.load(json_file)
        json_labels = json_data['labels']
    
    return json_labels

Let’s walk through this code step by step. The first method gets all of the names of the unique labels. These are things like “Sea and Ocean”, “Rice Fields”, “Water Bodies” etc. From there we sort the list into alphabetical order to keep it consistent, and then use sklearn’s multilabel binarizer function to create a one-hot-encoding of all the labels which will be used in our data generator. The last bit of code is our get_json method which is a helper method that returns all the labels of a file given a folder directory. This last method is really just to clean things up later on in the data generator.

Onto the next stage – splitting the data:

def split_test_set(data_length, validation_percent, test_percent):
    shuffled_indices = np.random.permutation(data_length)
    
    test_set_size = int(data_length * test_percent)
    validation_set_size = int(data_length * validation_percent)
    
    test_set_indices = shuffled_indices[:test_set_size]
    validation_set_indices = shuffled_indices[test_set_size:validation_set_size + test_set_size]
    train_set_indices = shuffled_indices[validation_set_size + test_set_size:]
    return (np.sort(test_set_indices), np.sort(validation_set_indices), np.sort(train_set_indices))

def dictionary_creator (data_dir):
    dataset = os.listdir(data_dir)
    
    test_indices, valid_indices, train_indices = split_test_set(len(dataset), 0.1, 0.01)
    
    partition = {'test':[], 'validation':[], 'train':[]}
    labels = {}
    
    for i, image in enumerate(dataset):
        if image == '.DS_Store':
            continue
        if i in test_indices:
            partition['test'].append(image)
        elif i in valid_indices:
            partition['validation'].append(image)
        else:
            partition['train'].append(image)
            
        labels[image] = get_json(data_dir, image)
        
    return partition, labels

From here on I drew lots of inspiration from Stanford Edu blog which you can find here. These two methods defines dictionaries that divide the training, the validation, and the test sets; and it links each image/class with their respective labels. This allows the data to be easily accessed during the data generation process.

Finally, we get to our data generator itself:

class DataGenerator(keras.utils.Sequence):
    
    def __init__(self, list_IDs, labels, batch_size=32, dim=(120, 120), n_channels=3,
             n_classes=43, shuffle=True):
        # Initialization
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.on_epoch_end()
    
    def on_epoch_end(self):
        # Updates indexes after each epoch
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        # Generates data containing batch_size samples # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        y = np.empty((self.batch_size, 43))

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            subdir_path = os.path.join('BigEarthNet-v1.0/', ID)
            rgb_files = [f for f in os.listdir(subdir_path) if f.endswith(('B02.tif', 'B03.tif', 'B04.tif'))]
            rgb_images = []
            for filename in rgb_files:
                image_path = os.path.join(subdir_path, filename)
                with Image.open(image_path) as img:
                    rgb_images.append(np.array(img))

            combo_image = np.stack(rgb_images, axis=-1)
            combo_image = combo_image / 20566.0
            combo_images = np.array(combo_image)
            X[i] = combo_images

            # Store class
            y[i] = mlb.fit_transform([self.labels[ID]])
            
        return X, y
    
    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))
    
    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)
        
        return X, y

This class is what will later get fed into Keras and will allow us to train and test our model! Starting from the top I initialize the class using the __init__ method. This method is mandatory when creating classes and just defines the object/class. The on_epoch_end method rearranges the indices of so that we can generate novel batches every epoch. This is an easy helper method that assists in improving model accuracy. The __data_generation method is the big dog in this whole class because it is what actually returns our data. It goes through our dataset and creates two arrays that are the length of our batch: array X, which holds the images, and array y, which holds the one-hot-encoded labels. The operation is pretty straight forward and basically batch by batch iterates through the whole dataset. You may notice there is a line where I divide the combo_image variable by 20566.0 this is a way of standardizing the data. In general, you don’t want your ML data to have a large amount of variance because it can cause the model to give larger weights to high variance components. In this example, I simply divided all of the values by the max value that appears in the images which happens to be 20566. The __len__ function returns the number of batches per epoch. Finally, the __getitem__ method is a getter that uses the __data_generation method and returns a batch. And voilà we are finished with the preprocessing!

In the next post I’m going to go through using the neural network to develop a model the accurately classifies our images! This upcoming post is what has taken me a while to do because I want to go through the technical background that drives all of the operations which has taken me a bit more reading than I expected. But keep your eyes peeled!

Resources

Data:
https://bigearth.net/#downloads

Research Paper:
G. Sumbul, A. d. Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M. Caetano, B. Demir, V. Markl, BigEarthNet-MM: A Large Scale Multi-Modal Multi-Label Benchmark Archive for Remote Sensing Image Classification and Retrieval“, IEEE Geoscience and Remote Sensing Magazine, 2021, doi: 10.1109/MGRS.2021.3089174.

Additional Resources:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
https://builtin.com/data-science/when-and-why-standardize-your-data
https://towardsdatascience.com/introduction-to-data-preprocessing-in-machine-learning-a9fa83a5dc9d
https://www.simplilearn.com/data-preprocessing-in-machine-learning-article

Leave a Reply

Your email address will not be published. Required fields are marked *