Introduction to machine learning, neural networks and deep learning#

Objectives#

  • Understand the fundamental goals of machine learning and a bit of the field’s history

  • Gain familiarity with the mechanics of a neural network, convolutional neural networks, the U-Net architecture, and Transformers

  • Discuss considerations for choosing a deep learning architecture for a particular problem

What is Machine Learning?#

Machine learning (ML) is a subset of artificial intelligence (AI). In broad terms, AI refers to the simulation of human intelligence in machines to mimic human actions and reasoning. Machine learning narrows this scope by focusing on the ability of machines to learn and improve from experiences without being explicitly programmed.

https://human-centered.ai/wordpress/wp-content/uploads/2017/11/Deep-Learning-subset-of-Machine-Learning-subset-of-Artificial-Intelligence.jpg

Fig. 1 Figure from https://human-centered.ai#

In contrast to traditional programming paradigms, where explicit instructions are required for every possible outcome, machine learning offers several advantages:

  1. Efficiency: Reduces the time spent by programmers in writing exhaustive rules or instructions.

  2. Automation: Minimizes the need for manual interpretation or analysis of data.

  3. Reduction of human error: By learning from data, machines can avoid biases or errors introduced by humans.

  4. Scalability: ML models can handle large volumes of data and make decisions at a scale that would be impossible for humans.

However, human intervention is still essential in many steps of the machine learning process:

  1. Data Collection and Preparation: Humans are responsible for collecting and curating high-quality datasets, which are critical for training performant machine learning models.

  2. Algorithm Selection and Tuning: ML practitioners must develop data preprocessing pipelines, select suitable algorithms, and adjust parameters to optimize model performance.

  3. Model Evaluation and Selection: Humans play a key role in evaluating model performance, comparing different models, and selecting the best one.

  4. Deployment and Monitoring: After a model is trained, it needs to be deployed into production and monitored over time. Developing models that can remain relevant with changing data distributions, or developing new models to meet new challenges, is a significant engineering and research effort.

There are several subcategories of machine learning, but for analysis involving remote sensing data, we are typically concerned with two types:

  1. Supervised Learning: This approach involves training a model on labeled data. Here, the model learns to predict targets from features, based on example input-output pairs.

  2. Unsupervised Learning: In contrast, unsupervised learning models attempt to find inherent patterns in data without any labels or guidance. This is often used for clustering or anomaly detection tasks.

What is Deep Learning?#

Deep learning is a specialized subset of machine learning characterized by the application of artificial neural networks with several layers - hence the term “deep”. These deep neural networks attempt to simulate the behavior of the human brain—albeit far from matching its ability—in order to ‘learn’ from large amounts of data. In a neural network, additional layers can learn more complex relationships to make predictions with higher accuracy compared to simpler models. In an image recognition task, for instance, lower layers might identify edges, while deeper layers could identify the concepts relevant to a human, such as digits or letters or faces. The deepest layers of a network once trained can infer highly abstract concepts, such as what differentiates a farm from a forest in satellite imagery.

Deep learning drives many artificial intelligence (AI) applications and services that improve automation, performing analytical and physical tasks without human intervention. It is responsible for advances in image recognition, speech recognition, natural language processing, and other AI fields.

Cost of deep learning

Deep learning requires a lot of data to learn and usually a significant amount of computing power. Developing datasets and training deep learning models can be expensive depending on the scope of the problem.

In the upcoming sections, we will dive deeper into the mechanics of deep learning architectures that are especially relevant for computer vision: neural networks, convolutional neural networks, the U-Net architecture, and transformers.

What are Neural Networks?#

Artificial neural networks (ANNs) are a specific, biologically-inspired class of machine learning algorithms. They are modeled after the structure and function of the human brain.

https://github.com/developmentseed/tensorflow-eo-training/blob/main/ds_book/docs/images/neuron-structure.jpg?raw=1

Fig. 2 Biological neuron (from https://training.seer.cancer.gov/anatomy/nervous/tissue.html).#

ANNs are essentially programs that make decisions by weighing the evidence and responding to feedback. By varying the input data, types of parameters and their values, we can get different models of decision-making.

https://miro.medium.com/max/1100/1*x6KWjKTOBhUYL0MRX4M3oQ.png

Fig. 3 Basic neural network from https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9.#

In network architectures, neurons are grouped in layers, with synapses traversing the interstitial space between neurons in one layer and the next.

Training and Testing Data#

The dataset (e.g. all images and their labels) are split into training, validation and testing sets. A common ratio is 70:20:10 percent, train:validation:test. If randomly split, it is important to check that all class labels exist in all sets and are well represented.

Important

Why do we need validation and test data? Are they redundant? We need separate test data to evaluate the performance of the model because the validation data is used during training to measure error and therefore inform updates to the model parameters. Therefore, validation data is not unbiased to the model. A need for new, wholly unseen data to test with is required.

Forward and backward propagation, hyper-parameters, and learnable parameters#

Neural networks train in cycles, where the input data passes through the network, a relationship between input data and target values is learned, a prediction is made, the prediction value is measured for error relative to its true value, and the errors are used to inform updates to parameters in the network, feeding into the next cycle of learning and prediction using the updated information. This happens through a two-step process called forward propagation and back propagation, in which the first part is used to gather knowledge and the second part is used to correct errors in the model’s knowledge.

https://thumbs.gfycat.com/BitesizedWeeBlacklemur-max-1mb.gif

Fig. 4 Forward and back propagation.#

The activation function decides whether or not the output from one neuron is useful or not based on a threshold value, and therefore, whether it will be carried from one layer to the next.

Weights control the signal (or the strength of the connection) between two neurons in two consecutive layers. These parameters determine how much input will be passed from one neuron to the next. For each neuron, an operation where input * weights occurs. Weights affect the magnitude of the activation.

Biases are values which help determine whether or not the weighted output from a neuron is going to be passed forward through the network. After all of the incoming neurons are weighted and summed, a bias value is added. Thus, the operation extends to output = (inputs * weights) + bias. Bias terms shift the activation.

In a neural network, neurons in one layer are connected to neurons in the next layer. As information passes from one neuron to the next, the information is conditioned by the weight of the synapse and is subjected to a bias. The weights and biases determine if the information passes further beyond the current neuron.

https://cdn-images-1.medium.com/max/651/1*UA30b0mJUPYoPvN8yJr2iQ.jpeg

Fig. 5 Weights, bias, activation.#

During training, the weights and biases are learned and updated using the training and validation dataset to fit the data and reduce error of prediction values relative to target values.

Important

  • Activation function: decides whether or not the output from one neuron is useful or not

  • Weights: control the signal between neurons in consecutive layers and determines the magnitude of the activation (i.e. how important a feature is)

  • Biases: a value that helps determine the activation of each neuron

  • Weights and biases are the learnable parameters of a deep learning model

The learning rate controls how much we want the model to change in response to the estimated error after each training cycle

../_images/localvsglobalmin.png

Fig. 6 Local vs. global minimum (the optimal point to reach).#

The batch size determines the portion of our training dataset that can be fed to the model during each cycle. Stated otherwise, batch size controls the number of training samples to work through before the model’s internal parameters are updated.

https://www.baeldung.com/wp-content/uploads/sites/4/2020/12/epoch-batch-size.png

Fig. 7 Modulating batch size determines how many iterations are within one epoch.#

An epoch is defined as the point when all training samples, aka the entire dataset, has passed through the neural network once. The number of epochs controls how many times the entire dataset is cycled through and analyzed by the neural network. Related, but not necessarily as a parameter is an iteration, which is the pass of one batch through the network. If the batch size is smaller than the size of the whole dataset, then there are multiple iterations in one epoch.

The optimization function is really important. It’s what we use to change the attributes of your neural network such as weights and biases in order to reduce the losses. The goal of an optimization function is to minimize the error produced by the model.

The loss function, also known as the cost function, measures how much the model needs to improve based on the prediction errors relative to the true values during training.

https://miro.medium.com/max/810/1*UUHvSixG7rX2EfNFTtqBDA.gif

Fig. 8 Loss curve.#

The accuracy metric measures the performance of a model. For example, a pixel to pixel comparison for agreement on class.

Note: the activation function is also a hyper-parameter.

Types of Machine Learning for Computer Vision#

  • Image classification: classifying whole images, e.g. image with clouds, image without clouds

  • Object detection: identifying locations of objects in an image and classifying them, e.g. identify bounding boxes of cars and planes in satellite imagery

  • Semantic segmentation: classifying individual pixels in an image, e.g. land cover classification

  • Instance segmentation: classifying individual pixels in an image in terms of both class and individual membership, e.g. detecting unique agricultural field polygons and classifying them

  • Generative Adversarial: a type of image generation where synthetic images are created from real ones, e.g. creating synthetic landscapes from real landscape images

  • LSTMs: a type of network that treats time as a feature

  • Transformers: networks that leverage embeddings to learn complex features

For computer vision tasks, CNNs have played a large role in the advancement of the field. Let’s dive into what CNNs are and how they learn. Later in this training we’ll learn more about Transformers and how these architectures are starting to compete with CNNs in computer vision domains.

What are Convolutional Neural Networks?#

A Convolutional Neural Network (ConvNet/CNN) is a deep learning architecture inspired by the organization of the human visual cortex, in which individual neurons respond to stimuli within a constrained region of the visual field known as the receptive field. Several receptive fields overlap to account for the entire visual area.

In artificial CNNs, an input matrix such as an image is given importance per various aspects and objects in the image through a moving, convolving receptive field. Very little pre-processing is required for CNNs relative to other classification methods as the need for upfront feature-engineering is removed. Rather, CNNs learn features on their own, provided enough training time and informative examples.

https://miro.medium.com/max/1400/1*Fw-ehcNBR9byHtho-Rxbtw.gif

Fig. 9 Convolution of a kernel over an input matrix from https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1.#

What is a kernel/filter?#

A kernel is a matrix smaller than the input. It acts as a receptive field that moves over the input matrix from left to right and top to bottom and filters for features in the image. Depending on the contents of the kernel, it may activate when passed over different edges or shapes.

What is stride?#

Stride refers to the number of pixels that the kernel shifts at each step in its navigation of the input matrix.

What is a convolution operation?#

The convolution operation is the combination of two functions to produce a third function as a result. It merges two sets of information, the kernel and the input matrix.

../_images/convop.png

Fig. 10 Convolution of a kernel over an input matrix from http://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html.#

Convolution operation using 3D filter#

An input image is often represented as a 3D matrix with a dimension for width (pixels), height (pixels), and depth (channels). In the case of an optical image with red, green and blue channels, the kernel/filter matrix is shaped with the same channel depth as the input and the weighted sum of dot products is computed across all 3 dimensions.

What is padding?#

After a convolution operation, the feature map is by default smaller than the original input matrix.

https://www.researchgate.net/profile/Sheraz-Khan-14/publication/321586653/figure/fig4/AS:568546847014912@1512563539828/The-LeNet-5-Architecture-a-convolutional-neural-network.png

Fig. 11 Progressive downsizing of feature maps in a multi-layer CNN.#

To maintain the same spatial dimensions between input matrix and output feature map, we may pad the input matrix with a border of zeroes or ones. There are two types of padding:

  1. Same padding: a border of zeroes or ones is added to match the input/output dimensions

  2. Valid padding: no border is added and the output dimensions are not matched to the input

https://miro.medium.com/max/666/1*noYcUAa_P8nRilg3Lt_nuA.png

Fig. 12 Padding an input matrix with zeroes.#

Semantic Segmentation#

To pair with the content of these tutorials, we will demonstrate supervised semantic segmentation to map crop types in Germany and land cover using the LandCoverNet South America dataset. Datasets are sourced from Source Cooperative (previously Radiant Earth).

  • Semantic = of or relating to meaning (class)

  • Segmentation = division (of image) into separate parts

U-Net Segmentation Architecture#

Semantic segmentation is often distilled into the combination of an encoder and a decoder. An encoder generates logic or feedback from input data, and a decoder takes that feedback and translates it to output data in the same form as the input.

The U-Net model, which is one of many deep learning segmentation algorithms, has a great illustration of this structure.

../_images/unet_arch_ronn.png

Fig. 13 U-Net architecture (from Ronneberger et al., 2015).#

In Fig. 13, the encoder is on the left side of the model. It consists of consecutive convolutional layers, each followed by ReLU and a max pooling operation to encode feature representations at multiple scales. The encoder can be represented by most feature extraction networks designed for classification.

The decoder, on the right side of the Fig. 13 diagram, is tasked to semantically project the discriminative features learned by the encoder onto the original pixel space to render a dense classification. The decoder consists of deconvolution and concatenation followed by regular convolution operations.

Following the decoder is the final classification layer, which computes the pixel-wise classification for each cell in the final feature map.

ReLU is an operation, an activation function to be specific, that induces non-linearity. This function intakes the feature map from a convolution operation and remaps it such that any positive value stays exactly the same, and any negative value becomes zero.

https://miro.medium.com/max/3200/1*w48zY6o9_5W9iesSsNabmQ.gif

Fig. 14 ReLU activation function.#

https://miro.medium.com/max/1000/1*cmGESKfSZLH2ksqF_kBgfQ.gif

Fig. 15 ReLU applied to an input matrix.#

Max pooling is used to summarize a feature map and only retain the important structural elements, foregoing the more granular detail that may not be significant to the modeling task. This helps to denoise the signal and helps with computational efficiency. It works similar to convolution in that a kernel with a stride is applied to the feature map and only the maximum value within each patch is reserved.

https://thumbs.gfycat.com/FirstMediumDalmatian-size_restricted.gif

Fig. 16 Max pooling with a kernel over an input matrix.#

Transformers and Self Attention#

The transformer is a powerful deep learning model that was first introduced for sequence processing in text ,speech, and natural language in the 2017 paper “Attention is all you Need”. It motivates the Transformer and self attention for a few reasons:

  • For modeling with sequences, like pixel time series, or lengths of text, RNNs are hard to parallelize and difficult to train.

  • CNN based methods increase the number of computations needed to relate two input positions. So it is harder for a CNN based sequence model to capture long term memory effects

  • In contrast, multi-head self attention and the transformer architecture are easier to train, easy to parallelize, and input values can be related to each other in constant time no matter how far apart they are.

Attention, the primary mechanism in a transformer, is a generalization of a multi-layer perceptron. With attention, any input pixel in an image can attend to any number of other pixel in an image. This is in contrast to a CNN, where pixels in one convolutional layer only attend to a neighborhood of pixels, and is also different from a multi-layer perceptron, where the connections (or “attention”) between nodes in a layer are fixed and cannot change.

Attention is a very flexible mechanism that scales well. Since it makes so few assumptions between elements in a sequence, it can eventually learn efficient and powerful representations between inputs given enough data. But since it makes fewer assumptions than other mechanisms, it takes longer to train. We’ll revisit transformers and attention later on in the lesson to discuss how recent architectures (the Vision Transformer) address computation and data constraints, in computer vision and specifically in remote sensing.

References#

Below is a 2021 video recording of a previous version of this lesson. It was given by Lilly Thomas, ML Engineer at Development Seed.

from IPython.display import YouTubeVideo

def display_youtube_video(url, **kwargs):
    id_ = url.split("=")[-1]
    return YouTubeVideo(id_, **kwargs)

display_youtube_video("https://www.youtube.com/watch?v=-C3niPVd-zU", width=800, height=600)