Long Short-Term Memory (LSTM) Networks: A Comprehensive Guide

Long Short-Term Memory (LSTM) networks are a specialized type of Recurrent Neural Network (RNN), designed to handle sequential data and learn long-term dependencies. This article explores the theoretical foundations of LSTMs, their architecture, and their applications in various fields.

Introduction to LSTM Networks

LSTM networks are a type of technology that can help you make predictions and understand patterns in data that comes in a particular order, like speech, writing, or videos. It’s very useful in many areas like language, audio, and video processing.

Although Hochreiter and Schmidhuber first published this algorithm in 1997, it has since been improved and is still a very effective neural network that uses a variety of issues.

LSTMs address the limitations of traditional RNNs, which struggle with learning long-term dependencies due to the vanishing or exploding gradient problem. By introducing a memory cell and gates to control the flow of information, LSTMs can selectively retain or discard information, enabling them to learn long-term patterns in sequential data.

Why Learn LSTM?

Learning about LSTMs can provide you with new skills that are in high demand and help you work on exciting projects! LSTMs are very effective in preserving and transforming RNNs into a potent Deep Learning model. LSTM is a type of technology that can help you make predictions and understand patterns in data that comes in a particular order, like speech, writing or videos. It’s very useful in many areas like language, audio and video processing. By learning about LSTMs, you can gain new skills that are in high demand and help you work on exciting projects!

LSTM Usage Over Time

As you can see in the Time series and NLP is the most used of this model and we will choose two different projects to show how to use this model in this kind of project. Statistics show the LSTM usage over time in papers compared with other Neural networks as well as the next image illustrates the proportion of various tasks that employ this Model.

The Problem with RNNs: Vanishing Gradients

RNNs address the memory issue by giving a feedback mechanism that looks back to the previous output and serves as a kind of memory. A sentence or phrase only holds meaning when every word in it is associated with its previous word and the next one.

The main problem with the RNNs model is the Vanishing gradients. In other words, this model is not able to train when there are time lags rather than 5- 10 times and struggle to keep the information from the earlier step for long period.

For example consider the following sentence: Creating a snowman is a great way to keep children happy and entertain but we need to wait for a proper …. to able create it.

If you asked to fill in the blank in the sentence you don't need to analyse all words in the sentence, just see a snowman and will be saying snow. other close words to the blanket do not help us such as Proper, Children, Happy and Entertain. We need to get back to reach the snowman to understand what word we need for a blanket. Unfortunately, the Rnns can not carry information for long time steps.

Read also: Revolutionizing Remote Monitoring

Just imagine if the RNNs model could remember important information for long period like the below image which most information will keep in a long memory:

The LSTM model is created to do the above example and keep some essential information for a long period.

Vanishing and Exploding Gradients Explained

RNNs have quite massively proved their incredible performance in sequence learning. Recurrent Neural Networks uses a hyperbolic tangent function, what we call the tanh function. The range of this activation function lies between [-1,1], with its derivative ranging from [0,1]. Now we know that RNNs are a deep sequential neural network. Hence, due to its depth, the matrix multiplications continually increase in the network as the input sequence keeps on increasing. Hence, while we use the chain rule of differentiation during calculating backpropagation, the network keeps on multiplying the numbers with small numbers. And guess what happens when you keep on multiplying a number with negative values with itself? It becomes exponentially smaller, squeezing the final gradient to almost 0, hence weights are no more updated, and model training halts.

Vanishing Gradient: When training a model over time, the gradients which help the model learn can shrink as they pass through many steps. This makes it hard for the model to learn long-term patterns since earlier information becomes almost irrelevant.
Exploding Gradient: Sometimes gradients can grow too large causing instability. This makes it difficult for the model to learn properly as the updates to the model become erratic and unpredictable.

Both of these issues make it challenging for standard RNNs to effectively capture long-term dependencies in sequential data.

LSTM Architecture: An In-Depth Look

The LSTM stands for Long Short-Term Memory network and is a complex shape of RNNs( Recurrent Neural Networks) to meet the problem of RNNs algorithms. Oh boy, I can already hear you asking “what’s the deal with RNNs?”, let me explain a little about LSTM and then explain the problem.

Read also: Boosting Algorithms Explained

The LSTM is a recurrent neural network, similar to RNNs, except it has an additional input and output, as seen in the image. The name of the new input is C, which stands for the Cell State’s first letter and stands for long-term memory.

This new input(Ct-1)is connected to the output(Ct) directly and is keep connect for the whole process. This line functions as a memory from which we can later add or remove information.

Core Components of an LSTM Cell

An LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate.

Cell: Every unit of the LSTM network is known as a “cell”. The LSTM cell is a building block that you can use to build a larger neural network. The cell memory and hidden state can be initialized to zero at the beginning. Then within the LSTM cell, $x$, $c$, and $h$ will be multiplied by separate weight tensors and pass through some activation functions a few times. The result is the updated cell memory and hidden state. These updated $c$ and $h$ will be used on the next time step of the input tensor.
Gates: LSTM uses a special theory of controlling the memorizing process. Popularly referred to as gating mechanism in LSTM, what the gates in LSTM do is, store the memory components in analog format, and make it a probabilistic score by doing point-wise multiplication using sigmoid activation function, which stores it in the range of 0-1. LSTM architectures involves the memory cell which is controlled by three gates:
- Forget Gate: Determines what information is removed from the memory cell. The information that is no longer useful in the cell state is removed with the forget gate. Two inputs xt (input at the particular time) and h{t-1} (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias. The resultant is passed through sigmoid activation function which gives output in range of [0,1]. If for a particular cell state the output is 0 or near to 0, the piece of information is forgotten and for output of 1 or near to 1, the information is retained for future use.
  - The equation for the forget gate is:
    - f_t = σ ( W_f ⋅ [h_{t-1}, x_t] + b_f )
    - Where:
      - W_f represents the weight matrix associated with the forget gate.
      - [h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
      - b_f is the bias with the forget gate.
      - σ is the sigmoid activation function.
- Input Gate: Controls what information is added to the memory cell. This gate lets in optional information necessary from the current cell state. As discussed earlier, the input gate optionally permits information that is relevant from the current cell state. It is the gate that determines which information is necessary for the current input and which isn’t by using the sigmoid activation function. It then stores the information in the current cell state. The addition of useful information to the cell state is done by the input gate. First the information is regulated using the sigmoid function and filter the values to be remembered similar to the forget gate using inputs h_{t-1} and x_t. Then, a vector is created using tanh function that gives an output from -1 to +1 which contains all the possible values from h_{t-1} and x_t. At last the values of the vector and the regulated values are multiplied to obtain the useful information.
  - The equation for the input gate is:
    - i_t = σ ( W_i ⋅ [h_{t-1}, x_t] + b_i )
    - Ĉ_t = tanh ( W_c ⋅ [h_{t-1}, x_t] + b_c )
  - We multiply the previous state by f_t effectively filtering out the information we had decided to ignore earlier. Then we add i_t ⊙ Ĉ_t which represents the new candidate values scaled by how much we decided to update each state value.
    - C_t = f_t ⊙ C_{t-1} + i_t ⊙ Ĉ_t where ⊙ denotes element-wise multiplication
- Output Gate: Controls what information is output from the memory cell. This gate updates and finalizes the next hidden state. The output gate is responsible for deciding what part of the current cell state should be sent as the hidden state (output) for this time step. First, the gate uses a sigmoid function to determine which information from the current cell state will be output. This is done using the previous hidden state h_{t - 1} and the current input x_t:
  - o_t = σ ( W_o ⋅ [h_{t-1}, x_t] + b_o )
  - Next, the current cell state C_t is passed through a tanh activation to scale its values between -1 and +1. Finally, this transformed cell state is multiplied element-wise with o_t to produce the hidden state h_t:
  - h_t = o_t ⊙ tanh(C_t)
  - Here:
    - o_t is the output gate activation.
    - C_t is the current cell state.
    - ⊙ represents element-wise multiplication.
    - σ is the sigmoid activation function.
  - This hidden state h_t is then passed to the next time step and can also be used for generating the output of the network.

Step-by-Step Explanation of LSTM Operations

Forget Gate Operation:
- In this gate, we have two inputs (Ct-1) and an unknown input that we know has a sigmoid function which leads to the output would be between 0 to 1. As you can see the gate is like a function that combines inputs and produces the same shape output and is a bit different from “Ct-1”. in order words the gate impact on “Ct-1" by using the other input “ft”.
- If an element in the “ft” is between 0 and 1, it will have some impact on the equivalent element in the “Ct-1,” otherwise, it will have no impact. The corresponding element in the “Ct-1” will be forgotten if an element in the “ft” is 0.
- The “ht-1” and “Xt” as two inputs of the Forgat gate which is responsible to remove unnecessary information from Long term memory. the above image shows that two inputs are combined together by using MLP and fully connected layers and then they will be passed from the sigmoid function which converts the element into a number between 0 to 1 then the output is “ft”.
- As I mentioned the “ft” is a vector in the same shape as “Ct-1”. This vector specifies how each Ct-1 element should be multiplied, ranging from 0 to 1. If any part of “ft” is close to 1, it should be maintained since it represents a portion of Ct-1. If it is near 0, you should throw away this part of Ct-1.
Remembering Gate Operation:
- So far, we have a “Ct” as the output of the previous network. In this part, the network adds information to “Ct” or our Long memory.
- So, we have a gate here that is responsible for adding data. for this gate there are two inputs, “Ct” and we don't know about the other one but we know from last step that this unknown input has to be a vector in the shape of “Ct”. This unknown input is really important for us because tells us which information have to add to the long story so lets to define how is created.
- In the above image, the highlighted green networks showed the add function. As you can see the initial inputs are the same as forget gate, “ht-1” and “Xt”. Again these two inputs will be connected by a fully connected( MLP) by their new weights (“wig”, ”whg”). Everything until this point was identical to that of the previous stage, but they will now be passed through a hyperbolic tangent, which will change all “gt” elements to values between -1 and 1. Hyperbolic tangent is a technique used to reduce the effect of several directories or components in “Ct”. In other words, we can change the impact of specific components by adjusting numbers between -1 and 1.
- From the last step, we produce a new input, named “gt” which we can update “Ct” by this input. It seems everything is perfect but we don't want to add much information to our long memory, we just need essential information on this memory. Exactly when we read a sentence our mind just keeps important information and our mind plays a role of a catalyst. We can easily access the “gt” information by a similar forget gate like the one that we have in the previous part. So, we define a new gate named, Input Gate to access the “gt” information.
- As can be seen in the image the equation of updated “Ct” can be achieved.
Output Gate Operation:
- Don't Give up you almost learn everything about LSTM, now is the time to get to the last part. As you guess in this part we learn how to produce the “ht” and why we need it!
- We need “ht” because LSTM like an RNNs network is a recurrent network which means our outputs will be inputs for the next turn(Image 3), So the “ht” play the input role for the next network.
- In order to produce “ht”, we have already an updated “ct” which contains valuable and important information which we will pass through a tanh( to restrict values to be in the range of 1 and -1).
- Oops, there is still we haven’t figured out yet. The “Ct” contain a bunch of information but we only need some of them. we just need some specific information. So it is easy just need an output gate, yes again gate! it's like a key that unlocks the door to the right information.
- But don't worry everything is similar to the previous step. Our two inputs “h t-1” and “Xt” are connected by a fully connected layer and passed through a sigmoid function and produce a vector with the same shape as “Ct” which we named “Dt” as you can see in the above image.

Practical Applications of LSTM Networks

LSTMs are particularly effective for working with sequential data, which can vary in length, and learning long-term dependencies between time steps of that data. Some of the famous applications of LSTM includes:

Language Modeling: Used in tasks like language modeling, machine translation and text summarization. These networks learn the dependencies between words in a sentence to generate coherent and grammatically correct sentences.
Speech Recognition: Used in transcribing speech to text and recognizing spoken commands. By learning speech patterns they can match spoken words to corresponding text.
Time Series Forecasting: Used for predicting stock prices, weather and energy consumption. They learn patterns in time series data to predict future events.
Anomaly Detection: Used for detecting fraud or network intrusions. These networks can identify patterns in data that deviate drastically and flag them as potential anomalies.
Recommender Systems: In recommendation tasks like suggesting movies, music and books. They learn user behavior patterns to provide personalized suggestions.
Video Analysis: Applied in tasks such as object detection, activity recognition and action classification.
Signal processing: Signals are naturally sequential data, as they are often collected from sensors over time. Automatic classification and regression on large signal data sets allow prediction in real time. Raw signal data can be fed into deep networks or preprocessed to focus on specific features, such as frequency components.
Natural language processing (NLP): Language is naturally sequential, and pieces of text vary in length.

LSTM for Time Series Prediction in PyTorch

LSTM is useful for data such as time series or string of text. LSTM cell is a building block that you can use to build a larger neural network. Since the LSTM cell expects the input $x$ in the form of multiple time steps, each input sample should be a 2D tensors: One dimension for time and another dimension for features. Usually time series prediction is done on a window. That is, given data from time $t-w$ to time $t$, you are asked to predict for time $t+1$ (or deeper into the future). The size of window $w$ governs how much data you are allowed to look at when you make the prediction. On a long enough time series, multiple overlapping window can be created. A time series of $L$ time steps can produce roughly $L$ windows (because a window can start from any time step as long as the window does not go beyond the boundary of the time series). Within one window, there are multiple consecutive time steps of values. In each time step, there can be multiple features. It is intentional to produce the “feature” and the “target” the same shape: For a window of three time steps, the “feature” is the time series from $t$ to $t+2$ and the target is from $t+1$ to $t+3$. Note that the input time series is a 2D array and the output from the create_dataset() function will be a 3D tensors.

The output of nn.LSTM() is a tuple. The first element is the generated hidden states, one for each time step of the input. The output of hidden states is further processed by a fully-connected layer to produce a single regression result. The training set is plotted in red while the test set is plotted in green. The blue curve is what the actual data looks like.

Advantages of LSTM Networks

Handles Long-Term Dependencies: LSTMs excel in tasks where long-term dependencies are crucial, outperforming traditional RNNs.
Mitigates Vanishing Gradient Problem: The gating mechanism in LSTMs helps to alleviate the vanishing gradient problem, allowing for more effective training.
Versatile Applications: LSTMs can be applied to a wide range of tasks, including language modeling, speech recognition, time series forecasting, and video analysis.

BiLSTM: Capturing Bidirectional Dependencies

A bidirectional LSTM (BiLSTM) learns bidirectional dependencies between time steps of time-series or sequence data. These dependencies can be useful when you want the network to learn from the complete time series at each time step. A BiLSTM consists of two LSTM components: the forward LSTM and the backward LSTM. The forward LSTM operates from the first time step to the last time step. The backward LSTM operates from the last time step to the first time step.

tags: #LSTM #machine #learning #tutorial