Relation networks consist of two important functions: an embedding function, denoted by and the relation function, denoted by
. The embedding function is used for extracting the features from the input. If our input is an image, then we can use a convolutional network as our embedding function, which will give us the feature vectors/embeddings of an image. If our input is text, then we can use LSTM networks to get the embeddings of the text. Let us say, we have a support set containing three classes, {lion, elephant, dog} as shown below:

And let's say we have a query image , as shown in the following diagram, and we want to predict the class of this query image:

First, we take each image, , from the support set and pass it to the embedding function
for extract the features. Since our support set has images, we can use a convolutional network as our embedding...