Construction Notebook for:​
GPT Transformer Trained on BookCorpus Data

NetModel Access

This Notebook

NetModel["GPT Transformer Trained on BookCorpus Data","ConstructionNotebook"]

Untrained Net

NetModel["GPT Transformer Trained on BookCorpus Data","UninitializedEvaluationNet"]

Trained Net

NetModel["GPT Transformer Trained on BookCorpus Data"]

Net Construction

Internal Functions

In[]:=
keyValueAttention[numHiddens_,causal_:True]:=NetGraph[<|​​ "key"NetMapOperator[numHiddens],​​ "value"NetMapOperator[numHiddens],​​ "query"NetMapOperator[numHiddens],​​"elem"ElementwiseLayer[#/Sqrt[numHiddens]&],​​ "attention"AttentionLayer["Dot","ScoreRescaling"None,"Mask"If[causal,"Causal",None]]​​ |>,​​ NetPort["Input"]"key"NetPort["attention","Key"],​​ NetPort["Input"]"value"NetPort["attention","Value"],​​ NetPort["Query"]"query""elem"NetPort["attention","Query"]​​]
In[]:=
multiHeadAttention[numHeads_,numHiddens_,args___]:=NetGraph[​​ Join[​​ Table[keyValueAttention[numHiddens,args],numHeads],​​ {CatenateLayer[2],NetMapOperator[numHiddens*numHeads]}​​ ],​​ {Range[numHeads]numHeads+1numHeads+2}​​]
In[]:=
multiHeadAttentionBlock[args__]:=NetGraph[​​ <|​​ "attention"multiHeadAttention[args],​​ "dropout"DropoutLayer[0.1],​​ "add"ThreadingLayer[Plus],​​ "norm"NormalizationLayer[2;;,"Same","Epsilon"0.00001]​​ |>,​​ {"attention""dropout",{NetPort["Query"],"dropout"}"add""norm"}​​];
In[]:=
selfAttentionBlock[args__]:=With[{multihead=multiHeadAttentionBlock[args]},​​ NetGraph[​​ Normal[multihead],​​ ReplaceAll[EdgeList[multihead],NetPort["Query"]NetPort["Input"]]​​ ]​​];
In[]:=
feedForwardBlock[inputDim_,numHiddens_,normalization_:True]:=NetGraph[​​ <|​​ "linear1"NetMapOperator[numHiddens],​​ "gelu"ElementwiseLayer[(0.5#*(1+Tanh[Sqrt[(2/Pi)]*(#+0.044715*#^3)]))&],​​ (*"gelu"Ramp,*)​​ "linear2"NetMapOperator[inputDim],​​ "dropout"DropoutLayer[0.1],​​ "add"ThreadingLayer[Plus],(*Residualconnection*)​​ "norm"NormalizationLayer[2;;,"Same","Epsilon"0.00001](*LayerNorm*)​​ |>,​​ {"linear1""gelu""linear2""dropout",{NetPort["Input"],"dropout"}"add""norm"},​​ "Input"{"Varying",Automatic}​​];
In[]:=
decoderBlock[attentionHeads_,attentionHiddens_,feedForwardHiddens_]:=NetChain[​​ selfAttentionBlock[attentionHeads,attentionHiddens,True],​​ feedForwardBlock[attentionHeads*attentionHiddens,feedForwardHiddens]​​]

Final Net

In[]:=
numStackDecoder=12;​​embeddingSize=768;​​feedForwardHiddens=3072;​​attentionHiddens=64;​​attentionHeads=12;​​numTokens=40478;​​nSpecial=2;​​numPos=512;
In[]:=
decoder:=NetGraph[​​Table[decoderBlock[attentionHeads,attentionHiddens,feedForwardHiddens],numStackDecoder],​​Table[i->i+1,{i,numStackDecoder-1}]​​]
In[]:=
embedding:=NetGraph[<|​​ "embeddingtokens"->EmbeddingLayer[embeddingSize,numTokens],​​ "posembed"->NeuralNetworks`SequenceIndicesLayer[512],​​ "embeddingpos"->EmbeddingLayer[embeddingSize,numPos],​​ "inputCombine"->ThreadingLayer[#1+#2&],​​ "dropout"->DropoutLayer[0.1]​​ |>,​​ {NetPort["Input"]->"embeddingtokens",NetPort["Input"]->"posembed"->"embeddingpos",{"embeddingtokens","embeddingpos"}->"inputCombine"->"dropout"}​​];​​​​transformer=NetChain[<|​​ "embedding"->embedding,​​ "decoder"->decoder​​ |>​​]
Out[]=
NetChain

uniniti
aliz
ed
Input
port:
vector of
n
indices
(range: 1..40478)
Output
port:
matrix
(size:
n
×768)
Number of layers:
2

In[]:=
pos={"embedding","embeddingtokens"};​​snet=NetReplacePart[transformer,pos->NetInsertSharedArrays[NetExtract[transformer,pos]]];​​lmnet=​​ NetAppend[snet,​​ "last"->SequenceLastLayer[],​​ "classifier"->LinearLayer[numTokens,​​ "Weights"->NetSharedArray["Weights"],​​ "Biases"->None],​​ "probabilities"->SoftmaxLayer[]]
Out[]=
NetChain

uniniti
aliz
ed
Input
port:
vector of
n
indices
(range: 1..40478)
Output
port:
vector
(size: 40478)
Number of layers:
5


Training

(Performed separately)