Construction Notebook for:
GPT Transformer Trained on BookCorpus Data
Construction Notebook for:
GPT Transformer Trained on BookCorpus Data
GPT Transformer Trained on BookCorpus Data
NetModel Access
NetModel Access
This Notebook
This Notebook
NetModel["GPT Transformer Trained on BookCorpus Data","ConstructionNotebook"]
Untrained Net
Untrained Net
NetModel["GPT Transformer Trained on BookCorpus Data","UninitializedEvaluationNet"]
Trained Net
Trained Net
NetModel["GPT Transformer Trained on BookCorpus Data"]
Net Construction
Net Construction
Internal Functions
Internal Functions
In[]:=
keyValueAttention[numHiddens_,causal_:True]:=NetGraph[<| "key"NetMapOperator[numHiddens], "value"NetMapOperator[numHiddens], "query"NetMapOperator[numHiddens],"elem"ElementwiseLayer[#/Sqrt[numHiddens]&], "attention"AttentionLayer["Dot","ScoreRescaling"None,"Mask"If[causal,"Causal",None]] |>, NetPort["Input"]"key"NetPort["attention","Key"], NetPort["Input"]"value"NetPort["attention","Value"], NetPort["Query"]"query""elem"NetPort["attention","Query"]]
In[]:=
multiHeadAttention[numHeads_,numHiddens_,args___]:=NetGraph[ Join[ Table[keyValueAttention[numHiddens,args],numHeads], {CatenateLayer[2],NetMapOperator[numHiddens*numHeads]} ], {Range[numHeads]numHeads+1numHeads+2}]
In[]:=
multiHeadAttentionBlock[args__]:=NetGraph[ <| "attention"multiHeadAttention[args], "dropout"DropoutLayer[0.1], "add"ThreadingLayer[Plus], "norm"NormalizationLayer[2;;,"Same","Epsilon"0.00001] |>, {"attention""dropout",{NetPort["Query"],"dropout"}"add""norm"}];
In[]:=
selfAttentionBlock[args__]:=With[{multihead=multiHeadAttentionBlock[args]}, NetGraph[ Normal[multihead], ReplaceAll[EdgeList[multihead],NetPort["Query"]NetPort["Input"]] ]];
In[]:=
feedForwardBlock[inputDim_,numHiddens_,normalization_:True]:=NetGraph[ <| "linear1"NetMapOperator[numHiddens], "gelu"ElementwiseLayer[(0.5#*(1+Tanh[Sqrt[(2/Pi)]*(#+0.044715*#^3)]))&], (*"gelu"Ramp,*) "linear2"NetMapOperator[inputDim], "dropout"DropoutLayer[0.1], "add"ThreadingLayer[Plus],(*Residualconnection*) "norm"NormalizationLayer[2;;,"Same","Epsilon"0.00001](*LayerNorm*) |>, {"linear1""gelu""linear2""dropout",{NetPort["Input"],"dropout"}"add""norm"}, "Input"{"Varying",Automatic}];
In[]:=
decoderBlock[attentionHeads_,attentionHiddens_,feedForwardHiddens_]:=NetChain[ selfAttentionBlock[attentionHeads,attentionHiddens,True], feedForwardBlock[attentionHeads*attentionHiddens,feedForwardHiddens]]
Final Net
Final Net
In[]:=
numStackDecoder=12;embeddingSize=768;feedForwardHiddens=3072;attentionHiddens=64;attentionHeads=12;numTokens=40478;nSpecial=2;numPos=512;
In[]:=
decoder:=NetGraph[Table[decoderBlock[attentionHeads,attentionHiddens,feedForwardHiddens],numStackDecoder],Table[i->i+1,{i,numStackDecoder-1}]]
In[]:=
embedding:=NetGraph[<| "embeddingtokens"->EmbeddingLayer[embeddingSize,numTokens], "posembed"->NeuralNetworks`SequenceIndicesLayer[512], "embeddingpos"->EmbeddingLayer[embeddingSize,numPos], "inputCombine"->ThreadingLayer[#1+#2&], "dropout"->DropoutLayer[0.1] |>, {NetPort["Input"]->"embeddingtokens",NetPort["Input"]->"posembed"->"embeddingpos",{"embeddingtokens","embeddingpos"}->"inputCombine"->"dropout"}];transformer=NetChain[<| "embedding"->embedding, "decoder"->decoder |>]
Out[]=
NetChain
In[]:=
pos={"embedding","embeddingtokens"};snet=NetReplacePart[transformer,pos->NetInsertSharedArrays[NetExtract[transformer,pos]]];lmnet= NetAppend[snet, "last"->SequenceLastLayer[], "classifier"->LinearLayer[numTokens, "Weights"->NetSharedArray["Weights"], "Biases"->None], "probabilities"->SoftmaxLayer[]]
Out[]=
NetChain
Training
Training
(Performed separately)