Aran Komatsuzaki

GPT-J-6B: 6B JAX-Based Transformer



Below, we will refer to GPT-J-6B by GPT-J in short.

Why does this project matter?

Credit assignment:


We would like to thank everyone who have helped this project (in alphabetical order):

Model design:

Our model design and hyperparameter choice closely follow those of 6.7B GPT-3 with some differences. Notably,

We have made two minor architectural improvements:


Sample Outputs:

Below, we will list some interesting samples we found that were generated with GPT-J along with their prompt.


Prompt ------- 

What is 75*10?




What is -0.002 take away 72.75?


Calculate -0.5 - 1039.


What is the difference between -1360 and 2?


What is -27.95 less than -20?


Calculate -0.3 + -169.


What is 0.7 minus 0.05?


Calculate -2 + 0.0899.


The model performs addition and subtraction perfectly in this setting of creating the questions by themselves.

Theorem Proving

Prompt -------

Theorem 3.2: Any cyclic group is abelian.




Let G be a cyclic group. If the order is 1, then it is trivial. If the order is 1, then G has only one element, so it is abelian.

If the order is not 1, let G be a cyclic group of order n.
Then G is generated by a single element a. Let a = gt. 

Note that G is an abelian group, so gt^n = e. But t^n = e, so gt^n = t^n = e, so t^n = g^n. 

For any element g in G, (gt)^n = t^n, so g^n = (gt)^n = t^n = g^n. 

So G is abelian.

While the model can imitate the style of proving this simple theorem to some extent, there is still a huge gap with human-level accuracy.


Prompt ------

Passage: Badgers are short-legged omnivores in the family Mustelidae, which also includes the otters, polecats, weasels, and wolverines. They belong to the caniform suborder of carnivoran mammals. The 11 species of badgers are grouped in three subfamilies: Melinae (Eurasian badgers), Mellivorinae (the honey badger or ratel), and Taxideinae (the American badger). The Asiatic stink badgers of the genus Mydaus were formerly included within Melinae (and thus Mustelidae), but recent genetic evidence indicates these are actually members of the skunk family, placing them in the taxonomic family Mephitidae. 

Question: is a wolverine the same as a badger?


Output (greedy):

A: The wolverine is a member of the Mustelidae family, but it is not a badger.


Output (p = 0.9):


You'll first need to establish what you mean by a wolverine and what you mean by a badger. Wolverines are fur-bearing members of the canid order Carnivora, which includes dogs, foxes, and weasels. They are not the same thing as badgers, and neither are badgers what wolverines were originally described as. They're different animals. Badgers are canids; wolverines are canids.
As to what the difference is, there is none.

Completion on a question from BoolQ (SuperGLUE). While both sampling methods result in the same correct conclusion, the nucleus sampling hallucinates and contains incorrect reasoning, while the greedy sampling answers concisely and reasonably. In general, we observed that greedy sampling is more accurate and contains less hallcinations than nucleus sampling when the output is supposed to be short like this, which is predictable given that classification task is usually done with greedy sampling.


Prompt ------

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):



super(MultiHeadedAttention, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.h = nn.ModuleList([nn.Linear(d_model, h) for _ in range(h)])
        self.d = nn.Linear(d_model, d_model)
        self.d_model = d_model

    def forward(self, x, h_att_mask, return_h_att_mask=False):
        :param x: a sequence of word tokens of shape (batch, seq, vocab),
        :param h_att_mask: a mask of attention heads, with shape (batch, seq, head)
        :param return_h_att_mask: whether to return the attention mask

While the model knows what attention is like, it does not know its precise mechanism as in theorem proving.


Prompt ------

Fun Facts About Geese:
1. Geese have impressive visual capabilities!



They can see a human face and react to it.
2. Geese produce a lot of noise!
They can be heard and seen flying over 100 miles away!
3. Geese are the biggest birds of prey in the world!
They have an average wingspan of 45 inches.
4. Geese can fly 1,000 miles per hour!
They can travel up to 40 miles per hour for short distances.
5. Geese usually mate for life!
They will mate with any other goose with which they are
closely associated.
6. Geese have a variety of vocalizations.
The most prominent one is the honking call, which is
used mainly for communication.
7. Geese can be found on every continent except Antarctica.

Geese are birds of the genus Anser. They are native to the Northern Hemisphere, and
are found in a wide variety of environments.
Geese are typically very solitary birds and seldom form breeding pairs.
Geese are characterized by their long necks, distinctive head crests, and heavy body.
In temperate regions, geese are a common sight, for they form large flocks.
The most common of these



To cite GPT-J-6B pretrained model:

  author = {Wang, Ben and Komatsuzaki, Aran},
  title = {{GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model}},
  howpublished = {\url{}},
  year = 2021,
  month = May

To cite Mesh Transformer JAX, the codebase used for the training:

  author = {Wang, Ben},
  title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
  howpublished = {\url{}},
  year = 2021,
  month = May