Building GPT from scratch

With all the talk & articles about GPT, a few weeks ago I decided to spend a couple days coding one from scratch.


The training size can’t be the same (for $$$ reasons), but the model can follow a very similar architecture to the one used by GPT-3. As long as we narrow the training data, we can get amazing results – in my case I used US legal corpus data, which is widely available online. 

I found this experience very valuable to better understand how this new models can improve & be used in the future. If you want to follow along I do recommend reading the paper GPT is based upon, and also the great implementation from Andrej.

If you have interest in this area, and know the basics of AI / Python, def. recommend the experience. Personally, I think the impact to society will be very significant – although probably not in the obvious ways we’re imagining.

It’s amazing to think all of this is based on math, we’ve come a long way…


Here’s the code I wrote for reference, I optimized it for M Macs to take advantage of their GPUs so it trains a lot faster. Note the faster your GPU the faster it will train, also time will depend on the amount of the training data. I used an M1 Max w. 32 GPU cores and 64 GB RAM, and it took me a about two hours to train. 

Comments / Feedback? Find me at or @MarcMasVi

Happy coding,