Latent Space Podcast 5/20/23 [Summary] - MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

Dive into MosaicML's 9-day, $200k "llongboi" MPT-7B training, data prep insights, & the rise of open AI models with experts Frankle & Venigalla.

Prof. Otto NomosOct 05, 2023 6 min read
blog-image-0

Original Link: MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

Summary

In an episode of the Latent Space podcast, Alessio and co-host Swyx welcome guests Jonathan and Abhinav (Abhi) from Mosaic ML.

Key Takeaways:

  • Introductions:

    • Jonathan, a former student of Princeton and MIT, made significant contributions with his research on the "lottery ticket hypothesis" in 2018. He delayed his PhD defense for two years due to commitments with Mosaic and has been appointed as an assistant professor at Harvard.

    • Abhinav, an MIT graduate and former researcher at Cerebras, now works with Mosaic. He mentions Cerebras' innovative approach to use an entire wafer for computing, introducing a wafer-scale computing system for training models.

  • Mosaic's Journey:

    • Mosaic ventured into building its own model, the MPT-7B, after profiling various models and realizing training costs could be considerably lowered. Mosaic's focus is not on a single standout model but on empowering clients to create their own optimized models using Mosaic’s tools.

  • Training and Model Creation:

    • Mosaic initiated its MPT-7B project as a base model, inspired by LLaMA 7B, trained on a trillion tokens. They over-trained the model intentionally to ensure it was effective for inference. Abhinav mentions a term "chinchilla laws" which dictate efficient compute spend, and this was one principle guiding their training decisions.

  • Data Choices:

    • The duo discussed the challenge of determining the right balance between quality and quantity of data for model training. Repetition of high-quality data might be as effective, if not more so, than a larger volume of diverse, lower-quality data.

blog-image-0

Deciphering the Ingredients for Optimal Language Models: Data Mixes, Evaluations, and Technical Innovations

Discussion Topic: Mix of Data Sets for Large Language Models (LLMs) Swyx asks about the essential question of what mix of data sets should be used for training LLMs. Jonathan shares his experience consulting with law students from Georgetown Law on the optimal data mix. Some considerations include:

  • The volume of English text.

  • Multilingual data.

  • The inclusion of code in the dataset.

  • The value of using reputable sources like Wikipedia. He further elaborates on different data sets, including c4 and mc4, discussing their unexpected performances and preprocessing anomalies. Jonathan emphasizes the importance of diverse data for different user purposes and the challenges in model evaluation.

Evaluation Challenges for LLMs Swyx mentions benchmarks like MMLU and Big Bench. Jonathan notes the difficulty in evaluating models since they are primarily used for open-ended generation, not multiple-choice questions. Existing metrics don't capture what users expect from the models. Abhinav highlights challenges related to the scale of models, evaluation metrics diversity, and the variance in user purposes.

Flash Attention Abhinav introduces flash attention as a faster implementation of full attention developed by Stanford's Hazy Research, providing speed improvements during training and inference. This implementation is unique compared to other models available on Hugging Face.

Alibi Position Encodings Abhinav explains the Alibi position encoding, which eliminates the need for positional embeddings in the model, allowing models to handle longer context lengths more stably. It works by adding a bias to the attention map, which can be stretched during inference for longer positions. Jonathan points out that with enough memory, the context length could technically be infinite, and they chose 84k as the practical longest length.

The discussion also touched upon other topics like the choice of evaluation metrics, the emergence phenomenon, and training stabilities.

blog-image-0

Fine-Tuning for Creative Outputs & Open Source Ethical Dilemmas

Swyx and Jonathan discuss the nuances of fine-tuning AI models for creative tasks. Alex had once fine-tuned a model on books to achieve a story writer, emphasizing the significance of feedback in fine-tuning. Swyx draws parallels with computer vision, hinting at potential applications in the language domain, despite some strategies not always translating well from vision to text. Alessio introduces the complications around redistributing content, and the duo further dives into licensing challenges with Jonathan emphasizing the unforeseen issues surrounding AI-generated content and the implications of using open-source licenses in ways they weren't originally intended for. Swyx raises the topic of what's considered "fair use" in training models, while Jonathan encourages a community-driven decision on comfort levels in using AI outputs. The conversation shifts to technical details with Abhinav, as they discuss training stability enhancement. Mosaic's solution offers stability improvements to mitigate loss spikes during training. Both Jonathan and Abhinav emphasize the significance of the infrastructure, noting challenges and breakthroughs they faced, particularly with hardware failures during model training. The segment ends with a comparison between CPU and GPU failure management.

blog-image-0

Data Readiness & Training Preparation

  • Synchronous Data Pluralism: The process where every GP works in sync and averages their gradients. The aim is determinism, the ability to replicate a run perfectly.

  • Inference, Training, and Composer Products: The starting point for many customers, evolving from the traditional LOP stack. There are two main uses: using Mosaic's checkpoints or starting from Mosaic's configuration.

  • Data Preparedness: Abhinav emphasizes the importance of having proper evaluation metrics and cleaned data before training models. The goal is predictability in outcomes before significant financial investments are made.

Dynamic Real-time Model Evaluation

  • Evaluation Techniques: Discussion on the significance of various evaluation metrics, such as human eval and vibe-based evaluation. Jonathan emphasizes the importance of real-world prompts for evaluating model performance.

  • Live Evaluations: LLM Foundry provides tools for model evaluation, boasting a fast framework compatible with multiple GPUs. It allows real-time monitoring of model training.

Open Science for Affordable AI Research

  • Cost and Efficiency: The high cost of training has limited experiments in the past. Mosaic's objective is to reduce costs to enable more experiments and thus better insights.

  • Value of Openness: Mosaic emphasizes the importance of being open with the community. They believe in sharing insights as their primary aim is to help clients train better models and provide superior infrastructure.

blog-image-0

Balancing Open Approaches and the Dynamic Journey of Mosaic in AI Evolution

The Open Approach:

  • Alessio questions the requirements and feasibility of the open approach, mentioning GPT-4 and its state-of-the-art capabilities. He seeks insights on whether current methods will suffice or if more advancements are needed.

  • Jonathan believes in the coexistence of different technologies, using Linux and Windows as an analogy. He highlights the importance of having domain-specific models and expresses concern over the diminishing open-source ecosystem. He fears the stagnation of innovation due to reduced openness and shares that the collaborative exchange of ideas was crucial for technological progress.

The Future of Mosaic:

  • Swyx brings up the evolution and plans of Mosaic, remarking on the company's recent dive into inference despite earlier claims.

  • Jonathan emphasizes the importance of focus for startups, sharing Mosaic's initial reluctance towards inference. Despite that, they ventured into it due to customer demand and business benefits. He recalls Mosaic's vision of efficient training and its realization of the changing needs of inference. He stresses the continuous nature of model training and improvement.

  • Abhinav points out the evolving nature of data and the continuous need to update models accordingly. He notes that the current models aren't the endpoint; they will evolve, improving in various aspects.

  • Jonathan ends by debunking the idea of training as a one-time cost. He cites a stat from Google, illustrating the split between inference and training costs, emphasizing the significant investment in training.

This transcript sheds light on the evolving world of AI, emphasizing the importance of the open approach, the need for continuous learning and improvement, and the dynamic nature of startups in the tech industry.

blog-image-0

The Evolution of Speed and Efficiency in AI Training

Swyx delves into the topic of efficiency and speed in AI model training, questioning how fast the process can get given the current durations varying from three to 10 days. Abhinav highlights this year as crucial for training efficiency, emphasizing new hardware like Nvidia's H 100 and the new floating point format, FP8. These innovations significantly reduce training time and costs. The discussion dives into the evolution from 32-bit to 16-bit precision in training, foreseeing FP8 bringing about even greater improvements.

Furthermore, Abhinav touches on Mosaic's advancements using FP8 with H 100s and the architectural applications in the pipeline. By year's end, the cost of training could drop significantly – from 500k to an ambitious 100k for specific models. Jonathan chimes in, stating that cost reductions for new models can be substantial because they often begin inefficient. As an example, Mosaic reduced costs by 12x for the stable diffusion model due to its newness and inherent inefficiencies.

On the topic of model contexts, Alessio brings up the challenge of determining the right number, suggesting customer needs and specific tasks as potential determinants. Jonathan offers a critical view of long contexts and attention methods, emphasizing the importance of firsthand implementation and testing. He raises concerns about relying solely on published papers, indicating that practical results often differ from promises made in research.

Toward the end, the group draws parallels between RAM in computers and context in AI models. Swyx mentions a comparison made by Andrej Karpathy, suggesting context is the RAM of AI. Jonathan humorously alludes to Bill Gates' famed statement on RAM and speculates about the future. As AI gets more ambitious, so will the demands on context. Future models might process complex data forms, like images and videos, further intensifying the need for efficient and rapid training.

blog-image-0

The Evolution and Endurance of Transformers in AI

During the "Trends and Transformers" discussion, Swyx and Jonathan explore the longevity of transformer models in AI, suggesting that, just like Convolutional Neural Networks, they could remain in use for a long time. They delve into the nuances of attention mechanisms, with Jonathan emphasizing the importance of historical model architectures and how challenging it is to find fundamental improvements. He also mentions the significance of demos for their business, as they demonstrate capabilities without giving everything away.

Discussing the open research ethos, Jonathan explains the balance they strike between open work and proprietary customer solutions. Abhinav introduces the topic of model sparsity, pointing out the current lack of hardware to accelerate sparse models, though he acknowledges certain advancements like those from Cereus. The conversation concludes with a discussion about the hardware-software co-evolution: how new architectures need to pair well with emerging hardware, much like how transformers are apt for GPUs, albeit initially designed for TPUs.

blog-image-0

Unraveling AI's Unexpected Strides and the Road Ahead

AI's Surprising Progress and Future Predictions

In a lively discussion, Alessio prompted guests to share their views on AI's trajectory. Jonathan expressed surprise at the swift progress, particularly in models like GPT-3. He admitted to being wrong about the scalability and usefulness of such models. Abhinav was similarly astonished by the rapid adoption and emotional connections users are making with large-scale chatbots. They both reflected on the boundaries of AI, from Abhinav's curiosity about the potential of quantizing models down to analog levels to Jonathan's desire to explore efficient paths to achieve AI capabilities without massive scale. The conversation ended with a call for balance in the AI discourse. Jonathan urged listeners to remain grounded, avoiding hyperbolic fear and embracing the constructive tools AI provides. Abhinav emphasized the importance of open research, championing collaborative oversight for AI's safety and real-world implications. The segment closed with gratitude for keeping AI discussions and advancements open to the public.

/Related storiesSee All Stories >