The Week in AI: Anthropic's Prompt Caching, GitHub's Autofix Copilot, Neural Magic's LLM Compressor, More -- Pure AI

The Week in AI: Anthropic's Prompt Caching, GitHub's Autofix Copilot, Neural Magic's LLM Compressor, More

By Pure AI Editors
08/18/2024

This edition of our weekly roundup of AI products and services includes GitHub's Autofix Copilot, Lambda Labs and Nous Research's Hermes 3, Neural Magic's LLM Compressor, Primate Labs' Geekbench AI 1.0, and more.

AnswerAI unveiled a new, proof-of-concept model answerai-colbert-small-v1, which shows the potential of multi-vector models when combined with advanced training techniques. Developed using the JaColBERTv2.5 training recipe and additional optimizations, the new, compact model of just 33 million parameters with a footprint comparable to MiniLM, has surpassed the performance of all previous models of similar size on common benchmarks, its developers said. It has even outperformed larger, widely used models, including e5-large-v2 and bge-base-en-v1.5. This achievement underscores the potential of AnswerAI’s approach in pushing the boundaries of what’s possible with smaller, more efficient AI models, they said. The answerai-colbert-small-v1 model was designed with future compatibility in mind, particularly for the upcoming RAGatouille overhaul.

Anthropic announced that "prompt caching," which allows developers to cache frequently used context between API calls, is now available on the Anthropic API. The new feature improves the response times of its Claude GenAI chatbot. "With prompt caching, customers can provide Claude with more background knowledge and example outputs," the company said in a blog post, "all while reducing costs by up to 90% and latency by up to 85% for long prompts. Early adopters of prompt caching have reported substantial improvements in speed and cost efficiency across various use cases, including complex multi-turn conversations and many-shot prompting, the company said. Prompt caching is available now in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.

GitHub has announced the general availability of Autofix Copilot, an AI-driven tool for vulnerability remediation within GitHub Advanced Security. Initially introduced in a public beta in May, Autofix uses generative AI to detect and address vulnerabilities in new code during pull requests, offering solutions before the code is pushed to production. The tool, operating similarly to GitHub's Copilot, serves as a virtual security expert, scanning existing code, identifying vulnerabilities, and providing fixes with detailed explanations. GitHub reported that Autofix significantly speeds up remediation, with cross-site scripting fixes taking just 22 minutes compared to 2.8 hours manually, and SQL injection vulnerabilities resolved in 18 minutes versus 3.7 hours manually.

AI startups Lambda Labs and Nous Research launched Hermes 3, a new large language model described as a "personalized, unrestricted" version of Meta’s open-source Llama 3.1 model. Hermes 3, available in parameter sizes of 8 billion, 70 billion, and 405 billion, was designed to be more adaptable and customizable than existing models, boasting enhanced reasoning, creativity, and long-term context retention. Hermes 3 stands out for its open weights, which allows users to tailor its responses to specific needs; it's an approach that contrasts with the rigidity of many leading large language models (LLMs). The model also features agentic capabilities, enabling it to perform tasks such as generating code, providing detailed explanations, and engaging in visual communications using Mermaid diagrams. The model is available through the Lambda Chat Completions API and chat interface. Lambda and Nous Research are encouraging users to engage with Hermes 3 and explore its capabilities. Hermes 3 can be deployed on a single Lambda node or scaled to a multi-node configuration for further fine-tuning.

Researchers at DeepSeek-AI announced the release of DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4. It enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. The method begins with whole-proof generation, where the language model produces complete proof code based on the theorem statement. The Lean prover then verifies this code. If an error is detected, the code is truncated at the first error message, and the successfully generated portion serves as a prompt for the next proof segment. The latest state from the Lean 4 prover is appended as a comment to the prompt to enhance accuracy. The truncate-and-resume mechanism is integrated into the Monte-Carlo tree search (MCTS), allowing for flexible truncation points determined by the tree search policy. Also, a reward-free exploration algorithm is proposed to address the reward sparsity issue in proof search, assigning intrinsic motivation to the tree search agent for extensive exploration of the tactic state space.

Neural Magic released the LLM Compressor, a new tool designed to optimize large language models by enabling faster inference through advanced compression techniques. The LLM Compressor integrates various model compression tools, unifying state-of-the-art algorithms like GPTQ, SmoothQuant, and SparseGPT into a single library. With this approach, the company aims to address the fragmented landscape of existing tools and simplify the application of compression algorithms. The LLM Compressor comes with support for activation and weight quantization, which leverages INT8 and FP8 tensor cores optimized for NVIDIA's latest GPU architectures. This quantization can double performance in inference tasks, particularly under high server loads, as demonstrated by models like Llama 3.1 70B. In addition to quantization, the tool supports structured sparsity and weight pruning, significantly reducing model size while maintaining accuracy. This allows for efficient deployment on resource-constrained hardware and enhances the performance of large language models in production environments.

Primate Labs, known for its cross-platform benchmarking software, launched Geekbench AI 1.0, a new tool designed to assess the real-world performance of AI workloads on mobile and desktop platforms. Previously called Geekbench ML, the software has been rebranded to align with industry naming conventions. It measures performance across machine learning and deep learning tasks and introduces a three-metric scoring system to provide a more nuanced understanding of AI performance across different hardware. The new tool allows developers to see how different hardware optimizations impact specific AI tasks. It also includes workload accuracy measurements to help developers fine-tune models based on performance and error rates. It supports a range of AI frameworks, including OpenVINO on Linux and Windows, and TensorFlow Lite on Android. It is available for Windows, MacOS, Linux, and on the Google Play Store and Apple App Store. The tool is also integrated with Geekbench Browser for cross-platform benchmarking.