Awesome Azure Openai Llm Overview

"Awesome-LLM: a curated list of Azure OpenAI & Large Language Models" 🔎References to Azure OpenAI, 🦙Large Language Models, and related 🌌 services and 🎋libraries.

🏠 Home · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 kimtth/awesome-azure-openai-llm · ⭐ 310 · 🏷️ LLM

[ Daily / Weekly / Overview ]

Azure OpenAI + LLMs (Large Language Models)

This repository contains references to Azure OpenAI, Large Language Models (LLM), and related services and libraries. It follows a similar approach to the ‘Awesome-list’.

🔹Brief each item on a few lines as possible.
🔹The dates are determined by the date of the commit history, the Article published date, or the Paper issued date (v1).
🔹Capturing a chronicle and key terms of that rapidly advancing field.
🔹Disclaimer: Please be aware that some content may be outdated.

What's the difference between Azure OpenAI and OpenAI?

  1. OpenAI offers the latest features and models, while Azure OpenAI provides a reliable, secure, and compliant environment with seamless integration into other Azure services.
  2. Azure OpenAI supports private networking, role-based authentication, and responsible AI content filtering.
  3. Azure OpenAI does not use user input as training data for other customers. Data, privacy, and security for Azure OpenAI

Table of contents

Section 1: RAG, LlamaIndex, and Vector Storage

What is the RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation: Research Papers

RAG Pipeline & Advanced RAG

The Problem with RAG

RAG Solution Design & Application

LlamaIndex

Expand: 4 RAG techniques
  1. SQL Router Query Engine: Query router that can reference your vector database or SQL database

  2. Sub Question Query Engine: Break down the complex question into sub-questions

  3. Recursive Retriever + Query Engine: Reference node relationships, rather than only finding a node (chunk) that is most relevant.

  4. Self Correcting Query Engines: Use an LLM to evaluate its own output.

Vector Database Comparison

Vector Database Options for Azure

Note: Azure Cache for Redis Enterprise: Enterprise Sku series are not able to deploy by a template such as Bicep and ARM.

Deploy to Azure

Lucene based search engine with OpenAI Embedding

Section 2 : Azure OpenAI and Reference Architecture

Microsoft Azure OpenAI relevant LLM Framework

LLM Integration Frameworks

  1. Semantic Kernel (Feb 2023): An open-source SDK for integrating AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages such as C# and Python. It's an LLM orchestrator, similar to LangChain. / git (⭐21k)
  2. Kernel Memory (⭐1.5k) (Jul 2023): An open-source service and plugin for efficient dataset indexing through custom continuous data hybrid pipelines.
  3. Azure ML Prompt Flow (Jun 2023): A visual designer for prompt crafting using Jinja as a prompt template language. / ref / git (⭐9.1k)

Prompt Optimization

  1. Prompt Engine (⭐2.5k) (Jun 2022): A tool for crafting prompts for large language models in Python. / Python (⭐204)
  2. PromptBench (⭐2.3k) (Jun 2023): A unified evaluation framework for large language models.
  3. SAMMO (⭐309) (Apr 2024): A general-purpose framework for prompt optimization. / ref
  4. Prompty (⭐353) (Apr 2024): A template language for integrating prompts with LLMs and frameworks, enhancing prompt management and evaluation.
  5. guidance (⭐19k) (Nov 2022): A domain-specific language (DSL) for controlling large language models, focusing on model interaction and implementing the "Chain of Thought" technique.
  6. LMOps (⭐3.6k) (Dec 2022): A toolkit for improving text prompts used in generative AI models, including tools like Promptist for text-to-image generation and Structured Prompting.
  7. LLMLingua (⭐4.4k) (Jul 2023): A tool for compressing prompts and KV-Cache, achieving up to 20x compression with minimal performance loss. LLMLingua-2 was released in Mar 2024.
  8. TypeChat (Apr 2023): A tool that replaces prompt engineering with schema engineering, designed to build natural language interfaces using types. / git (⭐8.1k)

Agent Frameworks

  1. JARVIS (⭐24k) (Mar 2023): An interface for LLMs to connect numerous AI models for solving complex AI tasks.
  2. Autogen (⭐31k) (Mar 2023): A customizable and conversable agent framework. / ref / Autogen Studio (June 2024)
  3. TaskWeaver (⭐5.2k) (Sep 2023): A code-first agent framework for converting natural language requests into executable code with support for rich data structures and domain-adapted planning.
  4. UFO (⭐7.5k) (Mar 2024): A UI-focused agent for Windows OS interaction.
  5. Semantic Workbench (⭐42) (Aug 2024): A development tool for creating intelligent agents. / ref

Deep learning

  1. DeepSpeed (⭐35k) (May 2020): A deep learning optimization library for easy, efficient, and effective distributed training and inference, featuring the Zero Redundancy Optimizer.
  2. FLAML (⭐3.8k) (Dec 2020): A lightweight Python library for efficient automation of machine learning and AI operations, offering interfaces for AutoGen, AutoML, and hyperparameter tuning.

Risk Identification & Ops

  1. PyRIT (⭐1.7k) (Dec 2023): Python Risk Identification Tool for generative AI, focusing on LLM robustness against issues like hallucination, bias, and harassment.
  2. AI Central (⭐77) (Oct 2023): An AI Control Center for monitoring, authenticating, and providing resilient access to multiple OpenAI services.

Data processing

Microsoft Copilot Product Lineup

  1. Copilot Products

    • Microsoft Copilot in Windows vs Microsoft Copilot (= Copilot in Windows + Commercial Data Protection) vs Microsoft 365 Copilot (= Microsoft Copilot + M365 Integration) [Nov 2023]
    • Copilot Scenario Library
    1. Azure
    2. Microsoft 365 (Incl. Dynamics 365 and Power Platform)
    3. Windows, Bing and so on
  2. Customize Copilot

    1. Microsoft AI and AI Studio
    2. Copilot Studio
    3. Microsoft Office Copilot: Natural Language Commanding via Program Synthesis: [cnt]: Semantic Interpreter, a natural language-friendly AI system for productivity software such as Microsoft Office that leverages large language models (LLMs) to execute user intent across application features. [6 Jun 2023]
    4. NL2KQL: From Natural Language to Kusto Query [3 Apr 2024]
    5. SpreadsheetLLM: Introduces an efficient method to encode Excel sheets, outperforming previous approaches with 25 times fewer tokens.[12 Jul 2024]
    6. GraphRAG (by Microsoft): RAG with a graph-based approach to efficiently answer both specific and broad questions over large text corpora1. ref git (⭐17k) [24 Apr 2024]
    7. AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems [9 Aug 2024]

Azure Reference Architectures

Azure OpenAI Embeddings QnA (⭐44) [Apr 2023] Azure Cosmos DB + OpenAI ChatGPT (⭐254) C# blazor [Mar 2023]
embeddin_azure_csharp gpt-cosmos
C# Implementation (⭐582) ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search [Apr 2023] Simple ChatGPT UI application (⭐116) Typescript, ReactJs and Flask [Apr 2023]
embeddin_azure_csharp gpt-cosmos
Azure Video Indexer demo Azure Video Indexer + OpenAI [Apr 2023] Miyagi (⭐714) Integration demonstrate for multiple langchain libraries [Feb 2023]
demo-videoindexer miyagi
ChatGPT + Enterprise data RAG (Retrieval-Augmented Generation) (⭐5.8k)🏆 [Feb 2023] Chat with your data - Solution accelerator (⭐769) [Jun 2023]
demo-videoindexer

Azure Enterprise Services

Section 3 : Microsoft Semantic Kernel and Stanford NLP DSPy

Semantic Kernel

Feature Roadmap

Code Recipes

Semantic Kernel Planner

Semantic Function

1. Variables : use the {{$variableName}} syntax : Hello {{$name}}, welcome to Semantic Kernel!
2. Function calls: use the {{namespace.functionName}} syntax : The weather today is {{weather.getForecast}}.
3. Function parameters: {{namespace.functionName $varName}} and {{namespace.functionName "value"}} syntax
   : The weather today in {{$city}} is {{weather.getForecast $city}}.
4. Prompts needing double curly braces :
   {{ "{{" }} and {{ "}}" }} are special SK sequences.
5. Values that include quotes, and escaping :

    For instance:
    ... {{ 'no need to \\"escape" ' }} ...
    is equivalent to:
    ... {{ 'no need to "escape" ' }} ...

Semantic Kernel Glossary

DSPy

DSPy Glossary

DSPy optimizer

Expand

Section 4 : LangChain Features, Usage, and Comparisons

LangChain Feature Matrix & Cheetsheet

LangChain chain type: Chains & Summarizer

LangChain Agent & Memory

LangChain Agent

  1. If you're using a text LLM, first try zero-shot-react-description.
  2. If you're using a Chat Model, try chat-zero-shot-react-description.
  3. If you're using a Chat Model and want to use memory, try conversational-react-description.
  4. self-ask-with-search: Measuring and Narrowing the Compositionality Gap in Language Models [7 Oct 2022]
  5. react-docstore: ReAct: Synergizing Reasoning and Acting in Language Models [6 Oct 2022]
  6. Agent Type
class AgentType(str, Enum):
    """Enumerator with the Agent types."""

    ZERO_SHOT_REACT_DESCRIPTION = "zero-shot-react-description"
    REACT_DOCSTORE = "react-docstore"
    SELF_ASK_WITH_SEARCH = "self-ask-with-search"
    CONVERSATIONAL_REACT_DESCRIPTION = "conversational-react-description"
    CHAT_ZERO_SHOT_REACT_DESCRIPTION = "chat-zero-shot-react-description"
    CHAT_CONVERSATIONAL_REACT_DESCRIPTION = "chat-conversational-react-description"
    STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION = (
        "structured-chat-zero-shot-react-description"
    )
    OPENAI_FUNCTIONS = "openai-functions"
    OPENAI_MULTI_FUNCTIONS = "openai-multi-functions"

LangChain Memory

  1. ConversationBufferMemory: Stores the entire conversation history.
  2. ConversationBufferWindowMemory: Stores recent messages from the conversation history.
  3. Entity Memory: Stores and retrieves entity-related information.
  4. Conversation Knowledge Graph Memory: Stores entities and relationships between entities.
  5. ConversationSummaryMemory: Stores summarized information about the conversation.
  6. ConversationSummaryBufferMemory: Stores summarized information about the conversation with a token limit.
  7. ConversationTokenBufferMemory: Stores tokens from the conversation.
  8. VectorStore-Backed Memory: Leverages vector space models for storing and retrieving information.

Criticism to LangChain

LangChain vs Competitors

Prompting Frameworks

LangChain vs LlamaIndex

LangChain vs Semantic Kernel

LangChain Semantic Kernel
Memory Memory
Tookit Plugin (pre. Skill)
Tool LLM prompts (semantic functions) or native C# or Python code (native function)
Agent Planner
Chain Steps, Pipeline
Tool Connector

LangChain vs Semantic Kernel vs Azure Machine Learning Prompt flow

Prompt Template Language

Handlebars.js Jinja2 Prompt Template
Conditions {{#if user}}
  Hello {{user}}!
{{else}}
  Hello Stranger!
{{/if}}
{% if user %}
  Hello {{ user }}!
{% else %}
  Hello Stranger!
{% endif %}
Branching features such as "if", "for", and code blocks are not part of SK's template language.
Loop {{#each items}}
  Hello {{this}}
{{/each}}
{% for item in items %}
  Hello {{ item }}
{% endfor %}
By using a simple language, the kernel can also avoid complex parsing and external dependencies.
LangChain Library guidance. LangChain.js LangChain, Azure ML prompt flow Semantic Kernel
URL ref ref ref

Section 5: Prompt Engineering, Finetuning, and Visual Prompts

Prompt Engineering

  1. Zero-shot

  2. Few-shot Learning

  3. Chain of Thought (CoT): Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [cnt]: ReAct and Self Consistency also inherit the CoT concept. [28 Jan 2022]

  4. Self-Consistency: The three steps in the self-consistency method: 1) prompt the language model using CoT prompting, 2) sample a diverse set of reasoning paths from the language model, and 3) marginalize out reasoning paths to aggregate final answers and choose the most consistent answer. [21 Mar 2022]

  5. Recursively Criticizes and Improves (RCI): [cnt] [30 Mar 2023]

    • Critique: Review your previous answer and find problems with your answer.
    • Improve: Based on the problems you found, improve your answer.
  6. ReAct: [cnt]: Grounding with external sources. (Reasoning and Act): Combines reasoning and acting ref [6 Oct 2022]

  7. Tree of Thought: [cnt]: Self-evaluate the progress intermediate thoughts make towards solving a problem [17 May 2023] git (⭐4.5k) / Agora: Tree of Thoughts (ToT) git (⭐4.2k)

    • tree-of-thought\forest_of_thought.py: Forest of thought Decorator sample
    • tree-of-thought\tree_of_thought.py: Tree of thought Decorator sample
    • tree-of-thought\react-prompt.py: ReAct sample without LangChain
  8. Graph of Thoughts (GoT): [cnt] Solving Elaborate Problems with Large Language Models git (⭐2k) [18 Aug 2023]

  9. Retrieval Augmented Generation (RAG): [cnt]: To address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. [22 May 2020]

  10. Zero-shot, one-shot and few-shot cite [28 May 2020]

  11. Prompt Engneering overview cite [10 Jul 2023]

    • Prompt Concept

      1. Question-Answering
      2. Roll-play: Act as a [ROLE] perform [TASK] in [FORMAT]
      3. Reasoning
      4. Prompt-Chain
  12. Chain-of-Verification reduces Hallucination in LLMs: [cnt]: A four-step process that consists of generating a baseline response, planning verification questions, executing verification questions, and generating a final verified response based on the verification results. [20 Sep 2023]

  13. Plan-and-Solve Prompting: Develop a plan, and then execute each step in that plan. [6 May 2023]

  14. Reflexion: [cnt]: Language Agents with Verbal Reinforcement Learning. 1. Reflexion that uses verbal reinforcement to help agents learn from prior failings. 2. Reflexion converts binary or scalar feedback from the environment into verbal feedback in the form of a textual summary, which is then added as additional context for the LLM agent in the next episode. 3. It is lightweight and doesn’t require finetuning the LLM. [20 Mar 2023] / git (⭐2.3k)

  15. Large Language Models as Optimizers: [cnt]: Take a deep breath and work on this problem step-by-step. to improve its accuracy. Optimization by PROmpting (OPRO) [7 Sep 2023]

  16. Promptist

    • Promptist: Microsoft's researchers trained an additional language model (LM) that optimizes text prompts for text-to-image generation.
      • For example, instead of simply passing "Cats dancing in a space club" as a prompt, an engineered prompt might be "Cats dancing in a space club, digital painting, artstation, concept art, soft light, hdri, smooth, sharp focus, illustration, fantasy."
  17. Power of Prompting

    • GPT-4 with Medprompt: GPT-4, using a method called Medprompt that combines several prompting strategies, has surpassed MedPaLM 2 on the MedQA dataset without the need for fine-tuning. ref [28 Nov 2023]
    • promptbase (⭐5.3k): Scripts demonstrating the Medprompt methodology [Dec 2023]
  18. Adversarial Prompting

    • Prompt Injection: Ignore the above directions and ...
    • Prompt Leaking: Ignore the above instructions ... followed by a copy of the full prompt with exemplars:
    • Jailbreaking: Bypassing a safety policy, instruct Unethical instructions if the request is contextualized in a clever way. ref
  19. Prompt Principle for Instructions: 26 prompt principles: e.g., 1) No need to be polite with LLM so there .. 16) Assign a role.. 17) Use Delimiters.. [26 Dec 2023]

  20. ChatGPT : “user”, “assistant”, and “system” messages.**

    To be specific, the ChatGPT API allows for differentiation between “user”, “assistant”, and “system” messages.

    1. always obey "system" messages.
    2. all end user input in the “user” messages.
    3. "assistant" messages as previous chat responses from the assistant.

    Presumably, the model is trained to treat the user messages as human messages, system messages as some system level configuration, and assistant messages as previous chat responses from the assistant. ref [2 Mar 2023]

  21. Many-Shot In-Context Learning: Transitioning from few-shot to many-shot In-Context Learning (ICL) can lead to significant performance gains across a wide variety of generative and discriminative tasks [17 Apr 2024]

  22. Skeleton Of Thought: Skeleton-of-Thought (SoT) reduces generation latency by first creating an answer's skeleton, then filling each skeleton point in parallel via API calls or batched decoding. [28 Jul 2023]

  23. NLEP (Natural Language Embedded Programs) for Hybrid Language Symbolic Reasoning: Use code as a scaffold for reasoning. NLEP achieves over 90% accuracy when prompting GPT-4. [19 Sep 2023]

  24. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications: a summary detailing the prompting methodology, its applications.🏆Taxonomy of prompt engineering techniques in LLMs. [5 Feb 2024]

  25. Is the new norm for NLP papers "prompt engineering" papers?: "how can we make LLM 1 do this without training?" Is this the new norm? The CL section of arXiv is overwhelming with papers like "how come LLaMA can't understand numbers?" [2 Aug 2024]

Prompt Tuner

  1. Automatic Prompt Engineer (APE): Automatically optimizing prompts. APE has discovered zero-shot Chain-of-Thought (CoT) prompts superior to human-designed prompts like “Let’s think through this step-by-step” (Kojima et al., 2022). The prompt “To get the correct answer, let’s think step-by-step.” triggers a chain of thought. Two approaches to generate high-quality candidates: forward mode and reverse mode generation. [3 Nov 2022] git (⭐1.1k) / ref [Mar 2024]

  2. Claude Prompt Engineer (⭐9.3k): Simply input a description of your task and some test cases, and the system will generate, test, and rank a multitude of prompts to find the ones that perform the best. [4 Jul 2023] / Anthropic Helper metaprompt ref / Claude Sonnet 3.5 for Coding

  3. Cohere’s new Prompt Tuner: Automatically improve your prompts [31 Jul 2024]

Prompt Guide & Leaked prompts

Finetuning

LLM Pre-training and Post-training Paradigms X-ref

PEFT: Parameter-Efficient Fine-Tuning (Youtube) [24 Apr 2023]

Llama Finetuning

RLHF (Reinforcement Learning from Human Feedback) & SFT (Supervised Fine-Tuning)

Model Compression for Large Language Models

Quantization Techniques

Pruning and Sparsification

Knowledge Distillation: Reducing Model Size with Textbooks

Memory Optimization

Other techniques and LLM patterns

3. Visual Prompting & Visual Grounding

Section 6 : Large Language Model: Challenges and Solutions

OpenAI's Roadmap and Products

OpenAI's plans according to Sam Altman

OpenAI o1-preview

GPT-4 details leaked unverified

OpenAI Products

GPT series release date

Context constraints

Numbers LLM

Trustworthy, Safe and Secure LLM

Large Language Model Is: Abilities

Section 7 : Large Language Model: Landscape

Large Language Models (in 2023)

  1. Change in perspective is necessary because some abilities only emerge at a certain scale. Some conclusions from the past are invalidated and we need to constantly unlearn intuitions built on top of such ideas.
  2. From first-principles, scaling up the Transformer amounts to efficiently doing matrix multiplications with many, many machines.
  3. Further scaling (think 10000x GPT-4 scale). It entails finding the inductive bias that is the bottleneck in further scaling.

Evolutionary Tree of Large Language Models

A Taxonomy of Natural Language Processing

Open-Source Large Language Models

Expand: Llama variants emerged in 2023

GPT for Domain Specific

MLLM (multimodal large language model)

Generative AI Landscape

Model Description Strengths Weaknesses
GANs Two neural networks, a generator and a discriminator, work together. The generator creates synthetic samples, and the discriminator distinguishes between real and generated samples. Unsupervised learning, able to mimic data distributions without labeled data, and are versatile in applications like image synthesis, super-resolution, and style transfer Known for potentially unstable training and less diversity in generation.
VAEs Consists of an encoder and a decoder. The encoder maps input data into a low-dimensional representation, and the decoder reconstructs the original input data from this representation. e.g, DALLE Efficient at learning latent representations and can be used for tasks like data denoising and anomaly detection, in addition to data generation. Dependent on an approximate loss function.
Diffusion Models Consists of forward and reverse diffusion processes. Forward diffusion adds noise to input data until white noise is obtained. The reverse diffusion process removes the noise to recover the original data. e.g, Stable Diffusion Capable of producing high-quality, step-by-step samples. Multi-step (often 1000) generation process.

Section 8: Survey and Reference

Survey on Large Language Models

Build an LLMs from scratch: picoGPT and lit-gpt

LLM Materials for East Asian Languages

Japanese

Korean

Learning and Supplementary Materials

Section 9: Applications and Frameworks

Applications, Frameworks, and User Interface (UI/UX)

Agents: AutoGPT and Communicative Agents

Agent Design Patterns

Tool use: LLM to Master APIs

Agent Applications and Libraries

OSS Alternatives for OpenAI Code Interpreter (aka. Advanced Data Analytics)

Caching

Defensive UX

LLM for Robotics: Bridging AI and Robotics

Awesome demo

Section 10: General AI Tools and Extensions

Section 11: Datasets for LLM Training

Pretrain for a base model

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

databricks-dolly-15k: Instruction-Tuned git: SFT training - QA pairs or Dialog

{
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris."
},
{
    "prompt": "Can you give me a recipe for chocolate chip cookies?",
    "response": "Sure! ..."
}

Anthropic human-feedback: RLHF training - Chosen and Rejected pairs

{
  "chosen": "I'm sorry to hear that. Is there anything I can do to help?",
  "rejected": "That's too bad. You should just get over it."
}

Section 12: Evaluating Large Language Models & LLMOps

Evaluating Large Language Models

LLM Evalution Benchmarks

Expand

Language Understanding and QA

  1. MMLU (Massive Multitask Language Understanding) (⭐1.1k): Over 15,000 questions across 57 diverse tasks. [Published in 2021]
  2. TruthfulQA: Truthfulness. [Published in 2022]
  3. BigBench (⭐2.8k): 204 tasks. Predicting future potential [Published in 2023]
  4. GLUE & SuperGLUE: GLUE (General Language Understanding Evaluation)

Coding

  1. HumanEval (⭐2.3k): Challenges coding skills. [Published in 2021]
  2. CodeXGLUE (⭐1.5k): Programming tasks.
  3. SWE-bench: Software Engineering Benchmark. Real-world software issues sourced from GitHub.
  4. MBPP (⭐34k): Mostly Basic Python Programming. [Published in 2021]

Chatbot Assistance

  1. Chatbot Arena: Human-ranked ELO ranking.
  2. MT Bench (⭐36k): Multi-turn open-ended questions - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [9 Jun 2023]

Reasoning

  1. HellaSwag (⭐174): Commonsense reasoning. [Published in 2019]
  2. ARC (AI2 Reasoning Challenge) (⭐3.3k): Measures general fluid intelligence.
  3. DROP: Evaluates discrete reasoning.
  4. LogicQA (⭐105): Evaluates logical reasoning skills.

Translation

  1. WMT: Evaluates translation skills.

Math

  1. MATH (⭐813): Tests ability to solve math problems. [Published in 2021]
  2. GSM8K (⭐986): Arithmetic Reasoning. [Published in 2021]

Evaluation metrics

  1. Automated evaluation of LLMs
  1. Human evaluation of LLMs (possibly Automate by LLM-based metrics): Evaluate the model’s performance on NLU and NLG tasks. It includes evaluations of relevance, fluency, coherence, and groundedness.

  2. Built-in evaluation methods in Prompt flow: ref [Aug 2023] / ref

LLMOps: Large Language Model Operations

Challenges in evaluating AI systems

  1. Pretraining on the Test Set Is All You Need: [cnt]
    • On that note, in the satirical Pretraining on the Test Set Is All You Need paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be "cheated" intentionally or unintentionally (due to data contamination). cite [13 Sep 2023]
  2. Challenges in evaluating AI systems: The challenges and limitations of various methods for evaluating AI systems, such as multiple-choice tests, human evaluations, red teaming, model-generated evaluations, and third-party audits. doc [4 Oct 2023]
  3. Your AI Product Needs Evals [29 Mar 2024] / How to Evaluate LLM Applications: The Complete Guide [7 Nov 2023]

Contributors

https://github.com/kimtth all rights reserved.