Intelligent CIO Africa Issue 99

t cht lk

AI-powered applications they build, and present roadblocks for bringing AI into production. If the LLM produces inappropriate answers, it also hurts the ability of consumers to trust the company itself, causing damage to the brand.

As one corporate LLM user told me, we want an easy way to evaluate and test the accuracy of different models and applications instead of taking the looks good to me approach. From evaluation to ongoing monitoring, observability is increasingly important to any organisation using AI applications.

AI observability gives the owners of AI applications the power to monitor, measure, and correct performance, helping in three different aspects of corporate AI use:

Evaluation and experimentation

With so many AI models and tools on the market, it is important that enterprises can easily determine which elements work best for their specific AI app use case. Observability is critical for evaluating different LLMs, configuration choices, code libraries, and more, enabling users to optimise their tech choices for each project.

Monitoring and iteration

RAG triad: context, grounded, relevance

The RAG Triad is one example of a set of metrics that helps evaluate RAG applications to ensure that they are honest and helpful. It includes three metrics, context relevance, groundedness, and answer relevance, to measure the quality of the three steps of a typical RAG application.

• Context relevance measures how relevant each piece of retrieved-context from a knowledgebase is to the query that was asked.

• Groundedness measures how well the final response is grounded in or supported by the retrieved pieces of context.

• Answer relevance measures how relevant the final response is to the query that was asked.

By decomposing a composite RAG system into components, query, context, and response, this evaluation framework can triage the failure points and provide a clearer understanding of where improvements are needed in the RAG system and guide targeted optimisation.

enough. For AI to become a reliable, trustworthy component of business infrastructure, LLM application answers must align with the 3H rule, being honest, harmless, and helpful.

Once an AI app has been deployed and is in use, observability helps with logging execution traces and monitoring its ongoing performance. When problems crop up, observability is crucial for diagnosing the source, fixing it, and then validating that it was correctly fixed, an iterative process of continuous improvement familiar to anyone who has worked with cloud software.

Tracking costs and latency

They need to be honest, meaning factually accurate and free of hallucinations. Enterprises must be able to use them for tasks where their generalisation is desirable: Summarising, generating inferences, and planning. Honest AI also means the system recognises and acknowledges when it cannot accurately answer a question. For example, if the answer just does not exist, the LLM should say I cannot answer that as opposed to spitting out something random.

Technology leaders are becoming increasingly practical about their AI efforts. Gone are the days of unchecked AI spending, leaders are now deeply concerned with the ROI of their AI investments, and understanding which use cases are delivering business results.

From this perspective, the two essential dimensions to measure are how much an application costs and how much time it takes to deliver answers, known as latency.

Throwing more GPUs and servers at an application can reduce latency, but it drives up cost. You cannot find the right balance for your application unless you can measure both accurately. Observability gives enterprises a clearer picture of both of these elements, enabling them to maximise results and minimise costs.

As enterprises bring AI applications into production, they must expect and demand more than good

For tasks where memorisation of facts is more important, we need to supplement LLMs with additional information and data sources to ensure that responses are accurate. This is an active field of research known as retrieval-augmented generation, or RAG: Combining LLMs with databases of factual data that they can retrieve to answer specific questions.

AI needs to be harmless, meaning answers do not leak personally identifiable information and are not vulnerable to jailbreak attacks designed to circumvent their designers’ guardrails. Those guardrails must ensure that the answers do not embody bias, hurtful stereotypes, or toxicity.

Finally, AI needs to be helpful. It needs to deliver answers that match the queries users give it, that are concise and coherent, and provide useful results. p

Intelligent CIO Africa Issue 99 | Page 69

t cht lk

t cht lk