Actuarial Intelligence with Generative AI – A Framework Illustrated Through Critical Illness Claims

TOP

Critical Illness, Disability

23 June 2026 Wendy Low, Samuel Lim

English

Many discussions of generative AI in insurance focus on the technology. The more interesting question is who is equipped to embed it in our workflow, and on that count actuaries have an unusual advantage. The discipline already trains its practitioners to take complex, ambiguous problems apart into structured components that can be tested and validated, and to do so with statistical rigour, an emphasis on explainability, and an eye on governance. Combined with a deep understanding of the insurance value chain, from product design and pricing through underwriting, claims, capital management, and experience analysis, these are precisely the strengths needed to turn powerful models into accountable decisions that satisfy both regulators and customers. Generative AI in this framework supplements rather than replaces human decision-making. It supports the actuary’s ability to think critically, structure problems clearly, and apply sound judgement at scale.

At Gen Re, our team began this work while developing a Critical Illness (CI) claims classification tool to support the long standing CI Data Insights Study (also known as the Dread Disease Survey). Through this initiative, we recognised that applying generative AI effectively requires not just technology, but a principled, structured framework that reflects human reasoning.

CI claims classification has long been a challenge in CI analytical work, as the raw data typically consists of unstructured free-text narratives, and CI products often differ in their definitions and coverage criteria. Classifying CI claims accurately requires a nuanced blend of technical precision, medical literacy, and disciplined professional judgement.

The scale of this work is also significant. For example, Gen Re’s latest published CI Data Insights Hong Kong 2019–2023 covers over 69,000 claims collected over a five-year period for the Hong Kong market. We use the CI claims free-text classification problem to show how a sequential, multi-agent workflow can emulate the way actuaries break down and evaluate a claim.

Although the CI classification use-case is used for illustration, the underlying principles extend far beyond this domain, offering actuaries a repeatable framework for designing robust, explainable, and scalable AI‑enabled solutions.

From Raw Information to Structured Reasoning – A CI Claims Example

A common pitfall is treating generative AI as an all knowing oracle – feeding it raw claim text and expecting a correct CI classification. This lack of structured guidance increases the model’s risk of hallucination,¹ reduces transparency, and misaligns the model’s reasoning with human actuarial logic.

To address this, we divide the problem into a sequence of explicit steps. Consider a simple claim scenario:

Claim Status: Accepted
Product Name: Happy Life CI Rider
Gender: Male
Age at Claim: 55
Raw Claim Description with abbreviation:
Dx w acute ant STEMI 2/11/25
+ECG ST elev V2‑5
+Trop 12.8 ng/mL
s/p PCI LAD 100% occ
EF 38%
55M smoker HTN
Claim Type: Major CI with 100% Sum Assured payout

A claims assessor would typically convert this raw description into a comprehensible description. For instance:

Diagnosed with acute anterior ST‑elevation myocardial infarction on 2/11/25
+Electrocardiogram showed ST‑elevation in V2‑5 consistent with anterior infarction
+Troponin elevated at 12.8 ng/mL
+Status post percutaneous coronary intervention to left anterior descending artery for 100% occlusion
+Post myocardial infarction ejection fraction reduced at 38%
55‑year-old male smoker with background hypertension

This will then be followed by a structured question-based thought process. The reasoning steps can then be used to derive a sequential, agent based workflow that “guides” AI to mimic actuarial thought patterns. Each agent performs a specific task which is the solution to the questions we asked.

From the CI claims classification example, we can turn the questions into steps like this:

	Human Question Thought Process	Steps
1	What medical condition does this narrative represent?	Transform the claim description into easily comprehensible English sentence for diagnosis, surgeries or disability
2	Since the claim is accepted, the condition must correspond to a covered CI event under Happy Life CI Rider product	Obtain a list of CIs covered in Happy Life CI Rider product and compare between the claim description and the list to look for the closest match
3	What CI definitions apply? Which definitions could plausibly correspond to the described condition?	Check that the closest match in the practical sense can fulfil the CI definition that is listed in the product document for Happy Life CI Rider
4	Is this claim something realistically possible? Is it an unusual case that warrants deeper scrutiny?	Look at the final CI classified, compare all the other information available such as gender, age at claim and claim type to derive the sensibility of the claim

As an extension to the claim scenario above, we can design a workflow that expands these four reasoning steps into five operational stages; the final stage emits the classified result with a confidence score, as shown in the diagram below:

AI‑assisted CI claims classification process

We now have a multi-agent workflow that provides a structured breakdown that reduces hallucinations, improves consistency, and increases transparency in a way that is both intuitive and auditable. This architecture reflects the way actuaries and claims professionals naturally approach CI assessment – breaking down a complex judgemental task into clear components.

It offers several benefits:

Reduced hallucinations through narrowly defined scopes
Improved auditability by making each step explicit
Modularity, allowing each agent to be tested or improved independently
Human friendly traceability showing exactly how the conclusion was formed
Compliance alignment with actuarial and regulatory expectations for explainability

This workflow transforms generative AI from a “black box predictor” into an explainable, governed, and traceable decision-support partner.

Reliability by Design – Evaluation and Governance of Generative AI

AI reliability has been a primary focus of regulators and policy-setters in the financial sector,² with much attention given to how it can be deployed in a stable, trustworthy manner. Considerations such as bias, fairness, explainability, transparency, and interpretability are all important when implementing generative AI solutions. However, reliability should not be viewed as a feature of the model alone. Rather, it is a property of the entire workflow in which the model operates. A robust generative AI solution therefore requires not only well-designed agents, but also disciplined evaluation, governance, and continuous monitoring of the entire workflow.

One important safeguard is the establishment of a human-in-the-loop review framework. While agentic workflows can reduce error and improve consistency, human oversight remains essential.

To implement an efficient human-oversight framework, a predefined metrics suited to the use case can be set so that edge cases are routed to human review. Given that there is no universal standard for what proportion of generative AI outputs should be reviewed by humans, this framework can help organisations adopt thresholds that are appropriate to their risk appetite and operational needs.

For example, during the testing phase, confidence scores can be assessed on a validation dataset to determine the level below which outputs should be flagged for human review. Additional review rules may also be introduced for terminology that is uncommon, novel, or particularly sensitive to error and bias. In the CI claims classification context, this may include low-confidence classifications, unfamiliar medical terminology, contradictory or ambiguous extracted features, and claims that fall outside known data distributions.

More deterministic metrics can also be incorporated. For instance, in a classification setting, cosine similarity (a measure of the angle between two vectors that captures direction independent of magnitude, so two texts of very different lengths can still register as similar if their feature distributions point the same way) between wording embeddings may be monitored, with cases falling below a specified threshold escalated for review. With careful methodology, generative AI supports, rather than supplants, professional judgement.

A second component of reliability is the creation of a rigorous validation pipeline supported by a “golden dataset” and other known non-generative AI‑dependent metrics. A golden dataset in this context refers to a curated set of high-quality and often hand-labelled data that contains the “ground truth” which the generative AI system is expected to be tested against. This dataset should be sufficiently large and diverse to reflect the range of practical scenarios the workflow may encounter, and it should be retained for ongoing testing as the solution evolves.

Although time-consuming to build, a golden dataset is central to a multi-layered evaluation workflow of a generative AI system. In the CI claims classification example, the multi-layered evaluation workflow may include back-testing against the golden dataset curated from historical claims descriptions, comparison with representation-based natural language processing methods such as embedding models or encoder models such as BERT (Bidirectional Encoder Representations from Transformers), statistical reasonableness and outlier checks, and adversarial testing using rare edge-cases or intentionally misleading descriptions.

Such a layered validation approach strengthens reliability and governance by ensuring that performance is tested from multiple angles rather than judged solely on headline accuracy.

Another useful mechanism is to use a stronger AI model to evaluate the output of another AI model. An independent generative AI agent can serve as an evaluation layer for the outputs produced by a smaller or less costly model used in production. This can be effective when stronger models with reasoning capabilities are guided through well-designed prompts to assess whether a result is plausible, complete, and aligned with expectations, while also assigning a confidence level for the evaluation. This creates a form of model-based quality assurance that can improve scalability without relying entirely on manual review. Nevertheless, human review remains necessary for cases where the evaluation agent identifies contradictions, uncertainty, or disagreement with the original output.

Transparent documentation of decisions is a fourth element. Clear documentation of prompts, model settings, reference materials, validation methods, escalation criteria, and workflow decisions supports auditability, reproducibility, and compliance with internal governance requirements. It also aligns closely with the actuarial profession’s emphasis on transparency, accountability, and professional judgement. In high-stakes insurance applications, documentation serves as a control that enables others to understand how and why a conclusion was reached. Moreover, as the field of generative AI continues to evolve rapidly, such documentation is essential for ensuring that enhancement and maintenance remain effective when changes occur that may affect the availability or performance of deployed models.

Finally, reliability depends not only on how a workflow is built, but also on how it is maintained over time. A simple and well-governed upgrading pipeline supports continued reliability. As models, prompts, product definitions, and source documents evolve, the workflow must be easy to update, retest, and redeploy in a controlled manner. A modular architecture is especially valuable for this purpose, as individual agents or components can be improved independently without disrupting the entire system. This allows organisations to continuously refine performance while preserving governance standards and maintaining confidence in the process.

Taken together, these practices shift generative AI from being an experimental tool to a dependable decision-support capability. By combining human oversight, golden dataset validation, model-based evaluation, transparent documentation, and controlled upgrade pathways, actuaries can design generative AI workflows that are effective, explainable, auditable, and aligned with enterprise risk management expectations.

Summary

As the AI landscape continues to evolve, the range of tasks the technology can support to improve productivity has expanded. Limitations that once made organisations cautious about adopting AI are increasingly being overcome as the technology matures.

Generative AI can deliver more value in insurance when it is engineered to reason the way actuaries work. This article has set out two connected arguments. First, decomposing a complex judgemental task into a sequence of specialised agents mirrors how actuaries break down and evaluate a problem, reducing hallucination and improving auditability. Second, reliability is a property of the entire workflow rather than the model alone, earned through human oversight, golden-dataset validation, model-based evaluation, transparent documentation, and controlled upgrades. Although illustrated through CI claims classification, these principles give actuaries a repeatable framework for building robust, explainable, and scalable AI‑enabled solutions, positioning them not as adopters of these tools but as the people who design and govern them.

Speak to your Gen Re associate today to learn how we can help implement AI‑enabled solutions across the insurance value chain.