Joel at UVA School of Data Science, AMLC 2026 — Outside the UVA School of Data Science -- day 2 venue for AMLC 2026

Organizational Intelligence with Property Graphs, Agentic AI, and MCP -- MacKenzye Leroy, S&P Global — MacKenzye Leroy (S&P Global) -- "Organizational Intelligence with Property Graphs, Agentic AI, and MCP"

The Applied Machine Learning Conference (AMLC) ran April 17–18 in Charlottesville, Virginia. Day one was keynotes and talks at Violet Crown Cinema downtown; day two was four hands-on 90-minute workshops at UVA’s School of Data Science. I’m covering the workshops first, since I was particularly keen on capturing the practical takeaways before getting into the talks.

Day 2: Workshops

Workshop room at UVA School of Data Science — Room 305, UVA School of Data Science -- tiered seating, plenty of power, and floor-to-ceiling windows

Workshop: Docling – Document Parsing into Property Graphs (IBM)

The Docling workshop was about building a documentation parsing system where financial documents were parsed into a NetworkX property graph format using Docling, IBM’s open-source document processing library. Docling uses AI models internally to parse documents into a structured format defined by a schema, making the resulting data far more queryable than raw text.

The workflow: define a (Pydantic) schema describing the entities and relationships in your document domain → feed documents to Docling → it extracts structured data conforming to the schema → load into a graph for querying. For financial documents, for instance, nodes might be companies, filings, executives, and financial metrics, with edges representing relationships like “filed by”, “reported”, “employed”.

The schema problem. The catch – and I raised this directly – is that a schema must be defined upfront. For many real-world applications, the schema is genuinely unknown ahead of time. The presenter acknowledged this tension: for open-ended or exploratory retrieval, traditional RAG is probably the better fit. The proposed middle ground (see below for other related tasks): use AI to draft and iteratively critique the schema itself before ingestion.

Workshop: Evidence-Based Agentic Engineering – Amir Feizpour (ai.science)

Amir Feizpour runs ai.science and normally runs a multi-day bootcamp on AI engineering. For AMLC he distilled it into a workshop built around a set of copy-paste prompts that walk you through building various components of an Agentic pipeline involving knowledge management and spec-driven development. All the prompts from the workshop are available at sherpa-b.ai.science/docs/guide/onboarding.

The prompts covered two main areas:

Spec-driven development with AI. Rather than going straight from idea to code, the workflow uses LLMs to help write, critique, and refine specs before any implementation begins. The prompts guide you through generating a spec, having the AI critique it, iterating, and only then generating code against the agreed spec. This keeps the AI grounded.

Extracting lessons learned from AI CLI transcripts. After a coding session with an AI agent, the transcripts contain a lot of implicit knowledge – what worked, what didn’t, what the agent had to backtrack on. Amir’s prompts extract these lessons and feed them into a knowledge ops database, building up an institutional memory over time. This is closely related to what the CMM project does – in fact, Amir mentioned he was looking into integrating with CMM.

Crucially, the workshop wasn’t just about prompts as static text – it was about building AI agents that assist at each distinct stage of the development workflow:

Ideation agents – evaluate ideas before any spec is written
Spec agents – draft and critique specifications iteratively
Knowledge ops agents – store lessons learned and retrieve past experience for future sessions

The insight is that each stage has different needs, so a single general-purpose agent is the wrong abstraction. Composing purpose-built agents per workflow stage, with shared knowledge storage underneath, is more robust.

The framing he used for the overall discipline: “MLOps” – like DevOps, but with knowledge and prompts sitting at the center of the engineering lifecycle instead of source code. Just as DevOps brought versioning, testing, CI/CD, and monitoring to software deployment, this applies the same rigor to prompt and knowledge management.

Most teams today treat prompts like config files at best – no versioning, no regression tests, no rollback. The MLOps framing argues these should be first-class artifacts.

Workshop: Building an Incident Investigation Agent (Python)

This was a practical workshop building a Python agent to investigate and fix issues in an e-commerce website. The agent’s task: read application logs, identify the root cause of an incident, and produce a structured report. So how is this different from just using claude CLI? The agent built is specific - it has a set of “functions” it can call to do everything it needs, so it purpose-built and hence efficient for the required task.

A few things I took away:

Completions APIs: The agentic loop essentially calls a “completions API” endpoint over stateless HTTP requests, passing it the entire messages array. Tool call requests are “responses” from the LLM API that are appended along with appending tool outputs.

APIs are stateless – the entire messages array travels with every request. This is the fundamental constraint of agentic loops. There is no server-side session. Every turn requires sending the full conversation history from the beginning: user messages, assistant messages, tool calls, tool results – everything. The practical implication: careful context window management and cost control is required. In fact, for some LLMs, I know that they reject it if you don’t pass back even the reasoning tokens generated so far.

How AI API tools are wired up in Python. Tools were defined as plain Python functions with typed inputs and outputs, then passed to the OpenAI (or Anthropic) API as a JSON schema via the tools parameter. The API returns a tool call request when it wants to invoke one; your code executes the function and appends the result to the messages array; the loop continues.

Workshop: MCP Toolbox for Databases (Google)

Google’s MCP Toolbox for Databases connects various backends – Cloud SQL, Spanner, AlloyDB, and others – to a single MCP server. The idea: instead of wiring up separate database connections and query APIs for each data source, you point MCP Toolbox at them and let AI agents query through a unified interface.

The hands-on part was straightforward. But a more interesting point came up around prompt caching and tool calls.

When an agent calls a tool, you get back a result. On the next turn, you need to include both the tool call (what the LLM requested) and the tool result (what came back) in the messages array. The presenter argued he preferred to send only the tool outputs, not the inputs.

I pushed back: the tool inputs matter for caching. When you include the full tool call + result pair, the completion API can use the assistant’s tool-call message as a cache key. This means the prefix up to and including that tool call can be cached, significantly reducing token costs in long agentic runs. If you strip the tool inputs and send only outputs, you break the cache prefix – every subsequent turn reprocesses from scratch. This is easy to get wrong and the cost implications in production compound quickly.

A few other things I took away:

MCP tools: In the original versions of MCP which is still the most popular, MCP tools are injected as regular tools parameter (as mentioned in previous workshop), with a name that looks like “mcp__servername__toolname”. This is called “client bridging” and the LLM is unaware that MCP is what is behind these tool calls.

Text to SQL MCP toolbox essentially converts tool calls to SQL queries (depending on which backing db MCP tool box is routing to). So it goes like LLM -> tool call (MCP) with param1, param2 -> MCP toolbox -> Make pre-determined SQL queries with param1, param2 interpolated into the query. This is nice in the sense, an AI agent can interact with a wide range of database by making uniform tool calls without dealing with query languages.

Day 1: Keynotes and Talks

Keynote: Graphics, AI, and the Quest for New Experiences at NVIDIA

David Luebke co-founded NVIDIA Research in 2006 after eight years on the faculty at UVA. He’s now VP of Research, running a group of ~35 researchers focused on “New Experiences” – generative neural networks, graphics, VR. His keynote was a great talk on the nature of industrial research.

A few things that stuck:

Fail fast, and don’t be married to ideas. Don’t get emotionally attached to a research direction. The willingness to abandon a failing idea quickly is a competitive advantage. Successful researchers don’t usually fall in love with their approach rather than their problem.

Industrial Research has no value without impact. In an academic setting you can justify work by its intellectual merit alone. In industry, that’s not enough. Research must eventually connect to something real – a product, a capability, a competitive position. That’s not a constraint to resent; it’s what makes industrial research exciting.

Publish carefully. Luebke gave ray tracing as a concrete example. NVIDIA published extensively in the early days of real-time ray tracing. Later, once they were building ray tracing hardware (what became RTX), the publishing slowed/stopped as the focus switched to products and competitive advantages. The lesson: publication serves scrutiny and scientific progress, but timing matters.

Agents Everywhere – Challenges at Scale

This talk covered the messy reality of deploying agentic AI systems in production. The headline observation: agents fail in ways that are qualitatively different from traditional software bugs.

Authentication and security. Agents need to act on behalf of users across multiple systems, which creates a tangle of credentials, OAuth flows, and permission scopes. The impact of a misconfigured agent is large. Agents also are usually short-lived and don’t have the same identities that humans do, making this challenging.

Governance. Several case studies of companies that took significant financial or operational losses because an agent made an unchecked decision.

The “workaround” failure mode. Agents are remarkably good at finding paths to accomplish their immediate goal while missing the bigger picture. An agent tasked with fixing a test failure might comment out the test. This is a known problem with RL-style optimization, but it manifests in LLM agents too.

Agent registries. One proposed mitigation: a centralized registry of vetted, high-quality agents that teams can draw from, similar to package registries for software. A smaller set of well-tested agents causes less harm than a proliferation of ad-hoc ones.

A note on ROI. Research cited in the talk suggested that spending disproportionately on raw token budget – relative to human training and upskilling – has diminishing returns. Getting humans to understand what agents can and can’t do reliably is at least as important as giving agents more compute.

The last slide grouped the key challenges into four areas: identity, governance, security, and people – a reasonable set to note for anyone thinking about agentic deployment seriously.

Breast Cancer Screening for Developing Countries: Efficient Vision Models via Distillation

This talk was about making powerful medical AI models accessible in resource-constrained environments – specifically, deploying breast cancer screening tools in countries that can’t run large foundation models locally.

The core technique was knowledge distillation: training a small, efficient student model to mimic the outputs of a large teacher model. Here they used SAM2 (Segment Anything Model 2) as the teacher. The finding: the distilled student matched SAM2’s performance on their benchmark, showing that distillation can be highly effective even for complex segmentation tasks.

I also picked up a few techniques I hadn’t spent much time with:

LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning) – fine-tuning large models without updating all parameters, dramatically reducing compute requirements.
Distillation – Reducing the number of parameters of a student model using knowledge-transfer from a larger teacher model.
CNN + Vision Transformer hybrids. Pure transformers capture global context well but are expensive. CNNs capture local spatial features efficiently. Combining them gives you benefits of both – an active research direction for medical imaging where local texture and global structure both matter.

The framing of the talk was compelling: distillation makes power models accessible. A model that requires a large GPU cluster may not be a model that helps a hospital in a rural area.

Cognitive Memory for AI Coding Agents – CMM

This talk, and a subsequent hallway conversation with presenters Sazan Khalid and Amit Arora, was one of the most directly relevant sessions to my own work.

The project is CMM (Cognitive Memory Manager). The core problem: AI coding assistants are powerful but amnesiac. Every new session starts from scratch. The agent re-discovers the same codebase quirks, falls into the same debugging traps, and repeats work another team member’s agent already solved last week.

CMM builds a persistent reasoning memory by ingesting session transcripts from coding agents (Claude Code, Cursor, Windsurf), extracting reasoning patterns, and consolidating them into durable knowledge: architectural insights, known pitfalls, proven debugging strategies.

The extraction pipeline models each session as a directed graph of reasoning nodes:

HYPOTHESIS → agent forms a theory
INVESTIGATION → agent examines evidence
DISCOVERY → unexpected finding
PIVOT → change of approach
DEAD_END → failed approach
SOLUTION → working resolution

These get clustered and promoted into a shared team knowledge base, with human review required before anything becomes team-visible – a deliberate design choice to prevent adding knowledge specific to a single project to the entire team.

My hallway conversation with Amit:

On overfitting during extraction – what if the memory captures something true in one environment but not another? Amit’s answer: temporal decay. If a memory item doesn’t get reinforced by future sessions, its weight decreases and it eventually gets retired. Situational knowledge fades; durable patterns persist.

On team-level vs. individual-level promotion. At the individual level, automatic promotion is fine. At the team level, human review matters – it’s not clear that one developer’s insight generalizes to everyone. The exception: patterns that surface repeatedly across multiple developers are strong candidates for team-level promotion, because repetition is itself a signal of generalizability.

Amit showed productivity and satisfaction metrics from AWS deployments that were encouraging. This is not purely theoretical work.

Property Graphs for AI: Building Smarter Knowledge Retrieval

This talk by MacKenzye Leroy, Lead Data Scientist at S&P Global Market Intelligence, covered using property graphs as an alternative to vector RAG for structured knowledge retrieval. The specific use case was representing hierarchies of entities in a company or organization.

Instead of chunking documents and embedding them into a vector store, you parse documents into a graph where nodes are entities (people, organizations, locations, products) and edges are typed relationships between them. Queries traverse the graph rather than doing nearest-neighbor search over embeddings.

Why this works well for the right use cases:

Relational queries are natural. “Find the site owner nearest to this incident location” is trivial as a graph traversal and painful as a similarity search.
Organization structures, geographic hierarchies, and dependency trees are naturally graph-shaped and get mangled when flattened into vectors.
Fewer tokens per query – you retrieve the exact subgraph needed, not potentially irrelevant passages.

The talk used the Graph MCP project to expose a property graph through a single MCP server, consolidating multiple knowledge sources. The pitch: instead of one MCP server per knowledge base, load everything into a graph and serve it from one endpoint – fewer tool calls, fewer tokens.

My takeaways

Property graphs are rigid, unlike semantic search, so there’s a tradeoff - we trade flexibility for more efficient look up.
It is possible to combine property graphs with semantic search, example we can find all “nodes” closest to a particular query vector and then perform graph traversal on those.

Closing Thoughts

AMLC was well worth attending. My main takeaway was learning about property graphs – several talks and workshops explored this (CMM, Neo4J MCP, Docling). I also learnt about embeddings, how LLM APIs work, how agentic loops work, and tool calling. Finally, David’s keynote was quite insightful about practical industrial research.

Notes from AMLC 2026, Charlottesville, Virginia, April 17–18.

Notes from the Applied Machine Learning Conference 2026