How do you make sense of security, governance, and risk in an age of black-box AI?  This week, Raj is joined by Preetam Joshi, founder of Aimon Labs and machine learning veteran with experience at DRDOYahooNetflix, and Thumbtack. Together, they break down the technical evolution behind large language models (LLMs), explore the real challenges of explainability, and discuss why GRC teams must rethink risk in the age of autonomous reasoning systems.

Preetam brings a rare mix of hands-on ML expertise and practical experience deploying LLMs in enterprise environments. If you’ve been wondering how transformers work, what explainability really means, or why AI governance is still a mess — this episode is for you.

 5 Key Takeaways:

From DRDO to Netflix to Aimon Labs — Preetam’s career journey shows the intersection of machine learning, security, and entrepreneurship.
-How Transformers Work — 
A simple breakdown of encoder/decoder architecture, embeddings, and attention mechanisms.
-Explainability in AI — 
What it meant in traditional ML… and why it’s nearly impossible with today’s LLMs.
-Rule-Based Logic Isn’t Dead — 
In high-stakes environments, deterministic systems still matter.
-Bridging AI & GRC — 
Practical steps for model security, auditing, and compliance in non-deterministic systems.

📌 Take Action

Security & GRC Decoded is brought to you by ComplianceCow — the platform for proactive, automated compliance.

🎧 Subscribe, rate, and share if this episode sparked a thought.

⏱ Timestamps (approx.)

00:00 – Intro
01:11 – Welcome Preetam to the show
03:20 – What has been your favorite experience working in AI so far?
07:08 – What is transformer architecture and how does it work?
10:23 – How do LLMs solve problems like math or reasoning?
12:38 – Where do agents fit in the LLM ecosystem?
16:07 – How does reinforcement learning apply to AI models?
21:33 – What does explainability mean in ML?
24:55 – Can you explain the limitations of SHAP and parameter-level reasoning?
27:33 – What does GRC look like in the LLM age?
30:58 – What does AIMon Labs actually do?
35:00 – Why is reliability a challenge with LLMs?
39:15 – Where does GRC intersect with AI deployment and compliance?
41:30 – What is fine-tuning and when is it useful?
44:43 – Is Retrieval Augmented Generation (RAG) still relevant with longer context windows?
47:29 – How do we guard against LLM misuse and toxic output?
49:43 – How can LLMs overexpose sensitive company data?
53:28 – Advice for those starting a career in AI or ML
55:34 – What are your favorite models right now?

[00:01] Raj: Hey, hey, welcome to Security and GRC Decoded. I’m your favorite host, Raj Krishnamurthy. And today we have an awesome guest, Preetam Joshi. I think many times we talk about large language models and generative AI, and many times we have a surface-level conversation. Today is one of those, not one of those conversations. Preetam has lived and breathed machine learning and AI, and we are going to dive deeper, right? And he’s sort of at the perfect culmination of Generative AI, machine learning, security, and GRC. And we would love to sort of uncover with Preetam. Preetam, welcome to the Security and GRC Decoded show. Awesome.

[00:41] Preetam: Thanks Raj. Thanks for having me here. ah You know, I always love to be on a podcast with experts like yourself. So looking forward to our conversation today.

[00:51] Raj: Preetam, you have an awesome experience, right? So you founded this company, AIMON Labs. I think we want to talk about that. You’d worked with companies like Yahoo. In fact, you start way back in the early 2000s with the India DRDO doing machine learning, right? Some very interesting stuff. uh And then you go on to work at companies like Yahoo. You build the data engineering, machine learning practice at Thumbtack. You worked at Netflix, did some awesome projects at Netflix. and then now you’re at AImon. That’s a fantastic journey. ah I’ve been honored to be part of this. It’s been a learning experience all throughout. I could talk a little bit more about the journey if you’d like Raj. Yeah, so interestingly, I started off in the early 2005, 2007, machine learning wasn’t even a thing. It was just a few people who were enthusiastic about solving math problems, getting together and sort of working on projects together. And at the time, uh the whole MLOps thing wasn’t a thing. Hadoop was just coming on, if you remember those days. uh And the whole big data thing was just coming on. So back then, it used to be called data mining. That was the popular terminology. And then we moved on to uh having this whole ML ops, big data explosion with Hadoop and Spark and all of that come on. And that’s where I got to work with a lot of those systems at Yahoo. ah I actually started off in security working on machine learning applied to security problems at Georgia Tech. And then moved on to building sort of these big system, big data systems at Yahoo for recommendations and stuff. Which of that, and maybe it’s very hard to pick, which of that is your favorite experiences? Yeah, know, surprisingly, my most favorite experience of that is the one that you pointed out in a lab called DRDO, which was a small lab. the reason why I really like that… Yeah. Yeah, it’s a small government lab, right? It was called the Center of Artificial Intelligence and Robotics. It was like a small unit of a government lab uh from back in India. And their focus was essentially working on innovations related to machine learning or doing a bunch of different things. And I specifically focused on text-based technologies. So how do you do parts of speech tagging? Back then, it was a big thing. You don’t have modern LLMs where now you can ask it a prompt and it gives you the parts of speech. Back then, you actually had to do a lot of work to actually get uh high quality parts of speech from a piece of text. Just in case, for completeness, parts of speeches, like name, nouns, you would extract which company did this text contain a company, did it contain an animal, did it contain something like that, and verbs and all of that stuff. So that’s what parts of speech was. And back then, we were innovating, actually innovating with a system called, uh with an algorithm called hidden Markov models. So it was a lot of fun. It was way ahead of its time. And so learning through that experience was a lot of fun. Let me ask you this. think over the, you have been a practitioner and over the years, the idea of machine learning, which in my opinion right now is much more democratic in the way that we are calling the language models, whether it is large or small, has fundamentally changed. What made that change? Right? And maybe can you double click on that?

[04:48] Preetam: Absolutely. um It’s a fantastic question and we can probably talk about this all day, but one of the things that I personally think changed the game was better compute. um And neural networks, it wasn’t a new thing, Like neural networks have been around from the 80s, maybe even earlier, right? The concept of a neural network has been around back then, but the only reason we couldn’t harness their power was because of a lack of compute. It was impossible to run them. uh on a CPU machine. mean, try running one of these big large language models on a CPU. uh It’ll not work, right? uh Or it’ll be extremely horrendously slow. uh So I think the leap in compute was a major thing. That’s what uh triggered ah making these language models, these sort of these… uh models bigger and bigger, the neural networks, right? So you had the RNNs that came on, Karpati has a fantastic blog on that. By the way, if you not followed Karpati, he’s a fantastic person to follow. And then moved on to… Andridge Karpati, yeah, exactly. He wrote a fantastic blog about RNNs and LSTMs back in 2013, 2014. And then, so then, you know, then Google DeepMind sort of innovated on… some of their uh classic BERT and BART models, and that’s where things started coming up. And attention is all you need. I think everybody’s familiar with that paper. uh The core piece was attention, right? And that key insight led to uh the modern-day LLMs, if you have to say it. you don’t mind, let’s take a step back. uh Attention is all you need is sort of a seminal paper. It came, what, 10 years ago and talks about the transformer architecture. Maybe, can you sort of simplify your view of the transformer architecture and why you think it is seminal? Yeah, I think it came around 2016, 2017 by a group of researchers. And their key insight was this concept of attention. And we’ll talk about the transformer architecture itself. The most simplest way to think about a transformer architecture is, I mean, if you are aware of how neural networks work, you have all of these various neurons, per se, it’s sort of modeled behind the brain, if I have to simplify this thing. You have all these neurons interacting with each other. There are connections between these neurons and whatnot. So now if you think about the evolution of a neural network into a transformer architecture, there is two pieces of it. So there’s the encoder layer, is uh all what that means is essentially you take like say a gigantic piece of text and it would map it into uh some array of floating point numbers. And that’s what we call vectors. an array of floating point numbers, which is smaller in dimensions in terms of the space. So you would go from a vocabulary size of, uh say, 10,000 into a array dimension size of 32. So that was the encoder. And then now you have the decoder, which is the second part of the transformer architecture, which takes the small embedding, as it’s popularly known, and then translates it into generated text. So that’s what the transformer architecture is. It’s in a very, very high level simplified way. The key piece to it is attention. And what attention does is essentially just says that, um for all of the different input tokens that you have, tokens as in the words, what are the ones that are the most important? It’s an algorithm that allows you to figure out what are the more important words. Yeah, and we can get into the math of it, but it’s probably better on a whiteboard to talk about the math about that.

[08:38] Raj: Maybe explain to us, uh Preetam. So I think the transformer architecture essentially became very interesting as it essentially started predicting the next word, right? The probability distribution of the next set of tokens. Word, uh maybe I’m super simplifying it, but the idea is that I type in a set of words, I’m in the context, and the next word comes in and it gets fed back, right? So that the next word. becomes a probability distribution of the previous combination of the words, so on and so forth, right? How has that resulted in where we are seeing with the emergent behaviors of large language models right now? Because earlier language models were not this sophisticated, and how are we able to accomplish so far so quick? fantastic question.

[09:25] Preetam: um So you’re absolutely right, right? I mean, in the end, is just um all the language models are doing is predicting the next word given a sequence of n previous words. Now, n can be dependent on the model and architecture, and we won’t get into all of that, but it’s essentially that, right? So you want to predict the next word given a sequence of n previous words. um Now, how does it actually go ahead and solve, Olympiad level math problems or play chess that beats the best player in the world and all of that. That’s a fascinating question. And this was one of the reasons why I had actually originally looked into Andhraaj Karpathy’s work a lot because he did a ton of original work. I would say his work is even more important than the attention is all you need work. So how does that actually work? you know how the, must have heard of this term called weights, right? Like the number of parameters, everybody talks about that. OpenAI has a three trillion parameter, four trillion parameter model, right? Now, all of these parameters are essentially, you know, the number of things, like the number of items that exist inside a neural network. So you have multiple layers of neurons, right, which are interacting with each other. Now, what happens is during the training stage, when you are taking a piece of text and… and training this model to predict the next text, these layers, these intermediate layers are learning things, like learning complex relationships between these words. And those complex relationships can be things like uh math or how to sum two numbers. Because think about it, right? Like if you are trying to predict two plus two equal to four, There is no way a model can know what 2 plus 2 equal to 4. It can’t predict 4 after the equals or after the 2 plus 2 and the equals without actually knowing math. So you are actually teaching math to the algorithm. And it’s quite amazing that in the training process, you can actually uh use math to make this thing work like that. Hopefully that makes sense. I think that leads to an interesting question, right? And maybe in that context, where do you think agents fit in? Because I think this is where I think agents are gonna start doing a much better job, right? Because some of these deterministic things can come through the agents. What is your take on it? Where do agents fit into the large language model ecosystem? Yeah, agents. um So just to for completeness, like what agent means, essentially, an agent has a tool which basically think of it like an API call, uh something as simple as sending an email, right? Or it has three main components. One is the tool, which is this API call. The second is the language model itself or some sort of machine learning model inside it. The third is memory, right? Like keeping track of what, so that it can keep track of state, right? So an agent is sort of like a higher level abstraction on top of a language, a large language model, right? Language models by themselves can’t keep track of state and all those things. So that’s why you have this paradigm of an agent. Now, the nice thing about an agent is you can use this amazing concept, which I love called separation of concerns. You can have one agent do a very specific task, do that very, very well. then you can have multiple agents work together to solve a larger problem. um Now, this used to happen even before agents, actually. People used to basically break it down into tasks. They used to solve one piece. Let’s say you want to compute a summarization of a text, right? Or answer a very complex user query, which needs like sub-tasks to be executed. So they would split those into sub-tasks and then aggregate them to answer the top-level query. So that’s why I think agents are super powerful. And I think in the example that he gave, 2 plus 2 is equal to 4. So 2 plus 2 essentially can be the language model. And it essentially invokes an agent that takes these two parameters, produces 4, and it comes back to the language model to produce whatever response that it can produce in terms of summarization and response. Is that a fair description of how agents can fit in as well? Yeah, in the 2 plus 2, you could probably have one uh LLM do the 2 plus 2, because all you do is take those, tokenize them, right? 2 would be one token, plus would be another token, the equals would be the third token, right? You would give the sequence of tokens to an LLM and then give it 4, right? But then you could have, what the agent could do is keep track of what was the previous set of computations. So like 2 plus 2, maybe there was 10 plus 9 and whatnot. So having a sequence of these computations together, might need to need them to solve a larger problem. So, yeah. I want to uh maybe push this conversation a little bit forward, right? So we have the large language models. And I don’t know if deep seek is a milestone moment or not, but then we get into large reasoning models. Explain to us what is a large reasoning model and how it is different than a large language model. Yeah, yeah, it’s a great question, right? Like reasoning models or so-called thinking models, um the most simple way to think about it, and I’m actually oversimplifying this, so I apologize to the ML purists who might be listening to this, but the most simple way to describe it is um this concept of traces, right? Like so… What you would do is, if you have looked at any of these thinking models work, they basically generate a set of steps, right, or traces, as they call them. And those traces tend to help the model figure out if it is moving in the right direction to the right task. Now, there’s this concept of reinforcement learning, and it’s a term, but what it means is essentially these, there are a certain subset of these traces that give the best reward. and reward is again an RL term, but reward towards um solving the last task, right, the actual task. So you keep improving, you keep going through the traces, picking the best one, and then picking the best one after that, and picking the best one after that. So you must have seen how Charsypto3 mini or O3 works, right? It finds a trace, it thinks about a problem, it figures out, okay, this is what I’m going to go with. Now it will go with the next. step and then next step and next step and it’s making a sequence of choices, right? And then those choices bubble up into the large task, which is why it’s so much more effective at complex reasoning problems, where you have to do like really complex math or logic problems. Those are very well solved by uh thinking models. The trade-off with uh LLMs is that it is ah slow. Right. You will have to sacrifice a lot of latency and compute also. I know OpenAI spends a ton of money running compute for these kind of thinking models. uh So reinforcement learning is a very old phenomenon, right? So what has fundamentally changed and that has created this reasoning or thinking model? Yeah, it hasn’t changed much, to be honest. It’s just an application, in my opinion, at least. It’s an application of reinforcement learning to these large language models, right? Reinforcement learning has always been popular, even in recommendation systems. If you go back to that world, you had, you know, like all of these different users interacting with your, say, videos that you put up on your website. And then you want to find like the best set of videos that should be shown to the next person, right? Like the way you would do that is you would just randomly do this thing called explore and exploit, explore some video, somebody clicked on it, that’s the reward, then start pushing more and more of the same kind of video tour to that person, right? So yeah, I think it’s a similar concept if you look at it from an… a language model point of view is just an application. and you have built recommendation systems before. Yes, yes, I have. Yeah. In News Feed recommendations, you we built a recommendation system back in the day called Slinkstone, which was powering, if you had Apple News, right, all of the News Feed that you see there, it was being powered by that system back then. And obviously at Netpix, I was working quite closely with the recommendation systems team, uh helping them power… their models for these recommendations. Got it. So there was this recent write up from the engineers at Apple, which is thinking is an illusion of thinking, illusion of thinking. I’m not sure what the right paper is, but the idea basically states that the reasoning models um perform okay at small, low complex and medium complex tasks, but when you get into high complexity, they fail. What is your take on that? mean, are we, what does that mean to you? what do you think we as a community, mean, the machine learning community is going to do about it? Yeah, think that was, uh you know, everybody had suspected this and all of us had some sort of, um you know, a specific set of examples that we personally have seen where these models fail, right? So we had these kind of anecdotal examples that we have perceived these models to the paper talks about. Now, what they did was a more thorough, elaborate quantitative study, which actually proved this concept. And so I’m actually interested to learn more. I will say that I haven’t read the paper fully, but yeah, the gist of it is, uh like you said, you know, they don’t actually think. there is, and because of that, it’s em solving these sort of complex problems makes it even more compute heavy. You need to actually run a lot more computer, run a lot more traces to actually solve those kinds of complex logic problems, which is impractical, em right? eh Yeah, so, and related to that, there was a study from Anthropic actually. So a lot of uh people talk about these traces, the thinking traces as explanations, right? They would use those things as, okay, this is how the model reasoned and that’s the set of, that’s how it actually thought about the problem and all of that. Now, what Anthropic said was um they found that it wasn’t actually true. It wasn’t actually what it thought. And so… It might just be showing those traces because it’s trained to do that, but it is not actually the internal thought process of the model. Can you double, that’s a very interesting view. And I think this is leading us into the questions on the intersection of security and GRC and the idea of large language models. And I think we are getting there.

[20:59] Raj: But Preetam, help explain our leaders because traditionally when you are doing your machine learning models, you have spent a lot of time on them. You can actually build explainability because you are actually building explainability on a bunch of features that are at least countable in relative terms, right? Now you say that, uh the traces are not explainable, double click on that for us. If that is not explainable, how do we achieve explainability in large language models or large reasoning models? And what does explain, before we say that, can we just take a step back? What does explainability mean? Maybe we’ll start from there and then we’ll go into this. Yeah, yeah, let’s start from there. Let’s talk about explainability and we can probably just talk about explainability as a completely new podcast session, by the way, but we’ll try to keep that concise uh in terms of the discussion. So explainability is simpler. Let’s take a very simple classification model. Let’s say you’re a bank and you want to approve loans, right? So you get a bunch of different loan applications and now you’re getting like thousands, maybe 10,000 loan applications every week, it’s impossible to put humans in there to actually approve those loan applications. So now you think about adding a machine learning model there, right? In banking and all of these kinds of critical sectors, they actually don’t even use LLMs. It is uh using traditional models, like tree-based models, decision trees and whatnot, right? So let’s just pick a decision tree or maybe even a linear regression model. The idea there was this model would basically take in attributes. out of these applications and then generate whether, yes, I should approve this loan or I should not approve this loan, right? So simple setup. So now as a banking person who’s running this sort of model, let’s say the model says this loan is rejected. You want to know why, right? Why was this loan rejected? That’s where explainability comes in. In traditional machine learning, you would have these sort of features. probably about 30 or maybe 100 of them. That’s typically the case. And the reason why you had such small set of features was because of this lost art of feature engineering, which nobody seems to care about these days. uh And you would spend a meticulous amount of time figuring out what are the right features for this model. So now that you have these features, these features will be able to tell you what contributed to it. And a feature could be something as simple as… What was the age of the person? Did they have a current history? How old was their current history? Have they been defaulted on previous loans or similar things? Have they had previous loans in the past which increased the probability of success? So now that you have those features, the model can tell you why it rejected or accepted an application. That is what explainability is. In the modern world, ah if you have to come into that, Raj, in ah LLMs, You have three trillion parameters. How will you tell which parameter contributed to something? Let’s say you just take this three trillion parameter model and apply it to this loan application problem. You can’t use those traditional techniques to actually figure out which parameters or set of parameters contributed to this application being approved or not. Has there been any research done, Preetam, if you distribute the probability of those parameters? Is there a pattern that we see? Is there a… to at least, I know it is a massive task, but has there been any work done towards that? Yeah, you know, for complex models like these, there have been classic algorithms like LIME or SHAP, which have attempted to uh sort of give you um an overview of what parameters contributed to this. But even there, even for, um you know, the neural networks, let’s say LIME or SHAP, when it applies to neural networks, it’s important to have a clear, distinct set of features. Because… The parameters, if you look at the more modern language models, right, the parameters don’t mean much by themselves. In a combination of things, they mean a lot to the LLM, but it’s not like something that you can decipher as a human. So that’s the problem for it. Even if you run SHAP on top of these parameters, it’ll be garbage. mean, you will not be able to understand it. That’s where people started getting into this aspect of asking the LLM to explain itself. That was one form of explainability. The second form is what you were talking about, Raj, is the thinking process, the set of steps that you thought through from a thinking model. You could use that as an explanation, and maybe we can talk more about that. Got it. So, do you see, given the evolution of what you have seen, Preetam, are rule-based systems out? ah I don’t think so. I think there is still a lot of value in terms of high precision rules, right? um In the end, these machine learning models are probabilistic in nature, which means there’s a high degree of non-determinism and people don’t like that, right? Especially, know, take that banking example again. If you run the model two times and for the second time it says approved, the first time it says not approved, what would you do with that? You cannot realistically run such a system. So I think rules have their place. I think even the classic machine learning models have their place. Having said that, the more complex problems are better solved by LLMs, given their capabilities. The reason I think that’s a good segue for me to ask you, I mean, one of the challenges is that security, governance, risk or compliance, cybersecurity, have traditionally been very deterministic principles. Either you have turned backup on, you have turned backup off, you have turned logging on, you have turned logging off, on the virtual machine, on the Kubernetes cluster, whatever that is. uh How do you take and everybody is sort of moving towards the idea of using large language models in general, right, and apply that to cybersecurity and GRC. So how do you bridge these two worlds of probabilistic inputs to deterministic outputs? Yeah, it’s a tough problem, I will say. um And which is why a lot of, and you know this really well, Raj, since you all are experts in the compliance space, right? uh One of the reasons why, know, the AI governance frameworks haven’t taken full fruit is because of a lack of, ah I would say, consensus on what should be governed and how do you actually govern that thing, right? And like you said, in traditional controls, you will have yes or no, whether encryption is enabled or not, or whether you have two factor authentication enabled or not. We went through SOCTO by the way recently, which is why all of that is fresh in my head right now. I would say taking a more pragmatic incremental approach towards governance for AI, like non-deterministic systems is very important. And you could even start with this being some good work by a few uh people over here and simple things like model security. Is the model secure or not? Does somebody have access to the model? Can someone, can an external attacker influence the model somehow? Are there enough guardrails in place for the model? Can it spit out various things? Do you have PII regulated for in your model, especially in high-risk industries? So stuff like that, I think what we see is uh in our applications also apply to governance risk and compliance spaces. We see people adopting it on a piece by piece basis, piecemeal basis. So solving one problem at a time. I think that’s the right. Now let’s talk about, I want to talk about evals. I want to talk about, so the idea of what I would call unit testing, because I think we are all now becoming from software companies to AI companies in some ways. So let’s talk about AImon. What do you guys, what do you do at AImon? Where, you how did this happen? This is your brainchild. How did this happen? What are you doing? Yeah, so AI-1 actually originated as an idea when I was at Netflix, right? One of the things that happened, now I forget the year, it was probably 2023 or so when uh ChatGPT really took off, right? And um everybody was going crazy talking about the capabilities and there’s a massive, like this, uh virality moment for ChatGPT. Because of which even the enterprise part of OpenAI, like the models, started picking up a lot. So everybody was interested in getting OpenAI on the enterprise, deploying it and seeing how it could add value for their use case. So that’s what happened. There was a limited study done ah during my time at Netflix. And so we were evaluating OpenAI. So what we saw was, and we had a bunch of hackathons and things like that. So what we saw was essentially people building cool AI applications using OpenAI APIs. But then the problem was they didn’t really know how to improve those applications. So they got to a prototype stage very, very quickly, hacked something up. It was great to see.

[30:43] Preetam: But in order to get to the part, so from, 45 % accuracy rate to 80 % accuracy rate, which is absolutely required to go to production to make it even a viable solution. they didn’t know how to do that. So then they would come back to the machine learning team to us and then you would say that, hey, how do you do that? So we would have to go back and forth with them, give them tools and metrics to measure. Okay, what are the things, how do you measure accuracy? Like how do you measure hallucinations? What are the issues with this model? How do you measure conciseness? Somebody was talking about tone. I want this model to be talking in a certain tone because that’s how these models talk about. All of those things we had to really hand hold on. And so that’s where the idea for AImon came into picture in my head. So me and my co-founder, Puneet, we went out and talked to a bunch of enterprises and realized that this is a similar pattern. So we went out and built AImon. So essentially what AImon does, it’s an AI reliability platform. It helps you, so you can continue building your applications and leave the part about improving your applications or measuring how good your applications are. in a wide variety of metrics, things like accuracy metrics, traditional business metrics, uh and even security metrics like prompt injection and toxicity metrics and all of those things using our APIs. So continue building your apps using Amon to improve your apps and measure how good your apps are. And is this distillation of language models or is this the traditional feature-based machine learning model that you… Yeah, great question. So what we did was in order to compute these metrics, we didn’t really need all of the capabilities of language models, right? Like so language models are way too large, too expensive also. um Forced computing, say a hallucination metric or checking if your uh LLM followed all the instructions in the prompt, right? It’s called the instruction evaluation problem, instruction following evaluation problem, sorry. uh So for those kinds of things, those are very specific tasks. And in my experience, building models uh for that’s all for specific tasks is much more better than trying to take a horizontal model or a very generic model and trying to solve it for that task. So we have a distilled uh version of some language models. We have a specific architecture also. uh It is still based on the transformer architecture, but some parts of it have been removed. Like we don’t really need elaborate decoding layers and things like that. Which is why the AI-MODELs work at extremely low latency and can run on more commodity hardware as we would call it. No, that is beautiful. And are you a proxy meaning do you sit on the transaction path? Yeah, so the way you can implement AImon is use it in this sort of a proxy behavior where you go to OpenAI, um then get the result back, and then you could review that with AImon. So that’s completely up to you on how to implement that. We don’t have a reverse proxy or a full proxy as such. That’s not what we do. um We allow you to be flexible because a lot of people don’t want to… proxy all the time. They just want to do this offline. So they want to grab a certain set of things, and they also want to control what kind of data goes through this proxy and things like that. So which is why we have a more customizable API for doing these kind of metric evaluations and guard railing and what And I think one of the things that you touched on is the reliability, right? And I think the basic principles of systems engineering will now also apply to AI-based systems as well. So how do you see reliability? I mean, are there specific metrics? Is it very contextual to the type uh of solution? How do you think about these things? Yeah, I think reliability is a holistic problem. I mean, it’s like you need a holistic solution to that problem, right? Like, so things like the system aspects of it are also super important, by the way, when you are building an LLM application or any sort of AI application, you need to do the usual best practices of building good systems and make sure you have high availability and you make sure you have, you know, the systems properly backed up, especially if you have stateful systems. in your AI application. So all of those good things still exist in AI applications, right? And I think there is enough tooling and infrastructure for that. When you get into a quality of results, and we talk about data quality a lot, but this is quality of the results from uh a language model, that’s where a lot of the tooling is lacking right now, and it’s still very early. So a lot of people use LLM judges and things like that, but LLM judges have their own problems, right? they can be biased, they’re also subjective to being probabilistic in nature. They might say a relevant score of five for the same input once and then relevant score of 10 for the same input the next time. So I think having a holistic set of metrics, and this is where tools like AImon could potentially help, making sure you… uh We basically divided into four pillars of reliability. The first pillar is… uh making sure your output quality is great. The second pillar is making sure the toxicity or safety metrics related to your output quality is good. Third is your data itself. Is your data that you’re feeding to your LLM good or not? Does it have conflicting information? Does it have poor formatting? All of that can affect your quality. Those are the main pieces that we make sure we recommend at least when you’re thinking about reliability in your LN applications. Got it. Now in the traditional machine learning world, you typically create this confusion matrix, right? That talks about the accuracy of the outcomes and the precision of the outcomes, right? Based on what is expected and what is actual. How do you do that in the large language model world? Yeah, and that’s always been this problem, right? Where um you deploy an AI application, first of all, you need to measure it and how good it is. What is the precision? So a lot of people do what we call wipe checks, and they would take like three, four different queries, check if that works, and then call it success, that it works. Then when it actually goes to production, everything hell breaks loose, right? So the first important thing is doing a holistic evaluation, right? Just figuring out what your data set should be for your particular domain. Let’s say the finance banking application for the loans, right? Are you covering all possible types of loans that could get input into your system? Right? That would be your quote unquote golden data set that you will use to evaluate your AI application. And then once you did that, you know, now you have a high degree of confidence that it’s good. The next step after that is to do that continuously. So you might still get items that may not never have been in your evaluation data set or in your golden data set. So it’s important to do continuous monitoring, like how you have continuous compliance controls. Similar to that, it’s very important to do continuous monitoring of your application. This is where it’s interesting, Raj, is where the traditional machine learning world where they used to ensure model output quality And even the compliance world is sort of intersecting, right? Having poor output quality is also a risk, pretty important business risk. So that’s where sort of I see this two worlds meeting. Got it. And I think this idea of fine tuning, especially as we continue to leverage these large language models, but we see that ah the reliability or the output metrics of some of these larger language model does not suffice. language models do not suffice and we have to work on fine tuning. Maybe explain to our leaders what is fine tuning. uh Listeners, what is fine tuning? How do you see fine tuning? applied in the world of large language models and what challenges do you see and what solutions. Yeah, fine tuning. um So you would use fine tuning when you realize that your model is not working properly. Let’s say you did all of these evaluations and you found that uh it’s not actually working very well for your data set because, and again, coming back to this finance banking loan application thing, for whatever reason, your language model that you’re using has never seen finance or loan related data in the past. So now that you give this… uh new form of data, which we call out of bounds samples to this model, it will not perform well. It doesn’t really know how that works. So what fine tuning does is unlike full training, you would basically get, say, 100 to 1,000 examples or maybe 1,000 to 5,000 depending on your use case. And again, this is where it depends on use case to use case basis. You would take that, you would run it through a fine tuning algorithm uh of your choice, then uh basically figure out how to uh make sure that the model is working very well for certain metrics that you’re calculating. ah That’s the basic process. It’s just teaching a new capability to the model. Think of it like that, adding it on as a new capability. And that’s basically what fine tuning is. I would in fact suggest for listeners here, do not use fine tuning unless you absolutely need to. em Prompt engineering is with some few short examples and the few short can be anywhere between two to 50, right? You can give enough examples in your prompt, given the large context size windows that you have. and you can make more faster progress through prompt engineering itself. Fine-tuning only applies when you have like what I said, when for some reason your LLM hasn’t seen this kind of data set, right, in the past. Let’s take a slightly different day. Retrieval argument generation, right? RAG, which is maybe the most common deployment that we typically see.

[41:46] Raj: Can you explain what is RAG and how is it used? Yeah, so a RAG is a Retrieval Augmented Generation RAG, like what you said. So essentially, um the idea is you have some enterprise knowledge and the LLM hasn’t ever been trained with that knowledge, right? So let’s maybe take one step back and ask it a question to say, who is Tom Cruise? uh If you go to ChatGPT, ask who is Tom Cruise. Because it has been trained with lot of Wikipedia data, knows who Tom Cruise is. It will tell you about Tom Cruise being an actor, his amazing career, amazing movies and whatnot. But now you ask the same LLM something specific inside your organization, like what is A-B-A-C, right? And this acronym, it’ll probably come up with something from the internet, from the open internet. ah Now what the RAG does is it… uh actually provides context from your internal uh organization to the LLM. And that would be sitting in the prompt. So when you’re giving a prompt to the LLM, you would say, OK, this is what I have from my internal knowledge sources. Consider this when you are doing um your inference. um So this is what is called in-context learning, ICL, very popularly called prompt engineering. So you would basically take this RAG information uh that you would get from your internal knowledge base, put it in the prompt, and then ask the LLM to make an inference. So now this acronym, ABAC, is in your knowledge base. The LLM will pull that ABAC and give you the answer. So that’s how it works. um The process of creating the RAGs is so many popular frameworks these days. uh is, you know, this database called ApertureDB, there is VBA, there’s Will Milvers, and bunch of these other ones, which are just vector databases, Pinecone and whatnot, right? So you would basically create a vector embeddings out of your knowledge documents, store them in there. And then at runtime and the user sends a query, you will pull in the actual document from your vectorDB, give it to the LLMS context and let the LLM give you the answer. Got it, got it. And given the increasing context window sizes, do you think rag is still relevant? Yeah, it’s a hot debate, I will say. And I’ll tell you where I stand in that hot debate, right? I think a rag is still very relevant. A lot of people would disagree with me. But the reason is I still think providing precise information to an alum for answering a certain question is more important than giving it the entire world uh knowledge, right? Like if you take a knowledge dump that you have. they are million tokens, you might have more. And nowadays the context size is increasing too. You take the entire one million tokens, it to the LLM. It’s unlikely to find the needle in the haystack. That’s the problem. That is the main issue. And there’s well-documented research about this, that where LLMs tend to only focus on the tail ends of the context that you give it. It forgets about a lot of things in the middle. And there are certain other LLMs which focus on things in the middle and forget about… the tail end aspects of it. So, you know, which is why I’m, again, I prefer pragmatism. So having some system that can give you precise information and give that information to the LLM to make an inference. Let’s talk about something very specific, maybe a use case in security or GRC in the context of LLMs and evals. Yeah. So security and GRC in the context of LLMs and evals, right? Like, yeah. I think from the security perspective, there are a few metrics that are very popular these days. Everybody is concerned about them. One of them is prompt injection, right? Now, the prompt injection essentially means that an attacker could make the LLM do something that it’s not supposed to do. For instance, make the LLM… behave like a really random person uh when it’s not supposed to behave that way in the context that is deployed. Or maybe make the LLM talk about this competition. Say you’re talking to a Tesla chatbot and the LLM is now appraising another competitor like Cruz or something. So that’s where prompt injection attacks come in. And there are more serious implications of it. Like you could also have SQL injection. things inside that, you could have some malware injected into that. So from an evaluation perspective from security, GRC, I think the few things that people care about is there’s a metric called CBRN, is chemical, biological, nuclear uh risk. all those metrics means is you want to ensure that the LLM doesn’t give out techniques on how to say make a war a weapon. which can be used for mass destruction. So those kind of metrics or safety metrics are important from an evaluation perspective, from the security angle and from a GRC angle. So evaluation is one piece and I like to see it as an offline tool. You have some set of queries and you would evaluate your LLM on how good it fares against all these different prompt injection. Does it have prompt injection attacks? Does it have CBRN? Does it have toxic output? All of these different pieces. But you also want to, for these specific metrics, also want to put guardrails in. That’s where the security people, and even from a compliance perspective, having this control, like where you actually implement this control in place as a guardrail is super important. had seen a very, very cool demo with AImon. And by the way, Compliance Cloud, we work with, we integrate with AImon. We work very closely with you guys. We had done something very interesting on access policy using AImon models. Explain to our listeners what that is. Yeah, first of all, fantastic. It’s always fantastic working with you all and collaborating with you all on these problems. I think it’s been a great collaboration with you all so far. uh And garage to access control. One of the pieces is access control on unstructured data is becoming increasingly harder. So there’s all of these different systems that exist already to implement RBC. or attribute based access control, ABAC, right? uh In your systems, but all of them rely on like very traditional approaches, uh whether you have an identity of a user and then whether that identity can be properly verified. If it’s a structured table and sitting in a Postgres database, this particular person has access to a particular column, the table, you could do all of those things with that setup, right? But now think about it this way. You have an LLM which is giving you unstructured data. Somebody trained an LLM inside your organization, now deployed it on finance data, right? And now anyone in the company, even a contractor could ask this LLM, what is the compensation of person X when they shouldn’t actually have access to that kind of compensation data? If they try to make that query via a traditional system like a SQL system or something, they wouldn’t have access to it. And then they wouldn’t even have access to the interface to actually make those that query in the first place. But within LLM, all of that is nice now sort of sidestep and now they have access to this information. So how do you guard against that? Right? What we have implemented is, you know, we made this demo, which you just talked about, which helps you take this sort of Okta like policy, which you have all of these different attributes, permission sets. from a user stored in a system like Okta and use that to actually enforce those policies on unstructured data. And the way we do that is using a specific model that we have built for this, which can analyze the unstructured data’s properties and see if it has any of the specific attributes that are allowed or not allowed from those input permission sets. And I think what is very, very cool about it, Preetam, is that I don’t know if we are collectively realizing it or not. I think the traditional way of applying these policies, there are a bunch of ways in which we have done this traditionally, right? You use OPA, Open Policy Agent, right? Other engines like that. And what this actually is doing is allowing, democratizing the idea of writing these policies and executing these policies at runtime much more easily, right? uh Because how many people know Rigo? and how many people want to read Reco, learn Reco is a big question. And I think that is super cool. But I think a lot of that hinges on how reliably you think the response is going to be, right? Talk about that a little bit. So how do you ensure the reliability of these responses? Because you are underlying using a model.

[51:12] Preetam: Yeah, yeah, yeah, absolutely. And it’s a great point and a very important point that because these systems are still non-deterministic, we also underlying use a model to actually figure out whether this unstructured text follows, sort of matches the permission sets in the opt-up policy that you sort of retrieved. uh The way we handle this and we ensure high precision is as follows. First is because we have tuned these models to do really well on these tasks, we can already achieve really high precision on this and we optimize for precision. So whenever you think about machine learning models deployed in real world systems, there’s always this precision recall trade off, right? uh When you think about accuracy. So if you optimize too much for recall, you sometimes suffer on precision and vice versa, right? What we did was we optimized for precision. While we may not be able to catch everything, we make sure that you don’t have any false positives or any false negatives. That’s the piece that we have optimized these models for. Now, having said that, there isn’t a 100 % guarantee that it will always work, unlike like a traditional rule, which is very deterministic. And that is okay, at least uh for implementing the system right now, because without this, you have zero percent. right, access control. But with this, you have a high degree, like 95 % accuracy rate. There will be some things that might slip in, and that’s where we come in and sort of fine-tune the models against those cases to get it up to 98 % on your organization. That is super cool. That is super cool.

[52:50] Raj: uh We are approaching almost the end of our segment, Preetam. To a person who’s graduating and looking at this podcast, how can they enter this world, this fascinating world of machine learning and language models? What would your advice? Yeah, I would say there is, you know, there’s so much uh information out there. Like it wasn’t in 20 years ago, this wasn’t the case. You had books and I have a ton of them lying around here. So we had to pour through all of those books. But now you could even make an LM, create like a study plan for you to go through, you know, your favorite pieces of machine learning. Some people might be interested in linear algebra. going deep into it. Some people might just be interested in the applications part of it, right? So I would encourage you all to pursue that path. If you really want some recommendations, I would again come back to Andres Karpathy’s videos. There are three main videos, and I’m happy to share that offline. Three main videos on his YouTube channel. One that talks about general applications of LLMs and how he uses LLMs for. The second one is he actually goes through a three-hour long video where he replicates uh GPT-3 or GPT-2 from scratch, right? That’s amazing. mean, yes, it’s a bit long, but I would highly recommend sitting through it. And then there’s a shorter version of that video where he basically tries to build an LM model from scratch. So highly recommend looking at those and also look at the traditional machine learning algorithms like linear regression, logistic regression, start there. Having a very good base in those traditional machine learning algorithms is very important. before you jump into the cool kid stuff, right? Like the large language models and neural networks and crazy things like that. yeah. you don’t want to answer this question. Maybe you want, what is your favorite model? Okay, yeah. Let me see. I have a ranked list of favorite models. I think my first one is the cloud model, the Sonnet, right? The 3.7 Sonnet is pretty cool, and mainly because I use it with Merkurza, which is another coding assistant. So… I found a lot of improvement in my own productivity because of that. The second one I would say is O3. I think it’s a really fascinating model. Also one of the most, yes, O3 is OpenAI’s thinking model. And then soon after that follows QAN, QAN 3.5 or 3. then, yes, correct. It’s originated from CIDAX. Yeah, it’s completely open source and then also the deep-seq ones, right? uh I think open source is doing really, really well, by the way. And also unpopular opinion, but I’ll share this anyway, Raj. uh I think the next frontier of models will be more generalizable models, which uh use less data to actually do the same things that, even uh a 5 trillion parameter model from OpenAI can do. uh So lesser data to train LMS would actually make these LMS more accessible than ah having more data and more larger models that need massive hardware to run them. That is a very fascinating thing to say. That is almost another two more podcast episodes that we need to have. We haven’t even talked about MCP and we need to double click and go into more detail or A2A, right? Or whatever the trend that is gonna be tomorrow. So this is fascinating, Preetam, and thanks for being on the show. Sincerely appreciate it. No, thank you, Raj. Thank you for having me. It was a fun conversation. And yeah, I can chat about this all day. So we’re looking forward to more conversations in the future and always enjoy working with you. So thank you, Raj. Take care. I think we’ll stay on the line, Preetam.

Listen on your favorite platform and stay updated with the latest insights, stories, and interviews.

CCM NERO graphic

Want to see how we can help your team or project?

Book a Demo