Legal Innovation Spotlight: Damien Riehl

Damien Riehl

Join Ted as he engages with Damien Riehl to explore the transformative intersection of AI and the legal profession. This conversation unpacks the evolving role of large language models in legal reasoning, how they reshape traditional tasks, and the philosophical implications of trusting AI outputs. Whether you’re curious about the limits of AI’s “reasoning” or its practical benefits for attorneys, this episode offers fresh perspectives and insights you won’t want to miss.

In this episode, Damien shares insights on how to:

Assess the reasoning capabilities of large language models in legal contexts
Leverage generative AI to enhance legal document drafting and research
Interpret the limitations of AI outputs in subjective fields like law
Test the reliability and objectivity of AI systems in complex decision-making
Navigate ethical considerations in adopting AI-driven tools in legal practice

Key takeaways:

AI’s reasoning in law aligns with its backward-looking nature, as legal tasks often rely on precedent and existing data, making large language models effective for linking facts to statutes and regulations.
The “anesthesia test” emphasizes evaluating AI by its practical outputs, as effectiveness matters more than understanding its internal workings.
While large language models excel at connecting legal concepts, they struggle with objective tasks like math, reflecting their contextual strengths.
Studies like Stanford’s highlight AI’s limitations, but the legal industry should focus on practical applications for everyday workflows rather than edge cases.

About the guest, Damien Riehl:

Damien Riehl is a lawyer and technologist with experience in complex litigation, digital forensics, and software development. A lawyer since 2002 and coder since 1985, Damien clerked for the chief judges of state and federal courts, practiced in complex litigation for over a decade, has led teams of cybersecurity and world-spanning digital forensics investigations, and has built AI-backed legal software.

Co-Chair of the Minnesota Governor’s Council on Connected and Automated Vehicles, he is helping recommend changes to Minnesota statutes, rules, and policies — all related to connected and autonomous vehicles. Damien is Chair of the Minnesota State Bar Association’s AI Committee, which oversees an AI Sandbox to promote Access to Justice (A2J).

At SALI, the legal data standard, Damien built and greatly expanded the taxonomy of over 18,000 legal tags that matter, helping the legal industry’s development of Generative AI, analytics, and interoperability.

At vLex Group — which includes Fastcase, NextChapter, and Docket Alarm — Damien helps lead the design, development, and expansion of various products, integrating AI-backed technologies (e.g., GPT) into a billion-document dataset from 100+ countries, all to improve legal workflows.

“How much of the law is looking backward—that is, looking to precedent? You’re always looking to statutes; you’re always looking to something that is in the data set. So, if it is in the data set, really, all of our reasoning that is legal is backward-looking, not forward-looking.”– Damien Riehl

Connect with Damien Riehl:

Website: https://vlex.com/
LinkedIn: https://www.linkedin.com/in/damienriehl/

Subscribe for Updates

Machine Generated Episode Transcript

1 00:00:02,328 --> 00:00:04,179 Damien, how are you this afternoon? 2 00:00:04,179 --> 00:00:04,761 Couldn't be better. 3 00:00:04,761 --> 00:00:05,475 Life is really good. 4 00:00:05,475 --> 00:00:06,326 How are you Ted? 5 00:00:06,326 --> 00:00:07,196 I'm doing great. 6 00:00:07,196 --> 00:00:07,926 I'm doing great. 7 00:00:07,926 --> 00:00:10,786 I appreciate you joining me this afternoon. 8 00:00:10,786 --> 00:00:23,341 We were kicking around a really interesting topic via LinkedIn and I figured, you know what, this is, I I've been overdue to have you on the podcast anyway. 9 00:00:23,341 --> 00:00:27,472 So I figured this is a good opportunity to, uh, to riff a little bit. 10 00:00:27,472 --> 00:00:31,533 Um, but before we do, let's, let's get you introduced. 11 00:00:31,533 --> 00:00:35,104 So I went and looked at your, your LinkedIn profile. 12 00:00:35,104 --> 00:00:36,294 Interestingly, 13 00:00:36,344 --> 00:00:44,477 I didn't realize you started your legal career as a clerk and you started practicing in the early two thousands. 14 00:00:44,477 --> 00:00:46,717 You worked for TR and fast case. 15 00:00:46,717 --> 00:00:49,098 That's now VLex, right? 16 00:00:49,098 --> 00:00:54,420 And, um, you're still at VLex and I know you do a lot of work through with Sally. 17 00:00:54,420 --> 00:00:56,010 That's how you and I actually first met. 18 00:00:56,010 --> 00:01:01,352 But, um, why don't you tell us a little bit about who you are, what you do and where you do it. 19 00:01:01,363 --> 00:01:08,423 Sure, I've been a lawyer since 2002, I clerked for chief judges at the state appellate court and the federal district court. 20 00:01:08,423 --> 00:01:16,583 Then I worked for a big law firm, Robbins Kaplan, where I represented Best Buy and much of their commercial litigation, represented victims of Bernie Madoff, helped sue JPMorgan 21 00:01:16,583 --> 00:01:18,053 over the mortgage-backed security crisis. 22 00:01:18,053 --> 00:01:24,633 So I have a pretty long time, some would say too long as a litigator, but then I've also been a coder since 85. 23 00:01:24,633 --> 00:01:29,275 So I have the law plus technology background, and anyone who works with me will tell you that 24 00:01:29,275 --> 00:01:31,315 I am probably the worst coder you've ever met. 25 00:01:31,315 --> 00:01:38,640 I say I'm a coder not as a badge of honor, but a shroud of shame where I'm not very good at coding at all. 26 00:01:38,640 --> 00:01:42,523 But with large language models, one can be actually better at coding than one actually is. 27 00:01:42,523 --> 00:01:49,267 So after litigating for a bunch of years, I joined TR, building a big thing for them, did cybersecurity for a while. 28 00:01:49,267 --> 00:01:57,171 But since 2019, I've been working with Fastcase, which is now VLex, essentially playing in a playground of a billion legal documents. 29 00:01:57,171 --> 00:02:05,374 cases, statutes, regulations, motions, briefs, pleadings, extracting what matters from them using Sally tags and otherwise, and then running large language models across those 30 00:02:05,374 --> 00:02:06,105 things. 31 00:02:06,294 --> 00:02:06,925 Interesting. 32 00:02:06,925 --> 00:02:13,946 And is that how your involvement in Sally came to be was the work that you're doing at Vlex? 33 00:02:14,259 --> 00:02:18,949 It actually came to be that I met Toby Brown who founded Sally in 2017. 34 00:02:18,949 --> 00:02:23,839 I met him at Ilticon and we just happened to sit in the same breakfast table. 35 00:02:23,839 --> 00:02:28,279 And I'd known of Toby but had not actually met Toby before. 36 00:02:28,279 --> 00:02:34,629 But then we started talking a bit about Sally and he said, I said, you haven't really chased any litigation things. 37 00:02:34,629 --> 00:02:36,099 He said, no, we haven't. 38 00:02:36,099 --> 00:02:36,689 said, why not? 39 00:02:36,689 --> 00:02:38,079 I said, would you like some help on that? 40 00:02:38,079 --> 00:02:39,475 And he's like, well, it's too hard. 41 00:02:39,475 --> 00:02:40,215 Do you want to do it? 42 00:02:40,215 --> 00:02:41,675 And I said, yeah, I totally want to do it. 43 00:02:41,675 --> 00:02:47,327 So we met in 2019, August of 2019, and I've been working on Sally ever since. 44 00:02:47,382 --> 00:02:48,123 Interesting. 45 00:02:48,123 --> 00:02:49,999 And what were you coding in 85? 46 00:02:49,999 --> 00:02:52,125 I've been, I started coding in like 82. 47 00:02:52,125 --> 00:02:53,651 What were you coding basic? 48 00:02:53,651 --> 00:02:56,591 I was coding basic in my Commodore 128. 49 00:02:56,911 --> 00:03:02,731 I didn't start with the Commodore 64 because I wanted to get the thing that had 128K of RAM instead of 64K of RAM. 50 00:03:02,731 --> 00:03:03,931 So I was coding basic. 51 00:03:03,931 --> 00:03:11,111 was getting magazines where I would take the magazine on paper and I would recode the code in the magazine and then tried to tweak the code. 52 00:03:11,111 --> 00:03:13,902 So yeah, I was a very nerdy 10-year-old. 53 00:03:13,902 --> 00:03:14,382 Yeah. 54 00:03:14,382 --> 00:03:15,742 So it's funny. 55 00:03:15,742 --> 00:03:17,502 A lot of parallels there. 56 00:03:17,502 --> 00:03:24,722 Um, I started off with a Commodore 32, so I had one fourth of the memory that you did. 57 00:03:24,722 --> 00:03:30,342 And you know, I used to have to, uh, store my programs on audio cassette. 58 00:03:30,342 --> 00:03:40,322 This is before I could afford a floppy and you know, um, gosh, so this would have been, yeah, probably 82, 83. 59 00:03:40,322 --> 00:03:42,252 Then I graduated to a 60 00:03:42,252 --> 00:03:48,557 I had a TI-994A with a extended basic cartridge and a book about that thick. 61 00:03:48,557 --> 00:03:54,132 And I literally read every page of it to understand all the new commands. 62 00:03:54,132 --> 00:03:58,285 I totally geeked out on it and then was totally into it. 63 00:03:58,285 --> 00:04:04,461 And then during middle school, you know, the girls didn't think it was cool to be a computer programmer. 64 00:04:04,461 --> 00:04:07,834 So I kind of ditched it for a while until college. 65 00:04:07,834 --> 00:04:11,006 So I had a break in there, but 66 00:04:11,022 --> 00:04:17,502 Then when I picked up computers again, it would have been early nineties, like 91 ish. 67 00:04:17,502 --> 00:04:26,822 And by then it was a visual basic, you know, doing native windows development like VB four. 68 00:04:27,161 --> 00:04:28,582 God, I can't remember. 69 00:04:28,582 --> 00:04:32,642 I think it was visual interdev and used to compile windows programs. 70 00:04:32,642 --> 00:04:33,652 did a lot of SQL. 71 00:04:33,652 --> 00:04:38,862 I was actually on the SQL team at Microsoft in late nineties, early 2000. 72 00:04:38,862 --> 00:04:39,675 So 73 00:04:39,675 --> 00:04:42,300 I can still hold my own on SQL, I'm like you, man. 74 00:04:42,300 --> 00:04:47,990 If I had to code an app, I'd be so lost right now. 75 00:04:48,307 --> 00:04:53,051 True, really query how important that is these days to be really a hardcore coder. 76 00:04:53,051 --> 00:05:01,677 I know people that are really good hardcore coders that use things like cursor and use large language models to be able to be a bicycle for the mind, like Steve Jobs would say, 77 00:05:01,677 --> 00:05:03,618 and make them go better, faster, and stronger. 78 00:05:03,618 --> 00:05:14,486 But even for people that are rusty or really awful, like you and me, it's still, I can't go 10 times as fast as a normal coder can with a large language model, but I can maybe do 79 00:05:14,486 --> 00:05:16,231 1x what they used to be able to do. 80 00:05:16,231 --> 00:05:16,382 Right. 81 00:05:16,382 --> 00:05:20,440 There's, there's really, um, it really evens the playing field on what is possible. 82 00:05:20,440 --> 00:05:21,350 Yeah. 83 00:05:21,430 --> 00:05:33,095 Well, you and I were riffing on a topic that I think is super interesting and I was kind of surprised to hear your perspective on it and I thought it was really interesting and we 84 00:05:33,095 --> 00:05:39,618 were talking about the question on whether or not LLMs can reason. 85 00:05:39,618 --> 00:05:45,340 I've always, you know, understanding the architecture, I've always just had the default assumption. 86 00:05:45,340 --> 00:05:49,688 That's kind of where I started my position on this with 87 00:05:49,688 --> 00:05:53,771 There's no way they can just based on, on the architecture, right? 88 00:05:53,771 --> 00:05:55,322 It predicts the next token. 89 00:05:55,322 --> 00:06:00,046 It has no concept of, um, comprehension. 90 00:06:00,046 --> 00:06:05,809 Therefore reasoning seems to be far out of reach, but it does create the illusion of reasoning. 91 00:06:05,809 --> 00:06:09,012 And you had an interesting argument, which was, does it matter? 92 00:06:09,012 --> 00:06:17,138 Um, so, I mean, let's start with do LLM's reason or create the illusion of reasoning. 93 00:06:17,631 --> 00:06:19,872 And yes, let's talk about that. 94 00:06:19,872 --> 00:06:25,133 I think a good precursor to that question is are LLMs conscious or are they not conscious? 95 00:06:25,133 --> 00:06:28,914 And that's another kind of academic exercise question that people have been thinking about. 96 00:06:28,914 --> 00:06:31,675 You know, it gives the illusion of consciousness, right? 97 00:06:31,675 --> 00:06:35,606 And so, but of course, large language models, in my opinion, are not conscious, right? 98 00:06:35,606 --> 00:06:38,057 Because they are just mimicking consciousness. 99 00:06:38,217 --> 00:06:44,599 But their philosophers for millennia have been saying consciousness is undefinable. 100 00:06:44,753 --> 00:06:48,245 Like, the only thing I can be conscious of is I know that I am conscious. 101 00:06:48,245 --> 00:06:53,468 But whether you are conscious or not or just a figment of my imagination is something I will never, know. 102 00:06:53,568 --> 00:06:56,590 All I know is that my own consciousness is a thing. 103 00:06:56,590 --> 00:07:05,555 So I think the question of whether large language models are conscious or not is kind of just an academic exercise that really doesn't matter, right? 104 00:07:05,676 --> 00:07:11,098 So any more than I know whether Ted is conscious or not, that we is a f- 105 00:07:11,609 --> 00:07:14,681 Science and we as philosophers have never defined consciousness. 106 00:07:14,681 --> 00:07:24,646 Therefore the debate about consciousness is just an academic exercise So let's now set consciousness aside and now let's talk about reasoning and the real question is I when I'm 107 00:07:24,646 --> 00:07:35,332 speaking with you Ted I have no idea whether your brain is reasoning or not I've that's because often we ourselves don't know how our brains are reasoning or not I'm all the only 108 00:07:35,332 --> 00:07:40,729 way I can tell whether Ted is reasoning or not is through the words that come out of Ted's mouths 109 00:07:40,729 --> 00:07:44,381 or the words that come out of Ted's keyboard as Ted is typing. 110 00:07:44,381 --> 00:07:51,925 And if those words look like reasoning, and if they quack like reasoning, then I could be able to say Ted is probably reasoning. 111 00:07:51,985 --> 00:07:55,427 So maybe shouldn't we judge large language models in the same way? 112 00:07:55,427 --> 00:08:00,950 That if the output of the large language models looks like reasoning and quacks like reasoning, then maybe it's reasoning. 113 00:08:00,950 --> 00:08:06,653 And that's what large language models, machine learning scientists, data scientists, call that the duck test. 114 00:08:06,733 --> 00:08:10,269 That is, they know what goes into the black box. 115 00:08:10,269 --> 00:08:15,244 They have no idea what happens inside the black box and they know what comes out of the black box. 116 00:08:15,244 --> 00:08:24,753 But if the output looks like reasoning and quacks like reasoning, maybe whether the black box is reasoning or not matters not, just like it doesn't matter if I know how you are 117 00:08:24,753 --> 00:08:25,854 reasoning in your brain. 118 00:08:25,854 --> 00:08:27,695 All I know is your output too. 119 00:08:28,098 --> 00:08:29,200 Interesting. 120 00:08:29,584 --> 00:08:31,540 Can we test for reasoning? 121 00:08:32,177 --> 00:08:34,028 Yes, I think we can. 122 00:08:35,009 --> 00:08:38,691 the question is, what are the tasks that you're testing on? 123 00:08:38,712 --> 00:08:41,614 There are objective tasks, mathematical tasks. 124 00:08:41,614 --> 00:08:43,876 So you can imagine a mathematical proof. 125 00:08:43,876 --> 00:08:47,919 You could be able to test whether it's making its way through the mathematical proof or not. 126 00:08:47,919 --> 00:08:50,761 You can test whether that is reasoning or not reasoning. 127 00:08:50,761 --> 00:08:51,962 Same with science. 128 00:08:51,962 --> 00:08:53,133 Is it getting science correct? 129 00:08:53,133 --> 00:08:54,994 Is it doing the scientific method correctly? 130 00:08:54,994 --> 00:08:55,985 Is it reasoning? 131 00:08:55,985 --> 00:08:59,237 Is it providing true causation rather than being a correlation? 132 00:08:59,237 --> 00:09:02,569 I think those are objective truths that you could be able to see reasoning. 133 00:09:02,569 --> 00:09:06,831 And I would say that the outputs for law are much, much different than that. 134 00:09:06,971 --> 00:09:12,735 That is, whether I made a good argument or not in front of this court is not objective. 135 00:09:12,735 --> 00:09:14,055 That is subjective. 136 00:09:14,055 --> 00:09:22,600 So I can't do a proof as to validity or invalidity any more than you could do a proof as to lawyer one made a better argument than lawyer two. 137 00:09:22,600 --> 00:09:28,499 Ask 10 lawyers and you might get a 50-50 split on whether lawyer one made a better argument or lawyer two made a better argument. 138 00:09:28,499 --> 00:09:36,899 going over to the transactional side, the contractual side, lawyer one might love this clause, but lawyer two says that's the worst clause on the planet. 139 00:09:36,899 --> 00:09:40,759 There's no objective standard as to what is good legal work. 140 00:09:40,759 --> 00:09:49,999 And absent any objective standard as to good legal work, maybe what is good legal reasoning is in the eye of the beholder, much like beauty is in the eye of the beholder. 141 00:09:49,999 --> 00:09:57,201 That with absent any objective way to be able to say this was good legal reasoning or bad legal reasoning. 142 00:09:57,201 --> 00:10:05,774 I guess the question of whether a large language model is providing good legal reasoning or bad legal reasoning is unanswerable in the same way to say whether that human is doing 143 00:10:05,774 --> 00:10:07,995 good legal reasoning or bad legal reasoning. 144 00:10:07,995 --> 00:10:15,717 So I think this whole debate about reasoning or not reasoning is academic at best because we should judge it by its outputs. 145 00:10:15,717 --> 00:10:22,479 And different lawyers will judge the outputs differently with humans, and they'll judge it differently with large language models. 146 00:10:22,990 --> 00:10:23,190 Okay. 147 00:10:23,190 --> 00:10:34,290 I think that's true to an extent, but let's say I come in as a, as an attorney and to make my closing arguments, I see, I sing the theme song to Gilligan's Island. 148 00:10:34,290 --> 00:10:43,330 Um, I think that would universally, um, be graded as that's a bad, this is bad legal reasoning, right? 149 00:10:43,330 --> 00:10:53,062 So, so there is a spectrum and you know, obviously that's an extreme case, but I think extreme cases are good to evaluate whether or not something's true. 150 00:10:53,106 --> 00:11:05,180 And, so yeah, mean, if something is just universally looked at every attorney reasonable person that would evaluate it says, it says it's bad. 151 00:11:05,321 --> 00:11:09,701 Is that, does that monkey wrench what, what you're putting forward there? 152 00:11:09,701 --> 00:11:10,406 No. 153 00:11:10,407 --> 00:11:11,097 Yeah, that's right. 154 00:11:11,097 --> 00:11:16,971 So you're right that it is a spectrum, that you have the worst argument on the planet, which is just gibberish. 155 00:11:16,971 --> 00:11:21,914 And then there's the best argument on the planet that is going to win 100 out of 100 times. 156 00:11:21,914 --> 00:11:23,374 And same thing with contracts. 157 00:11:23,374 --> 00:11:26,366 There's the contract that's going to get the deal done 100 out of 100 times. 158 00:11:26,366 --> 00:11:29,518 And there's the contract that is going to fail, 100. 159 00:11:29,518 --> 00:11:32,720 So everything is along that spectrum. 160 00:11:32,720 --> 00:11:39,121 And then if you add a y-axis to that spectrum, there is a most common thing, that is the head. 161 00:11:39,121 --> 00:11:42,573 And then there's a long tail of rare things that happen. 162 00:11:42,574 --> 00:11:47,717 So if you think about what the large language models are doing is largely giving you the head distribution. 163 00:11:47,717 --> 00:11:53,261 That is the most common things because it's giving you a compressed version of the training data set. 164 00:11:53,261 --> 00:11:57,384 so the head is almost never going to be Gilligan's Island. 165 00:11:57,504 --> 00:12:01,928 And the head is almost never going to be some of the worst contractual arguments ever made. 166 00:12:01,928 --> 00:12:04,430 It's going to fall on the average on that side. 167 00:12:04,430 --> 00:12:06,749 And that actually is probably 168 00:12:06,749 --> 00:12:09,931 the right thing to do for the large language model in the legal task. 169 00:12:09,931 --> 00:12:17,605 Because you want the average, because you want 100 out of 100 lawyers, you want most of the lawyers to say that's probably right. 170 00:12:17,725 --> 00:12:20,226 And that is the average distribution of this. 171 00:12:20,346 --> 00:12:30,242 And so really then, if we then say the x-axis and the y-axis and you have the head, the most common things, and then you have the long tail, and you now say, OK, the large 172 00:12:30,242 --> 00:12:35,865 language models are going to take the head, not the long tail, then you have to say, OK, what is that head? 173 00:12:35,865 --> 00:12:36,567 Is that 174 00:12:36,567 --> 00:12:39,288 Does it require legal reasoning or not? 175 00:12:39,508 --> 00:12:44,819 So let's take about let's talk about mathematics and science We want to find new science, right? 176 00:12:44,819 --> 00:12:48,200 We want to be able to create new cures to cancer, right? 177 00:12:48,200 --> 00:12:54,052 And we want to be able to do things that have never done been done before So does the large language model need reasoning for that? 178 00:12:54,052 --> 00:12:57,023 Absolutely, because that's not part of the training to set right? 179 00:12:57,023 --> 00:13:04,195 That's not part of something that we can look backward at so we need reasoning for new science We need reasoning for new mathematics 180 00:13:04,195 --> 00:13:12,061 You need reasoning for something that's never been done before that you need somebody like Einstein or somebody to somebody who is once in a generation to be able to go forward and 181 00:13:12,061 --> 00:13:13,342 leap forward. 182 00:13:13,342 --> 00:13:15,083 Contrast that with the law. 183 00:13:15,664 --> 00:13:19,707 How much new thinking do we really need to do in the law? 184 00:13:19,828 --> 00:13:24,531 In contrast, how much of the law is looking backward that is looking to precedent? 185 00:13:24,612 --> 00:13:32,938 If I am a lawyer arguing in court, if I say, I've got judge, I've got this really brand new idea that nobody's ever won on before, but I just sprouted out of my brain. 186 00:13:32,938 --> 00:13:33,693 What do you think? 187 00:13:33,693 --> 00:13:35,364 The judge is going to say, me a case. 188 00:13:35,364 --> 00:13:41,809 And if I can't show him a case, if I can't show her a statute, I lose because it's not based on precedent. 189 00:13:42,009 --> 00:13:45,212 So do we really need new things in litigation? 190 00:13:45,212 --> 00:13:47,693 Do we really need new things in transactional work? 191 00:13:47,693 --> 00:13:49,875 Do we really need new things in advisory work? 192 00:13:49,875 --> 00:13:51,856 Do we need new things in regulatory work? 193 00:13:51,856 --> 00:13:55,339 And I think the answer to all four of those is no, because you're always looking to precedent. 194 00:13:55,339 --> 00:13:56,580 You're always looking to statutes. 195 00:13:56,580 --> 00:13:59,441 You're always looking to something that is in the data set. 196 00:13:59,522 --> 00:14:01,453 So if it is in the data set, 197 00:14:01,777 --> 00:14:08,954 Really, all of our reasoning that is legal is backward looking, not forward looking like in mathematics or in science. 198 00:14:08,954 --> 00:14:10,616 It is all backward looking. 199 00:14:10,616 --> 00:14:18,113 So if it's all backward looking, is every legal reasoning really recombining the data set that we're having? 200 00:14:18,648 --> 00:14:23,852 Well, what about novel pieces of regulation that now have to be interpreted? 201 00:14:23,852 --> 00:14:33,260 Is there not new legal thinking that has to take place to evaluate the applicability in those scenarios? 202 00:14:34,008 --> 00:14:39,633 There is, but I would say that the data is taken care of through what's called interpolation. 203 00:14:39,633 --> 00:14:45,028 And so with the large language models, they connect concepts. 204 00:14:45,028 --> 00:14:48,172 I'm going to share my screen on this. 205 00:14:48,172 --> 00:14:48,964 is it possible? 206 00:14:48,964 --> 00:14:49,233 It's cool. 207 00:14:49,233 --> 00:14:57,831 So I'm going to pull up a PowerPoint to actually demonstrate a real live case that I had where there's so. 208 00:14:58,077 --> 00:15:05,372 for the less sophisticated and maybe more sophisticated, we'll recap on how large language model works, is that they pull out concepts. 209 00:15:05,573 --> 00:15:08,734 And they pull out concepts and put them into what's called vector space. 210 00:15:08,955 --> 00:15:19,963 And so you can imagine a two-dimensional vector space that the ideas of a faucet and a sink and a vanity are probably pretty close together in that two-dimensional vector space. 211 00:15:19,963 --> 00:15:24,306 And then you could be able to say, OK, let's go ahead and put that in three-dimensional vector space with a z-axis. 212 00:15:24,306 --> 00:15:25,903 And then you could be able to say, OK, these 213 00:15:25,903 --> 00:15:29,085 All similar things are kind of clustered together as ideas. 214 00:15:29,085 --> 00:15:34,048 And now add a fourth dimension, and our brains can't even figure out what that fourth dimension would look like. 215 00:15:34,048 --> 00:15:36,169 Now add a 10th dimension. 216 00:15:36,169 --> 00:15:37,910 Now add a 100th dimension. 217 00:15:37,910 --> 00:15:41,831 Now add a 1,000th dimension and add a 12,000th dimension. 218 00:15:42,412 --> 00:15:45,634 And 12,000 dimensional vector space is where large language models live. 219 00:15:45,634 --> 00:15:55,123 And somewhere in that 12,000 dimensional vector space lives Ernest Hemingwayness and Bob Dylanness and Pablo Picassoness. 220 00:15:55,123 --> 00:15:57,803 that lives in 12,000 dimensional vector space. 221 00:15:57,803 --> 00:16:03,803 So all of the things that are legal concepts live somewhere in that 12,000 dimensional vector space. 222 00:16:03,803 --> 00:16:09,163 And all the facts in the world live somewhere in 12,000 dimensional vector space. 223 00:16:09,163 --> 00:16:16,683 And so what you can imagine is, to your question, is isn't going to combine some novel things. 224 00:16:16,683 --> 00:16:19,103 I would say, yes, it will combine them. 225 00:16:19,103 --> 00:16:22,831 But the thing is, how many of those things are 226 00:16:22,831 --> 00:16:25,592 already in the large language models vector space. 227 00:16:25,592 --> 00:16:33,994 And then combining those is what's called, the data scientists would say, connecting the latent space between those two disparate concepts. 228 00:16:33,994 --> 00:16:40,276 So now as I'm going to be sharing my screen, this concept is to think through. 229 00:16:40,451 --> 00:16:42,877 A friend of mine works for an insurance company. 230 00:16:42,877 --> 00:16:48,991 And she asked, what do you think of the thing about that called effective computing? 231 00:16:48,991 --> 00:16:51,139 What do you think of effective computing? 232 00:16:51,247 --> 00:16:55,411 And I said, I'm a pretty technical guy, but I'm sad to say I don't know what effective computing is. 233 00:16:55,411 --> 00:17:01,815 So what I did is I went to the large language model and said, define effective computing in the context of insurance and the law. 234 00:17:02,236 --> 00:17:05,739 And she's an insurance in-house lawyer. 235 00:17:05,739 --> 00:17:13,466 So I says, well, effective computing is how computers recognize human emotions and facial expressions and voice patterns to create emotionally aware agents. 236 00:17:13,466 --> 00:17:14,226 I said, cool. 237 00:17:14,226 --> 00:17:20,621 Now analyze how effective computing can be used in an insurance call center, because that's how my friend's company was thinking about using it. 238 00:17:20,959 --> 00:17:28,185 They said well you could use it for emotional recognition figuring out the caller's emotional state figuring out their choice of words How quickly they're speaking how 239 00:17:28,185 --> 00:17:38,604 emotional they are after an accident or loss I said cool now give me a list of potential legal issues that could stem from using effective computing in a call center and they said 240 00:17:38,604 --> 00:17:48,802 have you thought about privacy law like GDPR or yeah, or CCPA I've you thought about consent and whether that that caller consented to you analyzing their emotions Have you 241 00:17:48,802 --> 00:17:50,203 thought about if you get hacked? 242 00:17:50,203 --> 00:17:54,796 What if all of your client's emotional data is in the hands of a hacker? 243 00:17:54,796 --> 00:17:55,987 What's that going to do legally? 244 00:17:55,987 --> 00:17:57,748 What's that going to do with PR? 245 00:17:57,748 --> 00:17:59,870 These are all good legal concepts. 246 00:17:59,870 --> 00:18:07,655 And I would guess that zero times has anyone ever asked about the legal aspects of emotional, of effective computing. 247 00:18:07,675 --> 00:18:15,140 But what it's done is it knows what effective computing is, and it knows what privacy law is, it knows what consent is, it knows what data security is. 248 00:18:15,140 --> 00:18:19,409 So it's connecting the latent space between the concept of effective computing 249 00:18:19,409 --> 00:18:21,280 and the concept of privacy law. 250 00:18:21,280 --> 00:18:23,681 And it then says, give me some sub bullets. 251 00:18:23,681 --> 00:18:28,604 And now it's going to continue expanding upon the concepts of which jurisdictions people are calling in from. 252 00:18:28,604 --> 00:18:29,525 What types of data? 253 00:18:29,525 --> 00:18:30,395 Third party sharing. 254 00:18:30,395 --> 00:18:32,086 Are you minimizing the data? 255 00:18:32,126 --> 00:18:35,628 Each one of these things that I had live somewhere in vector space. 256 00:18:35,628 --> 00:18:43,152 So merely combining the concept of effective computing with the concept of privacy law and consent and data security. 257 00:18:43,152 --> 00:18:48,705 That way we can then combine those aspects in new ways that haven't been in the training set. 258 00:18:48,829 --> 00:18:50,059 So I think that's where it is. 259 00:18:50,059 --> 00:18:58,343 Almost everything that we do as laws, as lawyers, everything we do is connecting my client's facts to the existing laws. 260 00:18:58,503 --> 00:19:01,724 And your client's facts are almost certainly in the training set. 261 00:19:01,724 --> 00:19:09,648 And the existing laws, if you are training on actual non-hallucinated cases, statutes, and regulations, those are also in the training set. 262 00:19:09,648 --> 00:19:18,329 So really, reasoning is just being able to connect those existing facts in the data set with the existing laws in the data set and saying how they relate to each other. 263 00:19:18,329 --> 00:19:21,945 if you have the actual non-hallucinated cases, statutes, and regulations. 264 00:19:22,604 --> 00:19:23,615 That's super interesting. 265 00:19:23,615 --> 00:19:30,978 So I find it, I have to think through this, but it seems shocking to me that there are no novel concepts. 266 00:19:30,978 --> 00:19:39,113 Um, that what you've just described two things that currently exist in the, in the training material, right? 267 00:19:39,113 --> 00:19:51,509 That, that the LLM has vectorized and plotted in 12,000 dimensions and it knows the associations and, and the latent space between them. 268 00:19:51,830 --> 00:19:52,588 But 269 00:19:52,588 --> 00:20:08,665 What about new areas of law like when we start selling real estate on the moon, that obviously at some point will make its way in, but until it does, how will it navigate 270 00:20:08,665 --> 00:20:10,366 scenarios like that? 271 00:20:10,579 --> 00:20:13,270 So I guess the question is where do those areas of law come from? 272 00:20:13,270 --> 00:20:14,660 And they come from regulations. 273 00:20:14,660 --> 00:20:15,861 They come from statutes. 274 00:20:15,861 --> 00:20:17,321 They come from cases. 275 00:20:17,701 --> 00:20:21,262 And of those cases, statutes and regulations are reflected in documents. 276 00:20:21,402 --> 00:20:30,765 And if the system has those documents, the cases, the statutes, and the regulations, then the system will be able to plot those in vector space and then be able to take those legal 277 00:20:30,765 --> 00:20:35,446 concepts and apply them to the factual concepts that are also in vector space. 278 00:20:35,446 --> 00:20:39,067 So really, every single area of law is written somewhere. 279 00:20:39,067 --> 00:20:40,990 It has to be, otherwise it's not a law. 280 00:20:40,990 --> 00:20:43,073 And if it's written, it can be vectorized. 281 00:20:43,073 --> 00:20:45,286 So really everything that we do is part of the training set. 282 00:20:45,286 --> 00:20:53,837 There is really no novelty that is needed in the law because everything is necessarily backward looking at the cases, the statutes, the regulations that are binding. 283 00:20:54,316 --> 00:20:54,987 Interesting. 284 00:20:54,987 --> 00:21:00,852 You had a metaphor I had not heard before with anesthesia. 285 00:21:00,852 --> 00:21:07,478 And I think you had a friend who was an anesthesiologist. 286 00:21:07,478 --> 00:21:08,058 Yes. 287 00:21:08,058 --> 00:21:10,170 And I have trouble saying that word. 288 00:21:10,170 --> 00:21:13,683 So I'll just say anesthesiology. 289 00:21:13,683 --> 00:21:17,376 To explain that, because I thought that was an interesting metaphor. 290 00:21:17,511 --> 00:21:19,001 Yeah, she told me something. 291 00:21:19,001 --> 00:21:26,744 I was over a campfire and it freaked me out and I may freak out your recent listeners, but Yeah, she said she said Damien. 292 00:21:26,744 --> 00:21:28,124 Do you realize we have no idea? 293 00:21:28,124 --> 00:21:29,435 She's a nurse anesthetist, right? 294 00:21:29,435 --> 00:21:39,037 So she puts people under every single day and she has a I think a master's degree in anesthesiology So she said do you realize we have no idea how anesthesia works? 295 00:21:39,298 --> 00:21:40,978 I said wait to say that again. 296 00:21:40,978 --> 00:21:44,143 She said yeah one of two options option number one 297 00:21:44,143 --> 00:21:51,830 is it does what everybody thinks that it does, is that it puts us to sleep and we don't feel that scalpel going into our bellies and then we come out and we're all fine, right? 298 00:21:51,830 --> 00:21:53,271 That's option number one. 299 00:21:53,271 --> 00:21:57,134 Option number two is we feel every single cut. 300 00:21:57,234 --> 00:22:01,317 And what anesthesia does is to give us amnesia to make us forget. 301 00:22:01,698 --> 00:22:04,381 We don't know whether it's option one or option two. 302 00:22:04,381 --> 00:22:07,503 That scares the crap out of me and it might well scrape the crap out of you. 303 00:22:07,503 --> 00:22:12,187 But the question is, do we not use anesthesia because we don't know how it works? 304 00:22:12,891 --> 00:22:20,977 No, of course we use anesthesia because the real question is does it work and is it effective as to what we would like it to do? 305 00:22:20,977 --> 00:22:28,281 If the answer to both those things is yes, then how it works maybe matters less than the fact that it does work. 306 00:22:28,502 --> 00:22:32,124 So apply that anesthesia test to reasoning. 307 00:22:32,525 --> 00:22:41,661 And just like I can't tell whether you could, you're reasoning in Ted's brain or not, but I can gauge you by your output, by your, by your speech. 308 00:22:41,661 --> 00:22:44,133 by your words coming out of your keyboard. 309 00:22:44,193 --> 00:22:47,856 And if that works, I say you're reasoning. 310 00:22:48,597 --> 00:22:51,000 whether I know how your brain works doesn't matter. 311 00:22:51,000 --> 00:22:54,302 And whether I know how anesthesia works doesn't matter. 312 00:22:54,423 --> 00:22:58,106 I'm sorry, whether I know how anesthesia works doesn't matter. 313 00:22:58,106 --> 00:23:00,008 The fact that it does work matters. 314 00:23:00,008 --> 00:23:07,934 So the fact that a large-language model does create output that seems like it is reasonable and is reasoning, just like a human is reasoning. 315 00:23:08,763 --> 00:23:19,705 If the human, if the large language model output is indistinguishable from Ted's output as reasonable, then I would say whether it is actual reasoning and how it's reasoning doesn't 316 00:23:19,705 --> 00:23:23,929 really matter any more than anesthesia doesn't matter if we know how anesthesia works. 317 00:23:24,322 --> 00:23:26,784 Yeah, that is disturbing to think about. 318 00:23:27,705 --> 00:23:31,109 But it's a valuable metaphor. 319 00:23:31,109 --> 00:23:33,791 Now here's what I would say in response to that. 320 00:23:33,791 --> 00:23:40,577 Did you have a chance to look at the Apple Intelligence team's study with the GSM 8K? 321 00:23:41,688 --> 00:23:43,371 Only in the two minutes before you sent it. 322 00:23:43,371 --> 00:23:45,395 So why don't you describe it and maybe I can react to it. 323 00:23:45,395 --> 00:23:45,825 Yeah. 324 00:23:45,825 --> 00:24:00,567 So, um, it's only five weeks old, so it's, it's very new, but one benchmark that has been used pretty widely to test reasoning in, um, large language models is the, the GSM, which 325 00:24:00,567 --> 00:24:06,011 stands for grade school math, AK there's 8,000 of these questions. 326 00:24:06,252 --> 00:24:14,198 And what Apple did was modified these questions ever so slightly. 327 00:24:14,198 --> 00:24:19,000 and evaluated the LLM's performance against those modifications. 328 00:24:19,000 --> 00:24:20,780 And it was pretty dramatic. 329 00:24:20,880 --> 00:24:34,704 So their conclusions were, I said, the performance of all models decline when only the numerical values in the question are altered in the GSM symbolic benchmark. 330 00:24:34,885 --> 00:24:37,705 That's pretty interesting. 331 00:24:38,326 --> 00:24:39,202 It says, 332 00:24:39,202 --> 00:24:45,283 their performance significantly deteriorates as the number of clauses in the question increases. 333 00:24:45,604 --> 00:24:54,726 And then its conclusion is we hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning. 334 00:24:55,205 --> 00:25:06,349 And I thought there were a few examples in this specifically that really, I guess, were telling. 335 00:25:06,389 --> 00:25:08,736 So let me see if I can find this here. 336 00:25:08,736 --> 00:25:23,130 So, um, one of these, uh, these are word problems and in one of the word problems, they, I'm not going to be able to find it, but I remember enough about it to, um, articulate it. 337 00:25:23,170 --> 00:25:31,473 What they did was in the problem, they threw a sentence that had nothing to do with the problem itself and it completely blew up the problem. 338 00:25:31,473 --> 00:25:36,302 Um, the sentence that they put in there was it, the question was something like, 339 00:25:36,302 --> 00:25:53,402 You know, if the current price of keyboards and mouse pads are five and $10 respectively, and inflation has increased by 10 % each year, that was the part that had nothing. 340 00:25:53,402 --> 00:25:55,592 Tell us what the current price is, right? 341 00:25:55,592 --> 00:25:57,352 It's already given you the information. 342 00:25:57,352 --> 00:26:01,302 The fact that inflation increased 10 % has nothing to do. 343 00:26:01,302 --> 00:26:05,362 And it, it plummeted the, the, 344 00:26:05,390 --> 00:26:10,092 accuracy of the large language models responses, something like 65%. 345 00:26:10,092 --> 00:26:14,253 It varied wildly as you would expect. 346 00:26:15,614 --> 00:26:30,560 The latest models did, you know, in the chain of thought that they did the best, but it was, it seemed to me that this really pokes a hole in the whole concept of if these, 347 00:26:30,560 --> 00:26:35,412 because what that points to, if that, if you throw it a sentence that has nothing to do with the problem, 348 00:26:35,486 --> 00:26:38,390 in and I can't, that means I haven't comprehended the problem. 349 00:26:38,390 --> 00:26:40,934 I don't know what the problem is, right? 350 00:26:40,934 --> 00:26:48,694 I'm, simply reciting answers and you know, it's what I honestly would expect from, but I don't know. 351 00:26:48,694 --> 00:26:50,366 What is your response to that? 352 00:26:50,483 --> 00:26:59,883 Yeah, so I would say two responses, one of which is the idea that mathematics has a right answer and a wrong answer, whereas legal often does not. 353 00:26:59,883 --> 00:27:05,943 That is, in litigation, it's whatever argument happens to win, and in transactional work, it's whatever gets the deal done. 354 00:27:05,943 --> 00:27:14,623 So, wherein the mathematical proof, you have a right answer or a wrong answer, whereas in legal, there is the eye of the beholder, where there is no objective, there's merely the 355 00:27:14,623 --> 00:27:15,163 subjective. 356 00:27:15,163 --> 00:27:16,863 So that's thing number one. 357 00:27:16,863 --> 00:27:19,275 Thing number two is, of course, 358 00:27:19,275 --> 00:27:29,501 With mathematics you want to be able to create new mathematics and be able to go forward with new scenarios But again law never has It's always looking backward to precedent 359 00:27:29,501 --> 00:27:38,016 looking backward to cases looking backward to the contracts like we've always done the contract in this way And we know that in this industry and this jurisdiction force measure 360 00:27:38,016 --> 00:27:48,281 clauses need to be in this way This is always backward looking so really so two things non objectivity in the law where there is objectivity in math 361 00:27:48,281 --> 00:27:52,613 and backward looking in the law rather than forward looking with mathematics. 362 00:27:52,754 --> 00:28:01,139 That yes, it'll throw off the mathematics by throwing in the inflationary tool and it won't really reason in that way. 363 00:28:01,179 --> 00:28:11,036 But I think for our use cases in the law, whether it's a transactional use case, a litigation use case, an advisory use case or regulatory use case, all of the stuff is 364 00:28:11,036 --> 00:28:11,806 there. 365 00:28:11,806 --> 00:28:17,843 And if we use the chain of thought like you've talked about, then it could probably overcome the lack of true 366 00:28:17,843 --> 00:28:19,784 quote unquote reasoning that we have. 367 00:28:19,784 --> 00:28:23,767 And we as humans are really good at separating wheat from chaff. 368 00:28:23,767 --> 00:28:30,691 And so you can imagine, you know, scenario one is everybody takes the robot's output and doesn't touch it. 369 00:28:30,892 --> 00:28:33,634 That's a bad scenario under anybody's estimation. 370 00:28:33,634 --> 00:28:39,838 But almost everybody's in scenario two where it gives an output and then you look over that output and get it out the door. 371 00:28:39,838 --> 00:28:43,040 Under scenario two, you're going to separate that wheat from the chaff. 372 00:28:43,080 --> 00:28:47,357 And so until we have autonomous legal bots, which 373 00:28:47,357 --> 00:28:49,691 God help us if we have that, right? 374 00:28:49,733 --> 00:28:52,811 But until we have that, you're always gonna have that human oversight. 375 00:28:52,811 --> 00:28:57,291 So really, whether it's reasoning or not, is gonna be pretty easily flagged. 376 00:28:57,528 --> 00:28:58,199 Yeah. 377 00:28:58,199 --> 00:29:02,102 And they, they, it wasn't just, um, there were other ways that they tested it. 378 00:29:02,102 --> 00:29:04,344 They actually changed some of the numbers. 379 00:29:04,344 --> 00:29:06,525 What was interesting that also threw it off. 380 00:29:06,525 --> 00:29:07,967 And this part surprised me. 381 00:29:07,967 --> 00:29:11,039 thought AI would, I thought LLMs would figure this out. 382 00:29:11,039 --> 00:29:12,711 They changed the names. 383 00:29:12,711 --> 00:29:15,873 So instead of Sophie, they put Lisa, right? 384 00:29:15,873 --> 00:29:17,655 But they did it consistently throughout. 385 00:29:17,655 --> 00:29:21,858 Like it should be able to, so anyway, it's a new study. 386 00:29:21,858 --> 00:29:26,968 There's still a lot, to be analyzed. 387 00:29:26,968 --> 00:29:29,559 from it, but I did think it was interesting. 388 00:29:30,480 --> 00:29:37,805 Speaking of studies, the Stanford study, there's been a lot of conversation about it. 389 00:29:37,986 --> 00:29:43,068 The second iteration of that came out in May. 390 00:29:43,068 --> 00:29:56,738 you know, there was a, obviously there's companies out there that put a lot of money and effort into these tools and Stanford was pretty pointed in their, in their commentary and 391 00:29:56,770 --> 00:30:01,013 You know, there was a lot of feedback that the, the, the, the study was biased. 392 00:30:01,013 --> 00:30:03,054 I read it multiple times. 393 00:30:03,054 --> 00:30:06,857 It's about 30 pages and it's a really, it's an easy read. 394 00:30:06,857 --> 00:30:09,549 Like reading scientific papers is usually rough going. 395 00:30:09,549 --> 00:30:11,660 That one was really easy to read. 396 00:30:11,680 --> 00:30:15,523 And I thought I didn't feel, I didn't see the bias. 397 00:30:15,523 --> 00:30:19,125 did try and trick the tools and, it was upfront about that. 398 00:30:19,125 --> 00:30:23,178 Just, just like the, Apple study tried to trick AI, right? 399 00:30:23,178 --> 00:30:25,846 That's kind of part of testing is, you know, 400 00:30:25,846 --> 00:30:30,392 evaluating, you're going to throw curveballs and see how the model responds. 401 00:30:30,392 --> 00:30:32,595 But I know, what was your take on the study? 402 00:30:32,595 --> 00:30:37,792 Did you feel there were biases or did you think it was fair? 403 00:30:38,411 --> 00:30:49,384 Two thoughts on that and not to throw shade on the Stanford folks, you can imagine that one issue I have with them is that the terminology that they used for hallucinations, I 404 00:30:49,384 --> 00:30:53,516 think they conflated hallucinations with just getting the wrong legal answer. 405 00:30:53,516 --> 00:30:54,906 Those are two different things, right? 406 00:30:54,906 --> 00:31:06,131 There is an hallucination where it just makes some things up and then there is where Ted and I disagree as to where the law ends up and number two is not hallucination. 407 00:31:06,131 --> 00:31:08,331 That is just us disagreeing. 408 00:31:08,331 --> 00:31:10,671 again, with the law, there may not be a right answer. 409 00:31:10,671 --> 00:31:16,261 And the reason there is litigation is because the reasonable minds can disagree as to what is the right answer or not the right answer. 410 00:31:16,261 --> 00:31:18,951 So a court has to be able to resolve that dispute. 411 00:31:19,431 --> 00:31:24,071 a disagreement as to the output is not hallucination. 412 00:31:24,071 --> 00:31:32,711 So number one, the quibble I had is on the termination that they call everything hallucination, where really we should focus that on confabulations that the large language 413 00:31:32,711 --> 00:31:33,271 models do. 414 00:31:33,271 --> 00:31:34,771 That's thing number one. 415 00:31:34,771 --> 00:31:38,730 Thing number two goes to trying to trick the model in the ways that you talked about. 416 00:31:38,730 --> 00:31:41,101 And this goes to the product side of me. 417 00:31:41,101 --> 00:31:42,051 I'm a product guy. 418 00:31:42,051 --> 00:31:43,410 You're a product guy. 419 00:31:43,951 --> 00:31:48,011 We, as product people, say, what are the most common user pathways? 420 00:31:48,011 --> 00:31:49,731 What are the most common user behaviors? 421 00:31:49,731 --> 00:31:53,611 And we want to be able to build products that are based on those most common user behaviors. 422 00:31:53,611 --> 00:32:01,467 And going back to my x and y-axis, this is the head and the long tail, where you have the most common things done are the head. 423 00:32:01,467 --> 00:32:06,710 And the weirdest, strangest things that you would never think a user would ever do is in the long tail. 424 00:32:06,911 --> 00:32:15,717 And so the things that they were asking were things like, when Justice Ruth Bader Ginsburg dissented in this case, what does that mean? 425 00:32:15,717 --> 00:32:24,234 Where a user would never ask that, knowing because that user would say, well, the user would know that Ruth Bader Ginsburg didn't dissent in that case. 426 00:32:24,234 --> 00:32:26,365 She was the concurrence in that case. 427 00:32:26,365 --> 00:32:29,143 So asking a question like that is 428 00:32:29,143 --> 00:32:32,245 way, way down on the long tail distribution curve. 429 00:32:32,245 --> 00:32:34,347 That is not the most common use case. 430 00:32:34,347 --> 00:32:42,793 So really, if they were to do the study correctly, they would take, they would say, what are the most common questions made by lawyers? 431 00:32:42,793 --> 00:32:50,848 The most common questions made by law students, the most common questions, and then collect those most common questions, randomly distribute those most common questions, and 432 00:32:50,848 --> 00:32:55,742 then say, based on those most common questions, or I guess not even most common, they would take the entire distribution curve. 433 00:32:55,742 --> 00:32:57,723 They would take the head and the tail. 434 00:32:57,723 --> 00:32:59,404 Mix that up in a randomized study. 435 00:32:59,404 --> 00:33:03,907 So there will be some long tail questions, some head questions. 436 00:33:03,907 --> 00:33:10,072 And then from that random distribution, then run those through and see how many confabulations slash hallucinations are there. 437 00:33:10,072 --> 00:33:12,013 That would be a reasonable way to do it. 438 00:33:12,013 --> 00:33:15,315 That would be most aligned with how users use the tools. 439 00:33:15,315 --> 00:33:25,803 So by asking a long tail question that arguably would zero times out of 1,000 ever be asked, zero times out of 1 million ever be asked, nobody would ever ask a question to 440 00:33:25,803 --> 00:33:26,703 trick it. 441 00:33:26,811 --> 00:33:32,754 I would say that's really not doing what the system should be doing for its purpose. 442 00:33:32,754 --> 00:33:43,088 The purpose of legal research is to answer legal questions and it's not to gauge the reasoning of the underlying large language models, GPT-4 or GPT-3.5 or whatever was in 443 00:33:43,088 --> 00:33:44,339 place at the time. 444 00:33:44,339 --> 00:33:46,920 And what they were doing was tricking GPT-3.5. 445 00:33:46,920 --> 00:33:50,861 They weren't tricking the legal research tools that were relying on GPT-3.5. 446 00:33:50,961 --> 00:33:55,097 So I would say that that tests the thing that is the product. 447 00:33:55,097 --> 00:33:58,605 not testing the underlying model that the product is using. 448 00:33:59,032 --> 00:34:08,042 So how is that different than throwing in a sentence about inflation being 10 % in the GSM 8K scenario? 449 00:34:08,711 --> 00:34:15,776 So I guess if we're looking at a legal product, like are we testing reasoning or are we testing how well the product works? 450 00:34:15,776 --> 00:34:17,597 Because those are two different goals. 451 00:34:17,597 --> 00:34:22,440 Because really testing reasoning is testing the foundational model, GPT-4, GPT-3.5. 452 00:34:22,440 --> 00:34:31,386 But if you're testing how well the product works for its intended purpose, then the question then would be, would the user input usually include that inflationary number? 453 00:34:31,446 --> 00:34:37,950 If the answer is yes, the user input would include that, then yes, we should definitely include that in the distribution of the user input. 454 00:34:38,988 --> 00:34:46,838 zero times out of 100 million times they would include that inflationary number, then that doesn't seem right because if you're really testing the product, that is not a use case 455 00:34:46,838 --> 00:34:48,581 that the users would ever use. 456 00:34:48,728 --> 00:34:50,409 Yeah, guess, yeah. 457 00:34:50,409 --> 00:34:51,659 OK, fair. 458 00:34:51,659 --> 00:35:00,383 So in section 6.2 in the study, do, I agree, they say hallucinations can be insidious. 459 00:35:00,423 --> 00:35:10,907 And then some of the scenarios that they document don't seem like hallucination scenarios, like misunderstanding holdings. 460 00:35:10,907 --> 00:35:17,690 Systems do not seem capable of consistently making out the holding of a case, distinguishing between legal actors. 461 00:35:17,822 --> 00:35:28,089 systems fail to distinguish between arguments made by litigants and statements made by the court, respecting order of authority, models strain in grasping hierarchies of legal 462 00:35:28,089 --> 00:35:28,550 authority. 463 00:35:28,550 --> 00:35:30,461 Yeah, those aren't hallucinations. 464 00:35:30,461 --> 00:35:34,764 Those are just limitations of the model itself, it seems. 465 00:35:34,827 --> 00:35:39,469 That's right, limitations of the model and perhaps limitations of the system that is using that model. 466 00:35:39,469 --> 00:35:49,352 So you could imagine that if the system were to say as part of the metadata that a trial court is below this particular appellate court, which is below this particular Supreme 467 00:35:49,352 --> 00:36:01,055 Court, and models that hierarchy of courts in a symbolic way, not a large language model way, but in a symbolic coded up way, then that system could avoid the, 468 00:36:01,139 --> 00:36:10,279 confabulation between the district court and the appellate court level because the guardrails of the symbolic AI would prevent that kind of misunderstanding. 469 00:36:10,919 --> 00:36:21,399 So is Stanford analyzing the large language model output or are they analyzing the system's coding, that is the hard coding, to be able to say that this trial court is below 470 00:36:21,399 --> 00:36:23,659 the appellate court, which is below the Supreme Court? 471 00:36:23,659 --> 00:36:30,773 I think that that is maybe a reasonable critique that if the system 472 00:36:30,963 --> 00:36:34,764 is not recognizing that hierarchy than maybe the system should. 473 00:36:34,784 --> 00:36:40,026 So I would say that's maybe a reasonable critique if you're really looking at that. 474 00:36:40,666 --> 00:36:42,327 So yeah, maybe two scenarios. 475 00:36:42,327 --> 00:36:52,500 The Ruth Bader Ginsburg is unreasonable, but knowing whether the district court was overruled by the appellate court, which was then ruled back by the Supreme Court, a legal 476 00:36:52,500 --> 00:36:56,611 system that is a legal research system should know those things in a symbolic AI way. 477 00:36:56,728 --> 00:36:57,910 Yeah, exactly. 478 00:36:57,910 --> 00:37:00,053 And the last category was fabrications. 479 00:37:00,053 --> 00:37:02,065 that is a hallucination. 480 00:37:03,689 --> 00:37:10,348 So what does this study mean for the future of AI and legal research? 481 00:37:11,085 --> 00:37:12,416 The Stanford study? 482 00:37:13,217 --> 00:37:22,305 I would say that there are, so there's that Stanford study is out of one side of Stanford and then there's another side of Stanford called the Codex. 483 00:37:22,305 --> 00:37:29,751 Megan Ma helps lead up that Codex and you probably know Megan and Megan's one of the smartest minds right now in legal AI. 484 00:37:29,751 --> 00:37:39,119 She's running another study that is comparing human generated output with humans plus machine generated output. 485 00:37:39,119 --> 00:37:45,102 and doing a double blind study as to see what the large law firm partners prefer. 486 00:37:45,102 --> 00:37:50,904 The partners don't know which is human created versus a human plus machine and going to be doing this. 487 00:37:50,904 --> 00:37:57,127 So that seems like a reasonable way because that is really taking actual use cases. 488 00:37:57,127 --> 00:38:06,661 So she's taking actual contractual questions or actual litigation questions and being able to actually take those common use cases, the head in the distribution curve, not the long 489 00:38:06,661 --> 00:38:07,847 tail, but the head. 490 00:38:07,847 --> 00:38:15,609 and then saying how do legal tools that are built for this actually performing on these more likely legal tasks. 491 00:38:15,609 --> 00:38:22,311 So I would say that the Stanford study is a bright shining light as to the way things should be done. 492 00:38:22,311 --> 00:38:25,212 The other Stanford study, Megan Ma's Stanford study. 493 00:38:25,212 --> 00:38:29,413 Second thing is that similar studies being done by Dan Schwartz out of Minnesota. 494 00:38:29,413 --> 00:38:35,655 Dan, you might have seen two of his other studies, one of which was to be able to say, 495 00:38:36,871 --> 00:38:44,315 They interspersed a large language model created essays with the human created essays and a double blind the professors graded both. 496 00:38:44,315 --> 00:38:46,657 And so that was his study number one. 497 00:38:46,657 --> 00:38:52,280 Study number two was to, I forget what study number two was, but it was a similar vein. 498 00:38:52,280 --> 00:39:01,746 But then study number three is doing kind of the same thing that Megan is doing, but just a different twist on it to be able to do a double or triple blind study of human created 499 00:39:01,746 --> 00:39:04,529 things along with machine created things. 500 00:39:04,529 --> 00:39:08,282 and mixing them up and having human evaluators be seeing what they prefer. 501 00:39:08,282 --> 00:39:10,403 So that's evaluation number two. 502 00:39:10,403 --> 00:39:11,984 That is a bright shining light. 503 00:39:11,984 --> 00:39:14,686 Evaluation number three is legal technology hub. 504 00:39:14,686 --> 00:39:26,114 Nikki Shaver and her team is working with vals.ai on a similar study that is involving Harvey, Thompson Reuters, LexisNexis, us at VLex, where it is a similar kind of John Henry 505 00:39:26,114 --> 00:39:28,497 kind of test to evaluate the outputs. 506 00:39:28,497 --> 00:39:35,456 So I would say that the old Stanford study is old news and is probably chasing the wrong things for the reason we've just discussed. 507 00:39:35,456 --> 00:39:43,937 Whereas the new Stanford study and the Minnesota-Michigan study and the legal technology have VALS AI study, that is going to give us some hope going forward. 508 00:39:44,162 --> 00:39:44,602 Interesting. 509 00:39:44,602 --> 00:39:45,082 It's funny. 510 00:39:45,082 --> 00:39:49,987 We're in a space that moves so fast when May is old news and it's November. 511 00:39:49,987 --> 00:39:52,269 But yeah, I don't disagree. 512 00:39:53,490 --> 00:40:04,540 So the Gartner hype curve does a absolutely phenomenal job, in my opinion, mapping out the trajectory of new technologies in many cases. 513 00:40:04,540 --> 00:40:09,284 And I think it's really playing out interestingly in AI right now. 514 00:40:09,284 --> 00:40:11,726 So the Goldman study 515 00:40:11,726 --> 00:40:17,466 came out saying that 44 % of legal tasks could be automated by GEN.AI, freaked everybody out. 516 00:40:17,466 --> 00:40:21,566 I think that number was very aspirational. 517 00:40:24,286 --> 00:40:38,638 I think I might have heard you talk about when GPT-3.5 took the bar and scored it in the mid-60s, GPT-4 scored 90 plus. 518 00:40:38,638 --> 00:40:41,878 That number was since been revised down significantly. 519 00:40:41,878 --> 00:40:43,958 So I've heard, which is interesting. 520 00:40:43,958 --> 00:41:01,698 Um, there's a Wharton survey out that I saw on Peter Duffy's newsletter that, surveyed inside council and showed that only 28 % of inside council anticipate a high impact of gen 521 00:41:01,698 --> 00:41:02,898 AI in their role. 522 00:41:02,898 --> 00:41:08,230 I found that super interesting and 25 % anticipate a low impact. 523 00:41:08,460 --> 00:41:10,441 Which again, that's kind of mind blowing. 524 00:41:10,441 --> 00:41:12,895 But where do you think we are on this hype curve? 525 00:41:12,895 --> 00:41:20,032 Do you feel like we're in the trough of disillusionment or are we still, do we still have further to go? 526 00:41:20,377 --> 00:41:22,588 I think we're going up the slope actually. 527 00:41:22,588 --> 00:41:28,129 I just gave a talk with Pablo Redondo of Case Text co-founder. 528 00:41:28,189 --> 00:41:36,011 He and I gave a presentation and he gave a lot of his part of the presentation walking us through the hype cycle and walking us through the trough. 529 00:41:36,011 --> 00:41:41,733 And he thinks that law firms and others are doing the hard yards of going up the slope slowly but surely. 530 00:41:41,733 --> 00:41:45,034 And I think that he's probably right. 531 00:41:45,034 --> 00:41:48,375 to a couple of the things that you mentioned, you know, the 532 00:41:49,223 --> 00:41:59,090 Bar exam that of course my friends Pablo was actually one of the guys who did that bar exam paper along with Mike Mike Bomarito and Dan Katz were the other co-authors and so 533 00:41:59,090 --> 00:42:08,436 they in that paper that they wrote they actually put the caveat saying that there were One those results are never publicly announced. 534 00:42:08,436 --> 00:42:18,155 So they're kind of doing replicas of the bar exam So this is always you know, until the bar exam the multi-state bar exam publishes its numbers, of course 535 00:42:18,155 --> 00:42:20,317 there's no definitive objective number. 536 00:42:20,317 --> 00:42:27,042 It's all largely kind of taking kind of statistical likelihood rather than definitive objective. 537 00:42:27,042 --> 00:42:27,932 That's thing number one. 538 00:42:27,932 --> 00:42:38,920 Thing number two, they had also put in footnotes caveats saying that there was a, this is during COVID times and this is a, know, there's of course there are fewer takers during 539 00:42:38,920 --> 00:42:46,045 COVID times and maybe those COVID people were less likely to do well and maybe, you know, there's all sorts of, you know, scientifically 540 00:42:46,045 --> 00:42:48,908 kind of nebulous things that make that number. 541 00:42:48,908 --> 00:42:53,682 But anyway, so they put that 90 % number with all of those caveats with the initial paper. 542 00:42:53,682 --> 00:42:57,400 So the subsequent papers that say, no, it's way lower than 90%. 543 00:42:57,400 --> 00:42:58,216 Like, come on. 544 00:42:58,216 --> 00:43:00,278 Like, they put that in the footnotes. 545 00:43:00,278 --> 00:43:01,870 So that's that. 546 00:43:01,870 --> 00:43:08,155 And then to the other thing that's saying that only 28 % of inside counsel anticipate high impact on their role. 547 00:43:09,737 --> 00:43:11,978 I've heard that, but there's also 548 00:43:12,093 --> 00:43:22,727 there's a lot of studies saying that 80 % of inside counsel expect their external counsel's bills to be reduced because of large language models. 549 00:43:22,848 --> 00:43:30,331 even though 28 % of them think that it's going to impact their role, 80 % think it's going to impact external counsel's role. 550 00:43:30,331 --> 00:43:33,472 So that is an expectation from the buy side, the client side. 551 00:43:33,492 --> 00:43:40,935 And another thing is that the Clio Cloud Conference announced their Clio survey, where they survey and 552 00:43:41,011 --> 00:43:44,771 Their distribution curve is mostly the solo small up to the midsize law firms. 553 00:43:44,771 --> 00:43:46,581 And they did a survey last year. 554 00:43:46,581 --> 00:43:50,651 during that survey, they said, how legal work are you doing? 555 00:43:50,651 --> 00:43:53,451 How many of you are using large language models for legal work? 556 00:43:53,451 --> 00:43:56,331 And the answer in 2023 was about 25%. 557 00:43:56,331 --> 00:43:58,271 They asked the same question in 2024. 558 00:43:58,271 --> 00:44:01,371 And the answer jumped to about 80%. 559 00:44:01,371 --> 00:44:07,471 That is 80 % of solo small up to midsize that Clio users are using AI for legal work. 560 00:44:07,471 --> 00:44:10,803 That's a dramatic jump from 25 % to 80%. 561 00:44:10,803 --> 00:44:16,803 And so that shows me that the future is not here, it's just not evenly distributed. 562 00:44:16,803 --> 00:44:25,023 That is, Solos Malls are using this, they're already on the slope in the light, but they're already using it for real use cases, where the big law folks maybe aren't telling 563 00:44:25,023 --> 00:44:26,493 anybody that they're using it. 564 00:44:26,493 --> 00:44:34,603 And the associates may be in big law, where if their law firms prohibit them from using it, they use shadow IT, where they use it on their personal devices, and they're not 565 00:44:34,603 --> 00:44:35,979 telling anybody about it. 566 00:44:36,014 --> 00:44:48,034 Yeah, you know, all these numbers that are flying around, I don't know if you saw the Ilta Tech Survey that 74 % of law firms with more than 700 attorneys are using GEN.ai in 567 00:44:48,034 --> 00:44:49,834 business use cases. 568 00:44:50,594 --> 00:44:53,354 seems very aspirational to me. 569 00:44:53,354 --> 00:44:56,184 And I had Steve Embry on the podcast a while back. 570 00:44:56,184 --> 00:45:02,114 He wrote an article, this was before the Tech Survey came out, called Mind the Gap. 571 00:45:02,114 --> 00:45:04,522 And the gap he was talking about is 572 00:45:04,522 --> 00:45:15,731 surveys like this that report gen AI usage and then the anecdotal observation of people like him and me that work with law firms all day long and just don't see it. 573 00:45:16,292 --> 00:45:20,376 So I think this, lot of these numbers are conflicting, aspirational. 574 00:45:20,376 --> 00:45:25,580 Maybe you have a lawyer who Googles what Chad GPT is and he can check the box. 575 00:45:25,580 --> 00:45:28,062 He or she can check the box, but I don't know. 576 00:45:28,183 --> 00:45:28,904 That's right. 577 00:45:28,904 --> 00:45:35,827 I would say that, yeah, every survey can be poked holes in by the way it's asked. 578 00:45:35,827 --> 00:45:45,713 Because you can imagine if it's way it's asked, have you used large language models in your practice, if I used it for one thing in the full year, then I could answer yes to 579 00:45:45,713 --> 00:45:46,433 that. 580 00:45:46,433 --> 00:45:54,958 But really, the question is, if the question were instead, what percentage of your work involved large language models, that number would be totally different, right? 581 00:45:54,958 --> 00:45:56,779 And that number would be way lower. 582 00:45:56,871 --> 00:46:02,288 Have you ever used it in the past year for one thing versus what percentage of your work have you used it? 583 00:46:02,288 --> 00:46:05,176 Those are two very different questions that will give very different answers. 584 00:46:05,176 --> 00:46:06,907 Yeah, agreed. 585 00:46:06,947 --> 00:46:09,607 Well, this has been a super fun conversation. 586 00:46:09,607 --> 00:46:12,086 I really appreciate you taking a few minutes with me. 587 00:46:12,086 --> 00:46:17,993 I think the, the LLM reasoning conversation is really just beginning. 588 00:46:17,993 --> 00:46:21,055 Do you, do you know Jan Likun from 589 00:46:21,055 --> 00:46:23,853 do, yeah, he's one of the smartest guys around. 590 00:46:23,884 --> 00:46:24,995 Yeah, from Meta. 591 00:46:24,995 --> 00:46:31,957 So I, again, this is more conflicting information that we as individuals have to make sense of. 592 00:46:31,957 --> 00:46:40,040 He talked about that, uh, currently a house cat is smarter than large language models, which I thought was interesting. 593 00:46:40,040 --> 00:46:46,614 And then I heard a, another, there was a talk at a local EO, um, entrepreneurs organization function here in St. 594 00:46:46,614 --> 00:46:48,525 Louis last night. 595 00:46:48,525 --> 00:46:52,446 And I got the notes from it and it said that current 596 00:46:52,838 --> 00:47:06,422 large language models are operating at, think the number was the equivalent of an IQ of 100 and that in the next year and a half, it will operate at an IQ of 1000. 597 00:47:06,422 --> 00:47:16,795 And which I don't even, those numbers don't make sense to me, but you know, when I hear Jan say that it's dumber than a house cat. 598 00:47:16,795 --> 00:47:20,618 And then I hear that we're operating today at IQ 100. 599 00:47:20,618 --> 00:47:22,246 There's lots of 600 00:47:22,508 --> 00:47:24,701 You know, there's lots of things to make sense of. 601 00:47:24,701 --> 00:47:25,803 Um, I don't know. 602 00:47:25,803 --> 00:47:27,901 What is your take on that before we wrap up? 603 00:47:27,901 --> 00:47:34,955 Yeah, I really like and respect Jan and I think that he's right that if we want to have robots, they need to understand the world. 604 00:47:34,955 --> 00:47:45,000 So when he talks about the it's as dumb as a house cat that he's talking about the idea that if you put a ball into a cup and then you flip the cup upside down, what is going to 605 00:47:45,000 --> 00:47:46,500 happen to that ball? 606 00:47:46,560 --> 00:47:49,862 The large language model should know that the ball should fall out of the cup, right? 607 00:47:49,862 --> 00:47:52,343 But a large language models often get that wrong. 608 00:47:52,363 --> 00:47:57,477 So if we want robots to be able to figure out how the world works, we definitely need that kind of spatial reasoning. 609 00:47:57,477 --> 00:47:59,369 And that's what he's talking about dumber than a house cat. 610 00:47:59,369 --> 00:48:01,971 House cats know that the ball falls out of the cup. 611 00:48:02,111 --> 00:48:10,378 But what Jan isn't saying is there are use cases like the law, where we don't have to deal with cups turning over in balls. 612 00:48:10,458 --> 00:48:17,163 Every single thing that a lawyer does, every single task, whether you're a litigator or transactional lawyer, every single task is based on words. 613 00:48:17,404 --> 00:48:21,551 We ingest words, we analyze words, and we output words. 614 00:48:21,551 --> 00:48:23,132 We don't deal with the physical world. 615 00:48:23,132 --> 00:48:26,674 We are merely word based in every single task that we do. 616 00:48:26,674 --> 00:48:30,237 So set aside, a cat doesn't know words. 617 00:48:30,237 --> 00:48:32,958 We don't need to know whether a ball falls out of a cup. 618 00:48:32,958 --> 00:48:35,060 All we need to know is how it worked work. 619 00:48:35,060 --> 00:48:40,983 And I would say for this use case, the legal use case, Jan's criticisms are maybe inapplicable. 620 00:48:41,038 --> 00:48:43,218 Yeah, yeah, and you're right. 621 00:48:43,218 --> 00:48:44,288 That's what he was. 622 00:48:44,288 --> 00:48:52,978 You know, house cats can plan and anticipate and they have spatial awareness that large language models don't. 623 00:48:52,978 --> 00:48:55,328 Well, this has been a lot of fun before we wrap up. 624 00:48:55,328 --> 00:48:58,268 How do folks find out more about what you do? 625 00:48:58,268 --> 00:49:01,058 Your your your work with Sally and Villex? 626 00:49:01,058 --> 00:49:03,475 How do people find out more about that? 627 00:49:03,475 --> 00:49:04,675 Yeah, the best places on LinkedIn. 628 00:49:04,675 --> 00:49:05,805 I hang out there a lot. 629 00:49:05,805 --> 00:49:08,596 It's Damien real and you have it in the show notes. 630 00:49:08,596 --> 00:49:09,367 Awesome. 631 00:49:09,367 --> 00:49:10,728 Well, good stuff. 632 00:49:11,070 --> 00:49:14,011 Hopefully, are you going to be in Miami for TLTF? 633 00:49:14,011 --> 00:49:15,543 I will see you at TLSTF. 634 00:49:15,543 --> 00:49:17,576 That's one of my favorite conferences. 635 00:49:17,576 --> 00:49:18,922 Yeah, looking forward to seeing you there. 636 00:49:18,922 --> 00:49:19,453 Absolutely. 637 00:49:19,453 --> 00:49:22,287 We'll be on stage pitching on Wednesday afternoon. 638 00:49:22,768 --> 00:49:23,270 good. 639 00:49:23,270 --> 00:49:24,852 We'll see you in Miami. 640 00:49:25,274 --> 00:49:25,995 All right. 641 00:49:25,995 --> 00:49:26,836 Thanks, Damian. 642 00:49:26,836 --> 00:49:27,858 Take care. 00:00:04,179 Damien, how are you this afternoon? 2 00:00:04,179 --> 00:00:04,761 Couldn't be better. 3 00:00:04,761 --> 00:00:05,475 Life is really good. 4 00:00:05,475 --> 00:00:06,326 How are you Ted? 5 00:00:06,326 --> 00:00:07,196 I'm doing great. 6 00:00:07,196 --> 00:00:07,926 I'm doing great. 7 00:00:07,926 --> 00:00:10,786 I appreciate you joining me this afternoon. 8 00:00:10,786 --> 00:00:23,341 We were kicking around a really interesting topic via LinkedIn and I figured, you know what, this is, I I've been overdue to have you on the podcast anyway. 9 00:00:23,341 --> 00:00:27,472 So I figured this is a good opportunity to, uh, to riff a little bit. 10 00:00:27,472 --> 00:00:31,533 Um, but before we do, let's, let's get you introduced. 11 00:00:31,533 --> 00:00:35,104 So I went and looked at your, your LinkedIn profile. 12 00:00:35,104 --> 00:00:36,294 Interestingly, 13 00:00:36,344 --> 00:00:44,477 I didn't realize you started your legal career as a clerk and you started practicing in the early two thousands. 14 00:00:44,477 --> 00:00:46,717 You worked for TR and fast case. 15 00:00:46,717 --> 00:00:49,098 That's now VLex, right? 16 00:00:49,098 --> 00:00:54,420 And, um, you're still at VLex and I know you do a lot of work through with Sally. 17 00:00:54,420 --> 00:00:56,010 That's how you and I actually first met. 18 00:00:56,010 --> 00:01:01,352 But, um, why don't you tell us a little bit about who you are, what you do and where you do it. 19 00:01:01,363 --> 00:01:08,423 Sure, I've been a lawyer since 2002, I clerked for chief judges at the state appellate court and the federal district court. 20 00:01:08,423 --> 00:01:16,583 Then I worked for a big law firm, Robbins Kaplan, where I represented Best Buy and much of their commercial litigation, represented victims of Bernie Madoff, helped sue JPMorgan 21 00:01:16,583 --> 00:01:18,053 over the mortgage-backed security crisis. 22 00:01:18,053 --> 00:01:24,633 So I have a pretty long time, some would say too long as a litigator, but then I've also been a coder since 85. 23 00:01:24,633 --> 00:01:29,275 So I have the law plus technology background, and anyone who works with me will tell you that 24 00:01:29,275 --> 00:01:31,315 I am probably the worst coder you've ever met. 25 00:01:31,315 --> 00:01:38,640 I say I'm a coder not as a badge of honor, but a shroud of shame where I'm not very good at coding at all. 26 00:01:38,640 --> 00:01:42,523 But with large language models, one can be actually better at coding than one actually is. 27 00:01:42,523 --> 00:01:49,267 So after litigating for a bunch of years, I joined TR, building a big thing for them, did cybersecurity for a while. 28 00:01:49,267 --> 00:01:57,171 But since 2019, I've been working with Fastcase, which is now VLex, essentially playing in a playground of a billion legal documents. 29 00:01:57,171 --> 00:02:05,374 cases, statutes, regulations, motions, briefs, pleadings, extracting what matters from them using Sally tags and otherwise, and then running large language models across those 30 00:02:05,374 --> 00:02:06,105 things. 31 00:02:06,294 --> 00:02:06,925 Interesting. 32 00:02:06,925 --> 00:02:13,946 And is that how your involvement in Sally came to be was the work that you're doing at Vlex? 33 00:02:14,259 --> 00:02:18,949 It actually came to be that I met Toby Brown who founded Sally in 2017. 34 00:02:18,949 --> 00:02:23,839 I met him at Ilticon and we just happened to sit in the same breakfast table. 35 00:02:23,839 --> 00:02:28,279 And I'd known of Toby but had not actually met Toby before. 36 00:02:28,279 --> 00:02:34,629 But then we started talking a bit about Sally and he said, I said, you haven't really chased any litigation things. 37 00:02:34,629 --> 00:02:36,099 He said, no, we haven't. 38 00:02:36,099 --> 00:02:36,689 said, why not? 39 00:02:36,689 --> 00:02:38,079 I said, would you like some help on that? 40 00:02:38,079 --> 00:02:39,475 And he's like, well, it's too hard. 41 00:02:39,475 --> 00:02:40,215 Do you want to do it? 42 00:02:40,215 --> 00:02:41,675 And I said, yeah, I totally want to do it. 43 00:02:41,675 --> 00:02:47,327 So we met in 2019, August of 2019, and I've been working on Sally ever since. 44 00:02:47,382 --> 00:02:48,123 Interesting. 45 00:02:48,123 --> 00:02:49,999 And what were you coding in 85? 46 00:02:49,999 --> 00:02:52,125 I've been, I started coding in like 82. 47 00:02:52,125 --> 00:02:53,651 What were you coding basic? 48 00:02:53,651 --> 00:02:56,591 I was coding basic in my Commodore 128. 49 00:02:56,911 --> 00:03:02,731 I didn't start with the Commodore 64 because I wanted to get the thing that had 128K of RAM instead of 64K of RAM. 50 00:03:02,731 --> 00:03:03,931 So I was coding basic. 51 00:03:03,931 --> 00:03:11,111 was getting magazines where I would take the magazine on paper and I would recode the code in the magazine and then tried to tweak the code. 52 00:03:11,111 --> 00:03:13,902 So yeah, I was a very nerdy 10-year-old. 53 00:03:13,902 --> 00:03:14,382 Yeah. 54 00:03:14,382 --> 00:03:15,742 So it's funny. 55 00:03:15,742 --> 00:03:17,502 A lot of parallels there. 56 00:03:17,502 --> 00:03:24,722 Um, I started off with a Commodore 32, so I had one fourth of the memory that you did. 57 00:03:24,722 --> 00:03:30,342 And you know, I used to have to, uh, store my programs on audio cassette. 58 00:03:30,342 --> 00:03:40,322 This is before I could afford a floppy and you know, um, gosh, so this would have been, yeah, probably 82, 83. 59 00:03:40,322 --> 00:03:42,252 Then I graduated to a 60 00:03:42,252 --> 00:03:48,557 I had a TI-994A with a extended basic cartridge and a book about that thick. 61 00:03:48,557 --> 00:03:54,132 And I literally read every page of it to understand all the new commands. 62 00:03:54,132 --> 00:03:58,285 I totally geeked out on it and then was totally into it. 63 00:03:58,285 --> 00:04:04,461 And then during middle school, you know, the girls didn't think it was cool to be a computer programmer. 64 00:04:04,461 --> 00:04:07,834 So I kind of ditched it for a while until college. 65 00:04:07,834 --> 00:04:11,006 So I had a break in there, but 66 00:04:11,022 --> 00:04:17,502 Then when I picked up computers again, it would have been early nineties, like 91 ish. 67 00:04:17,502 --> 00:04:26,822 And by then it was a visual basic, you know, doing native windows development like VB four. 68 00:04:27,161 --> 00:04:28,582 God, I can't remember. 69 00:04:28,582 --> 00:04:32,642 I think it was visual interdev and used to compile windows programs. 70 00:04:32,642 --> 00:04:33,652 did a lot of SQL. 71 00:04:33,652 --> 00:04:38,862 I was actually on the SQL team at Microsoft in late nineties, early 2000. 72 00:04:38,862 --> 00:04:39,675 So 73 00:04:39,675 --> 00:04:42,300 I can still hold my own on SQL, I'm like you, man. 74 00:04:42,300 --> 00:04:47,990 If I had to code an app, I'd be so lost right now. 75 00:04:48,307 --> 00:04:53,051 True, really query how important that is these days to be really a hardcore coder. 76 00:04:53,051 --> 00:05:01,677 I know people that are really good hardcore coders that use things like cursor and use large language models to be able to be a bicycle for the mind, like Steve Jobs would say, 77 00:05:01,677 --> 00:05:03,618 and make them go better, faster, and stronger. 78 00:05:03,618 --> 00:05:14,486 But even for people that are rusty or really awful, like you and me, it's still, I can't go 10 times as fast as a normal coder can with a large language model, but I can maybe do 79 00:05:14,486 --> 00:05:16,231 1x what they used to be able to do. 80 00:05:16,231 --> 00:05:16,382 Right. 81 00:05:16,382 --> 00:05:20,440 There's, there's really, um, it really evens the playing field on what is possible. 82 00:05:20,440 --> 00:05:21,350 Yeah. 83 00:05:21,430 --> 00:05:33,095 Well, you and I were riffing on a topic that I think is super interesting and I was kind of surprised to hear your perspective on it and I thought it was really interesting and we 84 00:05:33,095 --> 00:05:39,618 were talking about the question on whether or not LLMs can reason. 85 00:05:39,618 --> 00:05:45,340 I've always, you know, understanding the architecture, I've always just had the default assumption. 86 00:05:45,340 --> 00:05:49,688 That's kind of where I started my position on this with 87 00:05:49,688 --> 00:05:53,771 There's no way they can just based on, on the architecture, right? 88 00:05:53,771 --> 00:05:55,322 It predicts the next token. 89 00:05:55,322 --> 00:06:00,046 It has no concept of, um, comprehension. 90 00:06:00,046 --> 00:06:05,809 Therefore reasoning seems to be far out of reach, but it does create the illusion of reasoning. 91 00:06:05,809 --> 00:06:09,012 And you had an interesting argument, which was, does it matter? 92 00:06:09,012 --> 00:06:17,138 Um, so, I mean, let's start with do LLM's reason or create the illusion of reasoning. 93 00:06:17,631 --> 00:06:19,872 And yes, let's talk about that. 94 00:06:19,872 --> 00:06:25,133 I think a good precursor to that question is are LLMs conscious or are they not conscious? 95 00:06:25,133 --> 00:06:28,914 And that's another kind of academic exercise question that people have been thinking about. 96 00:06:28,914 --> 00:06:31,675 You know, it gives the illusion of consciousness, right? 97 00:06:31,675 --> 00:06:35,606 And so, but of course, large language models, in my opinion, are not conscious, right? 98 00:06:35,606 --> 00:06:38,057 Because they are just mimicking consciousness. 99 00:06:38,217 --> 00:06:44,599 But their philosophers for millennia have been saying consciousness is undefinable. 100 00:06:44,753 --> 00:06:48,245 Like, the only thing I can be conscious of is I know that I am conscious. 101 00:06:48,245 --> 00:06:53,468 But whether you are conscious or not or just a figment of my imagination is something I will never, know. 102 00:06:53,568 --> 00:06:56,590 All I know is that my own consciousness is a thing. 103 00:06:56,590 --> 00:07:05,555 So I think the question of whether large language models are conscious or not is kind of just an academic exercise that really doesn't matter, right? 104 00:07:05,676 --> 00:07:11,098 So any more than I know whether Ted is conscious or not, that we is a f- 105 00:07:11,609 --> 00:07:14,681 Science and we as philosophers have never defined consciousness. 106 00:07:14,681 --> 00:07:24,646 Therefore the debate about consciousness is just an academic exercise So let's now set consciousness aside and now let's talk about reasoning and the real question is I when I'm 107 00:07:24,646 --> 00:07:35,332 speaking with you Ted I have no idea whether your brain is reasoning or not I've that's because often we ourselves don't know how our brains are reasoning or not I'm all the only 108 00:07:35,332 --> 00:07:40,729 way I can tell whether Ted is reasoning or not is through the words that come out of Ted's mouths 109 00:07:40,729 --> 00:07:44,381 or the words that come out of Ted's keyboard as Ted is typing. 110 00:07:44,381 --> 00:07:51,925 And if those words look like reasoning, and if they quack like reasoning, then I could be able to say Ted is probably reasoning. 111 00:07:51,985 --> 00:07:55,427 So maybe shouldn't we judge large language models in the same way? 112 00:07:55,427 --> 00:08:00,950 That if the output of the large language models looks like reasoning and quacks like reasoning, then maybe it's reasoning. 113 00:08:00,950 --> 00:08:06,653 And that's what large language models, machine learning scientists, data scientists, call that the duck test. 114 00:08:06,733 --> 00:08:10,269 That is, they know what goes into the black box. 115 00:08:10,269 --> 00:08:15,244 They have no idea what happens inside the black box and they know what comes out of the black box. 116 00:08:15,244 --> 00:08:24,753 But if the output looks like reasoning and quacks like reasoning, maybe whether the black box is reasoning or not matters not, just like it doesn't matter if I know how you are 117 00:08:24,753 --> 00:08:25,854 reasoning in your brain. 118 00:08:25,854 --> 00:08:27,695 All I know is your output too. 119 00:08:28,098 --> 00:08:29,200 Interesting. 120 00:08:29,584 --> 00:08:31,540 Can we test for reasoning? 121 00:08:32,177 --> 00:08:34,028 Yes, I think we can. 122 00:08:35,009 --> 00:08:38,691 the question is, what are the tasks that you're testing on? 123 00:08:38,712 --> 00:08:41,614 There are objective tasks, mathematical tasks. 124 00:08:41,614 --> 00:08:43,876 So you can imagine a mathematical proof. 125 00:08:43,876 --> 00:08:47,919 You could be able to test whether it's making its way through the mathematical proof or not. 126 00:08:47,919 --> 00:08:50,761 You can test whether that is reasoning or not reasoning. 127 00:08:50,761 --> 00:08:51,962 Same with science. 128 00:08:51,962 --> 00:08:53,133 Is it getting science correct? 129 00:08:53,133 --> 00:08:54,994 Is it doing the scientific method correctly? 130 00:08:54,994 --> 00:08:55,985 Is it reasoning? 131 00:08:55,985 --> 00:08:59,237 Is it providing true causation rather than being a correlation? 132 00:08:59,237 --> 00:09:02,569 I think those are objective truths that you could be able to see reasoning. 133 00:09:02,569 --> 00:09:06,831 And I would say that the outputs for law are much, much different than that. 134 00:09:06,971 --> 00:09:12,735 That is, whether I made a good argument or not in front of this court is not objective. 135 00:09:12,735 --> 00:09:14,055 That is subjective. 136 00:09:14,055 --> 00:09:22,600 So I can't do a proof as to validity or invalidity any more than you could do a proof as to lawyer one made a better argument than lawyer two. 137 00:09:22,600 --> 00:09:28,499 Ask 10 lawyers and you might get a 50-50 split on whether lawyer one made a better argument or lawyer two made a better argument. 138 00:09:28,499 --> 00:09:36,899 going over to the transactional side, the contractual side, lawyer one might love this clause, but lawyer two says that's the worst clause on the planet. 139 00:09:36,899 --> 00:09:40,759 There's no objective standard as to what is good legal work. 140 00:09:40,759 --> 00:09:49,999 And absent any objective standard as to good legal work, maybe what is good legal reasoning is in the eye of the beholder, much like beauty is in the eye of the beholder. 141 00:09:49,999 --> 00:09:57,201 That with absent any objective way to be able to say this was good legal reasoning or bad legal reasoning. 142 00:09:57,201 --> 00:10:05,774 I guess the question of whether a large language model is providing good legal reasoning or bad legal reasoning is unanswerable in the same way to say whether that human is doing 143 00:10:05,774 --> 00:10:07,995 good legal reasoning or bad legal reasoning. 144 00:10:07,995 --> 00:10:15,717 So I think this whole debate about reasoning or not reasoning is academic at best because we should judge it by its outputs. 145 00:10:15,717 --> 00:10:22,479 And different lawyers will judge the outputs differently with humans, and they'll judge it differently with large language models. 146 00:10:22,990 --> 00:10:23,190 Okay. 147 00:10:23,190 --> 00:10:34,290 I think that's true to an extent, but let's say I come in as a, as an attorney and to make my closing arguments, I see, I sing the theme song to Gilligan's Island. 148 00:10:34,290 --> 00:10:43,330 Um, I think that would universally, um, be graded as that's a bad, this is bad legal reasoning, right? 149 00:10:43,330 --> 00:10:53,062 So, so there is a spectrum and you know, obviously that's an extreme case, but I think extreme cases are good to evaluate whether or not something's true. 150 00:10:53,106 --> 00:11:05,180 And, so yeah, mean, if something is just universally looked at every attorney reasonable person that would evaluate it says, it says it's bad. 151 00:11:05,321 --> 00:11:09,701 Is that, does that monkey wrench what, what you're putting forward there? 152 00:11:09,701 --> 00:11:10,406 No. 153 00:11:10,407 --> 00:11:11,097 Yeah, that's right. 154 00:11:11,097 --> 00:11:16,971 So you're right that it is a spectrum, that you have the worst argument on the planet, which is just gibberish. 155 00:11:16,971 --> 00:11:21,914 And then there's the best argument on the planet that is going to win 100 out of 100 times. 156 00:11:21,914 --> 00:11:23,374 And same thing with contracts. 157 00:11:23,374 --> 00:11:26,366 There's the contract that's going to get the deal done 100 out of 100 times. 158 00:11:26,366 --> 00:11:29,518 And there's the contract that is going to fail, 100. 159 00:11:29,518 --> 00:11:32,720 So everything is along that spectrum. 160 00:11:32,720 --> 00:11:39,121 And then if you add a y-axis to that spectrum, there is a most common thing, that is the head. 161 00:11:39,121 --> 00:11:42,573 And then there's a long tail of rare things that happen. 162 00:11:42,574 --> 00:11:47,717 So if you think about what the large language models are doing is largely giving you the head distribution. 163 00:11:47,717 --> 00:11:53,261 That is the most common things because it's giving you a compressed version of the training data set. 164 00:11:53,261 --> 00:11:57,384 so the head is almost never going to be Gilligan's Island. 165 00:11:57,504 --> 00:12:01,928 And the head is almost never going to be some of the worst contractual arguments ever made. 166 00:12:01,928 --> 00:12:04,430 It's going to fall on the average on that side. 167 00:12:04,430 --> 00:12:06,749 And that actually is probably 168 00:12:06,749 --> 00:12:09,931 the right thing to do for the large language model in the legal task. 169 00:12:09,931 --> 00:12:17,605 Because you want the average, because you want 100 out of 100 lawyers, you want most of the lawyers to say that's probably right. 170 00:12:17,725 --> 00:12:20,226 And that is the average distribution of this. 171 00:12:20,346 --> 00:12:30,242 And so really then, if we then say the x-axis and the y-axis and you have the head, the most common things, and then you have the long tail, and you now say, OK, the large 172 00:12:30,242 --> 00:12:35,865 language models are going to take the head, not the long tail, then you have to say, OK, what is that head? 173 00:12:35,865 --> 00:12:36,567 Is that 174 00:12:36,567 --> 00:12:39,288 Does it require legal reasoning or not? 175 00:12:39,508 --> 00:12:44,819 So let's take about let's talk about mathematics and science We want to find new science, right? 176 00:12:44,819 --> 00:12:48,200 We want to be able to create new cures to cancer, right? 177 00:12:48,200 --> 00:12:54,052 And we want to be able to do things that have never done been done before So does the large language model need reasoning for that? 178 00:12:54,052 --> 00:12:57,023 Absolutely, because that's not part of the training to set right? 179 00:12:57,023 --> 00:13:04,195 That's not part of something that we can look backward at so we need reasoning for new science We need reasoning for new mathematics 180 00:13:04,195 --> 00:13:12,061 You need reasoning for something that's never been done before that you need somebody like Einstein or somebody to somebody who is once in a generation to be able to go forward and 181 00:13:12,061 --> 00:13:13,342 leap forward. 182 00:13:13,342 --> 00:13:15,083 Contrast that with the law. 183 00:13:15,664 --> 00:13:19,707 How much new thinking do we really need to do in the law? 184 00:13:19,828 --> 00:13:24,531 In contrast, how much of the law is looking backward that is looking to precedent? 185 00:13:24,612 --> 00:13:32,938 If I am a lawyer arguing in court, if I say, I've got judge, I've got this really brand new idea that nobody's ever won on before, but I just sprouted out of my brain. 186 00:13:32,938 --> 00:13:33,693 What do you think? 187 00:13:33,693 --> 00:13:35,364 The judge is going to say, me a case. 188 00:13:35,364 --> 00:13:41,809 And if I can't show him a case, if I can't show her a statute, I lose because it's not based on precedent. 189 00:13:42,009 --> 00:13:45,212 So do we really need new things in litigation? 190 00:13:45,212 --> 00:13:47,693 Do we really need new things in transactional work? 191 00:13:47,693 --> 00:13:49,875 Do we really need new things in advisory work? 192 00:13:49,875 --> 00:13:51,856 Do we need new things in regulatory work? 193 00:13:51,856 --> 00:13:55,339 And I think the answer to all four of those is no, because you're always looking to precedent. 194 00:13:55,339 --> 00:13:56,580 You're always looking to statutes. 195 00:13:56,580 --> 00:13:59,441 You're always looking to something that is in the data set. 196 00:13:59,522 --> 00:14:01,453 So if it is in the data set, 197 00:14:01,777 --> 00:14:08,954 Really, all of our reasoning that is legal is backward looking, not forward looking like in mathematics or in science. 198 00:14:08,954 --> 00:14:10,616 It is all backward looking. 199 00:14:10,616 --> 00:14:18,113 So if it's all backward looking, is every legal reasoning really recombining the data set that we're having? 200 00:14:18,648 --> 00:14:23,852 Well, what about novel pieces of regulation that now have to be interpreted? 201 00:14:23,852 --> 00:14:33,260 Is there not new legal thinking that has to take place to evaluate the applicability in those scenarios? 202 00:14:34,008 --> 00:14:39,633 There is, but I would say that the data is taken care of through what's called interpolation. 203 00:14:39,633 --> 00:14:45,028 And so with the large language models, they connect concepts. 204 00:14:45,028 --> 00:14:48,172 I'm going to share my screen on this. 205 00:14:48,172 --> 00:14:48,964 is it possible? 206 00:14:48,964 --> 00:14:49,233 It's cool. 207 00:14:49,233 --> 00:14:57,831 So I'm going to pull up a PowerPoint to actually demonstrate a real live case that I had where there's so. 208 00:14:58,077 --> 00:15:05,372 for the less sophisticated and maybe more sophisticated, we'll recap on how large language model works, is that they pull out concepts. 209 00:15:05,573 --> 00:15:08,734 And they pull out concepts and put them into what's called vector space. 210 00:15:08,955 --> 00:15:19,963 And so you can imagine a two-dimensional vector space that the ideas of a faucet and a sink and a vanity are probably pretty close together in that two-dimensional vector space. 211 00:15:19,963 --> 00:15:24,306 And then you could be able to say, OK, let's go ahead and put that in three-dimensional vector space with a z-axis. 212 00:15:24,306 --> 00:15:25,903 And then you could be able to say, OK, these 213 00:15:25,903 --> 00:15:29,085 All similar things are kind of clustered together as ideas. 214 00:15:29,085 --> 00:15:34,048 And now add a fourth dimension, and our brains can't even figure out what that fourth dimension would look like. 215 00:15:34,048 --> 00:15:36,169 Now add a 10th dimension. 216 00:15:36,169 --> 00:15:37,910 Now add a 100th dimension. 217 00:15:37,910 --> 00:15:41,831 Now add a 1,000th dimension and add a 12,000th dimension. 218 00:15:42,412 --> 00:15:45,634 And 12,000 dimensional vector space is where large language models live. 219 00:15:45,634 --> 00:15:55,123 And somewhere in that 12,000 dimensional vector space lives Ernest Hemingwayness and Bob Dylanness and Pablo Picassoness. 220 00:15:55,123 --> 00:15:57,803 that lives in 12,000 dimensional vector space. 221 00:15:57,803 --> 00:16:03,803 So all of the things that are legal concepts live somewhere in that 12,000 dimensional vector space. 222 00:16:03,803 --> 00:16:09,163 And all the facts in the world live somewhere in 12,000 dimensional vector space. 223 00:16:09,163 --> 00:16:16,683 And so what you can imagine is, to your question, is isn't going to combine some novel things. 224 00:16:16,683 --> 00:16:19,103 I would say, yes, it will combine them. 225 00:16:19,103 --> 00:16:22,831 But the thing is, how many of those things are 226 00:16:22,831 --> 00:16:25,592 already in the large language models vector space. 227 00:16:25,592 --> 00:16:33,994 And then combining those is what's called, the data scientists would say, connecting the latent space between those two disparate concepts. 228 00:16:33,994 --> 00:16:40,276 So now as I'm going to be sharing my screen, this concept is to think through. 229 00:16:40,451 --> 00:16:42,877 A friend of mine works for an insurance company. 230 00:16:42,877 --> 00:16:48,991 And she asked, what do you think of the thing about that called effective computing? 231 00:16:48,991 --> 00:16:51,139 What do you think of effective computing? 232 00:16:51,247 --> 00:16:55,411 And I said, I'm a pretty technical guy, but I'm sad to say I don't know what effective computing is. 233 00:16:55,411 --> 00:17:01,815 So what I did is I went to the large language model and said, define effective computing in the context of insurance and the law. 234 00:17:02,236 --> 00:17:05,739 And she's an insurance in-house lawyer. 235 00:17:05,739 --> 00:17:13,466 So I says, well, effective computing is how computers recognize human emotions and facial expressions and voice patterns to create emotionally aware agents. 236 00:17:13,466 --> 00:17:14,226 I said, cool. 237 00:17:14,226 --> 00:17:20,621 Now analyze how effective computing can be used in an insurance call center, because that's how my friend's company was thinking about using it. 238 00:17:20,959 --> 00:17:28,185 They said well you could use it for emotional recognition figuring out the caller's emotional state figuring out their choice of words How quickly they're speaking how 239 00:17:28,185 --> 00:17:38,604 emotional they are after an accident or loss I said cool now give me a list of potential legal issues that could stem from using effective computing in a call center and they said 240 00:17:38,604 --> 00:17:48,802 have you thought about privacy law like GDPR or yeah, or CCPA I've you thought about consent and whether that that caller consented to you analyzing their emotions Have you 241 00:17:48,802 --> 00:17:50,203 thought about if you get hacked? 242 00:17:50,203 --> 00:17:54,796 What if all of your client's emotional data is in the hands of a hacker? 243 00:17:54,796 --> 00:17:55,987 What's that going to do legally? 244 00:17:55,987 --> 00:17:57,748 What's that going to do with PR? 245 00:17:57,748 --> 00:17:59,870 These are all good legal concepts. 246 00:17:59,870 --> 00:18:07,655 And I would guess that zero times has anyone ever asked about the legal aspects of emotional, of effective computing. 247 00:18:07,675 --> 00:18:15,140 But what it's done is it knows what effective computing is, and it knows what privacy law is, it knows what consent is, it knows what data security is. 248 00:18:15,140 --> 00:18:19,409 So it's connecting the latent space between the concept of effective computing 249 00:18:19,409 --> 00:18:21,280 and the concept of privacy law. 250 00:18:21,280 --> 00:18:23,681 And it then says, give me some sub bullets. 251 00:18:23,681 --> 00:18:28,604 And now it's going to continue expanding upon the concepts of which jurisdictions people are calling in from. 252 00:18:28,604 --> 00:18:29,525 What types of data? 253 00:18:29,525 --> 00:18:30,395 Third party sharing. 254 00:18:30,395 --> 00:18:32,086 Are you minimizing the data? 255 00:18:32,126 --> 00:18:35,628 Each one of these things that I had live somewhere in vector space. 256 00:18:35,628 --> 00:18:43,152 So merely combining the concept of effective computing with the concept of privacy law and consent and data security. 257 00:18:43,152 --> 00:18:48,705 That way we can then combine those aspects in new ways that haven't been in the training set. 258 00:18:48,829 --> 00:18:50,059 So I think that's where it is. 259 00:18:50,059 --> 00:18:58,343 Almost everything that we do as laws, as lawyers, everything we do is connecting my client's facts to the existing laws. 260 00:18:58,503 --> 00:19:01,724 And your client's facts are almost certainly in the training set. 261 00:19:01,724 --> 00:19:09,648 And the existing laws, if you are training on actual non-hallucinated cases, statutes, and regulations, those are also in the training set. 262 00:19:09,648 --> 00:19:18,329 So really, reasoning is just being able to connect those existing facts in the data set with the existing laws in the data set and saying how they relate to each other. 263 00:19:18,329 --> 00:19:21,945 if you have the actual non-hallucinated cases, statutes, and regulations. 264 00:19:22,604 --> 00:19:23,615 That's super interesting. 265 00:19:23,615 --> 00:19:30,978 So I find it, I have to think through this, but it seems shocking to me that there are no novel concepts. 266 00:19:30,978 --> 00:19:39,113 Um, that what you've just described two things that currently exist in the, in the training material, right? 267 00:19:39,113 --> 00:19:51,509 That, that the LLM has vectorized and plotted in 12,000 dimensions and it knows the associations and, and the latent space between them. 268 00:19:51,830 --> 00:19:52,588 But 269 00:19:52,588 --> 00:20:08,665 What about new areas of law like when we start selling real estate on the moon, that obviously at some point will make its way in, but until it does, how will it navigate 270 00:20:08,665 --> 00:20:10,366 scenarios like that? 271 00:20:10,579 --> 00:20:13,270 So I guess the question is where do those areas of law come from? 272 00:20:13,270 --> 00:20:14,660 And they come from regulations. 273 00:20:14,660 --> 00:20:15,861 They come from statutes. 274 00:20:15,861 --> 00:20:17,321 They come from cases. 275 00:20:17,701 --> 00:20:21,262 And of those cases, statutes and regulations are reflected in documents. 276 00:20:21,402 --> 00:20:30,765 And if the system has those documents, the cases, the statutes, and the regulations, then the system will be able to plot those in vector space and then be able to take those legal 277 00:20:30,765 --> 00:20:35,446 concepts and apply them to the factual concepts that are also in vector space. 278 00:20:35,446 --> 00:20:39,067 So really, every single area of law is written somewhere. 279 00:20:39,067 --> 00:20:40,990 It has to be, otherwise it's not a law. 280 00:20:40,990 --> 00:20:43,073 And if it's written, it can be vectorized. 281 00:20:43,073 --> 00:20:45,286 So really everything that we do is part of the training set. 282 00:20:45,286 --> 00:20:53,837 There is really no novelty that is needed in the law because everything is necessarily backward looking at the cases, the statutes, the regulations that are binding. 283 00:20:54,316 --> 00:20:54,987 Interesting. 284 00:20:54,987 --> 00:21:00,852 You had a metaphor I had not heard before with anesthesia. 285 00:21:00,852 --> 00:21:07,478 And I think you had a friend who was an anesthesiologist. 286 00:21:07,478 --> 00:21:08,058 Yes. 287 00:21:08,058 --> 00:21:10,170 And I have trouble saying that word. 288 00:21:10,170 --> 00:21:13,683 So I'll just say anesthesiology. 289 00:21:13,683 --> 00:21:17,376 To explain that, because I thought that was an interesting metaphor. 290 00:21:17,511 --> 00:21:19,001 Yeah, she told me something. 291 00:21:19,001 --> 00:21:26,744 I was over a campfire and it freaked me out and I may freak out your recent listeners, but Yeah, she said she said Damien. 292 00:21:26,744 --> 00:21:28,124 Do you realize we have no idea? 293 00:21:28,124 --> 00:21:29,435 She's a nurse anesthetist, right? 294 00:21:29,435 --> 00:21:39,037 So she puts people under every single day and she has a I think a master's degree in anesthesiology So she said do you realize we have no idea how anesthesia works? 295 00:21:39,298 --> 00:21:40,978 I said wait to say that again. 296 00:21:40,978 --> 00:21:44,143 She said yeah one of two options option number one 297 00:21:44,143 --> 00:21:51,830 is it does what everybody thinks that it does, is that it puts us to sleep and we don't feel that scalpel going into our bellies and then we come out and we're all fine, right? 298 00:21:51,830 --> 00:21:53,271 That's option number one. 299 00:21:53,271 --> 00:21:57,134 Option number two is we feel every single cut. 300 00:21:57,234 --> 00:22:01,317 And what anesthesia does is to give us amnesia to make us forget. 301 00:22:01,698 --> 00:22:04,381 We don't know whether it's option one or option two. 302 00:22:04,381 --> 00:22:07,503 That scares the crap out of me and it might well scrape the crap out of you. 303 00:22:07,503 --> 00:22:12,187 But the question is, do we not use anesthesia because we don't know how it works? 304 00:22:12,891 --> 00:22:20,977 No, of course we use anesthesia because the real question is does it work and is it effective as to what we would like it to do? 305 00:22:20,977 --> 00:22:28,281 If the answer to both those things is yes, then how it works maybe matters less than the fact that it does work. 306 00:22:28,502 --> 00:22:32,124 So apply that anesthesia test to reasoning. 307 00:22:32,525 --> 00:22:41,661 And just like I can't tell whether you could, you're reasoning in Ted's brain or not, but I can gauge you by your output, by your, by your speech. 308 00:22:41,661 --> 00:22:44,133 by your words coming out of your keyboard. 309 00:22:44,193 --> 00:22:47,856 And if that works, I say you're reasoning. 310 00:22:48,597 --> 00:22:51,000 whether I know how your brain works doesn't matter. 311 00:22:51,000 --> 00:22:54,302 And whether I know how anesthesia works doesn't matter. 312 00:22:54,423 --> 00:22:58,106 I'm sorry, whether I know how anesthesia works doesn't matter. 313 00:22:58,106 --> 00:23:00,008 The fact that it does work matters. 314 00:23:00,008 --> 00:23:07,934 So the fact that a large-language model does create output that seems like it is reasonable and is reasoning, just like a human is reasoning. 315 00:23:08,763 --> 00:23:19,705 If the human, if the large language model output is indistinguishable from Ted's output as reasonable, then I would say whether it is actual reasoning and how it's reasoning doesn't 316 00:23:19,705 --> 00:23:23,929 really matter any more than anesthesia doesn't matter if we know how anesthesia works. 317 00:23:24,322 --> 00:23:26,784 Yeah, that is disturbing to think about. 318 00:23:27,705 --> 00:23:31,109 But it's a valuable metaphor. 319 00:23:31,109 --> 00:23:33,791 Now here's what I would say in response to that. 320 00:23:33,791 --> 00:23:40,577 Did you have a chance to look at the Apple Intelligence team's study with the GSM 8K? 321 00:23:41,688 --> 00:23:43,371 Only in the two minutes before you sent it. 322 00:23:43,371 --> 00:23:45,395 So why don't you describe it and maybe I can react to it. 323 00:23:45,395 --> 00:23:45,825 Yeah. 324 00:23:45,825 --> 00:24:00,567 So, um, it's only five weeks old, so it's, it's very new, but one benchmark that has been used pretty widely to test reasoning in, um, large language models is the, the GSM, which 325 00:24:00,567 --> 00:24:06,011 stands for grade school math, AK there's 8,000 of these questions. 326 00:24:06,252 --> 00:24:14,198 And what Apple did was modified these questions ever so slightly. 327 00:24:14,198 --> 00:24:19,000 and evaluated the LLM's performance against those modifications. 328 00:24:19,000 --> 00:24:20,780 And it was pretty dramatic. 329 00:24:20,880 --> 00:24:34,704 So their conclusions were, I said, the performance of all models decline when only the numerical values in the question are altered in the GSM symbolic benchmark. 330 00:24:34,885 --> 00:24:37,705 That's pretty interesting. 331 00:24:38,326 --> 00:24:39,202 It says, 332 00:24:39,202 --> 00:24:45,283 their performance significantly deteriorates as the number of clauses in the question increases. 333 00:24:45,604 --> 00:24:54,726 And then its conclusion is we hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning. 334 00:24:55,205 --> 00:25:06,349 And I thought there were a few examples in this specifically that really, I guess, were telling. 335 00:25:06,389 --> 00:25:08,736 So let me see if I can find this here. 336 00:25:08,736 --> 00:25:23,130 So, um, one of these, uh, these are word problems and in one of the word problems, they, I'm not going to be able to find it, but I remember enough about it to, um, articulate it. 337 00:25:23,170 --> 00:25:31,473 What they did was in the problem, they threw a sentence that had nothing to do with the problem itself and it completely blew up the problem. 338 00:25:31,473 --> 00:25:36,302 Um, the sentence that they put in there was it, the question was something like, 339 00:25:36,302 --> 00:25:53,402 You know, if the current price of keyboards and mouse pads are five and $10 respectively, and inflation has increased by 10 % each year, that was the part that had nothing. 340 00:25:53,402 --> 00:25:55,592 Tell us what the current price is, right? 341 00:25:55,592 --> 00:25:57,352 It's already given you the information. 342 00:25:57,352 --> 00:26:01,302 The fact that inflation increased 10 % has nothing to do. 343 00:26:01,302 --> 00:26:05,362 And it, it plummeted the, the, 344 00:26:05,390 --> 00:26:10,092 accuracy of the large language models responses, something like 65%. 345 00:26:10,092 --> 00:26:14,253 It varied wildly as you would expect. 346 00:26:15,614 --> 00:26:30,560 The latest models did, you know, in the chain of thought that they did the best, but it was, it seemed to me that this really pokes a hole in the whole concept of if these, 347 00:26:30,560 --> 00:26:35,412 because what that points to, if that, if you throw it a sentence that has nothing to do with the problem, 348 00:26:35,486 --> 00:26:38,390 in and I can't, that means I haven't comprehended the problem. 349 00:26:38,390 --> 00:26:40,934 I don't know what the problem is, right? 350 00:26:40,934 --> 00:26:48,694 I'm, simply reciting answers and you know, it's what I honestly would expect from, but I don't know. 351 00:26:48,694 --> 00:26:50,366 What is your response to that? 352 00:26:50,483 --> 00:26:59,883 Yeah, so I would say two responses, one of which is the idea that mathematics has a right answer and a wrong answer, whereas legal often does not. 353 00:26:59,883 --> 00:27:05,943 That is, in litigation, it's whatever argument happens to win, and in transactional work, it's whatever gets the deal done. 354 00:27:05,943 --> 00:27:14,623 So, wherein the mathematical proof, you have a right answer or a wrong answer, whereas in legal, there is the eye of the beholder, where there is no objective, there's merely the 355 00:27:14,623 --> 00:27:15,163 subjective. 356 00:27:15,163 --> 00:27:16,863 So that's thing number one. 357 00:27:16,863 --> 00:27:19,275 Thing number two is, of course, 358 00:27:19,275 --> 00:27:29,501 With mathematics you want to be able to create new mathematics and be able to go forward with new scenarios But again law never has It's always looking backward to precedent 359 00:27:29,501 --> 00:27:38,016 looking backward to cases looking backward to the contracts like we've always done the contract in this way And we know that in this industry and this jurisdiction force measure 360 00:27:38,016 --> 00:27:48,281 clauses need to be in this way This is always backward looking so really so two things non objectivity in the law where there is objectivity in math 361 00:27:48,281 --> 00:27:52,613 and backward looking in the law rather than forward looking with mathematics. 362 00:27:52,754 --> 00:28:01,139 That yes, it'll throw off the mathematics by throwing in the inflationary tool and it won't really reason in that way. 363 00:28:01,179 --> 00:28:11,036 But I think for our use cases in the law, whether it's a transactional use case, a litigation use case, an advisory use case or regulatory use case, all of the stuff is 364 00:28:11,036 --> 00:28:11,806 there. 365 00:28:11,806 --> 00:28:17,843 And if we use the chain of thought like you've talked about, then it could probably overcome the lack of true 366 00:28:17,843 --> 00:28:19,784 quote unquote reasoning that we have. 367 00:28:19,784 --> 00:28:23,767 And we as humans are really good at separating wheat from chaff. 368 00:28:23,767 --> 00:28:30,691 And so you can imagine, you know, scenario one is everybody takes the robot's output and doesn't touch it. 369 00:28:30,892 --> 00:28:33,634 That's a bad scenario under anybody's estimation. 370 00:28:33,634 --> 00:28:39,838 But almost everybody's in scenario two where it gives an output and then you look over that output and get it out the door. 371 00:28:39,838 --> 00:28:43,040 Under scenario two, you're going to separate that wheat from the chaff. 372 00:28:43,080 --> 00:28:47,357 And so until we have autonomous legal bots, which 373 00:28:47,357 --> 00:28:49,691 God help us if we have that, right? 374 00:28:49,733 --> 00:28:52,811 But until we have that, you're always gonna have that human oversight. 375 00:28:52,811 --> 00:28:57,291 So really, whether it's reasoning or not, is gonna be pretty easily flagged. 376 00:28:57,528 --> 00:28:58,199 Yeah. 377 00:28:58,199 --> 00:29:02,102 And they, they, it wasn't just, um, there were other ways that they tested it. 378 00:29:02,102 --> 00:29:04,344 They actually changed some of the numbers. 379 00:29:04,344 --> 00:29:06,525 What was interesting that also threw it off. 380 00:29:06,525 --> 00:29:07,967 And this part surprised me. 381 00:29:07,967 --> 00:29:11,039 thought AI would, I thought LLMs would figure this out. 382 00:29:11,039 --> 00:29:12,711 They changed the names. 383 00:29:12,711 --> 00:29:15,873 So instead of Sophie, they put Lisa, right? 384 00:29:15,873 --> 00:29:17,655 But they did it consistently throughout. 385 00:29:17,655 --> 00:29:21,858 Like it should be able to, so anyway, it's a new study. 386 00:29:21,858 --> 00:29:26,968 There's still a lot, to be analyzed. 387 00:29:26,968 --> 00:29:29,559 from it, but I did think it was interesting. 388 00:29:30,480 --> 00:29:37,805 Speaking of studies, the Stanford study, there's been a lot of conversation about it. 389 00:29:37,986 --> 00:29:43,068 The second iteration of that came out in May. 390 00:29:43,068 --> 00:29:56,738 you know, there was a, obviously there's companies out there that put a lot of money and effort into these tools and Stanford was pretty pointed in their, in their commentary and 391 00:29:56,770 --> 00:30:01,013 You know, there was a lot of feedback that the, the, the, the study was biased. 392 00:30:01,013 --> 00:30:03,054 I read it multiple times. 393 00:30:03,054 --> 00:30:06,857 It's about 30 pages and it's a really, it's an easy read. 394 00:30:06,857 --> 00:30:09,549 Like reading scientific papers is usually rough going. 395 00:30:09,549 --> 00:30:11,660 That one was really easy to read. 396 00:30:11,680 --> 00:30:15,523 And I thought I didn't feel, I didn't see the bias. 397 00:30:15,523 --> 00:30:19,125 did try and trick the tools and, it was upfront about that. 398 00:30:19,125 --> 00:30:23,178 Just, just like the, Apple study tried to trick AI, right? 399 00:30:23,178 --> 00:30:25,846 That's kind of part of testing is, you know, 400 00:30:25,846 --> 00:30:30,392 evaluating, you're going to throw curveballs and see how the model responds. 401 00:30:30,392 --> 00:30:32,595 But I know, what was your take on the study? 402 00:30:32,595 --> 00:30:37,792 Did you feel there were biases or did you think it was fair? 403 00:30:38,411 --> 00:30:49,384 Two thoughts on that and not to throw shade on the Stanford folks, you can imagine that one issue I have with them is that the terminology that they used for hallucinations, I 404 00:30:49,384 --> 00:30:53,516 think they conflated hallucinations with just getting the wrong legal answer. 405 00:30:53,516 --> 00:30:54,906 Those are two different things, right? 406 00:30:54,906 --> 00:31:06,131 There is an hallucination where it just makes some things up and then there is where Ted and I disagree as to where the law ends up and number two is not hallucination. 407 00:31:06,131 --> 00:31:08,331 That is just us disagreeing. 408 00:31:08,331 --> 00:31:10,671 again, with the law, there may not be a right answer. 409 00:31:10,671 --> 00:31:16,261 And the reason there is litigation is because the reasonable minds can disagree as to what is the right answer or not the right answer. 410 00:31:16,261 --> 00:31:18,951 So a court has to be able to resolve that dispute. 411 00:31:19,431 --> 00:31:24,071 a disagreement as to the output is not hallucination. 412 00:31:24,071 --> 00:31:32,711 So number one, the quibble I had is on the termination that they call everything hallucination, where really we should focus that on confabulations that the large language 413 00:31:32,711 --> 00:31:33,271 models do. 414 00:31:33,271 --> 00:31:34,771 That's thing number one. 415 00:31:34,771 --> 00:31:38,730 Thing number two goes to trying to trick the model in the ways that you talked about. 416 00:31:38,730 --> 00:31:41,101 And this goes to the product side of me. 417 00:31:41,101 --> 00:31:42,051 I'm a product guy. 418 00:31:42,051 --> 00:31:43,410 You're a product guy. 419 00:31:43,951 --> 00:31:48,011 We, as product people, say, what are the most common user pathways? 420 00:31:48,011 --> 00:31:49,731 What are the most common user behaviors? 421 00:31:49,731 --> 00:31:53,611 And we want to be able to build products that are based on those most common user behaviors. 422 00:31:53,611 --> 00:32:01,467 And going back to my x and y-axis, this is the head and the long tail, where you have the most common things done are the head. 423 00:32:01,467 --> 00:32:06,710 And the weirdest, strangest things that you would never think a user would ever do is in the long tail. 424 00:32:06,911 --> 00:32:15,717 And so the things that they were asking were things like, when Justice Ruth Bader Ginsburg dissented in this case, what does that mean? 425 00:32:15,717 --> 00:32:24,234 Where a user would never ask that, knowing because that user would say, well, the user would know that Ruth Bader Ginsburg didn't dissent in that case. 426 00:32:24,234 --> 00:32:26,365 She was the concurrence in that case. 427 00:32:26,365 --> 00:32:29,143 So asking a question like that is 428 00:32:29,143 --> 00:32:32,245 way, way down on the long tail distribution curve. 429 00:32:32,245 --> 00:32:34,347 That is not the most common use case. 430 00:32:34,347 --> 00:32:42,793 So really, if they were to do the study correctly, they would take, they would say, what are the most common questions made by lawyers? 431 00:32:42,793 --> 00:32:50,848 The most common questions made by law students, the most common questions, and then collect those most common questions, randomly distribute those most common questions, and 432 00:32:50,848 --> 00:32:55,742 then say, based on those most common questions, or I guess not even most common, they would take the entire distribution curve. 433 00:32:55,742 --> 00:32:57,723 They would take the head and the tail. 434 00:32:57,723 --> 00:32:59,404 Mix that up in a randomized study. 435 00:32:59,404 --> 00:33:03,907 So there will be some long tail questions, some head questions. 436 00:33:03,907 --> 00:33:10,072 And then from that random distribution, then run those through and see how many confabulations slash hallucinations are there. 437 00:33:10,072 --> 00:33:12,013 That would be a reasonable way to do it. 438 00:33:12,013 --> 00:33:15,315 That would be most aligned with how users use the tools. 439 00:33:15,315 --> 00:33:25,803 So by asking a long tail question that arguably would zero times out of 1,000 ever be asked, zero times out of 1 million ever be asked, nobody would ever ask a question to 440 00:33:25,803 --> 00:33:26,703 trick it. 441 00:33:26,811 --> 00:33:32,754 I would say that's really not doing what the system should be doing for its purpose. 442 00:33:32,754 --> 00:33:43,088 The purpose of legal research is to answer legal questions and it's not to gauge the reasoning of the underlying large language models, GPT-4 or GPT-3.5 or whatever was in 443 00:33:43,088 --> 00:33:44,339 place at the time. 444 00:33:44,339 --> 00:33:46,920 And what they were doing was tricking GPT-3.5. 445 00:33:46,920 --> 00:33:50,861 They weren't tricking the legal research tools that were relying on GPT-3.5. 446 00:33:50,961 --> 00:33:55,097 So I would say that that tests the thing that is the product. 447 00:33:55,097 --> 00:33:58,605 not testing the underlying model that the product is using. 448 00:33:59,032 --> 00:34:08,042 So how is that different than throwing in a sentence about inflation being 10 % in the GSM 8K scenario? 449 00:34:08,711 --> 00:34:15,776 So I guess if we're looking at a legal product, like are we testing reasoning or are we testing how well the product works? 450 00:34:15,776 --> 00:34:17,597 Because those are two different goals. 451 00:34:17,597 --> 00:34:22,440 Because really testing reasoning is testing the foundational model, GPT-4, GPT-3.5. 452 00:34:22,440 --> 00:34:31,386 But if you're testing how well the product works for its intended purpose, then the question then would be, would the user input usually include that inflationary number? 453 00:34:31,446 --> 00:34:37,950 If the answer is yes, the user input would include that, then yes, we should definitely include that in the distribution of the user input. 454 00:34:38,988 --> 00:34:46,838 zero times out of 100 million times they would include that inflationary number, then that doesn't seem right because if you're really testing the product, that is not a use case 455 00:34:46,838 --> 00:34:48,581 that the users would ever use. 456 00:34:48,728 --> 00:34:50,409 Yeah, guess, yeah. 457 00:34:50,409 --> 00:34:51,659 OK, fair. 458 00:34:51,659 --> 00:35:00,383 So in section 6.2 in the study, do, I agree, they say hallucinations can be insidious. 459 00:35:00,423 --> 00:35:10,907 And then some of the scenarios that they document don't seem like hallucination scenarios, like misunderstanding holdings. 460 00:35:10,907 --> 00:35:17,690 Systems do not seem capable of consistently making out the holding of a case, distinguishing between legal actors. 461 00:35:17,822 --> 00:35:28,089 systems fail to distinguish between arguments made by litigants and statements made by the court, respecting order of authority, models strain in grasping hierarchies of legal 462 00:35:28,089 --> 00:35:28,550 authority. 463 00:35:28,550 --> 00:35:30,461 Yeah, those aren't hallucinations. 464 00:35:30,461 --> 00:35:34,764 Those are just limitations of the model itself, it seems. 465 00:35:34,827 --> 00:35:39,469 That's right, limitations of the model and perhaps limitations of the system that is using that model. 466 00:35:39,469 --> 00:35:49,352 So you could imagine that if the system were to say as part of the metadata that a trial court is below this particular appellate court, which is below this particular Supreme 467 00:35:49,352 --> 00:36:01,055 Court, and models that hierarchy of courts in a symbolic way, not a large language model way, but in a symbolic coded up way, then that system could avoid the, 468 00:36:01,139 --> 00:36:10,279 confabulation between the district court and the appellate court level because the guardrails of the symbolic AI would prevent that kind of misunderstanding. 469 00:36:10,919 --> 00:36:21,399 So is Stanford analyzing the large language model output or are they analyzing the system's coding, that is the hard coding, to be able to say that this trial court is below 470 00:36:21,399 --> 00:36:23,659 the appellate court, which is below the Supreme Court? 471 00:36:23,659 --> 00:36:30,773 I think that that is maybe a reasonable critique that if the system 472 00:36:30,963 --> 00:36:34,764 is not recognizing that hierarchy than maybe the system should. 473 00:36:34,784 --> 00:36:40,026 So I would say that's maybe a reasonable critique if you're really looking at that. 474 00:36:40,666 --> 00:36:42,327 So yeah, maybe two scenarios. 475 00:36:42,327 --> 00:36:52,500 The Ruth Bader Ginsburg is unreasonable, but knowing whether the district court was overruled by the appellate court, which was then ruled back by the Supreme Court, a legal 476 00:36:52,500 --> 00:36:56,611 system that is a legal research system should know those things in a symbolic AI way. 477 00:36:56,728 --> 00:36:57,910 Yeah, exactly. 478 00:36:57,910 --> 00:37:00,053 And the last category was fabrications. 479 00:37:00,053 --> 00:37:02,065 that is a hallucination. 480 00:37:03,689 --> 00:37:10,348 So what does this study mean for the future of AI and legal research? 481 00:37:11,085 --> 00:37:12,416 The Stanford study? 482 00:37:13,217 --> 00:37:22,305 I would say that there are, so there's that Stanford study is out of one side of Stanford and then there's another side of Stanford called the Codex. 483 00:37:22,305 --> 00:37:29,751 Megan Ma helps lead up that Codex and you probably know Megan and Megan's one of the smartest minds right now in legal AI. 484 00:37:29,751 --> 00:37:39,119 She's running another study that is comparing human generated output with humans plus machine generated output. 485 00:37:39,119 --> 00:37:45,102 and doing a double blind study as to see what the large law firm partners prefer. 486 00:37:45,102 --> 00:37:50,904 The partners don't know which is human created versus a human plus machine and going to be doing this. 487 00:37:50,904 --> 00:37:57,127 So that seems like a reasonable way because that is really taking actual use cases. 488 00:37:57,127 --> 00:38:06,661 So she's taking actual contractual questions or actual litigation questions and being able to actually take those common use cases, the head in the distribution curve, not the long 489 00:38:06,661 --> 00:38:07,847 tail, but the head. 490 00:38:07,847 --> 00:38:15,609 and then saying how do legal tools that are built for this actually performing on these more likely legal tasks. 491 00:38:15,609 --> 00:38:22,311 So I would say that the Stanford study is a bright shining light as to the way things should be done. 492 00:38:22,311 --> 00:38:25,212 The other Stanford study, Megan Ma's Stanford study. 493 00:38:25,212 --> 00:38:29,413 Second thing is that similar studies being done by Dan Schwartz out of Minnesota. 494 00:38:29,413 --> 00:38:35,655 Dan, you might have seen two of his other studies, one of which was to be able to say, 495 00:38:36,871 --> 00:38:44,315 They interspersed a large language model created essays with the human created essays and a double blind the professors graded both. 496 00:38:44,315 --> 00:38:46,657 And so that was his study number one. 497 00:38:46,657 --> 00:38:52,280 Study number two was to, I forget what study number two was, but it was a similar vein. 498 00:38:52,280 --> 00:39:01,746 But then study number three is doing kind of the same thing that Megan is doing, but just a different twist on it to be able to do a double or triple blind study of human created 499 00:39:01,746 --> 00:39:04,529 things along with machine created things. 500 00:39:04,529 --> 00:39:08,282 and mixing them up and having human evaluators be seeing what they prefer. 501 00:39:08,282 --> 00:39:10,403 So that's evaluation number two. 502 00:39:10,403 --> 00:39:11,984 That is a bright shining light. 503 00:39:11,984 --> 00:39:14,686 Evaluation number three is legal technology hub. 504 00:39:14,686 --> 00:39:26,114 Nikki Shaver and her team is working with vals.ai on a similar study that is involving Harvey, Thompson Reuters, LexisNexis, us at VLex, where it is a similar kind of John Henry 505 00:39:26,114 --> 00:39:28,497 kind of test to evaluate the outputs. 506 00:39:28,497 --> 00:39:35,456 So I would say that the old Stanford study is old news and is probably chasing the wrong things for the reason we've just discussed. 507 00:39:35,456 --> 00:39:43,937 Whereas the new Stanford study and the Minnesota-Michigan study and the legal technology have VALS AI study, that is going to give us some hope going forward. 508 00:39:44,162 --> 00:39:44,602 Interesting. 509 00:39:44,602 --> 00:39:45,082 It's funny. 510 00:39:45,082 --> 00:39:49,987 We're in a space that moves so fast when May is old news and it's November. 511 00:39:49,987 --> 00:39:52,269 But yeah, I don't disagree. 512 00:39:53,490 --> 00:40:04,540 So the Gartner hype curve does a absolutely phenomenal job, in my opinion, mapping out the trajectory of new technologies in many cases. 513 00:40:04,540 --> 00:40:09,284 And I think it's really playing out interestingly in AI right now. 514 00:40:09,284 --> 00:40:11,726 So the Goldman study 515 00:40:11,726 --> 00:40:17,466 came out saying that 44 % of legal tasks could be automated by GEN.AI, freaked everybody out. 516 00:40:17,466 --> 00:40:21,566 I think that number was very aspirational. 517 00:40:24,286 --> 00:40:38,638 I think I might have heard you talk about when GPT-3.5 took the bar and scored it in the mid-60s, GPT-4 scored 90 plus. 518 00:40:38,638 --> 00:40:41,878 That number was since been revised down significantly. 519 00:40:41,878 --> 00:40:43,958 So I've heard, which is interesting. 520 00:40:43,958 --> 00:41:01,698 Um, there's a Wharton survey out that I saw on Peter Duffy's newsletter that, surveyed inside council and showed that only 28 % of inside council anticipate a high impact of gen 521 00:41:01,698 --> 00:41:02,898 AI in their role. 522 00:41:02,898 --> 00:41:08,230 I found that super interesting and 25 % anticipate a low impact. 523 00:41:08,460 --> 00:41:10,441 Which again, that's kind of mind blowing. 524 00:41:10,441 --> 00:41:12,895 But where do you think we are on this hype curve? 525 00:41:12,895 --> 00:41:20,032 Do you feel like we're in the trough of disillusionment or are we still, do we still have further to go? 526 00:41:20,377 --> 00:41:22,588 I think we're going up the slope actually. 527 00:41:22,588 --> 00:41:28,129 I just gave a talk with Pablo Redondo of Case Text co-founder. 528 00:41:28,189 --> 00:41:36,011 He and I gave a presentation and he gave a lot of his part of the presentation walking us through the hype cycle and walking us through the trough. 529 00:41:36,011 --> 00:41:41,733 And he thinks that law firms and others are doing the hard yards of going up the slope slowly but surely. 530 00:41:41,733 --> 00:41:45,034 And I think that he's probably right. 531 00:41:45,034 --> 00:41:48,375 to a couple of the things that you mentioned, you know, the 532 00:41:49,223 --> 00:41:59,090 Bar exam that of course my friends Pablo was actually one of the guys who did that bar exam paper along with Mike Mike Bomarito and Dan Katz were the other co-authors and so 533 00:41:59,090 --> 00:42:08,436 they in that paper that they wrote they actually put the caveat saying that there were One those results are never publicly announced. 534 00:42:08,436 --> 00:42:18,155 So they're kind of doing replicas of the bar exam So this is always you know, until the bar exam the multi-state bar exam publishes its numbers, of course 535 00:42:18,155 --> 00:42:20,317 there's no definitive objective number. 536 00:42:20,317 --> 00:42:27,042 It's all largely kind of taking kind of statistical likelihood rather than definitive objective. 537 00:42:27,042 --> 00:42:27,932 That's thing number one. 538 00:42:27,932 --> 00:42:38,920 Thing number two, they had also put in footnotes caveats saying that there was a, this is during COVID times and this is a, know, there's of course there are fewer takers during 539 00:42:38,920 --> 00:42:46,045 COVID times and maybe those COVID people were less likely to do well and maybe, you know, there's all sorts of, you know, scientifically 540 00:42:46,045 --> 00:42:48,908 kind of nebulous things that make that number. 541 00:42:48,908 --> 00:42:53,682 But anyway, so they put that 90 % number with all of those caveats with the initial paper. 542 00:42:53,682 --> 00:42:57,400 So the subsequent papers that say, no, it's way lower than 90%. 543 00:42:57,400 --> 00:42:58,216 Like, come on. 544 00:42:58,216 --> 00:43:00,278 Like, they put that in the footnotes. 545 00:43:00,278 --> 00:43:01,870 So that's that. 546 00:43:01,870 --> 00:43:08,155 And then to the other thing that's saying that only 28 % of inside counsel anticipate high impact on their role. 547 00:43:09,737 --> 00:43:11,978 I've heard that, but there's also 548 00:43:12,093 --> 00:43:22,727 there's a lot of studies saying that 80 % of inside counsel expect their external counsel's bills to be reduced because of large language models. 549 00:43:22,848 --> 00:43:30,331 even though 28 % of them think that it's going to impact their role, 80 % think it's going to impact external counsel's role. 550 00:43:30,331 --> 00:43:33,472 So that is an expectation from the buy side, the client side. 551 00:43:33,492 --> 00:43:40,935 And another thing is that the Clio Cloud Conference announced their Clio survey, where they survey and 552 00:43:41,011 --> 00:43:44,771 Their distribution curve is mostly the solo small up to the midsize law firms. 553 00:43:44,771 --> 00:43:46,581 And they did a survey last year. 554 00:43:46,581 --> 00:43:50,651 during that survey, they said, how legal work are you doing? 555 00:43:50,651 --> 00:43:53,451 How many of you are using large language models for legal work? 556 00:43:53,451 --> 00:43:56,331 And the answer in 2023 was about 25%. 557 00:43:56,331 --> 00:43:58,271 They asked the same question in 2024. 558 00:43:58,271 --> 00:44:01,371 And the answer jumped to about 80%. 559 00:44:01,371 --> 00:44:07,471 That is 80 % of solo small up to midsize that Clio users are using AI for legal work. 560 00:44:07,471 --> 00:44:10,803 That's a dramatic jump from 25 % to 80%. 561 00:44:10,803 --> 00:44:16,803 And so that shows me that the future is not here, it's just not evenly distributed. 562 00:44:16,803 --> 00:44:25,023 That is, Solos Malls are using this, they're already on the slope in the light, but they're already using it for real use cases, where the big law folks maybe aren't telling 563 00:44:25,023 --> 00:44:26,493 anybody that they're using it. 564 00:44:26,493 --> 00:44:34,603 And the associates may be in big law, where if their law firms prohibit them from using it, they use shadow IT, where they use it on their personal devices, and they're not 565 00:44:34,603 --> 00:44:35,979 telling anybody about it. 566 00:44:36,014 --> 00:44:48,034 Yeah, you know, all these numbers that are flying around, I don't know if you saw the Ilta Tech Survey that 74 % of law firms with more than 700 attorneys are using GEN.ai in 567 00:44:48,034 --> 00:44:49,834 business use cases. 568 00:44:50,594 --> 00:44:53,354 seems very aspirational to me. 569 00:44:53,354 --> 00:44:56,184 And I had Steve Embry on the podcast a while back. 570 00:44:56,184 --> 00:45:02,114 He wrote an article, this was before the Tech Survey came out, called Mind the Gap. 571 00:45:02,114 --> 00:45:04,522 And the gap he was talking about is 572 00:45:04,522 --> 00:45:15,731 surveys like this that report gen AI usage and then the anecdotal observation of people like him and me that work with law firms all day long and just don't see it. 573 00:45:16,292 --> 00:45:20,376 So I think this, lot of these numbers are conflicting, aspirational. 574 00:45:20,376 --> 00:45:25,580 Maybe you have a lawyer who Googles what Chad GPT is and he can check the box. 575 00:45:25,580 --> 00:45:28,062 He or she can check the box, but I don't know. 576 00:45:28,183 --> 00:45:28,904 That's right. 577 00:45:28,904 --> 00:45:35,827 I would say that, yeah, every survey can be poked holes in by the way it's asked. 578 00:45:35,827 --> 00:45:45,713 Because you can imagine if it's way it's asked, have you used large language models in your practice, if I used it for one thing in the full year, then I could answer yes to 579 00:45:45,713 --> 00:45:46,433 that. 580 00:45:46,433 --> 00:45:54,958 But really, the question is, if the question were instead, what percentage of your work involved large language models, that number would be totally different, right? 581 00:45:54,958 --> 00:45:56,779 And that number would be way lower. 582 00:45:56,871 --> 00:46:02,288 Have you ever used it in the past year for one thing versus what percentage of your work have you used it? 583 00:46:02,288 --> 00:46:05,176 Those are two very different questions that will give very different answers. 584 00:46:05,176 --> 00:46:06,907 Yeah, agreed. 585 00:46:06,947 --> 00:46:09,607 Well, this has been a super fun conversation. 586 00:46:09,607 --> 00:46:12,086 I really appreciate you taking a few minutes with me. 587 00:46:12,086 --> 00:46:17,993 I think the, the LLM reasoning conversation is really just beginning. 588 00:46:17,993 --> 00:46:21,055 Do you, do you know Jan Likun from 589 00:46:21,055 --> 00:46:23,853 do, yeah, he's one of the smartest guys around. 590 00:46:23,884 --> 00:46:24,995 Yeah, from Meta. 591 00:46:24,995 --> 00:46:31,957 So I, again, this is more conflicting information that we as individuals have to make sense of. 592 00:46:31,957 --> 00:46:40,040 He talked about that, uh, currently a house cat is smarter than large language models, which I thought was interesting. 593 00:46:40,040 --> 00:46:46,614 And then I heard a, another, there was a talk at a local EO, um, entrepreneurs organization function here in St. 594 00:46:46,614 --> 00:46:48,525 Louis last night. 595 00:46:48,525 --> 00:46:52,446 And I got the notes from it and it said that current 596 00:46:52,838 --> 00:47:06,422 large language models are operating at, think the number was the equivalent of an IQ of 100 and that in the next year and a half, it will operate at an IQ of 1000. 597 00:47:06,422 --> 00:47:16,795 And which I don't even, those numbers don't make sense to me, but you know, when I hear Jan say that it's dumber than a house cat. 598 00:47:16,795 --> 00:47:20,618 And then I hear that we're operating today at IQ 100. 599 00:47:20,618 --> 00:47:22,246 There's lots of 600 00:47:22,508 --> 00:47:24,701 You know, there's lots of things to make sense of. 601 00:47:24,701 --> 00:47:25,803 Um, I don't know. 602 00:47:25,803 --> 00:47:27,901 What is your take on that before we wrap up? 603 00:47:27,901 --> 00:47:34,955 Yeah, I really like and respect Jan and I think that he's right that if we want to have robots, they need to understand the world. 604 00:47:34,955 --> 00:47:45,000 So when he talks about the it's as dumb as a house cat that he's talking about the idea that if you put a ball into a cup and then you flip the cup upside down, what is going to 605 00:47:45,000 --> 00:47:46,500 happen to that ball? 606 00:47:46,560 --> 00:47:49,862 The large language model should know that the ball should fall out of the cup, right? 607 00:47:49,862 --> 00:47:52,343 But a large language models often get that wrong. 608 00:47:52,363 --> 00:47:57,477 So if we want robots to be able to figure out how the world works, we definitely need that kind of spatial reasoning. 609 00:47:57,477 --> 00:47:59,369 And that's what he's talking about dumber than a house cat. 610 00:47:59,369 --> 00:48:01,971 House cats know that the ball falls out of the cup. 611 00:48:02,111 --> 00:48:10,378 But what Jan isn't saying is there are use cases like the law, where we don't have to deal with cups turning over in balls. 612 00:48:10,458 --> 00:48:17,163 Every single thing that a lawyer does, every single task, whether you're a litigator or transactional lawyer, every single task is based on words. 613 00:48:17,404 --> 00:48:21,551 We ingest words, we analyze words, and we output words. 614 00:48:21,551 --> 00:48:23,132 We don't deal with the physical world. 615 00:48:23,132 --> 00:48:26,674 We are merely word based in every single task that we do. 616 00:48:26,674 --> 00:48:30,237 So set aside, a cat doesn't know words. 617 00:48:30,237 --> 00:48:32,958 We don't need to know whether a ball falls out of a cup. 618 00:48:32,958 --> 00:48:35,060 All we need to know is how it worked work. 619 00:48:35,060 --> 00:48:40,983 And I would say for this use case, the legal use case, Jan's criticisms are maybe inapplicable. 620 00:48:41,038 --> 00:48:43,218 Yeah, yeah, and you're right. 621 00:48:43,218 --> 00:48:44,288 That's what he was. 622 00:48:44,288 --> 00:48:52,978 You know, house cats can plan and anticipate and they have spatial awareness that large language models don't. 623 00:48:52,978 --> 00:48:55,328 Well, this has been a lot of fun before we wrap up. 624 00:48:55,328 --> 00:48:58,268 How do folks find out more about what you do? 625 00:48:58,268 --> 00:49:01,058 Your your your work with Sally and Villex? 626 00:49:01,058 --> 00:49:03,475 How do people find out more about that? 627 00:49:03,475 --> 00:49:04,675 Yeah, the best places on LinkedIn. 628 00:49:04,675 --> 00:49:05,805 I hang out there a lot. 629 00:49:05,805 --> 00:49:08,596 It's Damien real and you have it in the show notes. 630 00:49:08,596 --> 00:49:09,367 Awesome. 631 00:49:09,367 --> 00:49:10,728 Well, good stuff. 632 00:49:11,070 --> 00:49:14,011 Hopefully, are you going to be in Miami for TLTF? 633 00:49:14,011 --> 00:49:15,543 I will see you at TLSTF. 634 00:49:15,543 --> 00:49:17,576 That's one of my favorite conferences. 635 00:49:17,576 --> 00:49:18,922 Yeah, looking forward to seeing you there. 636 00:49:18,922 --> 00:49:19,453 Absolutely. 637 00:49:19,453 --> 00:49:22,287 We'll be on stage pitching on Wednesday afternoon. 638 00:49:22,768 --> 00:49:23,270 good. 639 00:49:23,270 --> 00:49:24,852 We'll see you in Miami. 640 00:49:25,274 --> 00:49:25,995 All right. 641 00:49:25,995 --> 00:49:26,836 Thanks, Damian. 642 00:49:26,836 --> 00:49:27,858 Take care. -->

Read the full transcript Hide transcript

Stay up on the latest innovations in legal technology and knowledge management.

Damien Riehl

Subscribe for Updates

Newsletter

Machine Generated Episode Transcript

Subscribe

Subscribe