Join Ted as he engages with Damien Riehl to explore the transformative intersection of AI and the legal profession. This conversation unpacks the evolving role of large language models in legal reasoning, how they reshape traditional tasks, and the philosophical implications of trusting AI outputs. Whether you’re curious about the limits of AI’s “reasoning” or its practical benefits for attorneys, this episode offers fresh perspectives and insights you won’t want to miss.
In this episode, Damien shares insights on how to:
Assess the reasoning capabilities of large language models in legal contexts
Leverage generative AI to enhance legal document drafting and research
Interpret the limitations of AI outputs in subjective fields like law
Test the reliability and objectivity of AI systems in complex decision-making
Navigate ethical considerations in adopting AI-driven tools in legal practice
Key takeaways:
AI’s reasoning in law aligns with its backward-looking nature, as legal tasks often rely on precedent and existing data, making large language models effective for linking facts to statutes and regulations.
The “anesthesia test” emphasizes evaluating AI by its practical outputs, as effectiveness matters more than understanding its internal workings.
While large language models excel at connecting legal concepts, they struggle with objective tasks like math, reflecting their contextual strengths.
Studies like Stanford’s highlight AI’s limitations, but the legal industry should focus on practical applications for everyday workflows rather than edge cases.
About the guest, Damien Riehl:
Damien Riehl is a lawyer and technologist with experience in complex litigation, digital forensics, and software development. A lawyer since 2002 and coder since 1985, Damien clerked for the chief judges of state and federal courts, practiced in complex litigation for over a decade, has led teams of cybersecurity and world-spanning digital forensics investigations, and has built AI-backed legal software.
Co-Chair of the Minnesota Governor’s Council on Connected and Automated Vehicles, he is helping recommend changes to Minnesota statutes, rules, and policies — all related to connected and autonomous vehicles. Damien is Chair of the Minnesota State Bar Association’s AI Committee, which oversees an AI Sandbox to promote Access to Justice (A2J).
At SALI, the legal data standard, Damien built and greatly expanded the taxonomy of over 18,000 legal tags that matter, helping the legal industry’s development of Generative AI, analytics, and interoperability.
At vLex Group — which includes Fastcase, NextChapter, and Docket Alarm — Damien helps lead the design, development, and expansion of various products, integrating AI-backed technologies (e.g., GPT) into a billion-document dataset from 100+ countries, all to improve legal workflows.
“How much of the law is looking backward—that is, looking to precedent? You’re always looking to statutes; you’re always looking to something that is in the data set. So, if it is in the data set, really, all of our reasoning that is legal is backward-looking, not forward-looking.”– Damien Riehl
1
00:00:02,328 --> 00:00:04,179
Damien, how are you this afternoon?
2
00:00:04,179 --> 00:00:04,761
Couldn't be better.
3
00:00:04,761 --> 00:00:05,475
Life is really good.
4
00:00:05,475 --> 00:00:06,326
How are you Ted?
5
00:00:06,326 --> 00:00:07,196
I'm doing great.
6
00:00:07,196 --> 00:00:07,926
I'm doing great.
7
00:00:07,926 --> 00:00:10,786
I appreciate you joining me this afternoon.
8
00:00:10,786 --> 00:00:23,341
We were kicking around a really interesting topic via LinkedIn and I figured, you know
what, this is, I I've been overdue to have you on the podcast anyway.
9
00:00:23,341 --> 00:00:27,472
So I figured this is a good opportunity to, uh, to riff a little bit.
10
00:00:27,472 --> 00:00:31,533
Um, but before we do, let's, let's get you introduced.
11
00:00:31,533 --> 00:00:35,104
So I went and looked at your, your LinkedIn profile.
12
00:00:35,104 --> 00:00:36,294
Interestingly,
13
00:00:36,344 --> 00:00:44,477
I didn't realize you started your legal career as a clerk and you started practicing in
the early two thousands.
14
00:00:44,477 --> 00:00:46,717
You worked for TR and fast case.
15
00:00:46,717 --> 00:00:49,098
That's now VLex, right?
16
00:00:49,098 --> 00:00:54,420
And, um, you're still at VLex and I know you do a lot of work through with Sally.
17
00:00:54,420 --> 00:00:56,010
That's how you and I actually first met.
18
00:00:56,010 --> 00:01:01,352
But, um, why don't you tell us a little bit about who you are, what you do and where you
do it.
19
00:01:01,363 --> 00:01:08,423
Sure, I've been a lawyer since 2002, I clerked for chief judges at the state appellate
court and the federal district court.
20
00:01:08,423 --> 00:01:16,583
Then I worked for a big law firm, Robbins Kaplan, where I represented Best Buy and much of
their commercial litigation, represented victims of Bernie Madoff, helped sue JPMorgan
21
00:01:16,583 --> 00:01:18,053
over the mortgage-backed security crisis.
22
00:01:18,053 --> 00:01:24,633
So I have a pretty long time, some would say too long as a litigator, but then I've also
been a coder since 85.
23
00:01:24,633 --> 00:01:29,275
So I have the law plus technology background, and anyone who works with me will tell you
that
24
00:01:29,275 --> 00:01:31,315
I am probably the worst coder you've ever met.
25
00:01:31,315 --> 00:01:38,640
I say I'm a coder not as a badge of honor, but a shroud of shame where I'm not very good
at coding at all.
26
00:01:38,640 --> 00:01:42,523
But with large language models, one can be actually better at coding than one actually is.
27
00:01:42,523 --> 00:01:49,267
So after litigating for a bunch of years, I joined TR, building a big thing for them, did
cybersecurity for a while.
28
00:01:49,267 --> 00:01:57,171
But since 2019, I've been working with Fastcase, which is now VLex, essentially playing in
a playground of a billion legal documents.
29
00:01:57,171 --> 00:02:05,374
cases, statutes, regulations, motions, briefs, pleadings, extracting what matters from
them using Sally tags and otherwise, and then running large language models across those
30
00:02:05,374 --> 00:02:06,105
things.
31
00:02:06,294 --> 00:02:06,925
Interesting.
32
00:02:06,925 --> 00:02:13,946
And is that how your involvement in Sally came to be was the work that you're doing at
Vlex?
33
00:02:14,259 --> 00:02:18,949
It actually came to be that I met Toby Brown who founded Sally in 2017.
34
00:02:18,949 --> 00:02:23,839
I met him at Ilticon and we just happened to sit in the same breakfast table.
35
00:02:23,839 --> 00:02:28,279
And I'd known of Toby but had not actually met Toby before.
36
00:02:28,279 --> 00:02:34,629
But then we started talking a bit about Sally and he said, I said, you haven't really
chased any litigation things.
37
00:02:34,629 --> 00:02:36,099
He said, no, we haven't.
38
00:02:36,099 --> 00:02:36,689
said, why not?
39
00:02:36,689 --> 00:02:38,079
I said, would you like some help on that?
40
00:02:38,079 --> 00:02:39,475
And he's like, well, it's too hard.
41
00:02:39,475 --> 00:02:40,215
Do you want to do it?
42
00:02:40,215 --> 00:02:41,675
And I said, yeah, I totally want to do it.
43
00:02:41,675 --> 00:02:47,327
So we met in 2019, August of 2019, and I've been working on Sally ever since.
44
00:02:47,382 --> 00:02:48,123
Interesting.
45
00:02:48,123 --> 00:02:49,999
And what were you coding in 85?
46
00:02:49,999 --> 00:02:52,125
I've been, I started coding in like 82.
47
00:02:52,125 --> 00:02:53,651
What were you coding basic?
48
00:02:53,651 --> 00:02:56,591
I was coding basic in my Commodore 128.
49
00:02:56,911 --> 00:03:02,731
I didn't start with the Commodore 64 because I wanted to get the thing that had 128K of
RAM instead of 64K of RAM.
50
00:03:02,731 --> 00:03:03,931
So I was coding basic.
51
00:03:03,931 --> 00:03:11,111
was getting magazines where I would take the magazine on paper and I would recode the code
in the magazine and then tried to tweak the code.
52
00:03:11,111 --> 00:03:13,902
So yeah, I was a very nerdy 10-year-old.
53
00:03:13,902 --> 00:03:14,382
Yeah.
54
00:03:14,382 --> 00:03:15,742
So it's funny.
55
00:03:15,742 --> 00:03:17,502
A lot of parallels there.
56
00:03:17,502 --> 00:03:24,722
Um, I started off with a Commodore 32, so I had one fourth of the memory that you did.
57
00:03:24,722 --> 00:03:30,342
And you know, I used to have to, uh, store my programs on audio cassette.
58
00:03:30,342 --> 00:03:40,322
This is before I could afford a floppy and you know, um, gosh, so this would have been,
yeah, probably 82, 83.
59
00:03:40,322 --> 00:03:42,252
Then I graduated to a
60
00:03:42,252 --> 00:03:48,557
I had a TI-994A with a extended basic cartridge and a book about that thick.
61
00:03:48,557 --> 00:03:54,132
And I literally read every page of it to understand all the new commands.
62
00:03:54,132 --> 00:03:58,285
I totally geeked out on it and then was totally into it.
63
00:03:58,285 --> 00:04:04,461
And then during middle school, you know, the girls didn't think it was cool to be a
computer programmer.
64
00:04:04,461 --> 00:04:07,834
So I kind of ditched it for a while until college.
65
00:04:07,834 --> 00:04:11,006
So I had a break in there, but
66
00:04:11,022 --> 00:04:17,502
Then when I picked up computers again, it would have been early nineties, like 91 ish.
67
00:04:17,502 --> 00:04:26,822
And by then it was a visual basic, you know, doing native windows development like VB
four.
68
00:04:27,161 --> 00:04:28,582
God, I can't remember.
69
00:04:28,582 --> 00:04:32,642
I think it was visual interdev and used to compile windows programs.
70
00:04:32,642 --> 00:04:33,652
did a lot of SQL.
71
00:04:33,652 --> 00:04:38,862
I was actually on the SQL team at Microsoft in late nineties, early 2000.
72
00:04:38,862 --> 00:04:39,675
So
73
00:04:39,675 --> 00:04:42,300
I can still hold my own on SQL, I'm like you, man.
74
00:04:42,300 --> 00:04:47,990
If I had to code an app, I'd be so lost right now.
75
00:04:48,307 --> 00:04:53,051
True, really query how important that is these days to be really a hardcore coder.
76
00:04:53,051 --> 00:05:01,677
I know people that are really good hardcore coders that use things like cursor and use
large language models to be able to be a bicycle for the mind, like Steve Jobs would say,
77
00:05:01,677 --> 00:05:03,618
and make them go better, faster, and stronger.
78
00:05:03,618 --> 00:05:14,486
But even for people that are rusty or really awful, like you and me, it's still, I can't
go 10 times as fast as a normal coder can with a large language model, but I can maybe do
79
00:05:14,486 --> 00:05:16,231
1x what they used to be able to do.
80
00:05:16,231 --> 00:05:16,382
Right.
81
00:05:16,382 --> 00:05:20,440
There's, there's really, um, it really evens the playing field on what is possible.
82
00:05:20,440 --> 00:05:21,350
Yeah.
83
00:05:21,430 --> 00:05:33,095
Well, you and I were riffing on a topic that I think is super interesting and I was kind
of surprised to hear your perspective on it and I thought it was really interesting and we
84
00:05:33,095 --> 00:05:39,618
were talking about the question on whether or not LLMs can reason.
85
00:05:39,618 --> 00:05:45,340
I've always, you know, understanding the architecture, I've always just had the default
assumption.
86
00:05:45,340 --> 00:05:49,688
That's kind of where I started my position on this with
87
00:05:49,688 --> 00:05:53,771
There's no way they can just based on, on the architecture, right?
88
00:05:53,771 --> 00:05:55,322
It predicts the next token.
89
00:05:55,322 --> 00:06:00,046
It has no concept of, um, comprehension.
90
00:06:00,046 --> 00:06:05,809
Therefore reasoning seems to be far out of reach, but it does create the illusion of
reasoning.
91
00:06:05,809 --> 00:06:09,012
And you had an interesting argument, which was, does it matter?
92
00:06:09,012 --> 00:06:17,138
Um, so, I mean, let's start with do LLM's reason or create the illusion of reasoning.
93
00:06:17,631 --> 00:06:19,872
And yes, let's talk about that.
94
00:06:19,872 --> 00:06:25,133
I think a good precursor to that question is are LLMs conscious or are they not conscious?
95
00:06:25,133 --> 00:06:28,914
And that's another kind of academic exercise question that people have been thinking
about.
96
00:06:28,914 --> 00:06:31,675
You know, it gives the illusion of consciousness, right?
97
00:06:31,675 --> 00:06:35,606
And so, but of course, large language models, in my opinion, are not conscious, right?
98
00:06:35,606 --> 00:06:38,057
Because they are just mimicking consciousness.
99
00:06:38,217 --> 00:06:44,599
But their philosophers for millennia have been saying consciousness is undefinable.
100
00:06:44,753 --> 00:06:48,245
Like, the only thing I can be conscious of is I know that I am conscious.
101
00:06:48,245 --> 00:06:53,468
But whether you are conscious or not or just a figment of my imagination is something I
will never, know.
102
00:06:53,568 --> 00:06:56,590
All I know is that my own consciousness is a thing.
103
00:06:56,590 --> 00:07:05,555
So I think the question of whether large language models are conscious or not is kind of
just an academic exercise that really doesn't matter, right?
104
00:07:05,676 --> 00:07:11,098
So any more than I know whether Ted is conscious or not, that we is a f-
105
00:07:11,609 --> 00:07:14,681
Science and we as philosophers have never defined consciousness.
106
00:07:14,681 --> 00:07:24,646
Therefore the debate about consciousness is just an academic exercise So let's now set
consciousness aside and now let's talk about reasoning and the real question is I when I'm
107
00:07:24,646 --> 00:07:35,332
speaking with you Ted I have no idea whether your brain is reasoning or not I've that's
because often we ourselves don't know how our brains are reasoning or not I'm all the only
108
00:07:35,332 --> 00:07:40,729
way I can tell whether Ted is reasoning or not is through the words that come out of Ted's
mouths
109
00:07:40,729 --> 00:07:44,381
or the words that come out of Ted's keyboard as Ted is typing.
110
00:07:44,381 --> 00:07:51,925
And if those words look like reasoning, and if they quack like reasoning, then I could be
able to say Ted is probably reasoning.
111
00:07:51,985 --> 00:07:55,427
So maybe shouldn't we judge large language models in the same way?
112
00:07:55,427 --> 00:08:00,950
That if the output of the large language models looks like reasoning and quacks like
reasoning, then maybe it's reasoning.
113
00:08:00,950 --> 00:08:06,653
And that's what large language models, machine learning scientists, data scientists, call
that the duck test.
114
00:08:06,733 --> 00:08:10,269
That is, they know what goes into the black box.
115
00:08:10,269 --> 00:08:15,244
They have no idea what happens inside the black box and they know what comes out of the
black box.
116
00:08:15,244 --> 00:08:24,753
But if the output looks like reasoning and quacks like reasoning, maybe whether the black
box is reasoning or not matters not, just like it doesn't matter if I know how you are
117
00:08:24,753 --> 00:08:25,854
reasoning in your brain.
118
00:08:25,854 --> 00:08:27,695
All I know is your output too.
119
00:08:28,098 --> 00:08:29,200
Interesting.
120
00:08:29,584 --> 00:08:31,540
Can we test for reasoning?
121
00:08:32,177 --> 00:08:34,028
Yes, I think we can.
122
00:08:35,009 --> 00:08:38,691
the question is, what are the tasks that you're testing on?
123
00:08:38,712 --> 00:08:41,614
There are objective tasks, mathematical tasks.
124
00:08:41,614 --> 00:08:43,876
So you can imagine a mathematical proof.
125
00:08:43,876 --> 00:08:47,919
You could be able to test whether it's making its way through the mathematical proof or
not.
126
00:08:47,919 --> 00:08:50,761
You can test whether that is reasoning or not reasoning.
127
00:08:50,761 --> 00:08:51,962
Same with science.
128
00:08:51,962 --> 00:08:53,133
Is it getting science correct?
129
00:08:53,133 --> 00:08:54,994
Is it doing the scientific method correctly?
130
00:08:54,994 --> 00:08:55,985
Is it reasoning?
131
00:08:55,985 --> 00:08:59,237
Is it providing true causation rather than being a correlation?
132
00:08:59,237 --> 00:09:02,569
I think those are objective truths that you could be able to see reasoning.
133
00:09:02,569 --> 00:09:06,831
And I would say that the outputs for law are much, much different than that.
134
00:09:06,971 --> 00:09:12,735
That is, whether I made a good argument or not in front of this court is not objective.
135
00:09:12,735 --> 00:09:14,055
That is subjective.
136
00:09:14,055 --> 00:09:22,600
So I can't do a proof as to validity or invalidity any more than you could do a proof as
to lawyer one made a better argument than lawyer two.
137
00:09:22,600 --> 00:09:28,499
Ask 10 lawyers and you might get a 50-50 split on whether lawyer one made a better
argument or lawyer two made a better argument.
138
00:09:28,499 --> 00:09:36,899
going over to the transactional side, the contractual side, lawyer one might love this
clause, but lawyer two says that's the worst clause on the planet.
139
00:09:36,899 --> 00:09:40,759
There's no objective standard as to what is good legal work.
140
00:09:40,759 --> 00:09:49,999
And absent any objective standard as to good legal work, maybe what is good legal
reasoning is in the eye of the beholder, much like beauty is in the eye of the beholder.
141
00:09:49,999 --> 00:09:57,201
That with absent any objective way to be able to say this was good legal reasoning or bad
legal reasoning.
142
00:09:57,201 --> 00:10:05,774
I guess the question of whether a large language model is providing good legal reasoning
or bad legal reasoning is unanswerable in the same way to say whether that human is doing
143
00:10:05,774 --> 00:10:07,995
good legal reasoning or bad legal reasoning.
144
00:10:07,995 --> 00:10:15,717
So I think this whole debate about reasoning or not reasoning is academic at best because
we should judge it by its outputs.
145
00:10:15,717 --> 00:10:22,479
And different lawyers will judge the outputs differently with humans, and they'll judge it
differently with large language models.
146
00:10:22,990 --> 00:10:23,190
Okay.
147
00:10:23,190 --> 00:10:34,290
I think that's true to an extent, but let's say I come in as a, as an attorney and to make
my closing arguments, I see, I sing the theme song to Gilligan's Island.
148
00:10:34,290 --> 00:10:43,330
Um, I think that would universally, um, be graded as that's a bad, this is bad legal
reasoning, right?
149
00:10:43,330 --> 00:10:53,062
So, so there is a spectrum and you know, obviously that's an extreme case, but I think
extreme cases are good to evaluate whether or not something's true.
150
00:10:53,106 --> 00:11:05,180
And, so yeah, mean, if something is just universally looked at every attorney reasonable
person that would evaluate it says, it says it's bad.
151
00:11:05,321 --> 00:11:09,701
Is that, does that monkey wrench what, what you're putting forward there?
152
00:11:09,701 --> 00:11:10,406
No.
153
00:11:10,407 --> 00:11:11,097
Yeah, that's right.
154
00:11:11,097 --> 00:11:16,971
So you're right that it is a spectrum, that you have the worst argument on the planet,
which is just gibberish.
155
00:11:16,971 --> 00:11:21,914
And then there's the best argument on the planet that is going to win 100 out of 100
times.
156
00:11:21,914 --> 00:11:23,374
And same thing with contracts.
157
00:11:23,374 --> 00:11:26,366
There's the contract that's going to get the deal done 100 out of 100 times.
158
00:11:26,366 --> 00:11:29,518
And there's the contract that is going to fail, 100.
159
00:11:29,518 --> 00:11:32,720
So everything is along that spectrum.
160
00:11:32,720 --> 00:11:39,121
And then if you add a y-axis to that spectrum, there is a most common thing, that is the
head.
161
00:11:39,121 --> 00:11:42,573
And then there's a long tail of rare things that happen.
162
00:11:42,574 --> 00:11:47,717
So if you think about what the large language models are doing is largely giving you the
head distribution.
163
00:11:47,717 --> 00:11:53,261
That is the most common things because it's giving you a compressed version of the
training data set.
164
00:11:53,261 --> 00:11:57,384
so the head is almost never going to be Gilligan's Island.
165
00:11:57,504 --> 00:12:01,928
And the head is almost never going to be some of the worst contractual arguments ever
made.
166
00:12:01,928 --> 00:12:04,430
It's going to fall on the average on that side.
167
00:12:04,430 --> 00:12:06,749
And that actually is probably
168
00:12:06,749 --> 00:12:09,931
the right thing to do for the large language model in the legal task.
169
00:12:09,931 --> 00:12:17,605
Because you want the average, because you want 100 out of 100 lawyers, you want most of
the lawyers to say that's probably right.
170
00:12:17,725 --> 00:12:20,226
And that is the average distribution of this.
171
00:12:20,346 --> 00:12:30,242
And so really then, if we then say the x-axis and the y-axis and you have the head, the
most common things, and then you have the long tail, and you now say, OK, the large
172
00:12:30,242 --> 00:12:35,865
language models are going to take the head, not the long tail, then you have to say, OK,
what is that head?
173
00:12:35,865 --> 00:12:36,567
Is that
174
00:12:36,567 --> 00:12:39,288
Does it require legal reasoning or not?
175
00:12:39,508 --> 00:12:44,819
So let's take about let's talk about mathematics and science We want to find new science,
right?
176
00:12:44,819 --> 00:12:48,200
We want to be able to create new cures to cancer, right?
177
00:12:48,200 --> 00:12:54,052
And we want to be able to do things that have never done been done before So does the
large language model need reasoning for that?
178
00:12:54,052 --> 00:12:57,023
Absolutely, because that's not part of the training to set right?
179
00:12:57,023 --> 00:13:04,195
That's not part of something that we can look backward at so we need reasoning for new
science We need reasoning for new mathematics
180
00:13:04,195 --> 00:13:12,061
You need reasoning for something that's never been done before that you need somebody like
Einstein or somebody to somebody who is once in a generation to be able to go forward and
181
00:13:12,061 --> 00:13:13,342
leap forward.
182
00:13:13,342 --> 00:13:15,083
Contrast that with the law.
183
00:13:15,664 --> 00:13:19,707
How much new thinking do we really need to do in the law?
184
00:13:19,828 --> 00:13:24,531
In contrast, how much of the law is looking backward that is looking to precedent?
185
00:13:24,612 --> 00:13:32,938
If I am a lawyer arguing in court, if I say, I've got judge, I've got this really brand
new idea that nobody's ever won on before, but I just sprouted out of my brain.
186
00:13:32,938 --> 00:13:33,693
What do you think?
187
00:13:33,693 --> 00:13:35,364
The judge is going to say, me a case.
188
00:13:35,364 --> 00:13:41,809
And if I can't show him a case, if I can't show her a statute, I lose because it's not
based on precedent.
189
00:13:42,009 --> 00:13:45,212
So do we really need new things in litigation?
190
00:13:45,212 --> 00:13:47,693
Do we really need new things in transactional work?
191
00:13:47,693 --> 00:13:49,875
Do we really need new things in advisory work?
192
00:13:49,875 --> 00:13:51,856
Do we need new things in regulatory work?
193
00:13:51,856 --> 00:13:55,339
And I think the answer to all four of those is no, because you're always looking to
precedent.
194
00:13:55,339 --> 00:13:56,580
You're always looking to statutes.
195
00:13:56,580 --> 00:13:59,441
You're always looking to something that is in the data set.
196
00:13:59,522 --> 00:14:01,453
So if it is in the data set,
197
00:14:01,777 --> 00:14:08,954
Really, all of our reasoning that is legal is backward looking, not forward looking like
in mathematics or in science.
198
00:14:08,954 --> 00:14:10,616
It is all backward looking.
199
00:14:10,616 --> 00:14:18,113
So if it's all backward looking, is every legal reasoning really recombining the data set
that we're having?
200
00:14:18,648 --> 00:14:23,852
Well, what about novel pieces of regulation that now have to be interpreted?
201
00:14:23,852 --> 00:14:33,260
Is there not new legal thinking that has to take place to evaluate the applicability in
those scenarios?
202
00:14:34,008 --> 00:14:39,633
There is, but I would say that the data is taken care of through what's called
interpolation.
203
00:14:39,633 --> 00:14:45,028
And so with the large language models, they connect concepts.
204
00:14:45,028 --> 00:14:48,172
I'm going to share my screen on this.
205
00:14:48,172 --> 00:14:48,964
is it possible?
206
00:14:48,964 --> 00:14:49,233
It's cool.
207
00:14:49,233 --> 00:14:57,831
So I'm going to pull up a PowerPoint to actually demonstrate a real live case that I had
where there's so.
208
00:14:58,077 --> 00:15:05,372
for the less sophisticated and maybe more sophisticated, we'll recap on how large language
model works, is that they pull out concepts.
209
00:15:05,573 --> 00:15:08,734
And they pull out concepts and put them into what's called vector space.
210
00:15:08,955 --> 00:15:19,963
And so you can imagine a two-dimensional vector space that the ideas of a faucet and a
sink and a vanity are probably pretty close together in that two-dimensional vector space.
211
00:15:19,963 --> 00:15:24,306
And then you could be able to say, OK, let's go ahead and put that in three-dimensional
vector space with a z-axis.
212
00:15:24,306 --> 00:15:25,903
And then you could be able to say, OK, these
213
00:15:25,903 --> 00:15:29,085
All similar things are kind of clustered together as ideas.
214
00:15:29,085 --> 00:15:34,048
And now add a fourth dimension, and our brains can't even figure out what that fourth
dimension would look like.
215
00:15:34,048 --> 00:15:36,169
Now add a 10th dimension.
216
00:15:36,169 --> 00:15:37,910
Now add a 100th dimension.
217
00:15:37,910 --> 00:15:41,831
Now add a 1,000th dimension and add a 12,000th dimension.
218
00:15:42,412 --> 00:15:45,634
And 12,000 dimensional vector space is where large language models live.
219
00:15:45,634 --> 00:15:55,123
And somewhere in that 12,000 dimensional vector space lives Ernest Hemingwayness and Bob
Dylanness and Pablo Picassoness.
220
00:15:55,123 --> 00:15:57,803
that lives in 12,000 dimensional vector space.
221
00:15:57,803 --> 00:16:03,803
So all of the things that are legal concepts live somewhere in that 12,000 dimensional
vector space.
222
00:16:03,803 --> 00:16:09,163
And all the facts in the world live somewhere in 12,000 dimensional vector space.
223
00:16:09,163 --> 00:16:16,683
And so what you can imagine is, to your question, is isn't going to combine some novel
things.
224
00:16:16,683 --> 00:16:19,103
I would say, yes, it will combine them.
225
00:16:19,103 --> 00:16:22,831
But the thing is, how many of those things are
226
00:16:22,831 --> 00:16:25,592
already in the large language models vector space.
227
00:16:25,592 --> 00:16:33,994
And then combining those is what's called, the data scientists would say, connecting the
latent space between those two disparate concepts.
228
00:16:33,994 --> 00:16:40,276
So now as I'm going to be sharing my screen, this concept is to think through.
229
00:16:40,451 --> 00:16:42,877
A friend of mine works for an insurance company.
230
00:16:42,877 --> 00:16:48,991
And she asked, what do you think of the thing about that called effective computing?
231
00:16:48,991 --> 00:16:51,139
What do you think of effective computing?
232
00:16:51,247 --> 00:16:55,411
And I said, I'm a pretty technical guy, but I'm sad to say I don't know what effective
computing is.
233
00:16:55,411 --> 00:17:01,815
So what I did is I went to the large language model and said, define effective computing
in the context of insurance and the law.
234
00:17:02,236 --> 00:17:05,739
And she's an insurance in-house lawyer.
235
00:17:05,739 --> 00:17:13,466
So I says, well, effective computing is how computers recognize human emotions and facial
expressions and voice patterns to create emotionally aware agents.
236
00:17:13,466 --> 00:17:14,226
I said, cool.
237
00:17:14,226 --> 00:17:20,621
Now analyze how effective computing can be used in an insurance call center, because
that's how my friend's company was thinking about using it.
238
00:17:20,959 --> 00:17:28,185
They said well you could use it for emotional recognition figuring out the caller's
emotional state figuring out their choice of words How quickly they're speaking how
239
00:17:28,185 --> 00:17:38,604
emotional they are after an accident or loss I said cool now give me a list of potential
legal issues that could stem from using effective computing in a call center and they said
240
00:17:38,604 --> 00:17:48,802
have you thought about privacy law like GDPR or yeah, or CCPA I've you thought about
consent and whether that that caller consented to you analyzing their emotions Have you
241
00:17:48,802 --> 00:17:50,203
thought about if you get hacked?
242
00:17:50,203 --> 00:17:54,796
What if all of your client's emotional data is in the hands of a hacker?
243
00:17:54,796 --> 00:17:55,987
What's that going to do legally?
244
00:17:55,987 --> 00:17:57,748
What's that going to do with PR?
245
00:17:57,748 --> 00:17:59,870
These are all good legal concepts.
246
00:17:59,870 --> 00:18:07,655
And I would guess that zero times has anyone ever asked about the legal aspects of
emotional, of effective computing.
247
00:18:07,675 --> 00:18:15,140
But what it's done is it knows what effective computing is, and it knows what privacy law
is, it knows what consent is, it knows what data security is.
248
00:18:15,140 --> 00:18:19,409
So it's connecting the latent space between the concept of effective computing
249
00:18:19,409 --> 00:18:21,280
and the concept of privacy law.
250
00:18:21,280 --> 00:18:23,681
And it then says, give me some sub bullets.
251
00:18:23,681 --> 00:18:28,604
And now it's going to continue expanding upon the concepts of which jurisdictions people
are calling in from.
252
00:18:28,604 --> 00:18:29,525
What types of data?
253
00:18:29,525 --> 00:18:30,395
Third party sharing.
254
00:18:30,395 --> 00:18:32,086
Are you minimizing the data?
255
00:18:32,126 --> 00:18:35,628
Each one of these things that I had live somewhere in vector space.
256
00:18:35,628 --> 00:18:43,152
So merely combining the concept of effective computing with the concept of privacy law and
consent and data security.
257
00:18:43,152 --> 00:18:48,705
That way we can then combine those aspects in new ways that haven't been in the training
set.
258
00:18:48,829 --> 00:18:50,059
So I think that's where it is.
259
00:18:50,059 --> 00:18:58,343
Almost everything that we do as laws, as lawyers, everything we do is connecting my
client's facts to the existing laws.
260
00:18:58,503 --> 00:19:01,724
And your client's facts are almost certainly in the training set.
261
00:19:01,724 --> 00:19:09,648
And the existing laws, if you are training on actual non-hallucinated cases, statutes, and
regulations, those are also in the training set.
262
00:19:09,648 --> 00:19:18,329
So really, reasoning is just being able to connect those existing facts in the data set
with the existing laws in the data set and saying how they relate to each other.
263
00:19:18,329 --> 00:19:21,945
if you have the actual non-hallucinated cases, statutes, and regulations.
264
00:19:22,604 --> 00:19:23,615
That's super interesting.
265
00:19:23,615 --> 00:19:30,978
So I find it, I have to think through this, but it seems shocking to me that there are no
novel concepts.
266
00:19:30,978 --> 00:19:39,113
Um, that what you've just described two things that currently exist in the, in the
training material, right?
267
00:19:39,113 --> 00:19:51,509
That, that the LLM has vectorized and plotted in 12,000 dimensions and it knows the
associations and, and the latent space between them.
268
00:19:51,830 --> 00:19:52,588
But
269
00:19:52,588 --> 00:20:08,665
What about new areas of law like when we start selling real estate on the moon, that
obviously at some point will make its way in, but until it does, how will it navigate
270
00:20:08,665 --> 00:20:10,366
scenarios like that?
271
00:20:10,579 --> 00:20:13,270
So I guess the question is where do those areas of law come from?
272
00:20:13,270 --> 00:20:14,660
And they come from regulations.
273
00:20:14,660 --> 00:20:15,861
They come from statutes.
274
00:20:15,861 --> 00:20:17,321
They come from cases.
275
00:20:17,701 --> 00:20:21,262
And of those cases, statutes and regulations are reflected in documents.
276
00:20:21,402 --> 00:20:30,765
And if the system has those documents, the cases, the statutes, and the regulations, then
the system will be able to plot those in vector space and then be able to take those legal
277
00:20:30,765 --> 00:20:35,446
concepts and apply them to the factual concepts that are also in vector space.
278
00:20:35,446 --> 00:20:39,067
So really, every single area of law is written somewhere.
279
00:20:39,067 --> 00:20:40,990
It has to be, otherwise it's not a law.
280
00:20:40,990 --> 00:20:43,073
And if it's written, it can be vectorized.
281
00:20:43,073 --> 00:20:45,286
So really everything that we do is part of the training set.
282
00:20:45,286 --> 00:20:53,837
There is really no novelty that is needed in the law because everything is necessarily
backward looking at the cases, the statutes, the regulations that are binding.
283
00:20:54,316 --> 00:20:54,987
Interesting.
284
00:20:54,987 --> 00:21:00,852
You had a metaphor I had not heard before with anesthesia.
285
00:21:00,852 --> 00:21:07,478
And I think you had a friend who was an anesthesiologist.
286
00:21:07,478 --> 00:21:08,058
Yes.
287
00:21:08,058 --> 00:21:10,170
And I have trouble saying that word.
288
00:21:10,170 --> 00:21:13,683
So I'll just say anesthesiology.
289
00:21:13,683 --> 00:21:17,376
To explain that, because I thought that was an interesting metaphor.
290
00:21:17,511 --> 00:21:19,001
Yeah, she told me something.
291
00:21:19,001 --> 00:21:26,744
I was over a campfire and it freaked me out and I may freak out your recent listeners, but
Yeah, she said she said Damien.
292
00:21:26,744 --> 00:21:28,124
Do you realize we have no idea?
293
00:21:28,124 --> 00:21:29,435
She's a nurse anesthetist, right?
294
00:21:29,435 --> 00:21:39,037
So she puts people under every single day and she has a I think a master's degree in
anesthesiology So she said do you realize we have no idea how anesthesia works?
295
00:21:39,298 --> 00:21:40,978
I said wait to say that again.
296
00:21:40,978 --> 00:21:44,143
She said yeah one of two options option number one
297
00:21:44,143 --> 00:21:51,830
is it does what everybody thinks that it does, is that it puts us to sleep and we don't
feel that scalpel going into our bellies and then we come out and we're all fine, right?
298
00:21:51,830 --> 00:21:53,271
That's option number one.
299
00:21:53,271 --> 00:21:57,134
Option number two is we feel every single cut.
300
00:21:57,234 --> 00:22:01,317
And what anesthesia does is to give us amnesia to make us forget.
301
00:22:01,698 --> 00:22:04,381
We don't know whether it's option one or option two.
302
00:22:04,381 --> 00:22:07,503
That scares the crap out of me and it might well scrape the crap out of you.
303
00:22:07,503 --> 00:22:12,187
But the question is, do we not use anesthesia because we don't know how it works?
304
00:22:12,891 --> 00:22:20,977
No, of course we use anesthesia because the real question is does it work and is it
effective as to what we would like it to do?
305
00:22:20,977 --> 00:22:28,281
If the answer to both those things is yes, then how it works maybe matters less than the
fact that it does work.
306
00:22:28,502 --> 00:22:32,124
So apply that anesthesia test to reasoning.
307
00:22:32,525 --> 00:22:41,661
And just like I can't tell whether you could, you're reasoning in Ted's brain or not, but
I can gauge you by your output, by your, by your speech.
308
00:22:41,661 --> 00:22:44,133
by your words coming out of your keyboard.
309
00:22:44,193 --> 00:22:47,856
And if that works, I say you're reasoning.
310
00:22:48,597 --> 00:22:51,000
whether I know how your brain works doesn't matter.
311
00:22:51,000 --> 00:22:54,302
And whether I know how anesthesia works doesn't matter.
312
00:22:54,423 --> 00:22:58,106
I'm sorry, whether I know how anesthesia works doesn't matter.
313
00:22:58,106 --> 00:23:00,008
The fact that it does work matters.
314
00:23:00,008 --> 00:23:07,934
So the fact that a large-language model does create output that seems like it is
reasonable and is reasoning, just like a human is reasoning.
315
00:23:08,763 --> 00:23:19,705
If the human, if the large language model output is indistinguishable from Ted's output as
reasonable, then I would say whether it is actual reasoning and how it's reasoning doesn't
316
00:23:19,705 --> 00:23:23,929
really matter any more than anesthesia doesn't matter if we know how anesthesia works.
317
00:23:24,322 --> 00:23:26,784
Yeah, that is disturbing to think about.
318
00:23:27,705 --> 00:23:31,109
But it's a valuable metaphor.
319
00:23:31,109 --> 00:23:33,791
Now here's what I would say in response to that.
320
00:23:33,791 --> 00:23:40,577
Did you have a chance to look at the Apple Intelligence team's study with the GSM 8K?
321
00:23:41,688 --> 00:23:43,371
Only in the two minutes before you sent it.
322
00:23:43,371 --> 00:23:45,395
So why don't you describe it and maybe I can react to it.
323
00:23:45,395 --> 00:23:45,825
Yeah.
324
00:23:45,825 --> 00:24:00,567
So, um, it's only five weeks old, so it's, it's very new, but one benchmark that has been
used pretty widely to test reasoning in, um, large language models is the, the GSM, which
325
00:24:00,567 --> 00:24:06,011
stands for grade school math, AK there's 8,000 of these questions.
326
00:24:06,252 --> 00:24:14,198
And what Apple did was modified these questions ever so slightly.
327
00:24:14,198 --> 00:24:19,000
and evaluated the LLM's performance against those modifications.
328
00:24:19,000 --> 00:24:20,780
And it was pretty dramatic.
329
00:24:20,880 --> 00:24:34,704
So their conclusions were, I said, the performance of all models decline when only the
numerical values in the question are altered in the GSM symbolic benchmark.
330
00:24:34,885 --> 00:24:37,705
That's pretty interesting.
331
00:24:38,326 --> 00:24:39,202
It says,
332
00:24:39,202 --> 00:24:45,283
their performance significantly deteriorates as the number of clauses in the question
increases.
333
00:24:45,604 --> 00:24:54,726
And then its conclusion is we hypothesize that this decline is due to the fact that
current LLMs are not capable of genuine logical reasoning.
334
00:24:55,205 --> 00:25:06,349
And I thought there were a few examples in this specifically that really, I guess, were
telling.
335
00:25:06,389 --> 00:25:08,736
So let me see if I can find this here.
336
00:25:08,736 --> 00:25:23,130
So, um, one of these, uh, these are word problems and in one of the word problems, they,
I'm not going to be able to find it, but I remember enough about it to, um, articulate it.
337
00:25:23,170 --> 00:25:31,473
What they did was in the problem, they threw a sentence that had nothing to do with the
problem itself and it completely blew up the problem.
338
00:25:31,473 --> 00:25:36,302
Um, the sentence that they put in there was it, the question was something like,
339
00:25:36,302 --> 00:25:53,402
You know, if the current price of keyboards and mouse pads are five and $10 respectively,
and inflation has increased by 10 % each year, that was the part that had nothing.
340
00:25:53,402 --> 00:25:55,592
Tell us what the current price is, right?
341
00:25:55,592 --> 00:25:57,352
It's already given you the information.
342
00:25:57,352 --> 00:26:01,302
The fact that inflation increased 10 % has nothing to do.
343
00:26:01,302 --> 00:26:05,362
And it, it plummeted the, the,
344
00:26:05,390 --> 00:26:10,092
accuracy of the large language models responses, something like 65%.
345
00:26:10,092 --> 00:26:14,253
It varied wildly as you would expect.
346
00:26:15,614 --> 00:26:30,560
The latest models did, you know, in the chain of thought that they did the best, but it
was, it seemed to me that this really pokes a hole in the whole concept of if these,
347
00:26:30,560 --> 00:26:35,412
because what that points to, if that, if you throw it a sentence that has nothing to do
with the problem,
348
00:26:35,486 --> 00:26:38,390
in and I can't, that means I haven't comprehended the problem.
349
00:26:38,390 --> 00:26:40,934
I don't know what the problem is, right?
350
00:26:40,934 --> 00:26:48,694
I'm, simply reciting answers and you know, it's what I honestly would expect from, but I
don't know.
351
00:26:48,694 --> 00:26:50,366
What is your response to that?
352
00:26:50,483 --> 00:26:59,883
Yeah, so I would say two responses, one of which is the idea that mathematics has a right
answer and a wrong answer, whereas legal often does not.
353
00:26:59,883 --> 00:27:05,943
That is, in litigation, it's whatever argument happens to win, and in transactional work,
it's whatever gets the deal done.
354
00:27:05,943 --> 00:27:14,623
So, wherein the mathematical proof, you have a right answer or a wrong answer, whereas in
legal, there is the eye of the beholder, where there is no objective, there's merely the
355
00:27:14,623 --> 00:27:15,163
subjective.
356
00:27:15,163 --> 00:27:16,863
So that's thing number one.
357
00:27:16,863 --> 00:27:19,275
Thing number two is, of course,
358
00:27:19,275 --> 00:27:29,501
With mathematics you want to be able to create new mathematics and be able to go forward
with new scenarios But again law never has It's always looking backward to precedent
359
00:27:29,501 --> 00:27:38,016
looking backward to cases looking backward to the contracts like we've always done the
contract in this way And we know that in this industry and this jurisdiction force measure
360
00:27:38,016 --> 00:27:48,281
clauses need to be in this way This is always backward looking so really so two things non
objectivity in the law where there is objectivity in math
361
00:27:48,281 --> 00:27:52,613
and backward looking in the law rather than forward looking with mathematics.
362
00:27:52,754 --> 00:28:01,139
That yes, it'll throw off the mathematics by throwing in the inflationary tool and it
won't really reason in that way.
363
00:28:01,179 --> 00:28:11,036
But I think for our use cases in the law, whether it's a transactional use case, a
litigation use case, an advisory use case or regulatory use case, all of the stuff is
364
00:28:11,036 --> 00:28:11,806
there.
365
00:28:11,806 --> 00:28:17,843
And if we use the chain of thought like you've talked about, then it could probably
overcome the lack of true
366
00:28:17,843 --> 00:28:19,784
quote unquote reasoning that we have.
367
00:28:19,784 --> 00:28:23,767
And we as humans are really good at separating wheat from chaff.
368
00:28:23,767 --> 00:28:30,691
And so you can imagine, you know, scenario one is everybody takes the robot's output and
doesn't touch it.
369
00:28:30,892 --> 00:28:33,634
That's a bad scenario under anybody's estimation.
370
00:28:33,634 --> 00:28:39,838
But almost everybody's in scenario two where it gives an output and then you look over
that output and get it out the door.
371
00:28:39,838 --> 00:28:43,040
Under scenario two, you're going to separate that wheat from the chaff.
372
00:28:43,080 --> 00:28:47,357
And so until we have autonomous legal bots, which
373
00:28:47,357 --> 00:28:49,691
God help us if we have that, right?
374
00:28:49,733 --> 00:28:52,811
But until we have that, you're always gonna have that human oversight.
375
00:28:52,811 --> 00:28:57,291
So really, whether it's reasoning or not, is gonna be pretty easily flagged.
376
00:28:57,528 --> 00:28:58,199
Yeah.
377
00:28:58,199 --> 00:29:02,102
And they, they, it wasn't just, um, there were other ways that they tested it.
378
00:29:02,102 --> 00:29:04,344
They actually changed some of the numbers.
379
00:29:04,344 --> 00:29:06,525
What was interesting that also threw it off.
380
00:29:06,525 --> 00:29:07,967
And this part surprised me.
381
00:29:07,967 --> 00:29:11,039
thought AI would, I thought LLMs would figure this out.
382
00:29:11,039 --> 00:29:12,711
They changed the names.
383
00:29:12,711 --> 00:29:15,873
So instead of Sophie, they put Lisa, right?
384
00:29:15,873 --> 00:29:17,655
But they did it consistently throughout.
385
00:29:17,655 --> 00:29:21,858
Like it should be able to, so anyway, it's a new study.
386
00:29:21,858 --> 00:29:26,968
There's still a lot, to be analyzed.
387
00:29:26,968 --> 00:29:29,559
from it, but I did think it was interesting.
388
00:29:30,480 --> 00:29:37,805
Speaking of studies, the Stanford study, there's been a lot of conversation about it.
389
00:29:37,986 --> 00:29:43,068
The second iteration of that came out in May.
390
00:29:43,068 --> 00:29:56,738
you know, there was a, obviously there's companies out there that put a lot of money and
effort into these tools and Stanford was pretty pointed in their, in their commentary and
391
00:29:56,770 --> 00:30:01,013
You know, there was a lot of feedback that the, the, the, the study was biased.
392
00:30:01,013 --> 00:30:03,054
I read it multiple times.
393
00:30:03,054 --> 00:30:06,857
It's about 30 pages and it's a really, it's an easy read.
394
00:30:06,857 --> 00:30:09,549
Like reading scientific papers is usually rough going.
395
00:30:09,549 --> 00:30:11,660
That one was really easy to read.
396
00:30:11,680 --> 00:30:15,523
And I thought I didn't feel, I didn't see the bias.
397
00:30:15,523 --> 00:30:19,125
did try and trick the tools and, it was upfront about that.
398
00:30:19,125 --> 00:30:23,178
Just, just like the, Apple study tried to trick AI, right?
399
00:30:23,178 --> 00:30:25,846
That's kind of part of testing is, you know,
400
00:30:25,846 --> 00:30:30,392
evaluating, you're going to throw curveballs and see how the model responds.
401
00:30:30,392 --> 00:30:32,595
But I know, what was your take on the study?
402
00:30:32,595 --> 00:30:37,792
Did you feel there were biases or did you think it was fair?
403
00:30:38,411 --> 00:30:49,384
Two thoughts on that and not to throw shade on the Stanford folks, you can imagine that
one issue I have with them is that the terminology that they used for hallucinations, I
404
00:30:49,384 --> 00:30:53,516
think they conflated hallucinations with just getting the wrong legal answer.
405
00:30:53,516 --> 00:30:54,906
Those are two different things, right?
406
00:30:54,906 --> 00:31:06,131
There is an hallucination where it just makes some things up and then there is where Ted
and I disagree as to where the law ends up and number two is not hallucination.
407
00:31:06,131 --> 00:31:08,331
That is just us disagreeing.
408
00:31:08,331 --> 00:31:10,671
again, with the law, there may not be a right answer.
409
00:31:10,671 --> 00:31:16,261
And the reason there is litigation is because the reasonable minds can disagree as to what
is the right answer or not the right answer.
410
00:31:16,261 --> 00:31:18,951
So a court has to be able to resolve that dispute.
411
00:31:19,431 --> 00:31:24,071
a disagreement as to the output is not hallucination.
412
00:31:24,071 --> 00:31:32,711
So number one, the quibble I had is on the termination that they call everything
hallucination, where really we should focus that on confabulations that the large language
413
00:31:32,711 --> 00:31:33,271
models do.
414
00:31:33,271 --> 00:31:34,771
That's thing number one.
415
00:31:34,771 --> 00:31:38,730
Thing number two goes to trying to trick the model in the ways that you talked about.
416
00:31:38,730 --> 00:31:41,101
And this goes to the product side of me.
417
00:31:41,101 --> 00:31:42,051
I'm a product guy.
418
00:31:42,051 --> 00:31:43,410
You're a product guy.
419
00:31:43,951 --> 00:31:48,011
We, as product people, say, what are the most common user pathways?
420
00:31:48,011 --> 00:31:49,731
What are the most common user behaviors?
421
00:31:49,731 --> 00:31:53,611
And we want to be able to build products that are based on those most common user
behaviors.
422
00:31:53,611 --> 00:32:01,467
And going back to my x and y-axis, this is the head and the long tail, where you have the
most common things done are the head.
423
00:32:01,467 --> 00:32:06,710
And the weirdest, strangest things that you would never think a user would ever do is in
the long tail.
424
00:32:06,911 --> 00:32:15,717
And so the things that they were asking were things like, when Justice Ruth Bader Ginsburg
dissented in this case, what does that mean?
425
00:32:15,717 --> 00:32:24,234
Where a user would never ask that, knowing because that user would say, well, the user
would know that Ruth Bader Ginsburg didn't dissent in that case.
426
00:32:24,234 --> 00:32:26,365
She was the concurrence in that case.
427
00:32:26,365 --> 00:32:29,143
So asking a question like that is
428
00:32:29,143 --> 00:32:32,245
way, way down on the long tail distribution curve.
429
00:32:32,245 --> 00:32:34,347
That is not the most common use case.
430
00:32:34,347 --> 00:32:42,793
So really, if they were to do the study correctly, they would take, they would say, what
are the most common questions made by lawyers?
431
00:32:42,793 --> 00:32:50,848
The most common questions made by law students, the most common questions, and then
collect those most common questions, randomly distribute those most common questions, and
432
00:32:50,848 --> 00:32:55,742
then say, based on those most common questions, or I guess not even most common, they
would take the entire distribution curve.
433
00:32:55,742 --> 00:32:57,723
They would take the head and the tail.
434
00:32:57,723 --> 00:32:59,404
Mix that up in a randomized study.
435
00:32:59,404 --> 00:33:03,907
So there will be some long tail questions, some head questions.
436
00:33:03,907 --> 00:33:10,072
And then from that random distribution, then run those through and see how many
confabulations slash hallucinations are there.
437
00:33:10,072 --> 00:33:12,013
That would be a reasonable way to do it.
438
00:33:12,013 --> 00:33:15,315
That would be most aligned with how users use the tools.
439
00:33:15,315 --> 00:33:25,803
So by asking a long tail question that arguably would zero times out of 1,000 ever be
asked, zero times out of 1 million ever be asked, nobody would ever ask a question to
440
00:33:25,803 --> 00:33:26,703
trick it.
441
00:33:26,811 --> 00:33:32,754
I would say that's really not doing what the system should be doing for its purpose.
442
00:33:32,754 --> 00:33:43,088
The purpose of legal research is to answer legal questions and it's not to gauge the
reasoning of the underlying large language models, GPT-4 or GPT-3.5 or whatever was in
443
00:33:43,088 --> 00:33:44,339
place at the time.
444
00:33:44,339 --> 00:33:46,920
And what they were doing was tricking GPT-3.5.
445
00:33:46,920 --> 00:33:50,861
They weren't tricking the legal research tools that were relying on GPT-3.5.
446
00:33:50,961 --> 00:33:55,097
So I would say that that tests the thing that is the product.
447
00:33:55,097 --> 00:33:58,605
not testing the underlying model that the product is using.
448
00:33:59,032 --> 00:34:08,042
So how is that different than throwing in a sentence about inflation being 10 % in the GSM
8K scenario?
449
00:34:08,711 --> 00:34:15,776
So I guess if we're looking at a legal product, like are we testing reasoning or are we
testing how well the product works?
450
00:34:15,776 --> 00:34:17,597
Because those are two different goals.
451
00:34:17,597 --> 00:34:22,440
Because really testing reasoning is testing the foundational model, GPT-4, GPT-3.5.
452
00:34:22,440 --> 00:34:31,386
But if you're testing how well the product works for its intended purpose, then the
question then would be, would the user input usually include that inflationary number?
453
00:34:31,446 --> 00:34:37,950
If the answer is yes, the user input would include that, then yes, we should definitely
include that in the distribution of the user input.
454
00:34:38,988 --> 00:34:46,838
zero times out of 100 million times they would include that inflationary number, then that
doesn't seem right because if you're really testing the product, that is not a use case
455
00:34:46,838 --> 00:34:48,581
that the users would ever use.
456
00:34:48,728 --> 00:34:50,409
Yeah, guess, yeah.
457
00:34:50,409 --> 00:34:51,659
OK, fair.
458
00:34:51,659 --> 00:35:00,383
So in section 6.2 in the study, do, I agree, they say hallucinations can be insidious.
459
00:35:00,423 --> 00:35:10,907
And then some of the scenarios that they document don't seem like hallucination scenarios,
like misunderstanding holdings.
460
00:35:10,907 --> 00:35:17,690
Systems do not seem capable of consistently making out the holding of a case,
distinguishing between legal actors.
461
00:35:17,822 --> 00:35:28,089
systems fail to distinguish between arguments made by litigants and statements made by the
court, respecting order of authority, models strain in grasping hierarchies of legal
462
00:35:28,089 --> 00:35:28,550
authority.
463
00:35:28,550 --> 00:35:30,461
Yeah, those aren't hallucinations.
464
00:35:30,461 --> 00:35:34,764
Those are just limitations of the model itself, it seems.
465
00:35:34,827 --> 00:35:39,469
That's right, limitations of the model and perhaps limitations of the system that is using
that model.
466
00:35:39,469 --> 00:35:49,352
So you could imagine that if the system were to say as part of the metadata that a trial
court is below this particular appellate court, which is below this particular Supreme
467
00:35:49,352 --> 00:36:01,055
Court, and models that hierarchy of courts in a symbolic way, not a large language model
way, but in a symbolic coded up way, then that system could avoid the,
468
00:36:01,139 --> 00:36:10,279
confabulation between the district court and the appellate court level because the
guardrails of the symbolic AI would prevent that kind of misunderstanding.
469
00:36:10,919 --> 00:36:21,399
So is Stanford analyzing the large language model output or are they analyzing the
system's coding, that is the hard coding, to be able to say that this trial court is below
470
00:36:21,399 --> 00:36:23,659
the appellate court, which is below the Supreme Court?
471
00:36:23,659 --> 00:36:30,773
I think that that is maybe a reasonable critique that if the system
472
00:36:30,963 --> 00:36:34,764
is not recognizing that hierarchy than maybe the system should.
473
00:36:34,784 --> 00:36:40,026
So I would say that's maybe a reasonable critique if you're really looking at that.
474
00:36:40,666 --> 00:36:42,327
So yeah, maybe two scenarios.
475
00:36:42,327 --> 00:36:52,500
The Ruth Bader Ginsburg is unreasonable, but knowing whether the district court was
overruled by the appellate court, which was then ruled back by the Supreme Court, a legal
476
00:36:52,500 --> 00:36:56,611
system that is a legal research system should know those things in a symbolic AI way.
477
00:36:56,728 --> 00:36:57,910
Yeah, exactly.
478
00:36:57,910 --> 00:37:00,053
And the last category was fabrications.
479
00:37:00,053 --> 00:37:02,065
that is a hallucination.
480
00:37:03,689 --> 00:37:10,348
So what does this study mean for the future of AI and legal research?
481
00:37:11,085 --> 00:37:12,416
The Stanford study?
482
00:37:13,217 --> 00:37:22,305
I would say that there are, so there's that Stanford study is out of one side of Stanford
and then there's another side of Stanford called the Codex.
483
00:37:22,305 --> 00:37:29,751
Megan Ma helps lead up that Codex and you probably know Megan and Megan's one of the
smartest minds right now in legal AI.
484
00:37:29,751 --> 00:37:39,119
She's running another study that is comparing human generated output with humans plus
machine generated output.
485
00:37:39,119 --> 00:37:45,102
and doing a double blind study as to see what the large law firm partners prefer.
486
00:37:45,102 --> 00:37:50,904
The partners don't know which is human created versus a human plus machine and going to be
doing this.
487
00:37:50,904 --> 00:37:57,127
So that seems like a reasonable way because that is really taking actual use cases.
488
00:37:57,127 --> 00:38:06,661
So she's taking actual contractual questions or actual litigation questions and being able
to actually take those common use cases, the head in the distribution curve, not the long
489
00:38:06,661 --> 00:38:07,847
tail, but the head.
490
00:38:07,847 --> 00:38:15,609
and then saying how do legal tools that are built for this actually performing on these
more likely legal tasks.
491
00:38:15,609 --> 00:38:22,311
So I would say that the Stanford study is a bright shining light as to the way things
should be done.
492
00:38:22,311 --> 00:38:25,212
The other Stanford study, Megan Ma's Stanford study.
493
00:38:25,212 --> 00:38:29,413
Second thing is that similar studies being done by Dan Schwartz out of Minnesota.
494
00:38:29,413 --> 00:38:35,655
Dan, you might have seen two of his other studies, one of which was to be able to say,
495
00:38:36,871 --> 00:38:44,315
They interspersed a large language model created essays with the human created essays and
a double blind the professors graded both.
496
00:38:44,315 --> 00:38:46,657
And so that was his study number one.
497
00:38:46,657 --> 00:38:52,280
Study number two was to, I forget what study number two was, but it was a similar vein.
498
00:38:52,280 --> 00:39:01,746
But then study number three is doing kind of the same thing that Megan is doing, but just
a different twist on it to be able to do a double or triple blind study of human created
499
00:39:01,746 --> 00:39:04,529
things along with machine created things.
500
00:39:04,529 --> 00:39:08,282
and mixing them up and having human evaluators be seeing what they prefer.
501
00:39:08,282 --> 00:39:10,403
So that's evaluation number two.
502
00:39:10,403 --> 00:39:11,984
That is a bright shining light.
503
00:39:11,984 --> 00:39:14,686
Evaluation number three is legal technology hub.
504
00:39:14,686 --> 00:39:26,114
Nikki Shaver and her team is working with vals.ai on a similar study that is involving
Harvey, Thompson Reuters, LexisNexis, us at VLex, where it is a similar kind of John Henry
505
00:39:26,114 --> 00:39:28,497
kind of test to evaluate the outputs.
506
00:39:28,497 --> 00:39:35,456
So I would say that the old Stanford study is old news and is probably chasing the wrong
things for the reason we've just discussed.
507
00:39:35,456 --> 00:39:43,937
Whereas the new Stanford study and the Minnesota-Michigan study and the legal technology
have VALS AI study, that is going to give us some hope going forward.
508
00:39:44,162 --> 00:39:44,602
Interesting.
509
00:39:44,602 --> 00:39:45,082
It's funny.
510
00:39:45,082 --> 00:39:49,987
We're in a space that moves so fast when May is old news and it's November.
511
00:39:49,987 --> 00:39:52,269
But yeah, I don't disagree.
512
00:39:53,490 --> 00:40:04,540
So the Gartner hype curve does a absolutely phenomenal job, in my opinion, mapping out the
trajectory of new technologies in many cases.
513
00:40:04,540 --> 00:40:09,284
And I think it's really playing out interestingly in AI right now.
514
00:40:09,284 --> 00:40:11,726
So the Goldman study
515
00:40:11,726 --> 00:40:17,466
came out saying that 44 % of legal tasks could be automated by GEN.AI, freaked everybody
out.
516
00:40:17,466 --> 00:40:21,566
I think that number was very aspirational.
517
00:40:24,286 --> 00:40:38,638
I think I might have heard you talk about when GPT-3.5 took the bar and scored it in the
mid-60s, GPT-4 scored 90 plus.
518
00:40:38,638 --> 00:40:41,878
That number was since been revised down significantly.
519
00:40:41,878 --> 00:40:43,958
So I've heard, which is interesting.
520
00:40:43,958 --> 00:41:01,698
Um, there's a Wharton survey out that I saw on Peter Duffy's newsletter that, surveyed
inside council and showed that only 28 % of inside council anticipate a high impact of gen
521
00:41:01,698 --> 00:41:02,898
AI in their role.
522
00:41:02,898 --> 00:41:08,230
I found that super interesting and 25 % anticipate a low impact.
523
00:41:08,460 --> 00:41:10,441
Which again, that's kind of mind blowing.
524
00:41:10,441 --> 00:41:12,895
But where do you think we are on this hype curve?
525
00:41:12,895 --> 00:41:20,032
Do you feel like we're in the trough of disillusionment or are we still, do we still have
further to go?
526
00:41:20,377 --> 00:41:22,588
I think we're going up the slope actually.
527
00:41:22,588 --> 00:41:28,129
I just gave a talk with Pablo Redondo of Case Text co-founder.
528
00:41:28,189 --> 00:41:36,011
He and I gave a presentation and he gave a lot of his part of the presentation walking us
through the hype cycle and walking us through the trough.
529
00:41:36,011 --> 00:41:41,733
And he thinks that law firms and others are doing the hard yards of going up the slope
slowly but surely.
530
00:41:41,733 --> 00:41:45,034
And I think that he's probably right.
531
00:41:45,034 --> 00:41:48,375
to a couple of the things that you mentioned, you know, the
532
00:41:49,223 --> 00:41:59,090
Bar exam that of course my friends Pablo was actually one of the guys who did that bar
exam paper along with Mike Mike Bomarito and Dan Katz were the other co-authors and so
533
00:41:59,090 --> 00:42:08,436
they in that paper that they wrote they actually put the caveat saying that there were One
those results are never publicly announced.
534
00:42:08,436 --> 00:42:18,155
So they're kind of doing replicas of the bar exam So this is always you know, until the
bar exam the multi-state bar exam publishes its numbers, of course
535
00:42:18,155 --> 00:42:20,317
there's no definitive objective number.
536
00:42:20,317 --> 00:42:27,042
It's all largely kind of taking kind of statistical likelihood rather than definitive
objective.
537
00:42:27,042 --> 00:42:27,932
That's thing number one.
538
00:42:27,932 --> 00:42:38,920
Thing number two, they had also put in footnotes caveats saying that there was a, this is
during COVID times and this is a, know, there's of course there are fewer takers during
539
00:42:38,920 --> 00:42:46,045
COVID times and maybe those COVID people were less likely to do well and maybe, you know,
there's all sorts of, you know, scientifically
540
00:42:46,045 --> 00:42:48,908
kind of nebulous things that make that number.
541
00:42:48,908 --> 00:42:53,682
But anyway, so they put that 90 % number with all of those caveats with the initial paper.
542
00:42:53,682 --> 00:42:57,400
So the subsequent papers that say, no, it's way lower than 90%.
543
00:42:57,400 --> 00:42:58,216
Like, come on.
544
00:42:58,216 --> 00:43:00,278
Like, they put that in the footnotes.
545
00:43:00,278 --> 00:43:01,870
So that's that.
546
00:43:01,870 --> 00:43:08,155
And then to the other thing that's saying that only 28 % of inside counsel anticipate high
impact on their role.
547
00:43:09,737 --> 00:43:11,978
I've heard that, but there's also
548
00:43:12,093 --> 00:43:22,727
there's a lot of studies saying that 80 % of inside counsel expect their external
counsel's bills to be reduced because of large language models.
549
00:43:22,848 --> 00:43:30,331
even though 28 % of them think that it's going to impact their role, 80 % think it's going
to impact external counsel's role.
550
00:43:30,331 --> 00:43:33,472
So that is an expectation from the buy side, the client side.
551
00:43:33,492 --> 00:43:40,935
And another thing is that the Clio Cloud Conference announced their Clio survey, where
they survey and
552
00:43:41,011 --> 00:43:44,771
Their distribution curve is mostly the solo small up to the midsize law firms.
553
00:43:44,771 --> 00:43:46,581
And they did a survey last year.
554
00:43:46,581 --> 00:43:50,651
during that survey, they said, how legal work are you doing?
555
00:43:50,651 --> 00:43:53,451
How many of you are using large language models for legal work?
556
00:43:53,451 --> 00:43:56,331
And the answer in 2023 was about 25%.
557
00:43:56,331 --> 00:43:58,271
They asked the same question in 2024.
558
00:43:58,271 --> 00:44:01,371
And the answer jumped to about 80%.
559
00:44:01,371 --> 00:44:07,471
That is 80 % of solo small up to midsize that Clio users are using AI for legal work.
560
00:44:07,471 --> 00:44:10,803
That's a dramatic jump from 25 % to 80%.
561
00:44:10,803 --> 00:44:16,803
And so that shows me that the future is not here, it's just not evenly distributed.
562
00:44:16,803 --> 00:44:25,023
That is, Solos Malls are using this, they're already on the slope in the light, but
they're already using it for real use cases, where the big law folks maybe aren't telling
563
00:44:25,023 --> 00:44:26,493
anybody that they're using it.
564
00:44:26,493 --> 00:44:34,603
And the associates may be in big law, where if their law firms prohibit them from using
it, they use shadow IT, where they use it on their personal devices, and they're not
565
00:44:34,603 --> 00:44:35,979
telling anybody about it.
566
00:44:36,014 --> 00:44:48,034
Yeah, you know, all these numbers that are flying around, I don't know if you saw the Ilta
Tech Survey that 74 % of law firms with more than 700 attorneys are using GEN.ai in
567
00:44:48,034 --> 00:44:49,834
business use cases.
568
00:44:50,594 --> 00:44:53,354
seems very aspirational to me.
569
00:44:53,354 --> 00:44:56,184
And I had Steve Embry on the podcast a while back.
570
00:44:56,184 --> 00:45:02,114
He wrote an article, this was before the Tech Survey came out, called Mind the Gap.
571
00:45:02,114 --> 00:45:04,522
And the gap he was talking about is
572
00:45:04,522 --> 00:45:15,731
surveys like this that report gen AI usage and then the anecdotal observation of people
like him and me that work with law firms all day long and just don't see it.
573
00:45:16,292 --> 00:45:20,376
So I think this, lot of these numbers are conflicting, aspirational.
574
00:45:20,376 --> 00:45:25,580
Maybe you have a lawyer who Googles what Chad GPT is and he can check the box.
575
00:45:25,580 --> 00:45:28,062
He or she can check the box, but I don't know.
576
00:45:28,183 --> 00:45:28,904
That's right.
577
00:45:28,904 --> 00:45:35,827
I would say that, yeah, every survey can be poked holes in by the way it's asked.
578
00:45:35,827 --> 00:45:45,713
Because you can imagine if it's way it's asked, have you used large language models in
your practice, if I used it for one thing in the full year, then I could answer yes to
579
00:45:45,713 --> 00:45:46,433
that.
580
00:45:46,433 --> 00:45:54,958
But really, the question is, if the question were instead, what percentage of your work
involved large language models, that number would be totally different, right?
581
00:45:54,958 --> 00:45:56,779
And that number would be way lower.
582
00:45:56,871 --> 00:46:02,288
Have you ever used it in the past year for one thing versus what percentage of your work
have you used it?
583
00:46:02,288 --> 00:46:05,176
Those are two very different questions that will give very different answers.
584
00:46:05,176 --> 00:46:06,907
Yeah, agreed.
585
00:46:06,947 --> 00:46:09,607
Well, this has been a super fun conversation.
586
00:46:09,607 --> 00:46:12,086
I really appreciate you taking a few minutes with me.
587
00:46:12,086 --> 00:46:17,993
I think the, the LLM reasoning conversation is really just beginning.
588
00:46:17,993 --> 00:46:21,055
Do you, do you know Jan Likun from
589
00:46:21,055 --> 00:46:23,853
do, yeah, he's one of the smartest guys around.
590
00:46:23,884 --> 00:46:24,995
Yeah, from Meta.
591
00:46:24,995 --> 00:46:31,957
So I, again, this is more conflicting information that we as individuals have to make
sense of.
592
00:46:31,957 --> 00:46:40,040
He talked about that, uh, currently a house cat is smarter than large language models,
which I thought was interesting.
593
00:46:40,040 --> 00:46:46,614
And then I heard a, another, there was a talk at a local EO, um, entrepreneurs
organization function here in St.
594
00:46:46,614 --> 00:46:48,525
Louis last night.
595
00:46:48,525 --> 00:46:52,446
And I got the notes from it and it said that current
596
00:46:52,838 --> 00:47:06,422
large language models are operating at, think the number was the equivalent of an IQ of
100 and that in the next year and a half, it will operate at an IQ of 1000.
597
00:47:06,422 --> 00:47:16,795
And which I don't even, those numbers don't make sense to me, but you know, when I hear
Jan say that it's dumber than a house cat.
598
00:47:16,795 --> 00:47:20,618
And then I hear that we're operating today at IQ 100.
599
00:47:20,618 --> 00:47:22,246
There's lots of
600
00:47:22,508 --> 00:47:24,701
You know, there's lots of things to make sense of.
601
00:47:24,701 --> 00:47:25,803
Um, I don't know.
602
00:47:25,803 --> 00:47:27,901
What is your take on that before we wrap up?
603
00:47:27,901 --> 00:47:34,955
Yeah, I really like and respect Jan and I think that he's right that if we want to have
robots, they need to understand the world.
604
00:47:34,955 --> 00:47:45,000
So when he talks about the it's as dumb as a house cat that he's talking about the idea
that if you put a ball into a cup and then you flip the cup upside down, what is going to
605
00:47:45,000 --> 00:47:46,500
happen to that ball?
606
00:47:46,560 --> 00:47:49,862
The large language model should know that the ball should fall out of the cup, right?
607
00:47:49,862 --> 00:47:52,343
But a large language models often get that wrong.
608
00:47:52,363 --> 00:47:57,477
So if we want robots to be able to figure out how the world works, we definitely need that
kind of spatial reasoning.
609
00:47:57,477 --> 00:47:59,369
And that's what he's talking about dumber than a house cat.
610
00:47:59,369 --> 00:48:01,971
House cats know that the ball falls out of the cup.
611
00:48:02,111 --> 00:48:10,378
But what Jan isn't saying is there are use cases like the law, where we don't have to deal
with cups turning over in balls.
612
00:48:10,458 --> 00:48:17,163
Every single thing that a lawyer does, every single task, whether you're a litigator or
transactional lawyer, every single task is based on words.
613
00:48:17,404 --> 00:48:21,551
We ingest words, we analyze words, and we output words.
614
00:48:21,551 --> 00:48:23,132
We don't deal with the physical world.
615
00:48:23,132 --> 00:48:26,674
We are merely word based in every single task that we do.
616
00:48:26,674 --> 00:48:30,237
So set aside, a cat doesn't know words.
617
00:48:30,237 --> 00:48:32,958
We don't need to know whether a ball falls out of a cup.
618
00:48:32,958 --> 00:48:35,060
All we need to know is how it worked work.
619
00:48:35,060 --> 00:48:40,983
And I would say for this use case, the legal use case, Jan's criticisms are maybe
inapplicable.
620
00:48:41,038 --> 00:48:43,218
Yeah, yeah, and you're right.
621
00:48:43,218 --> 00:48:44,288
That's what he was.
622
00:48:44,288 --> 00:48:52,978
You know, house cats can plan and anticipate and they have spatial awareness that large
language models don't.
623
00:48:52,978 --> 00:48:55,328
Well, this has been a lot of fun before we wrap up.
624
00:48:55,328 --> 00:48:58,268
How do folks find out more about what you do?
625
00:48:58,268 --> 00:49:01,058
Your your your work with Sally and Villex?
626
00:49:01,058 --> 00:49:03,475
How do people find out more about that?
627
00:49:03,475 --> 00:49:04,675
Yeah, the best places on LinkedIn.
628
00:49:04,675 --> 00:49:05,805
I hang out there a lot.
629
00:49:05,805 --> 00:49:08,596
It's Damien real and you have it in the show notes.
630
00:49:08,596 --> 00:49:09,367
Awesome.
631
00:49:09,367 --> 00:49:10,728
Well, good stuff.
632
00:49:11,070 --> 00:49:14,011
Hopefully, are you going to be in Miami for TLTF?
633
00:49:14,011 --> 00:49:15,543
I will see you at TLSTF.
634
00:49:15,543 --> 00:49:17,576
That's one of my favorite conferences.
635
00:49:17,576 --> 00:49:18,922
Yeah, looking forward to seeing you there.
636
00:49:18,922 --> 00:49:19,453
Absolutely.
637
00:49:19,453 --> 00:49:22,287
We'll be on stage pitching on Wednesday afternoon.
638
00:49:22,768 --> 00:49:23,270
good.
639
00:49:23,270 --> 00:49:24,852
We'll see you in Miami.
640
00:49:25,274 --> 00:49:25,995
All right.
641
00:49:25,995 --> 00:49:26,836
Thanks, Damian.
642
00:49:26,836 --> 00:49:27,858
Take care.
00:00:04,179
Damien, how are you this afternoon?
2
00:00:04,179 --> 00:00:04,761
Couldn't be better.
3
00:00:04,761 --> 00:00:05,475
Life is really good.
4
00:00:05,475 --> 00:00:06,326
How are you Ted?
5
00:00:06,326 --> 00:00:07,196
I'm doing great.
6
00:00:07,196 --> 00:00:07,926
I'm doing great.
7
00:00:07,926 --> 00:00:10,786
I appreciate you joining me this afternoon.
8
00:00:10,786 --> 00:00:23,341
We were kicking around a really interesting topic via LinkedIn and I figured, you know
what, this is, I I've been overdue to have you on the podcast anyway.
9
00:00:23,341 --> 00:00:27,472
So I figured this is a good opportunity to, uh, to riff a little bit.
10
00:00:27,472 --> 00:00:31,533
Um, but before we do, let's, let's get you introduced.
11
00:00:31,533 --> 00:00:35,104
So I went and looked at your, your LinkedIn profile.
12
00:00:35,104 --> 00:00:36,294
Interestingly,
13
00:00:36,344 --> 00:00:44,477
I didn't realize you started your legal career as a clerk and you started practicing in
the early two thousands.
14
00:00:44,477 --> 00:00:46,717
You worked for TR and fast case.
15
00:00:46,717 --> 00:00:49,098
That's now VLex, right?
16
00:00:49,098 --> 00:00:54,420
And, um, you're still at VLex and I know you do a lot of work through with Sally.
17
00:00:54,420 --> 00:00:56,010
That's how you and I actually first met.
18
00:00:56,010 --> 00:01:01,352
But, um, why don't you tell us a little bit about who you are, what you do and where you
do it.
19
00:01:01,363 --> 00:01:08,423
Sure, I've been a lawyer since 2002, I clerked for chief judges at the state appellate
court and the federal district court.
20
00:01:08,423 --> 00:01:16,583
Then I worked for a big law firm, Robbins Kaplan, where I represented Best Buy and much of
their commercial litigation, represented victims of Bernie Madoff, helped sue JPMorgan
21
00:01:16,583 --> 00:01:18,053
over the mortgage-backed security crisis.
22
00:01:18,053 --> 00:01:24,633
So I have a pretty long time, some would say too long as a litigator, but then I've also
been a coder since 85.
23
00:01:24,633 --> 00:01:29,275
So I have the law plus technology background, and anyone who works with me will tell you
that
24
00:01:29,275 --> 00:01:31,315
I am probably the worst coder you've ever met.
25
00:01:31,315 --> 00:01:38,640
I say I'm a coder not as a badge of honor, but a shroud of shame where I'm not very good
at coding at all.
26
00:01:38,640 --> 00:01:42,523
But with large language models, one can be actually better at coding than one actually is.
27
00:01:42,523 --> 00:01:49,267
So after litigating for a bunch of years, I joined TR, building a big thing for them, did
cybersecurity for a while.
28
00:01:49,267 --> 00:01:57,171
But since 2019, I've been working with Fastcase, which is now VLex, essentially playing in
a playground of a billion legal documents.
29
00:01:57,171 --> 00:02:05,374
cases, statutes, regulations, motions, briefs, pleadings, extracting what matters from
them using Sally tags and otherwise, and then running large language models across those
30
00:02:05,374 --> 00:02:06,105
things.
31
00:02:06,294 --> 00:02:06,925
Interesting.
32
00:02:06,925 --> 00:02:13,946
And is that how your involvement in Sally came to be was the work that you're doing at
Vlex?
33
00:02:14,259 --> 00:02:18,949
It actually came to be that I met Toby Brown who founded Sally in 2017.
34
00:02:18,949 --> 00:02:23,839
I met him at Ilticon and we just happened to sit in the same breakfast table.
35
00:02:23,839 --> 00:02:28,279
And I'd known of Toby but had not actually met Toby before.
36
00:02:28,279 --> 00:02:34,629
But then we started talking a bit about Sally and he said, I said, you haven't really
chased any litigation things.
37
00:02:34,629 --> 00:02:36,099
He said, no, we haven't.
38
00:02:36,099 --> 00:02:36,689
said, why not?
39
00:02:36,689 --> 00:02:38,079
I said, would you like some help on that?
40
00:02:38,079 --> 00:02:39,475
And he's like, well, it's too hard.
41
00:02:39,475 --> 00:02:40,215
Do you want to do it?
42
00:02:40,215 --> 00:02:41,675
And I said, yeah, I totally want to do it.
43
00:02:41,675 --> 00:02:47,327
So we met in 2019, August of 2019, and I've been working on Sally ever since.
44
00:02:47,382 --> 00:02:48,123
Interesting.
45
00:02:48,123 --> 00:02:49,999
And what were you coding in 85?
46
00:02:49,999 --> 00:02:52,125
I've been, I started coding in like 82.
47
00:02:52,125 --> 00:02:53,651
What were you coding basic?
48
00:02:53,651 --> 00:02:56,591
I was coding basic in my Commodore 128.
49
00:02:56,911 --> 00:03:02,731
I didn't start with the Commodore 64 because I wanted to get the thing that had 128K of
RAM instead of 64K of RAM.
50
00:03:02,731 --> 00:03:03,931
So I was coding basic.
51
00:03:03,931 --> 00:03:11,111
was getting magazines where I would take the magazine on paper and I would recode the code
in the magazine and then tried to tweak the code.
52
00:03:11,111 --> 00:03:13,902
So yeah, I was a very nerdy 10-year-old.
53
00:03:13,902 --> 00:03:14,382
Yeah.
54
00:03:14,382 --> 00:03:15,742
So it's funny.
55
00:03:15,742 --> 00:03:17,502
A lot of parallels there.
56
00:03:17,502 --> 00:03:24,722
Um, I started off with a Commodore 32, so I had one fourth of the memory that you did.
57
00:03:24,722 --> 00:03:30,342
And you know, I used to have to, uh, store my programs on audio cassette.
58
00:03:30,342 --> 00:03:40,322
This is before I could afford a floppy and you know, um, gosh, so this would have been,
yeah, probably 82, 83.
59
00:03:40,322 --> 00:03:42,252
Then I graduated to a
60
00:03:42,252 --> 00:03:48,557
I had a TI-994A with a extended basic cartridge and a book about that thick.
61
00:03:48,557 --> 00:03:54,132
And I literally read every page of it to understand all the new commands.
62
00:03:54,132 --> 00:03:58,285
I totally geeked out on it and then was totally into it.
63
00:03:58,285 --> 00:04:04,461
And then during middle school, you know, the girls didn't think it was cool to be a
computer programmer.
64
00:04:04,461 --> 00:04:07,834
So I kind of ditched it for a while until college.
65
00:04:07,834 --> 00:04:11,006
So I had a break in there, but
66
00:04:11,022 --> 00:04:17,502
Then when I picked up computers again, it would have been early nineties, like 91 ish.
67
00:04:17,502 --> 00:04:26,822
And by then it was a visual basic, you know, doing native windows development like VB
four.
68
00:04:27,161 --> 00:04:28,582
God, I can't remember.
69
00:04:28,582 --> 00:04:32,642
I think it was visual interdev and used to compile windows programs.
70
00:04:32,642 --> 00:04:33,652
did a lot of SQL.
71
00:04:33,652 --> 00:04:38,862
I was actually on the SQL team at Microsoft in late nineties, early 2000.
72
00:04:38,862 --> 00:04:39,675
So
73
00:04:39,675 --> 00:04:42,300
I can still hold my own on SQL, I'm like you, man.
74
00:04:42,300 --> 00:04:47,990
If I had to code an app, I'd be so lost right now.
75
00:04:48,307 --> 00:04:53,051
True, really query how important that is these days to be really a hardcore coder.
76
00:04:53,051 --> 00:05:01,677
I know people that are really good hardcore coders that use things like cursor and use
large language models to be able to be a bicycle for the mind, like Steve Jobs would say,
77
00:05:01,677 --> 00:05:03,618
and make them go better, faster, and stronger.
78
00:05:03,618 --> 00:05:14,486
But even for people that are rusty or really awful, like you and me, it's still, I can't
go 10 times as fast as a normal coder can with a large language model, but I can maybe do
79
00:05:14,486 --> 00:05:16,231
1x what they used to be able to do.
80
00:05:16,231 --> 00:05:16,382
Right.
81
00:05:16,382 --> 00:05:20,440
There's, there's really, um, it really evens the playing field on what is possible.
82
00:05:20,440 --> 00:05:21,350
Yeah.
83
00:05:21,430 --> 00:05:33,095
Well, you and I were riffing on a topic that I think is super interesting and I was kind
of surprised to hear your perspective on it and I thought it was really interesting and we
84
00:05:33,095 --> 00:05:39,618
were talking about the question on whether or not LLMs can reason.
85
00:05:39,618 --> 00:05:45,340
I've always, you know, understanding the architecture, I've always just had the default
assumption.
86
00:05:45,340 --> 00:05:49,688
That's kind of where I started my position on this with
87
00:05:49,688 --> 00:05:53,771
There's no way they can just based on, on the architecture, right?
88
00:05:53,771 --> 00:05:55,322
It predicts the next token.
89
00:05:55,322 --> 00:06:00,046
It has no concept of, um, comprehension.
90
00:06:00,046 --> 00:06:05,809
Therefore reasoning seems to be far out of reach, but it does create the illusion of
reasoning.
91
00:06:05,809 --> 00:06:09,012
And you had an interesting argument, which was, does it matter?
92
00:06:09,012 --> 00:06:17,138
Um, so, I mean, let's start with do LLM's reason or create the illusion of reasoning.
93
00:06:17,631 --> 00:06:19,872
And yes, let's talk about that.
94
00:06:19,872 --> 00:06:25,133
I think a good precursor to that question is are LLMs conscious or are they not conscious?
95
00:06:25,133 --> 00:06:28,914
And that's another kind of academic exercise question that people have been thinking
about.
96
00:06:28,914 --> 00:06:31,675
You know, it gives the illusion of consciousness, right?
97
00:06:31,675 --> 00:06:35,606
And so, but of course, large language models, in my opinion, are not conscious, right?
98
00:06:35,606 --> 00:06:38,057
Because they are just mimicking consciousness.
99
00:06:38,217 --> 00:06:44,599
But their philosophers for millennia have been saying consciousness is undefinable.
100
00:06:44,753 --> 00:06:48,245
Like, the only thing I can be conscious of is I know that I am conscious.
101
00:06:48,245 --> 00:06:53,468
But whether you are conscious or not or just a figment of my imagination is something I
will never, know.
102
00:06:53,568 --> 00:06:56,590
All I know is that my own consciousness is a thing.
103
00:06:56,590 --> 00:07:05,555
So I think the question of whether large language models are conscious or not is kind of
just an academic exercise that really doesn't matter, right?
104
00:07:05,676 --> 00:07:11,098
So any more than I know whether Ted is conscious or not, that we is a f-
105
00:07:11,609 --> 00:07:14,681
Science and we as philosophers have never defined consciousness.
106
00:07:14,681 --> 00:07:24,646
Therefore the debate about consciousness is just an academic exercise So let's now set
consciousness aside and now let's talk about reasoning and the real question is I when I'm
107
00:07:24,646 --> 00:07:35,332
speaking with you Ted I have no idea whether your brain is reasoning or not I've that's
because often we ourselves don't know how our brains are reasoning or not I'm all the only
108
00:07:35,332 --> 00:07:40,729
way I can tell whether Ted is reasoning or not is through the words that come out of Ted's
mouths
109
00:07:40,729 --> 00:07:44,381
or the words that come out of Ted's keyboard as Ted is typing.
110
00:07:44,381 --> 00:07:51,925
And if those words look like reasoning, and if they quack like reasoning, then I could be
able to say Ted is probably reasoning.
111
00:07:51,985 --> 00:07:55,427
So maybe shouldn't we judge large language models in the same way?
112
00:07:55,427 --> 00:08:00,950
That if the output of the large language models looks like reasoning and quacks like
reasoning, then maybe it's reasoning.
113
00:08:00,950 --> 00:08:06,653
And that's what large language models, machine learning scientists, data scientists, call
that the duck test.
114
00:08:06,733 --> 00:08:10,269
That is, they know what goes into the black box.
115
00:08:10,269 --> 00:08:15,244
They have no idea what happens inside the black box and they know what comes out of the
black box.
116
00:08:15,244 --> 00:08:24,753
But if the output looks like reasoning and quacks like reasoning, maybe whether the black
box is reasoning or not matters not, just like it doesn't matter if I know how you are
117
00:08:24,753 --> 00:08:25,854
reasoning in your brain.
118
00:08:25,854 --> 00:08:27,695
All I know is your output too.
119
00:08:28,098 --> 00:08:29,200
Interesting.
120
00:08:29,584 --> 00:08:31,540
Can we test for reasoning?
121
00:08:32,177 --> 00:08:34,028
Yes, I think we can.
122
00:08:35,009 --> 00:08:38,691
the question is, what are the tasks that you're testing on?
123
00:08:38,712 --> 00:08:41,614
There are objective tasks, mathematical tasks.
124
00:08:41,614 --> 00:08:43,876
So you can imagine a mathematical proof.
125
00:08:43,876 --> 00:08:47,919
You could be able to test whether it's making its way through the mathematical proof or
not.
126
00:08:47,919 --> 00:08:50,761
You can test whether that is reasoning or not reasoning.
127
00:08:50,761 --> 00:08:51,962
Same with science.
128
00:08:51,962 --> 00:08:53,133
Is it getting science correct?
129
00:08:53,133 --> 00:08:54,994
Is it doing the scientific method correctly?
130
00:08:54,994 --> 00:08:55,985
Is it reasoning?
131
00:08:55,985 --> 00:08:59,237
Is it providing true causation rather than being a correlation?
132
00:08:59,237 --> 00:09:02,569
I think those are objective truths that you could be able to see reasoning.
133
00:09:02,569 --> 00:09:06,831
And I would say that the outputs for law are much, much different than that.
134
00:09:06,971 --> 00:09:12,735
That is, whether I made a good argument or not in front of this court is not objective.
135
00:09:12,735 --> 00:09:14,055
That is subjective.
136
00:09:14,055 --> 00:09:22,600
So I can't do a proof as to validity or invalidity any more than you could do a proof as
to lawyer one made a better argument than lawyer two.
137
00:09:22,600 --> 00:09:28,499
Ask 10 lawyers and you might get a 50-50 split on whether lawyer one made a better
argument or lawyer two made a better argument.
138
00:09:28,499 --> 00:09:36,899
going over to the transactional side, the contractual side, lawyer one might love this
clause, but lawyer two says that's the worst clause on the planet.
139
00:09:36,899 --> 00:09:40,759
There's no objective standard as to what is good legal work.
140
00:09:40,759 --> 00:09:49,999
And absent any objective standard as to good legal work, maybe what is good legal
reasoning is in the eye of the beholder, much like beauty is in the eye of the beholder.
141
00:09:49,999 --> 00:09:57,201
That with absent any objective way to be able to say this was good legal reasoning or bad
legal reasoning.
142
00:09:57,201 --> 00:10:05,774
I guess the question of whether a large language model is providing good legal reasoning
or bad legal reasoning is unanswerable in the same way to say whether that human is doing
143
00:10:05,774 --> 00:10:07,995
good legal reasoning or bad legal reasoning.
144
00:10:07,995 --> 00:10:15,717
So I think this whole debate about reasoning or not reasoning is academic at best because
we should judge it by its outputs.
145
00:10:15,717 --> 00:10:22,479
And different lawyers will judge the outputs differently with humans, and they'll judge it
differently with large language models.
146
00:10:22,990 --> 00:10:23,190
Okay.
147
00:10:23,190 --> 00:10:34,290
I think that's true to an extent, but let's say I come in as a, as an attorney and to make
my closing arguments, I see, I sing the theme song to Gilligan's Island.
148
00:10:34,290 --> 00:10:43,330
Um, I think that would universally, um, be graded as that's a bad, this is bad legal
reasoning, right?
149
00:10:43,330 --> 00:10:53,062
So, so there is a spectrum and you know, obviously that's an extreme case, but I think
extreme cases are good to evaluate whether or not something's true.
150
00:10:53,106 --> 00:11:05,180
And, so yeah, mean, if something is just universally looked at every attorney reasonable
person that would evaluate it says, it says it's bad.
151
00:11:05,321 --> 00:11:09,701
Is that, does that monkey wrench what, what you're putting forward there?
152
00:11:09,701 --> 00:11:10,406
No.
153
00:11:10,407 --> 00:11:11,097
Yeah, that's right.
154
00:11:11,097 --> 00:11:16,971
So you're right that it is a spectrum, that you have the worst argument on the planet,
which is just gibberish.
155
00:11:16,971 --> 00:11:21,914
And then there's the best argument on the planet that is going to win 100 out of 100
times.
156
00:11:21,914 --> 00:11:23,374
And same thing with contracts.
157
00:11:23,374 --> 00:11:26,366
There's the contract that's going to get the deal done 100 out of 100 times.
158
00:11:26,366 --> 00:11:29,518
And there's the contract that is going to fail, 100.
159
00:11:29,518 --> 00:11:32,720
So everything is along that spectrum.
160
00:11:32,720 --> 00:11:39,121
And then if you add a y-axis to that spectrum, there is a most common thing, that is the
head.
161
00:11:39,121 --> 00:11:42,573
And then there's a long tail of rare things that happen.
162
00:11:42,574 --> 00:11:47,717
So if you think about what the large language models are doing is largely giving you the
head distribution.
163
00:11:47,717 --> 00:11:53,261
That is the most common things because it's giving you a compressed version of the
training data set.
164
00:11:53,261 --> 00:11:57,384
so the head is almost never going to be Gilligan's Island.
165
00:11:57,504 --> 00:12:01,928
And the head is almost never going to be some of the worst contractual arguments ever
made.
166
00:12:01,928 --> 00:12:04,430
It's going to fall on the average on that side.
167
00:12:04,430 --> 00:12:06,749
And that actually is probably
168
00:12:06,749 --> 00:12:09,931
the right thing to do for the large language model in the legal task.
169
00:12:09,931 --> 00:12:17,605
Because you want the average, because you want 100 out of 100 lawyers, you want most of
the lawyers to say that's probably right.
170
00:12:17,725 --> 00:12:20,226
And that is the average distribution of this.
171
00:12:20,346 --> 00:12:30,242
And so really then, if we then say the x-axis and the y-axis and you have the head, the
most common things, and then you have the long tail, and you now say, OK, the large
172
00:12:30,242 --> 00:12:35,865
language models are going to take the head, not the long tail, then you have to say, OK,
what is that head?
173
00:12:35,865 --> 00:12:36,567
Is that
174
00:12:36,567 --> 00:12:39,288
Does it require legal reasoning or not?
175
00:12:39,508 --> 00:12:44,819
So let's take about let's talk about mathematics and science We want to find new science,
right?
176
00:12:44,819 --> 00:12:48,200
We want to be able to create new cures to cancer, right?
177
00:12:48,200 --> 00:12:54,052
And we want to be able to do things that have never done been done before So does the
large language model need reasoning for that?
178
00:12:54,052 --> 00:12:57,023
Absolutely, because that's not part of the training to set right?
179
00:12:57,023 --> 00:13:04,195
That's not part of something that we can look backward at so we need reasoning for new
science We need reasoning for new mathematics
180
00:13:04,195 --> 00:13:12,061
You need reasoning for something that's never been done before that you need somebody like
Einstein or somebody to somebody who is once in a generation to be able to go forward and
181
00:13:12,061 --> 00:13:13,342
leap forward.
182
00:13:13,342 --> 00:13:15,083
Contrast that with the law.
183
00:13:15,664 --> 00:13:19,707
How much new thinking do we really need to do in the law?
184
00:13:19,828 --> 00:13:24,531
In contrast, how much of the law is looking backward that is looking to precedent?
185
00:13:24,612 --> 00:13:32,938
If I am a lawyer arguing in court, if I say, I've got judge, I've got this really brand
new idea that nobody's ever won on before, but I just sprouted out of my brain.
186
00:13:32,938 --> 00:13:33,693
What do you think?
187
00:13:33,693 --> 00:13:35,364
The judge is going to say, me a case.
188
00:13:35,364 --> 00:13:41,809
And if I can't show him a case, if I can't show her a statute, I lose because it's not
based on precedent.
189
00:13:42,009 --> 00:13:45,212
So do we really need new things in litigation?
190
00:13:45,212 --> 00:13:47,693
Do we really need new things in transactional work?
191
00:13:47,693 --> 00:13:49,875
Do we really need new things in advisory work?
192
00:13:49,875 --> 00:13:51,856
Do we need new things in regulatory work?
193
00:13:51,856 --> 00:13:55,339
And I think the answer to all four of those is no, because you're always looking to
precedent.
194
00:13:55,339 --> 00:13:56,580
You're always looking to statutes.
195
00:13:56,580 --> 00:13:59,441
You're always looking to something that is in the data set.
196
00:13:59,522 --> 00:14:01,453
So if it is in the data set,
197
00:14:01,777 --> 00:14:08,954
Really, all of our reasoning that is legal is backward looking, not forward looking like
in mathematics or in science.
198
00:14:08,954 --> 00:14:10,616
It is all backward looking.
199
00:14:10,616 --> 00:14:18,113
So if it's all backward looking, is every legal reasoning really recombining the data set
that we're having?
200
00:14:18,648 --> 00:14:23,852
Well, what about novel pieces of regulation that now have to be interpreted?
201
00:14:23,852 --> 00:14:33,260
Is there not new legal thinking that has to take place to evaluate the applicability in
those scenarios?
202
00:14:34,008 --> 00:14:39,633
There is, but I would say that the data is taken care of through what's called
interpolation.
203
00:14:39,633 --> 00:14:45,028
And so with the large language models, they connect concepts.
204
00:14:45,028 --> 00:14:48,172
I'm going to share my screen on this.
205
00:14:48,172 --> 00:14:48,964
is it possible?
206
00:14:48,964 --> 00:14:49,233
It's cool.
207
00:14:49,233 --> 00:14:57,831
So I'm going to pull up a PowerPoint to actually demonstrate a real live case that I had
where there's so.
208
00:14:58,077 --> 00:15:05,372
for the less sophisticated and maybe more sophisticated, we'll recap on how large language
model works, is that they pull out concepts.
209
00:15:05,573 --> 00:15:08,734
And they pull out concepts and put them into what's called vector space.
210
00:15:08,955 --> 00:15:19,963
And so you can imagine a two-dimensional vector space that the ideas of a faucet and a
sink and a vanity are probably pretty close together in that two-dimensional vector space.
211
00:15:19,963 --> 00:15:24,306
And then you could be able to say, OK, let's go ahead and put that in three-dimensional
vector space with a z-axis.
212
00:15:24,306 --> 00:15:25,903
And then you could be able to say, OK, these
213
00:15:25,903 --> 00:15:29,085
All similar things are kind of clustered together as ideas.
214
00:15:29,085 --> 00:15:34,048
And now add a fourth dimension, and our brains can't even figure out what that fourth
dimension would look like.
215
00:15:34,048 --> 00:15:36,169
Now add a 10th dimension.
216
00:15:36,169 --> 00:15:37,910
Now add a 100th dimension.
217
00:15:37,910 --> 00:15:41,831
Now add a 1,000th dimension and add a 12,000th dimension.
218
00:15:42,412 --> 00:15:45,634
And 12,000 dimensional vector space is where large language models live.
219
00:15:45,634 --> 00:15:55,123
And somewhere in that 12,000 dimensional vector space lives Ernest Hemingwayness and Bob
Dylanness and Pablo Picassoness.
220
00:15:55,123 --> 00:15:57,803
that lives in 12,000 dimensional vector space.
221
00:15:57,803 --> 00:16:03,803
So all of the things that are legal concepts live somewhere in that 12,000 dimensional
vector space.
222
00:16:03,803 --> 00:16:09,163
And all the facts in the world live somewhere in 12,000 dimensional vector space.
223
00:16:09,163 --> 00:16:16,683
And so what you can imagine is, to your question, is isn't going to combine some novel
things.
224
00:16:16,683 --> 00:16:19,103
I would say, yes, it will combine them.
225
00:16:19,103 --> 00:16:22,831
But the thing is, how many of those things are
226
00:16:22,831 --> 00:16:25,592
already in the large language models vector space.
227
00:16:25,592 --> 00:16:33,994
And then combining those is what's called, the data scientists would say, connecting the
latent space between those two disparate concepts.
228
00:16:33,994 --> 00:16:40,276
So now as I'm going to be sharing my screen, this concept is to think through.
229
00:16:40,451 --> 00:16:42,877
A friend of mine works for an insurance company.
230
00:16:42,877 --> 00:16:48,991
And she asked, what do you think of the thing about that called effective computing?
231
00:16:48,991 --> 00:16:51,139
What do you think of effective computing?
232
00:16:51,247 --> 00:16:55,411
And I said, I'm a pretty technical guy, but I'm sad to say I don't know what effective
computing is.
233
00:16:55,411 --> 00:17:01,815
So what I did is I went to the large language model and said, define effective computing
in the context of insurance and the law.
234
00:17:02,236 --> 00:17:05,739
And she's an insurance in-house lawyer.
235
00:17:05,739 --> 00:17:13,466
So I says, well, effective computing is how computers recognize human emotions and facial
expressions and voice patterns to create emotionally aware agents.
236
00:17:13,466 --> 00:17:14,226
I said, cool.
237
00:17:14,226 --> 00:17:20,621
Now analyze how effective computing can be used in an insurance call center, because
that's how my friend's company was thinking about using it.
238
00:17:20,959 --> 00:17:28,185
They said well you could use it for emotional recognition figuring out the caller's
emotional state figuring out their choice of words How quickly they're speaking how
239
00:17:28,185 --> 00:17:38,604
emotional they are after an accident or loss I said cool now give me a list of potential
legal issues that could stem from using effective computing in a call center and they said
240
00:17:38,604 --> 00:17:48,802
have you thought about privacy law like GDPR or yeah, or CCPA I've you thought about
consent and whether that that caller consented to you analyzing their emotions Have you
241
00:17:48,802 --> 00:17:50,203
thought about if you get hacked?
242
00:17:50,203 --> 00:17:54,796
What if all of your client's emotional data is in the hands of a hacker?
243
00:17:54,796 --> 00:17:55,987
What's that going to do legally?
244
00:17:55,987 --> 00:17:57,748
What's that going to do with PR?
245
00:17:57,748 --> 00:17:59,870
These are all good legal concepts.
246
00:17:59,870 --> 00:18:07,655
And I would guess that zero times has anyone ever asked about the legal aspects of
emotional, of effective computing.
247
00:18:07,675 --> 00:18:15,140
But what it's done is it knows what effective computing is, and it knows what privacy law
is, it knows what consent is, it knows what data security is.
248
00:18:15,140 --> 00:18:19,409
So it's connecting the latent space between the concept of effective computing
249
00:18:19,409 --> 00:18:21,280
and the concept of privacy law.
250
00:18:21,280 --> 00:18:23,681
And it then says, give me some sub bullets.
251
00:18:23,681 --> 00:18:28,604
And now it's going to continue expanding upon the concepts of which jurisdictions people
are calling in from.
252
00:18:28,604 --> 00:18:29,525
What types of data?
253
00:18:29,525 --> 00:18:30,395
Third party sharing.
254
00:18:30,395 --> 00:18:32,086
Are you minimizing the data?
255
00:18:32,126 --> 00:18:35,628
Each one of these things that I had live somewhere in vector space.
256
00:18:35,628 --> 00:18:43,152
So merely combining the concept of effective computing with the concept of privacy law and
consent and data security.
257
00:18:43,152 --> 00:18:48,705
That way we can then combine those aspects in new ways that haven't been in the training
set.
258
00:18:48,829 --> 00:18:50,059
So I think that's where it is.
259
00:18:50,059 --> 00:18:58,343
Almost everything that we do as laws, as lawyers, everything we do is connecting my
client's facts to the existing laws.
260
00:18:58,503 --> 00:19:01,724
And your client's facts are almost certainly in the training set.
261
00:19:01,724 --> 00:19:09,648
And the existing laws, if you are training on actual non-hallucinated cases, statutes, and
regulations, those are also in the training set.
262
00:19:09,648 --> 00:19:18,329
So really, reasoning is just being able to connect those existing facts in the data set
with the existing laws in the data set and saying how they relate to each other.
263
00:19:18,329 --> 00:19:21,945
if you have the actual non-hallucinated cases, statutes, and regulations.
264
00:19:22,604 --> 00:19:23,615
That's super interesting.
265
00:19:23,615 --> 00:19:30,978
So I find it, I have to think through this, but it seems shocking to me that there are no
novel concepts.
266
00:19:30,978 --> 00:19:39,113
Um, that what you've just described two things that currently exist in the, in the
training material, right?
267
00:19:39,113 --> 00:19:51,509
That, that the LLM has vectorized and plotted in 12,000 dimensions and it knows the
associations and, and the latent space between them.
268
00:19:51,830 --> 00:19:52,588
But
269
00:19:52,588 --> 00:20:08,665
What about new areas of law like when we start selling real estate on the moon, that
obviously at some point will make its way in, but until it does, how will it navigate
270
00:20:08,665 --> 00:20:10,366
scenarios like that?
271
00:20:10,579 --> 00:20:13,270
So I guess the question is where do those areas of law come from?
272
00:20:13,270 --> 00:20:14,660
And they come from regulations.
273
00:20:14,660 --> 00:20:15,861
They come from statutes.
274
00:20:15,861 --> 00:20:17,321
They come from cases.
275
00:20:17,701 --> 00:20:21,262
And of those cases, statutes and regulations are reflected in documents.
276
00:20:21,402 --> 00:20:30,765
And if the system has those documents, the cases, the statutes, and the regulations, then
the system will be able to plot those in vector space and then be able to take those legal
277
00:20:30,765 --> 00:20:35,446
concepts and apply them to the factual concepts that are also in vector space.
278
00:20:35,446 --> 00:20:39,067
So really, every single area of law is written somewhere.
279
00:20:39,067 --> 00:20:40,990
It has to be, otherwise it's not a law.
280
00:20:40,990 --> 00:20:43,073
And if it's written, it can be vectorized.
281
00:20:43,073 --> 00:20:45,286
So really everything that we do is part of the training set.
282
00:20:45,286 --> 00:20:53,837
There is really no novelty that is needed in the law because everything is necessarily
backward looking at the cases, the statutes, the regulations that are binding.
283
00:20:54,316 --> 00:20:54,987
Interesting.
284
00:20:54,987 --> 00:21:00,852
You had a metaphor I had not heard before with anesthesia.
285
00:21:00,852 --> 00:21:07,478
And I think you had a friend who was an anesthesiologist.
286
00:21:07,478 --> 00:21:08,058
Yes.
287
00:21:08,058 --> 00:21:10,170
And I have trouble saying that word.
288
00:21:10,170 --> 00:21:13,683
So I'll just say anesthesiology.
289
00:21:13,683 --> 00:21:17,376
To explain that, because I thought that was an interesting metaphor.
290
00:21:17,511 --> 00:21:19,001
Yeah, she told me something.
291
00:21:19,001 --> 00:21:26,744
I was over a campfire and it freaked me out and I may freak out your recent listeners, but
Yeah, she said she said Damien.
292
00:21:26,744 --> 00:21:28,124
Do you realize we have no idea?
293
00:21:28,124 --> 00:21:29,435
She's a nurse anesthetist, right?
294
00:21:29,435 --> 00:21:39,037
So she puts people under every single day and she has a I think a master's degree in
anesthesiology So she said do you realize we have no idea how anesthesia works?
295
00:21:39,298 --> 00:21:40,978
I said wait to say that again.
296
00:21:40,978 --> 00:21:44,143
She said yeah one of two options option number one
297
00:21:44,143 --> 00:21:51,830
is it does what everybody thinks that it does, is that it puts us to sleep and we don't
feel that scalpel going into our bellies and then we come out and we're all fine, right?
298
00:21:51,830 --> 00:21:53,271
That's option number one.
299
00:21:53,271 --> 00:21:57,134
Option number two is we feel every single cut.
300
00:21:57,234 --> 00:22:01,317
And what anesthesia does is to give us amnesia to make us forget.
301
00:22:01,698 --> 00:22:04,381
We don't know whether it's option one or option two.
302
00:22:04,381 --> 00:22:07,503
That scares the crap out of me and it might well scrape the crap out of you.
303
00:22:07,503 --> 00:22:12,187
But the question is, do we not use anesthesia because we don't know how it works?
304
00:22:12,891 --> 00:22:20,977
No, of course we use anesthesia because the real question is does it work and is it
effective as to what we would like it to do?
305
00:22:20,977 --> 00:22:28,281
If the answer to both those things is yes, then how it works maybe matters less than the
fact that it does work.
306
00:22:28,502 --> 00:22:32,124
So apply that anesthesia test to reasoning.
307
00:22:32,525 --> 00:22:41,661
And just like I can't tell whether you could, you're reasoning in Ted's brain or not, but
I can gauge you by your output, by your, by your speech.
308
00:22:41,661 --> 00:22:44,133
by your words coming out of your keyboard.
309
00:22:44,193 --> 00:22:47,856
And if that works, I say you're reasoning.
310
00:22:48,597 --> 00:22:51,000
whether I know how your brain works doesn't matter.
311
00:22:51,000 --> 00:22:54,302
And whether I know how anesthesia works doesn't matter.
312
00:22:54,423 --> 00:22:58,106
I'm sorry, whether I know how anesthesia works doesn't matter.
313
00:22:58,106 --> 00:23:00,008
The fact that it does work matters.
314
00:23:00,008 --> 00:23:07,934
So the fact that a large-language model does create output that seems like it is
reasonable and is reasoning, just like a human is reasoning.
315
00:23:08,763 --> 00:23:19,705
If the human, if the large language model output is indistinguishable from Ted's output as
reasonable, then I would say whether it is actual reasoning and how it's reasoning doesn't
316
00:23:19,705 --> 00:23:23,929
really matter any more than anesthesia doesn't matter if we know how anesthesia works.
317
00:23:24,322 --> 00:23:26,784
Yeah, that is disturbing to think about.
318
00:23:27,705 --> 00:23:31,109
But it's a valuable metaphor.
319
00:23:31,109 --> 00:23:33,791
Now here's what I would say in response to that.
320
00:23:33,791 --> 00:23:40,577
Did you have a chance to look at the Apple Intelligence team's study with the GSM 8K?
321
00:23:41,688 --> 00:23:43,371
Only in the two minutes before you sent it.
322
00:23:43,371 --> 00:23:45,395
So why don't you describe it and maybe I can react to it.
323
00:23:45,395 --> 00:23:45,825
Yeah.
324
00:23:45,825 --> 00:24:00,567
So, um, it's only five weeks old, so it's, it's very new, but one benchmark that has been
used pretty widely to test reasoning in, um, large language models is the, the GSM, which
325
00:24:00,567 --> 00:24:06,011
stands for grade school math, AK there's 8,000 of these questions.
326
00:24:06,252 --> 00:24:14,198
And what Apple did was modified these questions ever so slightly.
327
00:24:14,198 --> 00:24:19,000
and evaluated the LLM's performance against those modifications.
328
00:24:19,000 --> 00:24:20,780
And it was pretty dramatic.
329
00:24:20,880 --> 00:24:34,704
So their conclusions were, I said, the performance of all models decline when only the
numerical values in the question are altered in the GSM symbolic benchmark.
330
00:24:34,885 --> 00:24:37,705
That's pretty interesting.
331
00:24:38,326 --> 00:24:39,202
It says,
332
00:24:39,202 --> 00:24:45,283
their performance significantly deteriorates as the number of clauses in the question
increases.
333
00:24:45,604 --> 00:24:54,726
And then its conclusion is we hypothesize that this decline is due to the fact that
current LLMs are not capable of genuine logical reasoning.
334
00:24:55,205 --> 00:25:06,349
And I thought there were a few examples in this specifically that really, I guess, were
telling.
335
00:25:06,389 --> 00:25:08,736
So let me see if I can find this here.
336
00:25:08,736 --> 00:25:23,130
So, um, one of these, uh, these are word problems and in one of the word problems, they,
I'm not going to be able to find it, but I remember enough about it to, um, articulate it.
337
00:25:23,170 --> 00:25:31,473
What they did was in the problem, they threw a sentence that had nothing to do with the
problem itself and it completely blew up the problem.
338
00:25:31,473 --> 00:25:36,302
Um, the sentence that they put in there was it, the question was something like,
339
00:25:36,302 --> 00:25:53,402
You know, if the current price of keyboards and mouse pads are five and $10 respectively,
and inflation has increased by 10 % each year, that was the part that had nothing.
340
00:25:53,402 --> 00:25:55,592
Tell us what the current price is, right?
341
00:25:55,592 --> 00:25:57,352
It's already given you the information.
342
00:25:57,352 --> 00:26:01,302
The fact that inflation increased 10 % has nothing to do.
343
00:26:01,302 --> 00:26:05,362
And it, it plummeted the, the,
344
00:26:05,390 --> 00:26:10,092
accuracy of the large language models responses, something like 65%.
345
00:26:10,092 --> 00:26:14,253
It varied wildly as you would expect.
346
00:26:15,614 --> 00:26:30,560
The latest models did, you know, in the chain of thought that they did the best, but it
was, it seemed to me that this really pokes a hole in the whole concept of if these,
347
00:26:30,560 --> 00:26:35,412
because what that points to, if that, if you throw it a sentence that has nothing to do
with the problem,
348
00:26:35,486 --> 00:26:38,390
in and I can't, that means I haven't comprehended the problem.
349
00:26:38,390 --> 00:26:40,934
I don't know what the problem is, right?
350
00:26:40,934 --> 00:26:48,694
I'm, simply reciting answers and you know, it's what I honestly would expect from, but I
don't know.
351
00:26:48,694 --> 00:26:50,366
What is your response to that?
352
00:26:50,483 --> 00:26:59,883
Yeah, so I would say two responses, one of which is the idea that mathematics has a right
answer and a wrong answer, whereas legal often does not.
353
00:26:59,883 --> 00:27:05,943
That is, in litigation, it's whatever argument happens to win, and in transactional work,
it's whatever gets the deal done.
354
00:27:05,943 --> 00:27:14,623
So, wherein the mathematical proof, you have a right answer or a wrong answer, whereas in
legal, there is the eye of the beholder, where there is no objective, there's merely the
355
00:27:14,623 --> 00:27:15,163
subjective.
356
00:27:15,163 --> 00:27:16,863
So that's thing number one.
357
00:27:16,863 --> 00:27:19,275
Thing number two is, of course,
358
00:27:19,275 --> 00:27:29,501
With mathematics you want to be able to create new mathematics and be able to go forward
with new scenarios But again law never has It's always looking backward to precedent
359
00:27:29,501 --> 00:27:38,016
looking backward to cases looking backward to the contracts like we've always done the
contract in this way And we know that in this industry and this jurisdiction force measure
360
00:27:38,016 --> 00:27:48,281
clauses need to be in this way This is always backward looking so really so two things non
objectivity in the law where there is objectivity in math
361
00:27:48,281 --> 00:27:52,613
and backward looking in the law rather than forward looking with mathematics.
362
00:27:52,754 --> 00:28:01,139
That yes, it'll throw off the mathematics by throwing in the inflationary tool and it
won't really reason in that way.
363
00:28:01,179 --> 00:28:11,036
But I think for our use cases in the law, whether it's a transactional use case, a
litigation use case, an advisory use case or regulatory use case, all of the stuff is
364
00:28:11,036 --> 00:28:11,806
there.
365
00:28:11,806 --> 00:28:17,843
And if we use the chain of thought like you've talked about, then it could probably
overcome the lack of true
366
00:28:17,843 --> 00:28:19,784
quote unquote reasoning that we have.
367
00:28:19,784 --> 00:28:23,767
And we as humans are really good at separating wheat from chaff.
368
00:28:23,767 --> 00:28:30,691
And so you can imagine, you know, scenario one is everybody takes the robot's output and
doesn't touch it.
369
00:28:30,892 --> 00:28:33,634
That's a bad scenario under anybody's estimation.
370
00:28:33,634 --> 00:28:39,838
But almost everybody's in scenario two where it gives an output and then you look over
that output and get it out the door.
371
00:28:39,838 --> 00:28:43,040
Under scenario two, you're going to separate that wheat from the chaff.
372
00:28:43,080 --> 00:28:47,357
And so until we have autonomous legal bots, which
373
00:28:47,357 --> 00:28:49,691
God help us if we have that, right?
374
00:28:49,733 --> 00:28:52,811
But until we have that, you're always gonna have that human oversight.
375
00:28:52,811 --> 00:28:57,291
So really, whether it's reasoning or not, is gonna be pretty easily flagged.
376
00:28:57,528 --> 00:28:58,199
Yeah.
377
00:28:58,199 --> 00:29:02,102
And they, they, it wasn't just, um, there were other ways that they tested it.
378
00:29:02,102 --> 00:29:04,344
They actually changed some of the numbers.
379
00:29:04,344 --> 00:29:06,525
What was interesting that also threw it off.
380
00:29:06,525 --> 00:29:07,967
And this part surprised me.
381
00:29:07,967 --> 00:29:11,039
thought AI would, I thought LLMs would figure this out.
382
00:29:11,039 --> 00:29:12,711
They changed the names.
383
00:29:12,711 --> 00:29:15,873
So instead of Sophie, they put Lisa, right?
384
00:29:15,873 --> 00:29:17,655
But they did it consistently throughout.
385
00:29:17,655 --> 00:29:21,858
Like it should be able to, so anyway, it's a new study.
386
00:29:21,858 --> 00:29:26,968
There's still a lot, to be analyzed.
387
00:29:26,968 --> 00:29:29,559
from it, but I did think it was interesting.
388
00:29:30,480 --> 00:29:37,805
Speaking of studies, the Stanford study, there's been a lot of conversation about it.
389
00:29:37,986 --> 00:29:43,068
The second iteration of that came out in May.
390
00:29:43,068 --> 00:29:56,738
you know, there was a, obviously there's companies out there that put a lot of money and
effort into these tools and Stanford was pretty pointed in their, in their commentary and
391
00:29:56,770 --> 00:30:01,013
You know, there was a lot of feedback that the, the, the, the study was biased.
392
00:30:01,013 --> 00:30:03,054
I read it multiple times.
393
00:30:03,054 --> 00:30:06,857
It's about 30 pages and it's a really, it's an easy read.
394
00:30:06,857 --> 00:30:09,549
Like reading scientific papers is usually rough going.
395
00:30:09,549 --> 00:30:11,660
That one was really easy to read.
396
00:30:11,680 --> 00:30:15,523
And I thought I didn't feel, I didn't see the bias.
397
00:30:15,523 --> 00:30:19,125
did try and trick the tools and, it was upfront about that.
398
00:30:19,125 --> 00:30:23,178
Just, just like the, Apple study tried to trick AI, right?
399
00:30:23,178 --> 00:30:25,846
That's kind of part of testing is, you know,
400
00:30:25,846 --> 00:30:30,392
evaluating, you're going to throw curveballs and see how the model responds.
401
00:30:30,392 --> 00:30:32,595
But I know, what was your take on the study?
402
00:30:32,595 --> 00:30:37,792
Did you feel there were biases or did you think it was fair?
403
00:30:38,411 --> 00:30:49,384
Two thoughts on that and not to throw shade on the Stanford folks, you can imagine that
one issue I have with them is that the terminology that they used for hallucinations, I
404
00:30:49,384 --> 00:30:53,516
think they conflated hallucinations with just getting the wrong legal answer.
405
00:30:53,516 --> 00:30:54,906
Those are two different things, right?
406
00:30:54,906 --> 00:31:06,131
There is an hallucination where it just makes some things up and then there is where Ted
and I disagree as to where the law ends up and number two is not hallucination.
407
00:31:06,131 --> 00:31:08,331
That is just us disagreeing.
408
00:31:08,331 --> 00:31:10,671
again, with the law, there may not be a right answer.
409
00:31:10,671 --> 00:31:16,261
And the reason there is litigation is because the reasonable minds can disagree as to what
is the right answer or not the right answer.
410
00:31:16,261 --> 00:31:18,951
So a court has to be able to resolve that dispute.
411
00:31:19,431 --> 00:31:24,071
a disagreement as to the output is not hallucination.
412
00:31:24,071 --> 00:31:32,711
So number one, the quibble I had is on the termination that they call everything
hallucination, where really we should focus that on confabulations that the large language
413
00:31:32,711 --> 00:31:33,271
models do.
414
00:31:33,271 --> 00:31:34,771
That's thing number one.
415
00:31:34,771 --> 00:31:38,730
Thing number two goes to trying to trick the model in the ways that you talked about.
416
00:31:38,730 --> 00:31:41,101
And this goes to the product side of me.
417
00:31:41,101 --> 00:31:42,051
I'm a product guy.
418
00:31:42,051 --> 00:31:43,410
You're a product guy.
419
00:31:43,951 --> 00:31:48,011
We, as product people, say, what are the most common user pathways?
420
00:31:48,011 --> 00:31:49,731
What are the most common user behaviors?
421
00:31:49,731 --> 00:31:53,611
And we want to be able to build products that are based on those most common user
behaviors.
422
00:31:53,611 --> 00:32:01,467
And going back to my x and y-axis, this is the head and the long tail, where you have the
most common things done are the head.
423
00:32:01,467 --> 00:32:06,710
And the weirdest, strangest things that you would never think a user would ever do is in
the long tail.
424
00:32:06,911 --> 00:32:15,717
And so the things that they were asking were things like, when Justice Ruth Bader Ginsburg
dissented in this case, what does that mean?
425
00:32:15,717 --> 00:32:24,234
Where a user would never ask that, knowing because that user would say, well, the user
would know that Ruth Bader Ginsburg didn't dissent in that case.
426
00:32:24,234 --> 00:32:26,365
She was the concurrence in that case.
427
00:32:26,365 --> 00:32:29,143
So asking a question like that is
428
00:32:29,143 --> 00:32:32,245
way, way down on the long tail distribution curve.
429
00:32:32,245 --> 00:32:34,347
That is not the most common use case.
430
00:32:34,347 --> 00:32:42,793
So really, if they were to do the study correctly, they would take, they would say, what
are the most common questions made by lawyers?
431
00:32:42,793 --> 00:32:50,848
The most common questions made by law students, the most common questions, and then
collect those most common questions, randomly distribute those most common questions, and
432
00:32:50,848 --> 00:32:55,742
then say, based on those most common questions, or I guess not even most common, they
would take the entire distribution curve.
433
00:32:55,742 --> 00:32:57,723
They would take the head and the tail.
434
00:32:57,723 --> 00:32:59,404
Mix that up in a randomized study.
435
00:32:59,404 --> 00:33:03,907
So there will be some long tail questions, some head questions.
436
00:33:03,907 --> 00:33:10,072
And then from that random distribution, then run those through and see how many
confabulations slash hallucinations are there.
437
00:33:10,072 --> 00:33:12,013
That would be a reasonable way to do it.
438
00:33:12,013 --> 00:33:15,315
That would be most aligned with how users use the tools.
439
00:33:15,315 --> 00:33:25,803
So by asking a long tail question that arguably would zero times out of 1,000 ever be
asked, zero times out of 1 million ever be asked, nobody would ever ask a question to
440
00:33:25,803 --> 00:33:26,703
trick it.
441
00:33:26,811 --> 00:33:32,754
I would say that's really not doing what the system should be doing for its purpose.
442
00:33:32,754 --> 00:33:43,088
The purpose of legal research is to answer legal questions and it's not to gauge the
reasoning of the underlying large language models, GPT-4 or GPT-3.5 or whatever was in
443
00:33:43,088 --> 00:33:44,339
place at the time.
444
00:33:44,339 --> 00:33:46,920
And what they were doing was tricking GPT-3.5.
445
00:33:46,920 --> 00:33:50,861
They weren't tricking the legal research tools that were relying on GPT-3.5.
446
00:33:50,961 --> 00:33:55,097
So I would say that that tests the thing that is the product.
447
00:33:55,097 --> 00:33:58,605
not testing the underlying model that the product is using.
448
00:33:59,032 --> 00:34:08,042
So how is that different than throwing in a sentence about inflation being 10 % in the GSM
8K scenario?
449
00:34:08,711 --> 00:34:15,776
So I guess if we're looking at a legal product, like are we testing reasoning or are we
testing how well the product works?
450
00:34:15,776 --> 00:34:17,597
Because those are two different goals.
451
00:34:17,597 --> 00:34:22,440
Because really testing reasoning is testing the foundational model, GPT-4, GPT-3.5.
452
00:34:22,440 --> 00:34:31,386
But if you're testing how well the product works for its intended purpose, then the
question then would be, would the user input usually include that inflationary number?
453
00:34:31,446 --> 00:34:37,950
If the answer is yes, the user input would include that, then yes, we should definitely
include that in the distribution of the user input.
454
00:34:38,988 --> 00:34:46,838
zero times out of 100 million times they would include that inflationary number, then that
doesn't seem right because if you're really testing the product, that is not a use case
455
00:34:46,838 --> 00:34:48,581
that the users would ever use.
456
00:34:48,728 --> 00:34:50,409
Yeah, guess, yeah.
457
00:34:50,409 --> 00:34:51,659
OK, fair.
458
00:34:51,659 --> 00:35:00,383
So in section 6.2 in the study, do, I agree, they say hallucinations can be insidious.
459
00:35:00,423 --> 00:35:10,907
And then some of the scenarios that they document don't seem like hallucination scenarios,
like misunderstanding holdings.
460
00:35:10,907 --> 00:35:17,690
Systems do not seem capable of consistently making out the holding of a case,
distinguishing between legal actors.
461
00:35:17,822 --> 00:35:28,089
systems fail to distinguish between arguments made by litigants and statements made by the
court, respecting order of authority, models strain in grasping hierarchies of legal
462
00:35:28,089 --> 00:35:28,550
authority.
463
00:35:28,550 --> 00:35:30,461
Yeah, those aren't hallucinations.
464
00:35:30,461 --> 00:35:34,764
Those are just limitations of the model itself, it seems.
465
00:35:34,827 --> 00:35:39,469
That's right, limitations of the model and perhaps limitations of the system that is using
that model.
466
00:35:39,469 --> 00:35:49,352
So you could imagine that if the system were to say as part of the metadata that a trial
court is below this particular appellate court, which is below this particular Supreme
467
00:35:49,352 --> 00:36:01,055
Court, and models that hierarchy of courts in a symbolic way, not a large language model
way, but in a symbolic coded up way, then that system could avoid the,
468
00:36:01,139 --> 00:36:10,279
confabulation between the district court and the appellate court level because the
guardrails of the symbolic AI would prevent that kind of misunderstanding.
469
00:36:10,919 --> 00:36:21,399
So is Stanford analyzing the large language model output or are they analyzing the
system's coding, that is the hard coding, to be able to say that this trial court is below
470
00:36:21,399 --> 00:36:23,659
the appellate court, which is below the Supreme Court?
471
00:36:23,659 --> 00:36:30,773
I think that that is maybe a reasonable critique that if the system
472
00:36:30,963 --> 00:36:34,764
is not recognizing that hierarchy than maybe the system should.
473
00:36:34,784 --> 00:36:40,026
So I would say that's maybe a reasonable critique if you're really looking at that.
474
00:36:40,666 --> 00:36:42,327
So yeah, maybe two scenarios.
475
00:36:42,327 --> 00:36:52,500
The Ruth Bader Ginsburg is unreasonable, but knowing whether the district court was
overruled by the appellate court, which was then ruled back by the Supreme Court, a legal
476
00:36:52,500 --> 00:36:56,611
system that is a legal research system should know those things in a symbolic AI way.
477
00:36:56,728 --> 00:36:57,910
Yeah, exactly.
478
00:36:57,910 --> 00:37:00,053
And the last category was fabrications.
479
00:37:00,053 --> 00:37:02,065
that is a hallucination.
480
00:37:03,689 --> 00:37:10,348
So what does this study mean for the future of AI and legal research?
481
00:37:11,085 --> 00:37:12,416
The Stanford study?
482
00:37:13,217 --> 00:37:22,305
I would say that there are, so there's that Stanford study is out of one side of Stanford
and then there's another side of Stanford called the Codex.
483
00:37:22,305 --> 00:37:29,751
Megan Ma helps lead up that Codex and you probably know Megan and Megan's one of the
smartest minds right now in legal AI.
484
00:37:29,751 --> 00:37:39,119
She's running another study that is comparing human generated output with humans plus
machine generated output.
485
00:37:39,119 --> 00:37:45,102
and doing a double blind study as to see what the large law firm partners prefer.
486
00:37:45,102 --> 00:37:50,904
The partners don't know which is human created versus a human plus machine and going to be
doing this.
487
00:37:50,904 --> 00:37:57,127
So that seems like a reasonable way because that is really taking actual use cases.
488
00:37:57,127 --> 00:38:06,661
So she's taking actual contractual questions or actual litigation questions and being able
to actually take those common use cases, the head in the distribution curve, not the long
489
00:38:06,661 --> 00:38:07,847
tail, but the head.
490
00:38:07,847 --> 00:38:15,609
and then saying how do legal tools that are built for this actually performing on these
more likely legal tasks.
491
00:38:15,609 --> 00:38:22,311
So I would say that the Stanford study is a bright shining light as to the way things
should be done.
492
00:38:22,311 --> 00:38:25,212
The other Stanford study, Megan Ma's Stanford study.
493
00:38:25,212 --> 00:38:29,413
Second thing is that similar studies being done by Dan Schwartz out of Minnesota.
494
00:38:29,413 --> 00:38:35,655
Dan, you might have seen two of his other studies, one of which was to be able to say,
495
00:38:36,871 --> 00:38:44,315
They interspersed a large language model created essays with the human created essays and
a double blind the professors graded both.
496
00:38:44,315 --> 00:38:46,657
And so that was his study number one.
497
00:38:46,657 --> 00:38:52,280
Study number two was to, I forget what study number two was, but it was a similar vein.
498
00:38:52,280 --> 00:39:01,746
But then study number three is doing kind of the same thing that Megan is doing, but just
a different twist on it to be able to do a double or triple blind study of human created
499
00:39:01,746 --> 00:39:04,529
things along with machine created things.
500
00:39:04,529 --> 00:39:08,282
and mixing them up and having human evaluators be seeing what they prefer.
501
00:39:08,282 --> 00:39:10,403
So that's evaluation number two.
502
00:39:10,403 --> 00:39:11,984
That is a bright shining light.
503
00:39:11,984 --> 00:39:14,686
Evaluation number three is legal technology hub.
504
00:39:14,686 --> 00:39:26,114
Nikki Shaver and her team is working with vals.ai on a similar study that is involving
Harvey, Thompson Reuters, LexisNexis, us at VLex, where it is a similar kind of John Henry
505
00:39:26,114 --> 00:39:28,497
kind of test to evaluate the outputs.
506
00:39:28,497 --> 00:39:35,456
So I would say that the old Stanford study is old news and is probably chasing the wrong
things for the reason we've just discussed.
507
00:39:35,456 --> 00:39:43,937
Whereas the new Stanford study and the Minnesota-Michigan study and the legal technology
have VALS AI study, that is going to give us some hope going forward.
508
00:39:44,162 --> 00:39:44,602
Interesting.
509
00:39:44,602 --> 00:39:45,082
It's funny.
510
00:39:45,082 --> 00:39:49,987
We're in a space that moves so fast when May is old news and it's November.
511
00:39:49,987 --> 00:39:52,269
But yeah, I don't disagree.
512
00:39:53,490 --> 00:40:04,540
So the Gartner hype curve does a absolutely phenomenal job, in my opinion, mapping out the
trajectory of new technologies in many cases.
513
00:40:04,540 --> 00:40:09,284
And I think it's really playing out interestingly in AI right now.
514
00:40:09,284 --> 00:40:11,726
So the Goldman study
515
00:40:11,726 --> 00:40:17,466
came out saying that 44 % of legal tasks could be automated by GEN.AI, freaked everybody
out.
516
00:40:17,466 --> 00:40:21,566
I think that number was very aspirational.
517
00:40:24,286 --> 00:40:38,638
I think I might have heard you talk about when GPT-3.5 took the bar and scored it in the
mid-60s, GPT-4 scored 90 plus.
518
00:40:38,638 --> 00:40:41,878
That number was since been revised down significantly.
519
00:40:41,878 --> 00:40:43,958
So I've heard, which is interesting.
520
00:40:43,958 --> 00:41:01,698
Um, there's a Wharton survey out that I saw on Peter Duffy's newsletter that, surveyed
inside council and showed that only 28 % of inside council anticipate a high impact of gen
521
00:41:01,698 --> 00:41:02,898
AI in their role.
522
00:41:02,898 --> 00:41:08,230
I found that super interesting and 25 % anticipate a low impact.
523
00:41:08,460 --> 00:41:10,441
Which again, that's kind of mind blowing.
524
00:41:10,441 --> 00:41:12,895
But where do you think we are on this hype curve?
525
00:41:12,895 --> 00:41:20,032
Do you feel like we're in the trough of disillusionment or are we still, do we still have
further to go?
526
00:41:20,377 --> 00:41:22,588
I think we're going up the slope actually.
527
00:41:22,588 --> 00:41:28,129
I just gave a talk with Pablo Redondo of Case Text co-founder.
528
00:41:28,189 --> 00:41:36,011
He and I gave a presentation and he gave a lot of his part of the presentation walking us
through the hype cycle and walking us through the trough.
529
00:41:36,011 --> 00:41:41,733
And he thinks that law firms and others are doing the hard yards of going up the slope
slowly but surely.
530
00:41:41,733 --> 00:41:45,034
And I think that he's probably right.
531
00:41:45,034 --> 00:41:48,375
to a couple of the things that you mentioned, you know, the
532
00:41:49,223 --> 00:41:59,090
Bar exam that of course my friends Pablo was actually one of the guys who did that bar
exam paper along with Mike Mike Bomarito and Dan Katz were the other co-authors and so
533
00:41:59,090 --> 00:42:08,436
they in that paper that they wrote they actually put the caveat saying that there were One
those results are never publicly announced.
534
00:42:08,436 --> 00:42:18,155
So they're kind of doing replicas of the bar exam So this is always you know, until the
bar exam the multi-state bar exam publishes its numbers, of course
535
00:42:18,155 --> 00:42:20,317
there's no definitive objective number.
536
00:42:20,317 --> 00:42:27,042
It's all largely kind of taking kind of statistical likelihood rather than definitive
objective.
537
00:42:27,042 --> 00:42:27,932
That's thing number one.
538
00:42:27,932 --> 00:42:38,920
Thing number two, they had also put in footnotes caveats saying that there was a, this is
during COVID times and this is a, know, there's of course there are fewer takers during
539
00:42:38,920 --> 00:42:46,045
COVID times and maybe those COVID people were less likely to do well and maybe, you know,
there's all sorts of, you know, scientifically
540
00:42:46,045 --> 00:42:48,908
kind of nebulous things that make that number.
541
00:42:48,908 --> 00:42:53,682
But anyway, so they put that 90 % number with all of those caveats with the initial paper.
542
00:42:53,682 --> 00:42:57,400
So the subsequent papers that say, no, it's way lower than 90%.
543
00:42:57,400 --> 00:42:58,216
Like, come on.
544
00:42:58,216 --> 00:43:00,278
Like, they put that in the footnotes.
545
00:43:00,278 --> 00:43:01,870
So that's that.
546
00:43:01,870 --> 00:43:08,155
And then to the other thing that's saying that only 28 % of inside counsel anticipate high
impact on their role.
547
00:43:09,737 --> 00:43:11,978
I've heard that, but there's also
548
00:43:12,093 --> 00:43:22,727
there's a lot of studies saying that 80 % of inside counsel expect their external
counsel's bills to be reduced because of large language models.
549
00:43:22,848 --> 00:43:30,331
even though 28 % of them think that it's going to impact their role, 80 % think it's going
to impact external counsel's role.
550
00:43:30,331 --> 00:43:33,472
So that is an expectation from the buy side, the client side.
551
00:43:33,492 --> 00:43:40,935
And another thing is that the Clio Cloud Conference announced their Clio survey, where
they survey and
552
00:43:41,011 --> 00:43:44,771
Their distribution curve is mostly the solo small up to the midsize law firms.
553
00:43:44,771 --> 00:43:46,581
And they did a survey last year.
554
00:43:46,581 --> 00:43:50,651
during that survey, they said, how legal work are you doing?
555
00:43:50,651 --> 00:43:53,451
How many of you are using large language models for legal work?
556
00:43:53,451 --> 00:43:56,331
And the answer in 2023 was about 25%.
557
00:43:56,331 --> 00:43:58,271
They asked the same question in 2024.
558
00:43:58,271 --> 00:44:01,371
And the answer jumped to about 80%.
559
00:44:01,371 --> 00:44:07,471
That is 80 % of solo small up to midsize that Clio users are using AI for legal work.
560
00:44:07,471 --> 00:44:10,803
That's a dramatic jump from 25 % to 80%.
561
00:44:10,803 --> 00:44:16,803
And so that shows me that the future is not here, it's just not evenly distributed.
562
00:44:16,803 --> 00:44:25,023
That is, Solos Malls are using this, they're already on the slope in the light, but
they're already using it for real use cases, where the big law folks maybe aren't telling
563
00:44:25,023 --> 00:44:26,493
anybody that they're using it.
564
00:44:26,493 --> 00:44:34,603
And the associates may be in big law, where if their law firms prohibit them from using
it, they use shadow IT, where they use it on their personal devices, and they're not
565
00:44:34,603 --> 00:44:35,979
telling anybody about it.
566
00:44:36,014 --> 00:44:48,034
Yeah, you know, all these numbers that are flying around, I don't know if you saw the Ilta
Tech Survey that 74 % of law firms with more than 700 attorneys are using GEN.ai in
567
00:44:48,034 --> 00:44:49,834
business use cases.
568
00:44:50,594 --> 00:44:53,354
seems very aspirational to me.
569
00:44:53,354 --> 00:44:56,184
And I had Steve Embry on the podcast a while back.
570
00:44:56,184 --> 00:45:02,114
He wrote an article, this was before the Tech Survey came out, called Mind the Gap.
571
00:45:02,114 --> 00:45:04,522
And the gap he was talking about is
572
00:45:04,522 --> 00:45:15,731
surveys like this that report gen AI usage and then the anecdotal observation of people
like him and me that work with law firms all day long and just don't see it.
573
00:45:16,292 --> 00:45:20,376
So I think this, lot of these numbers are conflicting, aspirational.
574
00:45:20,376 --> 00:45:25,580
Maybe you have a lawyer who Googles what Chad GPT is and he can check the box.
575
00:45:25,580 --> 00:45:28,062
He or she can check the box, but I don't know.
576
00:45:28,183 --> 00:45:28,904
That's right.
577
00:45:28,904 --> 00:45:35,827
I would say that, yeah, every survey can be poked holes in by the way it's asked.
578
00:45:35,827 --> 00:45:45,713
Because you can imagine if it's way it's asked, have you used large language models in
your practice, if I used it for one thing in the full year, then I could answer yes to
579
00:45:45,713 --> 00:45:46,433
that.
580
00:45:46,433 --> 00:45:54,958
But really, the question is, if the question were instead, what percentage of your work
involved large language models, that number would be totally different, right?
581
00:45:54,958 --> 00:45:56,779
And that number would be way lower.
582
00:45:56,871 --> 00:46:02,288
Have you ever used it in the past year for one thing versus what percentage of your work
have you used it?
583
00:46:02,288 --> 00:46:05,176
Those are two very different questions that will give very different answers.
584
00:46:05,176 --> 00:46:06,907
Yeah, agreed.
585
00:46:06,947 --> 00:46:09,607
Well, this has been a super fun conversation.
586
00:46:09,607 --> 00:46:12,086
I really appreciate you taking a few minutes with me.
587
00:46:12,086 --> 00:46:17,993
I think the, the LLM reasoning conversation is really just beginning.
588
00:46:17,993 --> 00:46:21,055
Do you, do you know Jan Likun from
589
00:46:21,055 --> 00:46:23,853
do, yeah, he's one of the smartest guys around.
590
00:46:23,884 --> 00:46:24,995
Yeah, from Meta.
591
00:46:24,995 --> 00:46:31,957
So I, again, this is more conflicting information that we as individuals have to make
sense of.
592
00:46:31,957 --> 00:46:40,040
He talked about that, uh, currently a house cat is smarter than large language models,
which I thought was interesting.
593
00:46:40,040 --> 00:46:46,614
And then I heard a, another, there was a talk at a local EO, um, entrepreneurs
organization function here in St.
594
00:46:46,614 --> 00:46:48,525
Louis last night.
595
00:46:48,525 --> 00:46:52,446
And I got the notes from it and it said that current
596
00:46:52,838 --> 00:47:06,422
large language models are operating at, think the number was the equivalent of an IQ of
100 and that in the next year and a half, it will operate at an IQ of 1000.
597
00:47:06,422 --> 00:47:16,795
And which I don't even, those numbers don't make sense to me, but you know, when I hear
Jan say that it's dumber than a house cat.
598
00:47:16,795 --> 00:47:20,618
And then I hear that we're operating today at IQ 100.
599
00:47:20,618 --> 00:47:22,246
There's lots of
600
00:47:22,508 --> 00:47:24,701
You know, there's lots of things to make sense of.
601
00:47:24,701 --> 00:47:25,803
Um, I don't know.
602
00:47:25,803 --> 00:47:27,901
What is your take on that before we wrap up?
603
00:47:27,901 --> 00:47:34,955
Yeah, I really like and respect Jan and I think that he's right that if we want to have
robots, they need to understand the world.
604
00:47:34,955 --> 00:47:45,000
So when he talks about the it's as dumb as a house cat that he's talking about the idea
that if you put a ball into a cup and then you flip the cup upside down, what is going to
605
00:47:45,000 --> 00:47:46,500
happen to that ball?
606
00:47:46,560 --> 00:47:49,862
The large language model should know that the ball should fall out of the cup, right?
607
00:47:49,862 --> 00:47:52,343
But a large language models often get that wrong.
608
00:47:52,363 --> 00:47:57,477
So if we want robots to be able to figure out how the world works, we definitely need that
kind of spatial reasoning.
609
00:47:57,477 --> 00:47:59,369
And that's what he's talking about dumber than a house cat.
610
00:47:59,369 --> 00:48:01,971
House cats know that the ball falls out of the cup.
611
00:48:02,111 --> 00:48:10,378
But what Jan isn't saying is there are use cases like the law, where we don't have to deal
with cups turning over in balls.
612
00:48:10,458 --> 00:48:17,163
Every single thing that a lawyer does, every single task, whether you're a litigator or
transactional lawyer, every single task is based on words.
613
00:48:17,404 --> 00:48:21,551
We ingest words, we analyze words, and we output words.
614
00:48:21,551 --> 00:48:23,132
We don't deal with the physical world.
615
00:48:23,132 --> 00:48:26,674
We are merely word based in every single task that we do.
616
00:48:26,674 --> 00:48:30,237
So set aside, a cat doesn't know words.
617
00:48:30,237 --> 00:48:32,958
We don't need to know whether a ball falls out of a cup.
618
00:48:32,958 --> 00:48:35,060
All we need to know is how it worked work.
619
00:48:35,060 --> 00:48:40,983
And I would say for this use case, the legal use case, Jan's criticisms are maybe
inapplicable.
620
00:48:41,038 --> 00:48:43,218
Yeah, yeah, and you're right.
621
00:48:43,218 --> 00:48:44,288
That's what he was.
622
00:48:44,288 --> 00:48:52,978
You know, house cats can plan and anticipate and they have spatial awareness that large
language models don't.
623
00:48:52,978 --> 00:48:55,328
Well, this has been a lot of fun before we wrap up.
624
00:48:55,328 --> 00:48:58,268
How do folks find out more about what you do?
625
00:48:58,268 --> 00:49:01,058
Your your your work with Sally and Villex?
626
00:49:01,058 --> 00:49:03,475
How do people find out more about that?
627
00:49:03,475 --> 00:49:04,675
Yeah, the best places on LinkedIn.
628
00:49:04,675 --> 00:49:05,805
I hang out there a lot.
629
00:49:05,805 --> 00:49:08,596
It's Damien real and you have it in the show notes.
630
00:49:08,596 --> 00:49:09,367
Awesome.
631
00:49:09,367 --> 00:49:10,728
Well, good stuff.
632
00:49:11,070 --> 00:49:14,011
Hopefully, are you going to be in Miami for TLTF?
633
00:49:14,011 --> 00:49:15,543
I will see you at TLSTF.
634
00:49:15,543 --> 00:49:17,576
That's one of my favorite conferences.
635
00:49:17,576 --> 00:49:18,922
Yeah, looking forward to seeing you there.
636
00:49:18,922 --> 00:49:19,453
Absolutely.
637
00:49:19,453 --> 00:49:22,287
We'll be on stage pitching on Wednesday afternoon.
638
00:49:22,768 --> 00:49:23,270
good.
639
00:49:23,270 --> 00:49:24,852
We'll see you in Miami.
640
00:49:25,274 --> 00:49:25,995
All right.
641
00:49:25,995 --> 00:49:26,836
Thanks, Damian.
642
00:49:26,836 --> 00:49:27,858
Take care. -->
Subscribe
Stay up on the latest innovations in legal technology and knowledge management.