﻿WEBVTT

00:00:07.750 --> 00:00:10.329
The LectAuRep project

00:00:10.330 --> 00:00:13.780
focuses on the record books of Parisian notaries

00:00:13.781 --> 00:00:16.070
from the period 1803-1940

00:00:16.071 --> 00:00:18.170
kept at the national archives.

00:00:18.171 --> 00:00:20.425
A digital image sample

00:00:20.426 --> 00:00:22.645
was processed by printed and handwritten

00:00:22.646 --> 00:00:26.590
character recognition and (exploratorily)

00:00:26.591 --> 00:00:28.630
by natural langage processing,

00:00:28.631 --> 00:00:31.115
named entity recognition

00:00:31.116 --> 00:00:33.600
and editorialization.

00:00:34.019 --> 00:00:36.878
A record book is a registry

00:00:36.879 --> 00:00:40.225
in which the notary records in chronological order

00:00:40.226 --> 00:00:42.399
the notarial deeds he established.

00:00:42.400 --> 00:00:45.270
The contents of these columns can be viewed as as much metadata

00:00:45.271 --> 00:00:48.390
relating to the acts described.

00:00:48.391 --> 00:00:53.379
They can be categorized as types of acts,

00:00:53.380 --> 00:00:55.739
dates, names, occupations,

00:00:55.740 --> 00:00:58.090
geographical names, keyword.

00:00:58.091 --> 00:01:01.345
As finding aids, record books are

00:01:01.346 --> 00:01:03.355
research corpuses in themselves,

00:01:03.356 --> 00:01:05.950
where it is possible to isolate

00:01:05.951 --> 00:01:09.210
- homogeneous batches of data of interest
- to economic and social history.

00:01:10.799 --> 00:01:13.569
Our goal is to facilitate

00:01:13.570 --> 00:01:16.570
and consolidate access to this content

00:01:16.571 --> 00:01:19.385
by offering our users an enriched reading

00:01:19.386 --> 00:01:21.150
and data mining service,

00:01:21.400 --> 00:01:24.250
we also want to share the results

00:01:24.251 --> 00:01:26.140
of our work and our feedback

00:01:26.141 --> 00:01:29.875
in order to promote mutualisation and interoperability

00:01:29.876 --> 00:01:32.639
of data and metadata produced by HTR.

00:01:32.640 --> 00:01:33.699
This involves

00:01:33.700 --> 00:01:36.850
agreeing on common best practices

00:01:36.851 --> 00:01:39.775
or even standards between GLAM, researchers,

00:01:39.776 --> 00:01:43.590
genealogists, library/archive information service software providers

00:01:43.591 --> 00:01:45.260
or HTR services providers.

00:01:45.261 --> 00:01:48.158
Finally, we wanted to take into account

00:01:48.159 --> 00:01:51.275
fifty years of digital heritage resulting

00:01:51.276 --> 00:01:55.245
from both retrospective digitization of microfilms

00:01:55.246 --> 00:01:57.959
and digitization from originals.

00:01:57.960 --> 00:02:00.489
Concretely, we sampled

00:02:00.490 --> 00:02:03.850
two batches of images in black and white

00:02:03.851 --> 00:02:05.965
and color, then we opened

00:02:05.966 --> 00:02:07.779
two sites that were more homogeneous

00:02:07.780 --> 00:02:09.935
on the material level and therefore

00:02:09.936 --> 00:02:13.540
less difficult to process technically,

00:02:13.541 --> 00:02:16.959
a century of marriage contract

00:02:16.960 --> 00:02:18.730
registration register

00:02:18.731 --> 00:02:20.440
for merchants, and the record books of

00:02:20.441 --> 00:02:22.225
an 18th century notary.

00:02:22.226 --> 00:02:25.495
On these four batches representing 250 hands

00:02:25.496 --> 00:02:27.820
at least about two thousand pages

00:02:27.821 --> 00:02:30.730
were transcribed i.e. a few dozen

00:02:30.731 --> 00:02:34.330
of scribal hands of which a large quarter

00:02:34.331 --> 00:02:37.090
was reread, satisfactory HTR

00:02:37.091 --> 00:02:38.800
models could be refined

00:02:38.801 --> 00:02:42.670
on the basis of a ground truth quality

00:02:42.671 --> 00:02:45.150
to reduce character error rate.

00:02:53.420 --> 00:02:55.318
For transcription task

00:02:55.319 --> 00:02:57.315
the technological environment

00:02:57.316 --> 00:02:58.890
in which the project takes place

00:02:58.891 --> 00:03:00.735
is essential, it is necessary to distinguish

00:03:00.736 --> 00:03:03.165
between what is the most visible part of software

00:03:03.166 --> 00:03:05.370
and what is hardware.

00:03:05.371 --> 00:03:06.855
The LectAuRep project is part

00:03:06.856 --> 00:03:08.430
of an open science approach

00:03:08.431 --> 00:03:10.459
which is naturally reflected in choice

00:03:10.460 --> 00:03:11.605
of software.

00:03:11.606 --> 00:03:13.365
First kraken, an HTR

00:03:13.366 --> 00:03:16.230
program developed by Benjamin

00:03:16.231 --> 00:03:19.320
Kiessling since 2015. Now under
the umbrella of SCRIPTA PSL, kraken is

00:03:19.321 --> 00:03:20.835
compatible with many alphabetic

00:03:20.836 --> 00:03:22.485
or non-alphabetic writing systems

00:03:22.486 --> 00:03:25.170
and several reading directions.

00:03:25.171 --> 00:03:26.910
It allows to train

00:03:26.911 --> 00:03:28.170
different types of models

00:03:28.171 --> 00:03:30.569
for transcription but also for segmentation,

00:03:30.570 --> 00:03:32.519
i.e. the detection of lines

00:03:32.520 --> 00:03:34.890
and/or text regions on the image.

00:03:34.891 --> 00:03:37.260
Second eScriptorium,

00:03:37.261 --> 00:03:40.410
- a web application developed by SCRIPTA PSL
- since 2018, is a virtual workbench

00:03:40.413 --> 00:03:43.095
for managing transcription

00:03:43.096 --> 00:03:45.810
projects that serves as shells

00:03:45.811 --> 00:03:47.840
or as graphical interface

00:03:47.841 --> 00:03:50.040
- for HTR engines, in this case kraken.
- Since 2019 the LectAuRep project

00:03:50.043 --> 00:03:52.395
has therefore relied on an application

00:03:52.396 --> 00:03:54.270
eScriptorium deployed by ALMAnaCH

00:03:54.271 --> 00:03:56.055
on its servers first on an INRIA

00:03:56.056 --> 00:03:58.125
virtual machine i.e. a minimalist server

00:03:58.126 --> 00:04:00.120
with little processing power

00:04:00.121 --> 00:04:02.670
which purpose was to allow all

00:04:02.671 --> 00:04:05.180
the members of the project to work

00:04:05.181 --> 00:04:07.170
together on the same database,

00:04:07.171 --> 00:04:09.180
then a first one would be

00:04:09.181 --> 00:04:11.609
scaled up in 2020 with a migration

00:04:11.610 --> 00:04:14.579
to the better powered Traces6 server,
in particular powered with graphics cards

00:04:14.580 --> 00:04:16.760
essential for effective

00:04:16.761 --> 00:04:19.605
training of the models. And finally
LectAuRep is one among the projects

00:04:19.606 --> 00:04:21.570
that will be able to take advantage

00:04:21.571 --> 00:04:24.180
of the new scale-up thanks to

00:04:24.181 --> 00:04:27.015
the Cremma server financed by the Dim Map
in addition to being better powered,

00:04:27.016 --> 00:04:28.680
it has more GPUs and more memory,

00:04:28.681 --> 00:04:30.660
it offers a modular architecture

00:04:30.661 --> 00:04:33.630
which is compatible with future scaling-up,

00:04:33.631 --> 00:04:36.530
- without this environment the task
- of transcription is not possible.

00:04:36.860 --> 00:04:39.749
In 2021 the results of LectAuRep

00:04:39.750 --> 00:04:41.985
are numerous, we have establish

00:04:41.986 --> 00:04:45.075
a method of producing transcription
models which works well, it is based

00:04:45.076 --> 00:04:46.620
on the use of

00:04:46.621 --> 00:04:49.125
so-called generic models which character error rate

00:04:49.126 --> 00:04:50.788
is inferior to 10%

00:04:50.789 --> 00:04:53.205
i.e. the model makes less than an error

00:04:53.206 --> 00:04:55.005
out of 10 characters.

00:04:55.006 --> 00:04:57.479
These models are trained on batches

00:04:57.480 --> 00:05:00.690
of varied hands, generally at least ten.

00:05:00.691 --> 00:05:02.310
We have two of them trained

00:05:02.311 --> 00:05:04.410
on different sets of transcription

00:05:04.411 --> 00:05:06.050
which are more or less final.

00:05:06.051 --> 00:05:08.760
These models serve several purposes:
to produce a first overall transcription

00:05:08.761 --> 00:05:10.275
which allows either

00:05:10.276 --> 00:05:12.720
the publications of a transcription

00:05:12.721 --> 00:05:14.640
which is admittedly distorted but compatible

00:05:14.641 --> 00:05:16.920
- with an exploration of the corpora
- in the context of a fuzzy search,

00:05:16.923 --> 00:05:18.960
or a pre-annotation document

00:05:18.961 --> 00:05:22.229
which saves time during the manual
transcription since instead of deciphering,

00:05:22.230 --> 00:05:24.595
one only has to correct.

00:05:24.596 --> 00:05:26.940
In addition these models serve as a basis

00:05:26.941 --> 00:05:29.670
for refining so-called specialized models

00:05:29.671 --> 00:05:31.620
which are re-trained on small

00:05:31.621 --> 00:05:33.540
batches of uniform data,

00:05:33.541 --> 00:05:35.910
in this way we quickly reach

00:05:35.911 --> 00:05:38.960
character error rates equal or even less than 5%

00:05:38.961 --> 00:05:40.770
that is one fault every twenty characters.

00:05:40.771 --> 00:05:42.780
Of course within an open science approach

00:05:42.781 --> 00:05:46.110
the conventions
and practices of transcription developed

00:05:46.111 --> 00:05:48.180
are documented as well

00:05:48.181 --> 00:05:51.020
- as the experiments with the data,
- the models and the infrastructure.

00:05:54.320 --> 00:05:57.045
Transcription datas are the sinews of war

00:05:57.046 --> 00:05:59.669
it is the basis for training models

00:05:59.670 --> 00:06:02.055
and LectAuRep has produced a lot of it either

00:06:02.056 --> 00:06:05.070
by doing the transcription entirely by hand

00:06:05.071 --> 00:06:07.440
or by doing automatic transcription

00:06:07.441 --> 00:06:08.940
recovery, part of this data

00:06:08.941 --> 00:06:11.310
is considered gold,

00:06:11.311 --> 00:06:12.750
i.e. they have been checked

00:06:12.751 --> 00:06:14.700
and corrected, they are made public

00:06:14.701 --> 00:06:17.220
through the HTR-United organization

00:06:17.221 --> 00:06:18.780
and may also be

00:06:18.781 --> 00:06:20.205
published through

00:06:20.206 --> 00:06:22.260
the data.culture.gouv.fr.

00:06:22.261 --> 00:06:23.760
The rest of the transcriptions

00:06:23.761 --> 00:06:26.610
still require corrections from the DMC

00:06:26.611 --> 00:06:28.910
and will gradually integrate

00:06:28.911 --> 00:06:30.900
the gold corpus.

00:06:30.901 --> 00:06:33.480
Other deliverables had not been
anticipated by the project such as

00:06:33.481 --> 00:06:35.475
a direct and continuous contribution

00:06:35.476 --> 00:06:37.455
to the SCRIPTA PSL project

00:06:37.456 --> 00:06:40.170
in the form of user feedbacks and use cases,

00:06:40.171 --> 00:06:42.105
in the form of development
of functionalities which

00:06:42.106 --> 00:06:45.990
are integrated into the source code
of the application, (we will mention here

00:06:45.991 --> 00:06:47.760
the work of Yves Tadjo

00:06:47.761 --> 00:06:49.650
whose contract is financed by LectAuRep)

00:06:49.651 --> 00:06:52.440
and globally in the form
of documentation which is made available

00:06:52.441 --> 00:06:53.450
to all users

00:06:53.451 --> 00:06:54.535
of the application.

00:06:54.536 --> 00:06:56.790
More broadly, the members of the project

00:06:56.791 --> 00:06:58.800
are engaged in a process

00:06:58.801 --> 00:07:00.450
of sharing expertise

00:07:00.451 --> 00:07:02.310
with users carrying out projects

00:07:02.311 --> 00:07:04.050
involving HTR, with working groups

00:07:04.051 --> 00:07:05.370
such as CREMMALab

00:07:05.371 --> 00:07:07.320
and with the glam community.

00:07:07.321 --> 00:07:09.780
This sharing of expertise also takes the form

00:07:09.781 --> 00:07:12.825
of scientific publication on the
questions of interest to the project,

00:07:12.826 --> 00:07:14.775
part of which is the subject

00:07:14.776 --> 00:07:17.160
of blog posts on the hypotheses platform

00:07:17.161 --> 00:07:19.750
opened during the confinement and the internship

00:07:19.751 --> 00:07:21.260
of Lucas Terriel in 2020.

00:07:22.310 --> 00:07:24.359
Witnessing an appropriation

00:07:24.360 --> 00:07:26.445
of model performance metrics

00:07:26.446 --> 00:07:28.350
in contact with the field,

00:07:28.351 --> 00:07:30.060
the KaMI tool makes it possible

00:07:30.061 --> 00:07:33.030
to better assess the efficiency of the models,

00:07:33.031 --> 00:07:36.690
it gives more metrics by combining
for example character error rate

00:07:36.691 --> 00:07:37.980
word error rate

00:07:37.981 --> 00:07:40.470
with Levenshtein distance and operations

00:07:40.471 --> 00:07:42.015
such as deletion

00:07:42.016 --> 00:07:44.505
and addition substitutions.

00:07:44.506 --> 00:07:48.630
KaMI is agnostic, that is to say that it
can be used to compare two character strings

00:07:48.631 --> 00:07:51.420
- regardless of the HTR software used
- to generate them and finally

00:07:51.423 --> 00:07:53.835
it allows above all to play

00:07:53.836 --> 00:07:55.755
with filters in order to negotiate

00:07:55.756 --> 00:07:57.735
the severity of the evaluation according

00:07:57.736 --> 00:07:59.775
to criteria considered as important,

00:07:59.776 --> 00:08:02.040
one can for example ignore the errors

00:08:02.041 --> 00:08:04.020
relating to the recognition of the digits

00:08:04.021 --> 00:08:06.030
in a case where one would mainly

00:08:06.031 --> 00:08:09.360
deal with knowing if the letters and the words

00:08:09.361 --> 00:08:12.720
are well recognized. Such an evaluation
is very useful to anticipate the difficulty

00:08:12.721 --> 00:08:16.070
- of the correction task after applying
- a transcription model.

00:08:16.071 --> 00:08:16.072


00:08:16.700 --> 00:08:18.689
The corpus of texts produced

00:08:18.690 --> 00:08:21.090
during the LEctAuRep project

00:08:21.091 --> 00:08:24.540
raises challenges which may be of interest
to specialists in natural language processing

00:08:24.541 --> 00:08:26.250
because the language used

00:08:26.251 --> 00:08:29.190
- in the pages of the record books
- is far from natural, it contains

00:08:29.193 --> 00:08:31.470
many abbreviations and named

00:08:31.471 --> 00:08:33.430
entities and is made

00:08:33.431 --> 00:08:34.610
of non-verbal sentences.

00:08:36.560 --> 00:08:39.615
Finally, the documents of interest

00:08:39.616 --> 00:08:41.115
to the project made it possible

00:08:41.116 --> 00:08:42.750
to wonder about the way of making

00:08:42.751 --> 00:08:44.970
accessible documents which layout is complex.

00:08:44.971 --> 00:08:46.320
LectAuRep feeds part

00:08:46.321 --> 00:08:47.430
of the examples studied

00:08:47.431 --> 00:08:49.515
by the ALMAnaCH team to integrate

00:08:49.516 --> 00:08:51.210
an application like TEI Publisher

00:08:51.211 --> 00:08:53.430
into a generalist chain treatment

00:08:53.431 --> 00:08:55.230
dedicated to HTR and based

00:08:55.231 --> 00:08:57.290
on a more systematic use of TEI.

00:08:59.880 --> 00:09:03.719
The LectAuRep project showed

00:09:03.720 --> 00:09:06.450
at the scale of part of the initial sample

00:09:06.451 --> 00:09:08.445
that broad-spectrum HTR

00:09:08.446 --> 00:09:10.305
models work well enough

00:09:10.306 --> 00:09:12.120
to allow fuzzy search

00:09:12.121 --> 00:09:14.250
on pages with loose writing lines

00:09:14.251 --> 00:09:16.485
in the black and white corpus

00:09:16.486 --> 00:09:18.950
from the 19th century and part of the corpus

00:09:18.951 --> 00:09:21.540
in color from the 20th century.

00:09:21.541 --> 00:09:24.225
Segmentation models currently come up
against a threshold when the lines

00:09:24.226 --> 00:09:26.700
are too tight,

00:09:26.701 --> 00:09:29.175
i.e. an important part of the 20th century corpus,

00:09:29.176 --> 00:09:31.680
so it would be useful to have tools

00:09:31.681 --> 00:09:33.930
enabling to assess the quality

00:09:33.931 --> 00:09:35.910
of segmentation models,on which the quality

00:09:35.911 --> 00:09:38.340
of the HTR depends, in order to be able

00:09:38.341 --> 00:09:39.625
to refine its models

00:09:39.626 --> 00:09:40.910
based on metrics.

00:09:40.970 --> 00:09:43.829
The target corpus of LectAuRep evaluated

00:09:43.830 --> 00:09:47.065
at more one million images is monumental,

00:09:47.066 --> 00:09:49.410
the production of less than one percent

00:09:49.411 --> 00:09:51.705
of this corpus, for example one the 122

00:09:51.706 --> 00:09:54.960
notarial studies or a chronological

00:09:54.961 --> 00:09:57.630
slice of one year out of 140

00:09:57.631 --> 00:10:00.330
requires participatory logistics,

00:10:00.331 --> 00:10:01.470
infrastructures

00:10:01.471 --> 00:10:03.405
and project engineering

00:10:03.406 --> 00:10:05.510
concerning in particular the flow

00:10:05.511 --> 00:10:07.285
of images and data.

00:10:07.286 --> 00:10:09.865
To refine a sampling it can be useful

00:10:09.866 --> 00:10:11.950
to deepen the (diplomatics) knowledge

00:10:11.951 --> 00:10:13.780
of physical sources

00:10:13.781 --> 00:10:15.565
and digital sources,

00:10:15.566 --> 00:10:18.370
get an idea of the number of hands

00:10:18.371 --> 00:10:20.875
per register, to be able to specify

00:10:20.876 --> 00:10:23.850
the quantitative distribution between black and white

00:10:23.851 --> 00:10:25.920
and colors thanks to better control

00:10:25.921 --> 00:10:28.350
of metadata would perhaps allow

00:10:28.351 --> 00:10:30.000
optimize and save resources

00:10:30.001 --> 00:10:31.910
needed for artificial

00:10:31.911 --> 00:10:33.195
intelligence.

00:10:33.196 --> 00:10:34.479
Thank you.

00:10:34.480 --> 00:10:41.369
[Applause]