Professor Printz and his phenology data

Community Article Published May 6, 2025

A word about us

At Findable AS we specialize in document understanding and analysis for the building industry. Doing this has allowed us to hone our skills in document related engineering and as a small token of appreciation for all the cool tools and models that have been made available to us we wanted to share our penology data set. This work has nothing to do with the type of data we normally work with, but is a “labor of love” that we feel deserves a larger audience.

What’s in this article?

This project essentially deals with digitizing tables like these:

image/png

As you’ll see the contents are hand written and the table layout is quite complex. If you wanted to digitize a table like this, your first idea might be to give it to one of the many open or closed source LLM models out there and ask for a conversion into some markdown format for instance. This might work, but from our experience, having tried this, even the most powerful models struggle with the complex layout and the fact that the data are handwritten. My guess is that in a year or two we will indeed have models that can successfully translate these tables, but for now we need to resort to other methods in order to succeed.

In this article we will explain more about the origin of these tables, what they contain and how to interpret these data. We will also show how we can bring the data in the tables into a form that is much easier for the ML models to read. In a second article we will show how to actually finetune a vLLM to produce the best possible results for data like this.

But first, let’s understand what’s in these tables and why their contents might be interesting.

Why should you care?

Basically, these tables contain something called phenological data. Phenology is, according to Wikipedia:

the study of periodic events in biological life cycles and how these are influenced by seasonal and interannual variations in climate, as well as habitat factors (such as elevation)

It is arguably one of the oldest of mans sciences since the very survival of our species depended (and depends) on this knowledge. Knowing for instance about migratory patterns of animals allowed for effective hunting, observations about plants allowed for effective sowing and harvesting and thereby planning of food supply etc. A lot is known, of course, about phenology, but there have been relatively few systematic and official efforts dedicated to collection of such information.

In Norway, a large effort at collecting phenological data concerning plants, birds and agricultural phenomena was undertaken in 1928 by a botanist at the University of Oslo, Henrik Printz. He established an extensive network of observers all over Norway and tasked them with observing a large number of different phenological phases such as flowering and budburst of different plants, arrival of certain migratory birds etc.

In his 1959 publication

A. Lauscher, F. Lauscher, and H. Printz, Die Phänologie Norwegens, Teil II. Phänologischce Mittelwerte für 260 Orte, Skr. Det Norske Videnskaps-Akademi Oslo. I. Mat.-Naturv. Kl. No.l 1959, 1-176, 1959

he published observations from 278 observation stations for the years 1928 to 1952. The article, 182 pages long and written in German, contains the data from each observation station in the form of handwritten tables looking like the table above.

For instance, the cell indexed as row i and column 1 (shown in blue in the figure above) represents the Julian date of the first flowering of a tiny plant called Coltsfoot (in Latin, tussilago farfara). Julian dates , as used in this article, are just the day number after the 1st of January, so in a non-leap year the Julian date 137 corresponds to the 17th of May.

The coltsfoot is interesting since it is a so-called phenologically plastic plant. It will basically start its lifecycle when local climatic conditions permit. If spring is early in a specific year it will blossom early, if spring is cold and/or arrives late it will blossom later. So you can think of these tiny plants a climatic laboratories spread all over Norway.

So why should we care about the flowering dates of these plants? Well, the observation period coincides with a period when the human contribution of potential greenhouse gases was much lower than today (see here for instance for an overview of the evolution of CO2 in the atmosphere since 1750). In this sense, Henrik Printz’s data represents a time capsule of indirect climatic observations that is soon to be a hundred years old. It serves as a very simple baseline against which we can compare today’s conditions in Norway and verify if indeed natures phenological phases have been influenced by a possible climatic change.

The data

A few observations about the data:

  1. There are 278 tables corresponding to the 278 observation locations.
  2. Every table contains 4 metadata fields (show in yellow below) and 292 data fields (shown in blue):

image/png

  1. There are 83.956 fields all in all, but as can be observed above not all observation were made in every location.
  2. There are a total of 33.905 cells that are not blank.
  3. For now, you will find an excel sheet with all the observations here. Note that the readings of the cells have been performed by a finetuned vLLM. Our plan for later is to also provide a manually verified table.
  4. The code for extracting all this data is available here.

An alternative approach to reading the data

The original format of the data are page spreads scanned on a commercial scanner, the spreads look like this:

image/png

So each spread contains four separate tables. Our first thought is that it would at least help if we could separate the four tables per page. There are several challenges that must be overcome in order to do this. Let’s start by splitting the pages in the middle and also by getting rid of some of the white space. Since we used great care when scanning, this can be done quite simply just by indexing into the row and column coordinates in the images. So basically we are just clipping the spreads like this:

image/png

This will produce single pages like this:

image/png

We still have problems here. First of all there is some white space left and the tables are slightly rotated. Let’s deal first with the rotations. This can be done in several ways using techniques from classical image processing such as the Hough (Radon) transform or techniques from mathematical morphology, but there is a much simpler and more intuitive technique for doing this. Here is the basic idea:

  1. Invert the images so that they are white text on a black background.
  2. Sum the values of the pixels vertically to get the column sums.
  3. Get the maximal value of the column sums.
  4. Rotate the image by a very small amount and repeat from 2.
  5. The rotation that produces the largest peak value corresponds to the optimal rotation.

This is what a plot of the summed columns look like for a not yet optimal rotation:

image/png

This is of course very “ad hoc” and would not work for all such problems, but it works wonderfully well in our case so this is how we did it.

We are however not quite there yet. We also need to trim off all the white space surrounding the tables. For this we will use some simple mathematical morphology tricks. I start by defining two so-called structuring elements looking like this:

image/png

The red dots represent the origins of the structuring elements. We will now slide each of these two elements over the inverted image and basically ask the following question: Does the structuring element “fit” in the inverted image or not? The leftmost structuring element can be seen to fit well with upper left corners, the rightmost structuring element above fits lower right corners. Obviously, this will fit in many locations in the images, but we are looking for the uppermost and leftmost fit as well as the lowermost right fit. I am showing quite a lot of technicalities under the carpet here, if you want to see the details of how this is done take a look in our GitHub repository

image/png

In this way it is possible to locate the upper left and lower right corner as shown below:

image/png

Having detected these corners it is a simple task to cut away the remaining white space. After we have done this we decide on the mean number of rows and columns in the pages and normalize (by resampling) the pages to this standard size. Since the pages are now on a normalized format, we can get the two separate tables by just indexing into the rows and columns of the pages. To check how this went, we take a look at nine randomly selected tables:

image/png

This seems to be all good and we now get to the final step in our processing. Using the same trick as we did for getting the separate tables, we also index into the tables to get small subimages from each cell. We basically create to lists of row and column coordinates like this:

image/png

Then we can define the data for a specific observation like this:

image/png

and we’re good. To check how this went, here is a random selection of 100 images of the coltsfoot flowering observation (Note: it is normal that some of the observations are not filled in, that just means that there were no observations available for this species in a given location):

image/png

You’ll notice that this is far from perfect. The tables are still sufficiently different that we do not always hit precisely so I have added a safety margin that also means that I will be seeing quite a lot of table borders. But think about this from the perspective of the vLLM that should read the text. It basically sees some frame and background with a simple text. Modern vLLMs, when prompted correctly, are very good at solving such problems.

We used the very capable QWen2.5 VL model to perform an initial reading of the texts. We then manually verified and corrected a subset of the reading in order to finetune the same model in order to become even better at this task. That process is described in a separate article.

You will find all the code needed to experiment with this in our GitHub repository.

Community

Sign up or log in to comment