A scientist, especially a data scientist, needs data to do their work.
Over the years, I have analysed a large variety of data types. In my various work and study experiences I came across laboratory data; field measurements of rainfall; earthquake records; online shopping data; and now, geospatial data. That is, spatial data that is used to understand the natural phenomena happening on our planet.
I have only just arrived in the geospatial community and must admit that geospatial data is different from everything I knew before. Nothing else that I know is structured the way geospatial data is.
Starting with the fact that it contains different levels of information, all centred around the geographical location to which the other values are associated. In Geographical Information Systems (GIS) the data itself is represented by so-called data models - mathematical and digital structures to represent different types of phenomena. Far more complex than the humble tables I previously worked with.
Then, there is the whole world of data standards, such as the INSPIRE directive for dealing with that data and metadata in a regulated way; for publishing and sharing standardised geospatial datasets. A real labyrinth of rules and regulations for those unfamiliar with the subject. However, with that complexity comes a great added value: data interoperability in the pan-European community.
There is a real hunger in the air for high-quality data. Science is theory based on, and confirmed by, empirical observation of measurements. In the age of digitalisation and the big data boom; with its unbridled use of models to describe the most disparate phenomena, scientists want data. For the best results, they need that data to be complete and abundant; that is, as homogeneous as possible, ready to be fed to numerical models that simulate or predict a wide range of phenomena.
It is therefore easy to understand why various scientific communities are moving to collect their data in a single pool, from which they can pull data when needed. This approach has a very strong impact on the quality of the models and theories that are generated from this data, because it is possible to reproduce and refine the results.
Machine Learning Repositories are already maintained by the data science community in an effort to make it easier to train their models by having more data centrally available. Now, this need to share available data from a single pool, in a secure and regulated way, is also emerging as a hot topic in the environmental data community. The most striking innovation is to be found in the exciting new technology surrounding data spaces, which are poised to become the secure pools of standardised and homogenous data that scientists crave.
This need for standardisation and homogeneity is also displayed by the INSPIRE directive, a form of a Standardised Spatial Data Infrastructure. INSPIRE seeks to provide spatial data with global terminologies that are also standardised to be exchanged across EU borders. For example, INSPIRE encodes addresses from different countries in such a way that they all have one unified format, in spite of regional differences in how addresses are stored.
Thus the INSPIRE and data science communities have similar needs and values, like those surrounding the interoperability of data, which is also a common theme in the broader scientific community.
One such project at wetransform fits into exactly that niche i.e. the intersection of standardized data and data science. In our FutureForests project, recommendations are based on standardised data such as the forest inventory, weather, soil, pest development, and air pollution.
You could say that I opened Pandora’s box when I first tried to imagine how to deduce the health of a forest from data.
The first step is to develop a model that can predict, from the data provided, which species of trees inhabit a certain study area. From there, more complex models can be built. Models that are not only able to classify observations, but also try to predict the health of a forest in, say, 20 years’ time, assuming some climate change scenarios.
The initial tree species classification alone could require data from many different fields, especially when using cartographic variables. Information on the chemical composition of the soil; cartographic or remotely sensed data of historical data providing the so-called labels (which species were classified by a human eye in the past, the oracle for our model); geological data, such as the elevation or distance to water sources; the hill-shade index; and many others.
I have learned that understanding, analysing, and reproducing any environmental phenomenon involves many different and complex factors. This means that one needs different data, from different sources, possibly in different formats. Which brings us back to the value offered by INSPIRE, the standardisation process, and data spaces.
One can easily see how being able to access so much different data at a single collection point can be a huge advantage for a scientist. It means one doesn’t have to look for different sources, get necessary permissions, and make a large variety of datasets homogenous. Not to mention the advantage of being able to access data that is already standardised and cleaned – in AI modelling, data cleaning and processing often takes up around 80% of the total project time. This time is cut down significantly by the use of standardised data, even if there is still some work left for the scientist to do themselves, because all the source data starts from a much better, more uniform point.
Let me give you a concrete example to better understand the advantages of standardised and interoperable data for the average data scientist.
Let us suppose that we are studying a forest that geographically covers two different countries, Italy and Austria for example. Now imagine that you do not have access to standardised data. This means that you first need to find a correspondence between variables described in Italian and variables described in German. Before you can use the data, you must make certain that specific values in one dataset describe the same parameter in the other dataset. With standardised data, this certainty is built in, because everything is structured the same regardless of language.
Can you imagine how much easier the life of a data scientist or machine learning model developer could be with data collection platforms where standardisation is guaranteed? I definitely can! That’s why I am really looking forward to see the birth of our first environmental data space. To learn more and get involved, check out the environmental data spaces community!