A scientist, especially a data scientist, needs data to do their work.

Over the years, I have analysed a large variety of data types. In my various work and study experiences I came across laboratory data; field measurements of rainfall; earthquake records; online shopping data; and now, geospatial data. That is, spatial data that is used to understand the natural phenomena happening on our planet.

I have only just arrived in the geospatial community and must admit that geospatial data is different from everything I knew before. Nothing else that I know is structured the way geospatial data is.

Starting with the fact that it contains different levels of information, all centred around the geographical location to which the other values are associated. In Geographical Information Systems (GIS) the data itself is represented by so-called data models - mathematical and digital structures to represent different types of phenomena. Far more complex than the humble tables I previously worked with.

Then, there is the whole world of data standards, such as the INSPIRE directive for dealing with that data and metadata in a regulated way; for publishing and sharing standardised geospatial datasets. A real labyrinth of rules and regulations for those unfamiliar with the subject. However, with that complexity comes a great added value: data interoperability in the pan-European community.

There is a real hunger in the air for high-quality data. Science is theory based on, and confirmed by, empirical observation of measurements. In the age of digitalisation and the big data boom; with its unbridled use of models to describe the most disparate phenomena, scientists want data. For the best results, they need that data to be complete and abundant; that is, as homogeneous as possible, ready to be fed to numerical models that simulate or predict a wide range of phenomena.

It is therefore easy to understand why various scientific communities are moving to collect their data in a single pool, from which they can pull data when needed. This approach has a very strong impact on the quality of the models and theories that are generated from this data, because it is possible to reproduce and refine the results.

Machine Learning Repositories are already maintained by the data science community in an effort to make it easier to train their models by having more data centrally available. Now, this need to share available data from a single pool, in a secure and regulated way, is also emerging as a hot topic in the environmental data community. The most striking innovation is to be found in the exciting new technology surrounding data spaces, which are poised to become the secure pools of standardised and homogenous data that scientists crave.

This need for standardisation and homogeneity is also displayed by the INSPIRE directive, a form of a Standardised Spatial Data Infrastructure. INSPIRE seeks to provide spatial data with global terminologies that are also standardised to be exchanged across EU borders. For example, INSPIRE encodes addresses from different countries in such a way that they all have one unified format, in spite of regional differences in how addresses are stored.

Thus the INSPIRE and data science communities have similar needs and values, like those surrounding the interoperability of data, which is also a common theme in the broader scientific community.

One such project at wetransform fits into exactly that niche i.e. the intersection of standardized data and data science. In our FutureForests project, recommendations are based on standardised data such as the forest inventory, weather, soil, pest development, and air pollution.

You could say that I opened Pandora’s box when I first tried to imagine how to deduce the health of a forest from data.

The first step is to develop a model that can predict, from the data provided, which species of trees inhabit a certain study area. From there, more complex models can be built. Models that are not only able to classify observations, but also try to predict the health of a forest in, say, 20 years’ time, assuming some climate change scenarios.

The initial tree species classification alone could require data from many different fields, especially when using cartographic variables. Information on the chemical composition of the soil; cartographic or remotely sensed data of historical data providing the so-called labels (which species were classified by a human eye in the past, the oracle for our model); geological data, such as the elevation or distance to water sources; the hill-shade index; and many others.

I have learned that understanding, analysing, and reproducing any environmental phenomenon involves many different and complex factors. This means that one needs different data, from different sources, possibly in different formats. Which brings us back to the value offered by INSPIRE, the standardisation process, and data spaces.

One can easily see how being able to access so much different data at a single collection point can be a huge advantage for a scientist. It means one doesn’t have to look for different sources, get necessary permissions, and make a large variety of datasets homogenous. Not to mention the advantage of being able to access data that is already standardised and cleaned – in AI modelling, data cleaning and processing often takes up around 80% of the total project time. This time is cut down significantly by the use of standardised data, even if there is still some work left for the scientist to do themselves, because all the source data starts from a much better, more uniform point.

Let me give you a concrete example to better understand the advantages of standardised and interoperable data for the average data scientist.

Let us suppose that we are studying a forest that geographically covers two different countries, Italy and Austria for example. Now imagine that you do not have access to standardised data. This means that you first need to find a correspondence between variables described in Italian and variables described in German. Before you can use the data, you must make certain that specific values in one dataset describe the same parameter in the other dataset. With standardised data, this certainty is built in, because everything is structured the same regardless of language.

Can you imagine how much easier the life of a data scientist or machine learning model developer could be with data collection platforms where standardisation is guaranteed? I definitely can! That’s why I am really looking forward to see the birth of our first environmental data space. To learn more and get involved, check out the environmental data spaces community!

A scientist, especially a data scientist, needs data to do their work.

Over the years, I have analysed a large variety of data types. In my various work and study experiences I came across laboratory data; field measurements of rainfall; earthquake records; online shopping data; and now, geospatial data. That is, spatial data that is used to understand the natural phenomena happening on our planet.

I have only just arrived in the geospatial community and must admit that geospatial data is different from everything I knew before. Nothing else that I know is structured the way geospatial data is.

Starting with the fact that it contains different levels of information, all centred around the geographical location to which the other values are associated. In Geographical Information Systems (GIS) the data itself is represented by so-called data models - mathematical and digital structures to represent different types of phenomena. Far more complex than the humble tables I previously worked with.

Then, there is the whole world of data standards, such as the INSPIRE directive for dealing with that data and metadata in a regulated way; for publishing and sharing standardised geospatial datasets. A real labyrinth of rules and regulations for those unfamiliar with the subject. However, with that complexity comes a great added value: data interoperability in the pan-European community.

There is a real hunger in the air for high-quality data. Science is theory based on, and confirmed by, empirical observation of measurements. In the age of digitalisation and the big data boom; with its unbridled use of models to describe the most disparate phenomena, scientists want data. For the best results, they need that data to be complete and abundant; that is, as homogeneous as possible, ready to be fed to numerical models that simulate or predict a wide range of phenomena.

It is therefore easy to understand why various scientific communities are moving to collect their data in a single pool, from which they can pull data when needed. This approach has a very strong impact on the quality of the models and theories that are generated from this data, because it is possible to reproduce and refine the results.

Machine Learning Repositories are already maintained by the data science community in an effort to make it easier to train their models by having more data centrally available. Now, this need to share available data from a single pool, in a secure and regulated way, is also emerging as a hot topic in the environmental data community. The most striking innovation is to be found in the exciting new technology surrounding data spaces, which are poised to become the secure pools of standardised and homogenous data that scientists crave.

This need for standardisation and homogeneity is also displayed by the INSPIRE directive, a form of a Standardised Spatial Data Infrastructure. INSPIRE seeks to provide spatial data with global terminologies that are also standardised to be exchanged across EU borders. For example, INSPIRE encodes addresses from different countries in such a way that they all have one unified format, in spite of regional differences in how addresses are stored.

Thus the INSPIRE and data science communities have similar needs and values, like those surrounding the interoperability of data, which is also a common theme in the broader scientific community.

One such project at wetransform fits into exactly that niche i.e. the intersection of standardized data and data science. In our FutureForests project, recommendations are based on standardised data such as the forest inventory, weather, soil, pest development, and air pollution.

You could say that I opened Pandora’s box when I first tried to imagine how to deduce the health of a forest from data.

The first step is to develop a model that can predict, from the data provided, which species of trees inhabit a certain study area. From there, more complex models can be built. Models that are not only able to classify observations, but also try to predict the health of a forest in, say, 20 years’ time, assuming some climate change scenarios.

The initial tree species classification alone could require data from many different fields, especially when using cartographic variables. Information on the chemical composition of the soil; cartographic or remotely sensed data of historical data providing the so-called labels (which species were classified by a human eye in the past, the oracle for our model); geological data, such as the elevation or distance to water sources; the hill-shade index; and many others.

I have learned that understanding, analysing, and reproducing any environmental phenomenon involves many different and complex factors. This means that one needs different data, from different sources, possibly in different formats. Which brings us back to the value offered by INSPIRE, the standardisation process, and data spaces.

One can easily see how being able to access so much different data at a single collection point can be a huge advantage for a scientist. It means one doesn’t have to look for different sources, get necessary permissions, and make a large variety of datasets homogenous. Not to mention the advantage of being able to access data that is already standardised and cleaned – in AI modelling, data cleaning and processing often takes up around 80% of the total project time. This time is cut down significantly by the use of standardised data, even if there is still some work left for the scientist to do themselves, because all the source data starts from a much better, more uniform point.

Let me give you a concrete example to better understand the advantages of standardised and interoperable data for the average data scientist.

Let us suppose that we are studying a forest that geographically covers two different countries, Italy and Austria for example. Now imagine that you do not have access to standardised data. This means that you first need to find a correspondence between variables described in Italian and variables described in German. Before you can use the data, you must make certain that specific values in one dataset describe the same parameter in the other dataset. With standardised data, this certainty is built in, because everything is structured the same regardless of language.

Can you imagine how much easier the life of a data scientist or machine learning model developer could be with data collection platforms where standardisation is guaranteed? I definitely can! That’s why I am really looking forward to see the birth of our first environmental data space. To learn more and get involved, check out the environmental data spaces community!

(more)

News entry thumbnail
Should INSPIRE be a priority?
03.05.2022 by John Boudewijn, Akshat Bajaj

To INSPIRE or not to INSPIRE – implementing the directive is often a much-debated question within organisations. Some see it as a resource intensive project, some see it as the future of Spatial Digital Infrastructures (SDI), and some see it as both.

We’ve seen a number of organisations deprioritise INSPIRE, due to a variety of reasons.

Chief among which:

  • The ROI: INSPIRE implementation can seem like a resource draining activity. There is a need to create and update metadata, harmonise data, and to publish view and download services. In the short term, it can be difficult to see how this can benefit the implementing organisation beyond just enabling data exchange and ensuring compliance to the directive.
  • Limited Resources: Public sector GIS analysts deal with a plethora of digitisation tasks and projects at any given time and are usually understaffed. The priorities of these tasks are often set by high-level decision makers who are themselves not very involved in geospatial data analysis.
  • Project Complexity: INSPIRE is very intricate. Transforming and validating gigabytes worth of shapefiles to nested GML files and ensuring a high quality of published data is no easy feat. Coupled with the lack of resources most GIS departments have, the probability of a successful INSPIRE project can seem low.

However, INSPIRE implementation has been a success for many organisations! Just on our own platform, we see 150 million hits a month, and most of the issues mentioned can be mitigated by engaging expert assistance.

Some of the organisations we’ve spoken with have gone so far as to call INSPIRE the “Rolls-Royce of SDIs” and have committed to making INSPIRE a reality.

Why are there such divergent perspectives on INSPIRE?

The short answer – it’s because of short-term thinking vs. long-term perspectives.

The long-term perspective on data management always contains INSPIRE.

In our modern world, the amount of data collected is staggering. It can be as technologically advanced as remote sensing, or as simple as a national bird count. Even our fridges and vacuum cleaners are becoming tiny data hubs. As a matter of fact, the global data pool is expected to almost triple, from 64.2 zettabytes in 2020 to over 180 zettabytes in 2025. For some perspective – 180 zettabytes of storage need nearly eight trillion Blu Ray discs. That amount would get you to the moon 23 times, just by stacking them!

So, what is this growing data pool good for, except for selling targeted advertising? If all we needed was massive quantities of data, surely by now we should be able to solve climate change and save the world at least ten times over by lunchtime? Unfortunately, data collection is merely the beginning of a much more complex process.

The way data can flow from its initial collection point to national institutions and international decision-makers is by using agreed-upon standards to format and work with this data. Standards like INSPIRE.

INSPIRE enables interoperability, which is key in enabling vast amounts of information to flow in a way that supports multiple parties, without being hampered by differences in language, formatting, and the like.

While the benefits of interoperability may not be immediately visible to the naked eye, in the grand scheme of things improving it can save up 500bn EUR. If you’re hesitant about allocating resources towards improved interoperability, check out this article on the quantified benefits for citizens, governments, and businesses..

According to a marketing study we conducted that targeted over 100 INSPIRE implementers, 80% of INSPIRE implementers said that transforming data is the most complex part of the INSPIRE process. They also found it hard to attain the necessary knowledge in a time sensitive manner to deal with this complexity. However, with the right tools and knowledge, harmonisations and publishing processes can be a relative walk in the park compared to going it alone. We even see an increasing number of clients who have previously implemented INSPIRE turn to us in order to improve their efficiency!

The value of harmonisation is huge. The past is full of challenges in which greater interoperability would have made a significant difference. The Icelandic volcanic eruptions that cost US $1.7 billion could have been handled more effectively if geodata could easily be transferred between different parties. This very incident was one of the strongest cases for INSPIRE to become a priority!

As further challenges to our ecosystem loom on the horizon, you can bet that harmonised data will be a part of the good fight, not in the least by enabling further innovation like Artificial Intelligence.

In short, implementing INSPIRE is very much what you make of it. It does not have to be a complicated process that you go alone, and the future impact of improved interoperability is a significant one.

For people who work with data all day, every day, sometimes it’s almost too easy to forget why enabling interoperability is so important. Words like “compliance” become key and priorities are set based on what will or won’t incur fines.

Sure, for a single institution that’s important, but in the bigger picture it is even more important that we avoid the data-equivalent of bowling shuttlecocks at baseball players. If we all actually try to play the same game, we stand a much better chance of winning.

Improved access to clean data enables the creation of better AI models to save forests and set up drone corridors. Implementing something like INSPIRE may be a challenge for a little bit, though significantly less so when utilising our tools, but it has big implications for the real world.

When people hear “science”, they expect that somewhere, someone smart is going to look at what has been provided, go “A-ha!” and then solve a problem.

Is that reality? Of course not, but we get closer to it with every step towards making collected data more valuable and interoperable.

So, the next time a question like “Should INSPIRE be a priority for us?” comes up, remember that it is a priority. For everyone.

To INSPIRE or not to INSPIRE – implementing the directive is often a much-debated question within organisations. Some see it as a resource intensive project, some see it as the future of Spatial Digital Infrastructures (SDI), and some see it as both.

We’ve seen a number of organisations deprioritise INSPIRE, due to a variety of reasons.

Chief among which:

  • The ROI: INSPIRE implementation can seem like a resource draining activity. There is a need to create and update metadata, harmonise data, and to publish view and download services. In the short term, it can be difficult to see how this can benefit the implementing organisation beyond just enabling data exchange and ensuring compliance to the directive.
  • Limited Resources: Public sector GIS analysts deal with a plethora of digitisation tasks and projects at any given time and are usually understaffed. The priorities of these tasks are often set by high-level decision makers who are themselves not very involved in geospatial data analysis.
  • Project Complexity: INSPIRE is very intricate. Transforming and validating gigabytes worth of shapefiles to nested GML files and ensuring a high quality of published data is no easy feat. Coupled with the lack of resources most GIS departments have, the probability of a successful INSPIRE project can seem low.

However, INSPIRE implementation has been a success for many organisations! Just on our own platform, we see 150 million hits a month, and most of the issues mentioned can be mitigated by engaging expert assistance.

Some of the organisations we’ve spoken with have gone so far as to call INSPIRE the “Rolls-Royce of SDIs” and have committed to making INSPIRE a reality.

Why are there such divergent perspectives on INSPIRE?

The short answer – it’s because of short-term thinking vs. long-term perspectives.

The long-term perspective on data management always contains INSPIRE.

In our modern world, the amount of data collected is staggering. It can be as technologically advanced as remote sensing, or as simple as a national bird count. Even our fridges and vacuum cleaners are becoming tiny data hubs. As a matter of fact, the global data pool is expected to almost triple, from 64.2 zettabytes in 2020 to over 180 zettabytes in 2025. For some perspective – 180 zettabytes of storage need nearly eight trillion Blu Ray discs. That amount would get you to the moon 23 times, just by stacking them!

So, what is this growing data pool good for, except for selling targeted advertising? If all we needed was massive quantities of data, surely by now we should be able to solve climate change and save the world at least ten times over by lunchtime? Unfortunately, data collection is merely the beginning of a much more complex process.

The way data can flow from its initial collection point to national institutions and international decision-makers is by using agreed-upon standards to format and work with this data. Standards like INSPIRE.

INSPIRE enables interoperability, which is key in enabling vast amounts of information to flow in a way that supports multiple parties, without being hampered by differences in language, formatting, and the like.

While the benefits of interoperability may not be immediately visible to the naked eye, in the grand scheme of things improving it can save up 500bn EUR. If you’re hesitant about allocating resources towards improved interoperability, check out this article on the quantified benefits for citizens, governments, and businesses..

According to a marketing study we conducted that targeted over 100 INSPIRE implementers, 80% of INSPIRE implementers said that transforming data is the most complex part of the INSPIRE process. They also found it hard to attain the necessary knowledge in a time sensitive manner to deal with this complexity. However, with the right tools and knowledge, harmonisations and publishing processes can be a relative walk in the park compared to going it alone. We even see an increasing number of clients who have previously implemented INSPIRE turn to us in order to improve their efficiency!

The value of harmonisation is huge. The past is full of challenges in which greater interoperability would have made a significant difference. The Icelandic volcanic eruptions that cost US $1.7 billion could have been handled more effectively if geodata could easily be transferred between different parties. This very incident was one of the strongest cases for INSPIRE to become a priority!

As further challenges to our ecosystem loom on the horizon, you can bet that harmonised data will be a part of the good fight, not in the least by enabling further innovation like Artificial Intelligence.

In short, implementing INSPIRE is very much what you make of it. It does not have to be a complicated process that you go alone, and the future impact of improved interoperability is a significant one.

For people who work with data all day, every day, sometimes it’s almost too easy to forget why enabling interoperability is so important. Words like “compliance” become key and priorities are set based on what will or won’t incur fines.

Sure, for a single institution that’s important, but in the bigger picture it is even more important that we avoid the data-equivalent of bowling shuttlecocks at baseball players. If we all actually try to play the same game, we stand a much better chance of winning.

Improved access to clean data enables the creation of better AI models to save forests and set up drone corridors. Implementing something like INSPIRE may be a challenge for a little bit, though significantly less so when utilising our tools, but it has big implications for the real world.

When people hear “science”, they expect that somewhere, someone smart is going to look at what has been provided, go “A-ha!” and then solve a problem.

Is that reality? Of course not, but we get closer to it with every step towards making collected data more valuable and interoperable.

So, the next time a question like “Should INSPIRE be a priority for us?” comes up, remember that it is a priority. For everyone.

(more)

For Users

New Features

  • hale»connect now supports the upload and schema validation of GeoPackages. Users can export GeoPackage schemas as json.hsd files from hale»studio for use in hale»connect. The functionality enables END (Environmental Noise Directive) implementors to validate END-conformant GeoPackages on hale»connect.
  • GeoPackage publishing support is now a part of hale»connect. Users can publish GeoPackages as datasets and automatically generate associated view and download services, and metadata.
  • GeoPackage is now available as a download format in ATOM feed. Independent of the dataset’s file type, hale»connect users can now add GeoPackage downloads to their atom feeds.
  • GeoPackage is now available as a transformation target in online transformation projects. Users working with GeoPackage as the target schema in hale»studio transformation projects can now use those projects in online transformation configurations on hale»connect. Transformation projects require a custom data export configuration with GeoPackage selected as export format. hale»connect first searches for custom export configurations named hale-connect and applies the export configuration if found. Next, hale»connect searches for custom export configurations named default. Otherwise, hale»connect will use the default configuration which exports to GML FeatureCollection.
  • hale»connect now supports 3D data publishing. Users can publish data with three coordinates and create OGC WFS services.

Fixes

  • Implemented a fix to raise exceptions rather than returning images from deegree when an error happens during rendering.
  • Overviews are now created as intended for raster files used for publishing.

For Users

New Features

  • hale»connect now supports the upload and schema validation of GeoPackages. Users can export GeoPackage schemas as json.hsd files from hale»studio for use in hale»connect. The functionality enables END (Environmental Noise Directive) implementors to validate END-conformant GeoPackages on hale»connect.
  • GeoPackage publishing support is now a part of hale»connect. Users can publish GeoPackages as datasets and automatically generate associated view and download services, and metadata.
  • GeoPackage is now available as a download format in ATOM feed. Independent of the dataset’s file type, hale»connect users can now add GeoPackage downloads to their atom feeds.
  • GeoPackage is now available as a transformation target in online transformation projects. Users working with GeoPackage as the target schema in hale»studio transformation projects can now use those projects in online transformation configurations on hale»connect. Transformation projects require a custom data export configuration with GeoPackage selected as export format. hale»connect first searches for custom export configurations named hale-connect and applies the export configuration if found. Next, hale»connect searches for custom export configurations named default. Otherwise, hale»connect will use the default configuration which exports to GML FeatureCollection.
  • hale»connect now supports 3D data publishing. Users can publish data with three coordinates and create OGC WFS services.

Fixes

  • Implemented a fix to raise exceptions rather than returning images from deegree when an error happens during rendering.
  • Overviews are now created as intended for raster files used for publishing.

(more)

Through the GIS - A Marketer’s Adventures in Geospatial Data

The advantage of being a marketer is that you do not need to have all the answers.

This comes in particularly handy when you have approximately none of them.

Having left behind previous safe havens in literature, the arts, the gaming industry, and B2B SaaS, I decided that it was high time to jump head-first into another adventure. For preference, one that would bring some good into the world.

That’s how I, after more than a little cajoling on my part, wound up in the world of data standards and GIS at wetransform. Before making a formal foray into the depths of these topics, I set about to learn the basics.

I am told that “GIS” is a Geographic Information System, effectively a super layered digital map that holds vast amounts of information regarding what’s going on in a particular slice of the world, provided the data is right. There’s a subreddit. There are memes. I embrace the hatred of unprojected CAD files.

The standardization business is also, at its core, not that hard to understand.

Loads of people, companies, and other instances collect geospatial data. Naturally, they all structure it in their own way. This structure usually works for their personal application but, to the world at large, is a mess.

The incredibly smart people I work with have created and maintain the tools required to make sense of said mess, so other really smart people (and even some robots) can use it to make decisions. The more clean, well-structured data there is to look at, the better the odds that those decisions are the ones that’ll steer our planet in the right direction. Save the data standard, save the world.

I found myself nodding sagely at terms like “INSPIRE”, “XPlanung”, and “CityGML”. Data standards, vital to interoperability. Cool. Apparently, it’s pretty difficult to become fully compliant to those standards without our help. Less cool. However, we’re saving well over a thousand organisations an absolute boatload of time, cutting their processes down by about 80%! Important people now have more time to spend on other things. Super cool.

Some of these data standards even enable non-GIS users to work with the data. The interoperability that data standards provide can streamline processes for everyone involved, from government to business to citizens, which can save a lot of time, effort, and money. To the tune of 500 billion actual Euros, according to the European Commission’s Joint Research Centre. Good for them!

So how do we help our clients achieve that almighty interoperability? Time for me to learn more about our tools. You know, the ones I am supposed to help get to the market.

hale»studio is an open-source tool that performs data harmonisation for geospatial data. Take your source data (I am told Shapefiles are common), select a schema (the structure it needs to go into, depending on the data standard you’re aiming for), assign what goes where, run it through. Hey presto, interoperability!

hale»connect complements hale»studio by handling metadata and publishing. Compliance to INSPIRE, for example, becomes a relative walk in the park! Plenty of options to add in collaboration and workflows. Heck, after a quick intro and some setup by a more experienced colleague and I found myself using it pretty effortlessly even though I’d never really touched geospatial data before.

I ask whether I can have one of my colleagues record the processes with and without our tools for marketing purposes. I am told that while the half an hour for a hale»connect/studio demo is fine, I am not allowed to give one of our finest engineers twelve hours of solid work. I just nod.

So, our tools are good. Really good. The best, in fact. I know this because I asked.

Just as I’m starting to feel confident in this new field and all its terminology, with a pretty decent handle on our tools for someone who is not and will likely never be an end-user, something else comes up. Data spaces. That sounds cool. It has “space” in it, so it must be cool. I hop into my imaginary spacecraft and find out that data spaces are, in fact, very cool.

You see, a lot of the data currently in use is so-called “open data”. Does what it says on the tin. It’s data, it’s there, you and anyone else can use it for whatever you please. However, there is far more data out there, hidden away behind multiple layers of security because whatever it contains requires protecting. The exact location of endangered species, details regarding the population’s personal health, that type of thing. The type of thing you do not want just anyone to have access to. Which is exactly why it’s not open.

There is a lot more of that type of data than there is open data… and it could be extremely valuable.

“Data gaps” is the term that keeps cropping up as the issue to be solved. A lack of required data, not because it does not exist, but because there is no access to it at all. Not even a path.

So how do you solve an issue like that? Where data that is, for good reason, not being shared at all could have a huge impact on how vital, world-saving decisions are made?

You create the VIP-area known as a “data space”, a members-only club with different levels of access and robust governance policies to ensure that such precious data never, ever falls into the wrong hands. You can rest assured that, for a change, you’ll have complete control over what’s happening with your data.

You can train AI inside the data space. Let it learn from all the data in there, then leave all that data on its way out to do some good in the real world. Inside the data space, every process is certified and every participant is vetted. A door policy to make Berghain blush.

You can find a nicely formatted slide on the process below.

Data spaces are already included in the European Data Strategy. Sectors such as Agriculture, Environment, Energy, Finance, Healthcare, Manufacturing, Mobility, and Public Authorities are all slated to use them.

While data spaces are still very new, I feel I am in the right place with the right team to learn more about them and how they will help solve very real problems. Not only do we have the tools to automatically make all data inside a data space fully interoperable and set up the governance, with the help of the International Data Spaces Association, wetransform stands at the cradle of the Environmental Data Space. We are already building a community dedicated to identifying high value use cases, defining architectures and governance models, and testing out solutions in concrete projects.

If you want to learn more about data spaces, check out this recent article, or head on over to the Environmental Data Spaces Community landing page to get involved.

Like I said before, I do not need to have all the answers. I do not know how to do your job. What I do know is that I work with incredibly smart, dedicated people who have built (and are continuing to build and maintain) solutions for the harmonisation and publication of geospatial data that, by every metric I have seen, can save you a great deal of time, effort, and money. I encourage you to have a look, perhaps even a call, to see how they can help you.

As for me, I intend to keep on learning. There is certainly enough to do.

Through the GIS - A Marketer’s Adventures in Geospatial Data

The advantage of being a marketer is that you do not need to have all the answers.

This comes in particularly handy when you have approximately none of them.

Having left behind previous safe havens in literature, the arts, the gaming industry, and B2B SaaS, I decided that it was high time to jump head-first into another adventure. For preference, one that would bring some good into the world.

That’s how I, after more than a little cajoling on my part, wound up in the world of data standards and GIS at wetransform. Before making a formal foray into the depths of these topics, I set about to learn the basics.

I am told that “GIS” is a Geographic Information System, effectively a super layered digital map that holds vast amounts of information regarding what’s going on in a particular slice of the world, provided the data is right. There’s a subreddit. There are memes. I embrace the hatred of unprojected CAD files.

The standardization business is also, at its core, not that hard to understand.

Loads of people, companies, and other instances collect geospatial data. Naturally, they all structure it in their own way. This structure usually works for their personal application but, to the world at large, is a mess.

The incredibly smart people I work with have created and maintain the tools required to make sense of said mess, so other really smart people (and even some robots) can use it to make decisions. The more clean, well-structured data there is to look at, the better the odds that those decisions are the ones that’ll steer our planet in the right direction. Save the data standard, save the world.

I found myself nodding sagely at terms like “INSPIRE”, “XPlanung”, and “CityGML”. Data standards, vital to interoperability. Cool. Apparently, it’s pretty difficult to become fully compliant to those standards without our help. Less cool. However, we’re saving well over a thousand organisations an absolute boatload of time, cutting their processes down by about 80%! Important people now have more time to spend on other things. Super cool.

Some of these data standards even enable non-GIS users to work with the data. The interoperability that data standards provide can streamline processes for everyone involved, from government to business to citizens, which can save a lot of time, effort, and money. To the tune of 500 billion actual Euros, according to the European Commission’s Joint Research Centre. Good for them!

So how do we help our clients achieve that almighty interoperability? Time for me to learn more about our tools. You know, the ones I am supposed to help get to the market.

hale»studio is an open-source tool that performs data harmonisation for geospatial data. Take your source data (I am told Shapefiles are common), select a schema (the structure it needs to go into, depending on the data standard you’re aiming for), assign what goes where, run it through. Hey presto, interoperability!

hale»connect complements hale»studio by handling metadata and publishing. Compliance to INSPIRE, for example, becomes a relative walk in the park! Plenty of options to add in collaboration and workflows. Heck, after a quick intro and some setup by a more experienced colleague and I found myself using it pretty effortlessly even though I’d never really touched geospatial data before.

I ask whether I can have one of my colleagues record the processes with and without our tools for marketing purposes. I am told that while the half an hour for a hale»connect/studio demo is fine, I am not allowed to give one of our finest engineers twelve hours of solid work. I just nod.

So, our tools are good. Really good. The best, in fact. I know this because I asked.

Just as I’m starting to feel confident in this new field and all its terminology, with a pretty decent handle on our tools for someone who is not and will likely never be an end-user, something else comes up. Data spaces. That sounds cool. It has “space” in it, so it must be cool. I hop into my imaginary spacecraft and find out that data spaces are, in fact, very cool.

You see, a lot of the data currently in use is so-called “open data”. Does what it says on the tin. It’s data, it’s there, you and anyone else can use it for whatever you please. However, there is far more data out there, hidden away behind multiple layers of security because whatever it contains requires protecting. The exact location of endangered species, details regarding the population’s personal health, that type of thing. The type of thing you do not want just anyone to have access to. Which is exactly why it’s not open.

There is a lot more of that type of data than there is open data… and it could be extremely valuable.

“Data gaps” is the term that keeps cropping up as the issue to be solved. A lack of required data, not because it does not exist, but because there is no access to it at all. Not even a path.

So how do you solve an issue like that? Where data that is, for good reason, not being shared at all could have a huge impact on how vital, world-saving decisions are made?

You create the VIP-area known as a “data space”, a members-only club with different levels of access and robust governance policies to ensure that such precious data never, ever falls into the wrong hands. You can rest assured that, for a change, you’ll have complete control over what’s happening with your data.

You can train AI inside the data space. Let it learn from all the data in there, then leave all that data on its way out to do some good in the real world. Inside the data space, every process is certified and every participant is vetted. A door policy to make Berghain blush.

You can find a nicely formatted slide on the process below.

Data spaces are already included in the European Data Strategy. Sectors such as Agriculture, Environment, Energy, Finance, Healthcare, Manufacturing, Mobility, and Public Authorities are all slated to use them.

While data spaces are still very new, I feel I am in the right place with the right team to learn more about them and how they will help solve very real problems. Not only do we have the tools to automatically make all data inside a data space fully interoperable and set up the governance, with the help of the International Data Spaces Association, wetransform stands at the cradle of the Environmental Data Space. We are already building a community dedicated to identifying high value use cases, defining architectures and governance models, and testing out solutions in concrete projects.

If you want to learn more about data spaces, check out this recent article, or head on over to the Environmental Data Spaces Community landing page to get involved.

Like I said before, I do not need to have all the answers. I do not know how to do your job. What I do know is that I work with incredibly smart, dedicated people who have built (and are continuing to build and maintain) solutions for the harmonisation and publication of geospatial data that, by every metric I have seen, can save you a great deal of time, effort, and money. I encourage you to have a look, perhaps even a call, to see how they can help you.

As for me, I intend to keep on learning. There is certainly enough to do.

(more)

The German version of this article can be accessed here.

Society is facing major challenges such as climate change and loss of ecosystems. For these challenges, we will have to find well-optimised, sustainable solutions. In this process, data is invaluable.

Thanks to Copernicus, INSPIRE, and the Open Data Directive more and more geodata has become publicly available. However, this data is merely the tip of the proverbial iceberg. Large swathes of relevant data remain hidden from view due to security concerns and legal obligations such as GDPR. The new European Data Strategy aims to make this pool of data more accessible by utilising a concept that’s already shown itself to be successful in the automotive industry. This concept is known as data spaces. In this article, the differences between the existing geodata infrastructure (GDI) and data spaces will be explained, as well as which issues can be resolved through the use of these data spaces, and what will need to be kept in mind when implementing them.

The automotive industry is adept at finding efficient solutions to incredibly complex issues. Their products need to conform to a myriad of both legal and scientific standards at every stage of their elaborate research, development, and production chains. In an effort to make these processes more cost-effective and secure, car manufacturers and their suppliers exchange data in a limited fashion.

Originally, companies only had a limited and often flawed insight into their supply chain. To remedy the issues that kept cropping up due to this lack of transparency, the Catena X Data Space was created. This data space allowed all participating organisations to exchange their data inside a single, secure platform. Who is allowed access to which data, and to what purpose, is decided by the organisations involved.

Bolstered by the success of this approach, 2021’s European Data Strategy builds upon the concept of data spaces to unlock the hitherto hidden potential of closed data in other sectors. In the new model, data spaces are set to support nine strategic areas: Agriculture, Environment, Energy, Finance, Healthcare, Manufacturing, Mobility, Public Administration, and Public Authorities. A data space to support the implementation of the Green Deal is already in the works. Every data space will contain a combination of public and proprietary data from both companies and governmental organisations.

What is a data space?

Governance is central to the data space concept. A combined set of rules and standards, as well as their technical implementation, that defines which roles exist within the data space and the level of access to data that each of these roles provide. For example, data providers can allow their data to be used within a training pool for AI models, but severely limit the export of that data outside of the data space. Common technical standards will have to be agreed upon as well, particularly data models such as INSPIRE, XPlanung, or 3A/NAS for the Geospatial and Environmental sectors.

Just as in GDI, source data sets will differ. Every organisation can create, house, and utilise their data in whatever manner they desire, be it on premise or in the cloud. Controlled access to that data can be securely managed through an adapter such as the Eclipse Dataspace Connector.

All data sets within a data space are interoperable. That does not mean that all data needs to conform to the same format or schema, but rather that they can automatically be integrated and harmonised as required. For this, matching- and mapping technology will be utilised, such as annotations (“This is a parcel.”). ETL tools like hale»studio and hale»connect can use this metadata to automatically prepare data for processors in different parts of the data space.

Such processing services are themselves part of the data space. How these services are allowed to access and use the data is established within the communal rules, for example whether or not they’re allowed to be temporarily cached. Trust plays an important part in this. Starting in 2022, processing services are able to obtain certifications. Once a service is certified, all participants in the data space can be certain that this service will only do exactly what it claims to do.

Which issues do data spaces solve?

The creation of a data space only makes sense where there is a concrete use case where vital data gaps can be closed through the use of previously inaccessible data. These data gaps need to be defined and thoroughly documented.

Such a data gap also exists in scenarios where there is data available, but not in sufficient quantity to train a useful AI model. Within the security of the data space, a much greater amount of training data can be made available. Since only the final AI model will be exported out of the data space, the confidentiality of the training data remains unaltered.

There is another problem data spaces solve. It is common practice for modern platforms to siphon off and sell large amounts of data without any input from the subjects of said data, be it companies or private citizens. Within a data space, rules can be established not only to secure data sovereignty, but also to allow a more balanced division of the value generated through that data. For example, through a “Pay as you go”-model.

In order to provide this data sovereignty, the data space has to be built upon hardware, software, and operating systems that have been designed and secured to allow for it. Therefore, the data spaces’ infrastructure is being created in collaboration with GAIA-X, Europe’s distributed cloud platform.

What does this mean in the real world?

The AI pilot project FutureForest.ai is a great way to illustrate the usefulness of data spaces. In this project, wetransform collaborates with the TU Munich and the TU Berlin, as well as several German state forests and forest research institutes to create a data space for forestry data. All these organisations contribute access to their data within the data space, so better decisions can be made surrounding climate-adapted forest conversion. This combines both public data, such as elevation models and land coverage maps, and private data, such as sensor data and detailed information from location mapping. The forest owners contribute their data and in return are able to leverage better decision-making models.

This last decade has allowed us to make great strides in terms of spatial data accessibility, chiefly through open data initiatives. Unfortunately, a lack of attention paid to organisational frameworks and data usage conditions often still hamper progress. Through the use of data spaces, such as the one for forestry, this will change for the better.

More than Projects – the Environmental Data Spaces Community

It’s still early days in the Geo- and GIS-community when it comes to the implementation and usage of data spaces. Many projects are being launched, both nationally and internationally. In order to create a network between all the different parties currently involved in these projects and provide more developmental continuity, wetransform has established the Environmental Data Spaces Community. Aided by several partners and the framework laid out by the International Data Spaces Association, which sets the standards for data spaces, wetransform supports the creation of diverse data ecosystems with the goal of making environmental data accessible and usable inside a secure data space that protects data sovereignty.

More information surrounding the Environmental Data Spaces Community and the possibilities for those desiring to join it, can be found here.

This article originally appeared in German in gis.business 2-2022, 25-27.

The German version of this article can be accessed here.

Society is facing major challenges such as climate change and loss of ecosystems. For these challenges, we will have to find well-optimised, sustainable solutions. In this process, data is invaluable.

Thanks to Copernicus, INSPIRE, and the Open Data Directive more and more geodata has become publicly available. However, this data is merely the tip of the proverbial iceberg. Large swathes of relevant data remain hidden from view due to security concerns and legal obligations such as GDPR. The new European Data Strategy aims to make this pool of data more accessible by utilising a concept that’s already shown itself to be successful in the automotive industry. This concept is known as data spaces. In this article, the differences between the existing geodata infrastructure (GDI) and data spaces will be explained, as well as which issues can be resolved through the use of these data spaces, and what will need to be kept in mind when implementing them.

The automotive industry is adept at finding efficient solutions to incredibly complex issues. Their products need to conform to a myriad of both legal and scientific standards at every stage of their elaborate research, development, and production chains. In an effort to make these processes more cost-effective and secure, car manufacturers and their suppliers exchange data in a limited fashion.

Originally, companies only had a limited and often flawed insight into their supply chain. To remedy the issues that kept cropping up due to this lack of transparency, the Catena X Data Space was created. This data space allowed all participating organisations to exchange their data inside a single, secure platform. Who is allowed access to which data, and to what purpose, is decided by the organisations involved.

Bolstered by the success of this approach, 2021’s European Data Strategy builds upon the concept of data spaces to unlock the hitherto hidden potential of closed data in other sectors. In the new model, data spaces are set to support nine strategic areas: Agriculture, Environment, Energy, Finance, Healthcare, Manufacturing, Mobility, Public Administration, and Public Authorities. A data space to support the implementation of the Green Deal is already in the works. Every data space will contain a combination of public and proprietary data from both companies and governmental organisations.

What is a data space?

Governance is central to the data space concept. A combined set of rules and standards, as well as their technical implementation, that defines which roles exist within the data space and the level of access to data that each of these roles provide. For example, data providers can allow their data to be used within a training pool for AI models, but severely limit the export of that data outside of the data space. Common technical standards will have to be agreed upon as well, particularly data models such as INSPIRE, XPlanung, or 3A/NAS for the Geospatial and Environmental sectors.

Just as in GDI, source data sets will differ. Every organisation can create, house, and utilise their data in whatever manner they desire, be it on premise or in the cloud. Controlled access to that data can be securely managed through an adapter such as the Eclipse Dataspace Connector.

All data sets within a data space are interoperable. That does not mean that all data needs to conform to the same format or schema, but rather that they can automatically be integrated and harmonised as required. For this, matching- and mapping technology will be utilised, such as annotations (“This is a parcel.”). ETL tools like hale»studio and hale»connect can use this metadata to automatically prepare data for processors in different parts of the data space.

Such processing services are themselves part of the data space. How these services are allowed to access and use the data is established within the communal rules, for example whether or not they’re allowed to be temporarily cached. Trust plays an important part in this. Starting in 2022, processing services are able to obtain certifications. Once a service is certified, all participants in the data space can be certain that this service will only do exactly what it claims to do.

Which issues do data spaces solve?

The creation of a data space only makes sense where there is a concrete use case where vital data gaps can be closed through the use of previously inaccessible data. These data gaps need to be defined and thoroughly documented.

Such a data gap also exists in scenarios where there is data available, but not in sufficient quantity to train a useful AI model. Within the security of the data space, a much greater amount of training data can be made available. Since only the final AI model will be exported out of the data space, the confidentiality of the training data remains unaltered.

There is another problem data spaces solve. It is common practice for modern platforms to siphon off and sell large amounts of data without any input from the subjects of said data, be it companies or private citizens. Within a data space, rules can be established not only to secure data sovereignty, but also to allow a more balanced division of the value generated through that data. For example, through a “Pay as you go”-model.

In order to provide this data sovereignty, the data space has to be built upon hardware, software, and operating systems that have been designed and secured to allow for it. Therefore, the data spaces’ infrastructure is being created in collaboration with GAIA-X, Europe’s distributed cloud platform.

What does this mean in the real world?

The AI pilot project FutureForest.ai is a great way to illustrate the usefulness of data spaces. In this project, wetransform collaborates with the TU Munich and the TU Berlin, as well as several German state forests and forest research institutes to create a data space for forestry data. All these organisations contribute access to their data within the data space, so better decisions can be made surrounding climate-adapted forest conversion. This combines both public data, such as elevation models and land coverage maps, and private data, such as sensor data and detailed information from location mapping. The forest owners contribute their data and in return are able to leverage better decision-making models.

This last decade has allowed us to make great strides in terms of spatial data accessibility, chiefly through open data initiatives. Unfortunately, a lack of attention paid to organisational frameworks and data usage conditions often still hamper progress. Through the use of data spaces, such as the one for forestry, this will change for the better.

More than Projects – the Environmental Data Spaces Community

It’s still early days in the Geo- and GIS-community when it comes to the implementation and usage of data spaces. Many projects are being launched, both nationally and internationally. In order to create a network between all the different parties currently involved in these projects and provide more developmental continuity, wetransform has established the Environmental Data Spaces Community. Aided by several partners and the framework laid out by the International Data Spaces Association, which sets the standards for data spaces, wetransform supports the creation of diverse data ecosystems with the goal of making environmental data accessible and usable inside a secure data space that protects data sovereignty.

More information surrounding the Environmental Data Spaces Community and the possibilities for those desiring to join it, can be found here.

This article originally appeared in German in gis.business 2-2022, 25-27.

(more)