Machine Learning

Building a geospatial ecosystem with open source and databricks

Most of the information related to the process that can be measured in the real world has a geospatial aspect to it. Organizations that manage assets in a wide area, or have a business process that requires them to process many layers of applied symbols that require a map, when they start using this data to answer structured questions or Applize. These highly focused organizations can ask these types of questions of their data:

How much of my property falls within the property boundary?

How long does it take my customers to get to the site by foot or by car?

What is the amount of footwall skin I should expect per unit area?

All of these are important geospatial questions, which require that a number of data structures are combined in a common layer of preservation, and that joining geopatials such as point-in-polygon functions and the geospon index are addressed to manage the entries. This document will discuss ways to scale geospatial analytics using databricks features, and open source tools that take advantage of spark functionality, the Delta Tallegraphy Commor catalog [1]focus on batch analytics for veopatial data.

Overview

The diagram below summarizes an open-source approach to building a geospatial Lakehouse on databricks. With various types of import methods (although they often appear in the Public API) geospatial information is included in cloud storage in various organizations; With Databricks This can be a volume within the Unity catalog and schema. Geospatial data formats include geospatial formats (Geojsons, .CSV and Shapefiles .SHP) representing points representing latitude / longitude, raster formats (geotiff, HDF5) for imaging data. Using geopanda [2] or Spark's mosaic-based geopatial tools [3] or H3 Databricks SQL functions [4] We can prepare vector files in memory and save them in the bronze layer with unity in Delta format, using a known text (WKT) as a string presentation of any points or geometries.

Functional overview of geospatial analytics built using the Unity catalog and open source databricks. Photo by the Author.

While getting to the bronze layer represents the Audit Log of imported data, bronze to silver is the layer where data is processed and for all Upstream use cases. The finished silver layer must represent a single geospatial view and can be integrated with other non-geospatial datasets as part of the business data model; It also offers the possibility to combine many tables from Bronze to the geospatial details of the cores that can have many qualities and geometries, at the base level of the grain required for the grain types. The golden layer is that Geopatial Presentation layer where the output of geospatial analysis such as travel time or census can be stored. Used in dashboarding tools such as gas BI, the output may be made as astronomical objects, while GIS GIS tools such as ESRI online, will choose Geojson files for certain map programs.

Geospatial data preparation

In addition to the high level of data challenges faced when combining multiple sources of individual data in the design of Lake Architecture (missing data, changing recording practices etc. In order to make geopating geospatial daospatial dasteasets interactive and easily displayed, it is better to choose a coordinate system such as WGS 84 (widely used GPS standard). UK Mantly Genicatic Geospatial Datasets will use other coordinate systems such as OSGB 36, which is a way to get the standard OSGB features in the UK with more precision (this format has been written east of FITRACTION (this format is progressing to Townstream Maples as explained in the figure below.

Overview of geospatial co-ordinate systems a) and overlap of WGS 84 and OSGB 36 for the UK B). Images converted from [5] with permission from the author. Copyright (c) Ordnance Survey 2018.

Many geospatial libraries such as geopandas, Moses and others have built-in functions to handle this conversion, an example from Moses' documentation:

df = (
  spark.createDataFrame([{'wkt': 'MULTIPOINT ((10 40), (40 30), (20 20), (30 10))'}])
  .withColumn('geom', st_setsrid(st_geomfromwkt('wkt'), lit(4326)))
)
df.select(st_astext(st_transform('geom', lit(3857)))).show(1, False)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|MULTIPOINT ((1113194.9079327357 4865942.279503176), (4452779.631730943 3503549.843504374), (2226389.8158654715 2273030.926987689), (3339584.723798207 1118889.9748579597))|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Converts multi-point geometry from WGS84 to Web Mercator Projection Format.

Another data quality problem unique to Vector Geospatial Data, is the concept of invalid geometries described in the figure below. These invalid geometries will break the geojson geojson files or analyzes, so it is better to fix them or remove them if necessary. Many geospatial libraries provide functions to find or attempt to fix invalid geometries.

Examples of types of invalid geometries. Photo taken from [6] with permission from the author. Copyright (c) 2024 Christoph Rieke.

These data steps and preparation steps should be done at the beginning of the Dollhouse layers; I've done it in the silver crop book in the past, along with any useful geospatial joins and other changes.

Measuring joining geospatial and analytics

The Geospatial feature of the business layer should effectively represent a single geospatial view that consumes all the advanced integrations, analytics, ML Modeling and AI. In addition to data quality checks and corrections, sometimes it is useful to combine multiple geospatial data or unions to simplify data queries, simplify data queries, simplify data queries and avoid the need to rejoin expensive geopatials. Geospatial joins are often more expensive due to the large number of bits required to represent sometimes different polygon geometries and the need for multiple intelligent comparisons.

A few techniques exist to make this join work well. For example, you can, to simplify complex geometries, effectively reduce the number of pairs of Lot Lon. Different methods are available for doing this which can be aimed at different desired results (eg space saving, or removing unwanted points) and these can be started from libraries:

df = spark.createDataFrame([{'wkt': 'LINESTRING (0 1, 1 2, 2 1, 3 0)'}])
df.select(st_simplify('wkt', 1.0)).show()
+----------------------------+
| st_simplify(wkt, 1.0)      |
+----------------------------+
| LINESTRING (0 1, 1 2, 3 0) |
+----------------------------+

Another way to measure geospatial queries is to use a geospatial index system as described in the equation below. By agging point or geoson geometry data in a geospatial display system such as H3, the approximation of the same information can be represented by a set of short polygons, with many hexagon / pernagon areas in different resolutions, which can be scrolled up / down in the Hierarchy.

Promotion of Geospatial Index (Compression) systems [7] and visualization of the H3 Index from Uber [8]. Images modified with permission from the authors. Copyright (c) carto 2023. Copyright (c) Uber 2018.

In Databricks H3 index system is also designed for use with spark sql engine, so you can write queries like this point in H3, POLDGOMS in H3, POSKM.

WITH locations_h3 AS (
    SELECT
        id,
        lat,
        lon,
        h3_pointash3(
            CONCAT('POINT(', lon, ' ', lat, ')'),
            7
        ) AS h3_index
    FROM locations
),
regions_h3 AS (
    SELECT
        name,
        explode(
            h3_polyfillash3(
                wkt,
                7
            )
        ) AS h3_index
    FROM regions
)
SELECT
    l.id AS point_id,
    r.name AS region_name,
    l.lat,
    l.lon,
    r.h3_index,
    h3_boundaryaswkt(r.h3_index) AS h3_polygon_wkt  
FROM locations_h3 l
JOIN regions_h3 r
  ON l.h3_index = r.h3_index;

Geopandas and Mosaic will also allow you to perform geospatial joins without equality if needed, but the use of H3 is accurate enough for joining and analysis such as census. With the Cloud Analytics Platform you can also use APIs, to deliver live traffic data and travel time calculations using services such as the open routing service [9]or geospatial geospatial data with additional features (e.g. transit or commercial locations) using tools such as the Overpass API of Offess API of Open Street Map [10].

Geospatial presentation layers

Now that the geospatial questions and queries and analysis have been done and the analytics are ready to visualize the ground, the geographic layer of the geopatial lalkehouse can be organized according to maps or data-based analytics. The figure below shows two common methods.

Comparison of geojson lic collection a) vs schema dimensions b) as data structures for Geospatial layer sepreser output. Photo by the Author.

When serving a Green Geospatial Information System (GIS) program such as ESRI online or another Web program with Maple tools, Geojson files stored in the Gold / Depration Valer repository, contain the required layer. Using the geojson geojson type you can create a nested json containing multiple geometries and associated attributes (“features”) which may be points, animations or polygons. If the low voltage dashstreading tool has bi power, it is possible that the star schema and the geometric qualities and symbols can be measured as the facts and dimensions of finding its cross support and the Delta elevation of the Delta tables

Platform design and integration

Geospatial data will represent another part of the comprehensive data model of the business and the portfolio of analytics and ML / AI used-cases of cloud integration and these will process (properly) Uptream Data, which has a series of things that go up and down and see that the analytics show that the analytics show that it is important to the organization. The figure below shows a high-level build of the type of Azure data platform I've worked with in the past for geospatial data.

Advanced geospatial Lakehouse architecture in azure. Photo by the Author.

The data is retrieved using various ETL tools (in case the databricks themselves are sufficient). Within the workspace (s) Medallion pattern of RAW (Bronze), Enterprise (Silver), and layers (gold) are stored, using the volume of the Unity Catalog Catalog. Layers operation (especially permissions) if needed. When the presented results are ready to share, there are many options for data sharing, application structure and GIS integration options and GIS integration options.

For example with ESRI Cloud, the ADLSG2 storage account connector within ESRI allows data written in external Unity catalog volumes (ie, Geojson files) to be pulled into the ESRI platform for integration into maps and dashboards. Some organizations may choose to have geospatial results recorded in Downstream systems such as CRMS or other geospatial databases. Geospatial Geospatial data and its aggregates are often used as input features in ML models and this works seamlessly with GeoSpatial Delta tables. Databricks leverages various AI Analytics features built into the platform (eg, AI BEI Genie [11] and agent bricks [12]), which provides the ability to query data in the Unity catalog using English and the long-term GeoSpatial vision that in any AI information in the same way as any other tabular information, one of the expressions that will be getting maps.

In closing

At the end of the day, it's all about making cool maps that are useful for decision making. The figure below shows a few results of geospatial analytics Outno has done over the past few years. Geospatial analysis puts down to know things like where people or events or windows, how long it takes to distribute a certain quality of interest (it can be a place of residence, or something at risk). All the important things to know about strategic planning (eg Where do I put a fire station?), Knowing your customer base (eg

Examples of specific geospatial analysis. a) Journey time analysis b) Hotspot Finding with H3 C) HOTSPOT Clustering with ML. Photo by the Author.

Thanks for reading and if you're interested in chatting or learning more, please get in touch or check out some of the links below.

Progress

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button