Machine Learning

Five Essential Lessons for Google Engine Beginners | by Daniel Pazmiño Vernaza | January, 2025

Hands-On Insights from a Python API user

About Data Science
Earth cover map of the Paute water basin in Ecuador for the year 2020. Image created using Google Earth Engine Python API and Geemap. Data source: Friedl, M., Sulla-Menashe, D. (2022); Lehner, B., Grill G. (2013) and Lehner, B., Verdin, K., Jarvis, A. (2008).

As a climate scientist, Google Earth Engine (GEE) is a powerful tool in my toolkit. No more downloading cumbersome satellite images to my computer.

The GEE primary API is Javascript, although Python users can also access a powerful API to perform similar tasks. Unfortunately, there are few materials for learning GEE with Python.

However, I love Python. Since I learned that GEE has a Python API, I imagined a world of possibilities combining GEE's cloud computing capabilities with Python frameworks.

Five lessons come from my latest project, which involved analyzing the water balance and drought in a watershed in Ecuador. However, the tips, code snippets and examples can apply to any project.

The article presents each lesson following the sequence of any data analysis project: data preparation (and planning), analysis, and visualization.

It's also worth mentioning that I also offer general advice regardless of the language you use.

This article for GEE beginners takes an understanding of Python and other geospatial concepts.

If you know Python but are new to GEE (like I was some time ago), you should know that GEE has built-in functions for processing satellite images. We will not go into the details of these activities here; you should check the official documentation.

However, my advice is to check first if GEE can do the analysis you want to do. When I first started using GEE, I used it as a data acquisition catalog, relying only on its basic functions. Then I wrote the Python code for a lot of analysis. While this approach can work, it often leads to serious challenges. I will discuss these challenges in future lessons.

Don't limit yourself to studying only the basic functions of GEE. If you know Python (or coding in general), the learning curve for these functions is not too steep. Try to use them as much as possible – it's worth it in terms of efficiency.

A final note: GEE functions even support machine learning functions. These GEE functions are easy to use and can help you solve many problems. Only when you can't solve your problem with these functions should you consider writing Python code from scratch.

As an example for this lesson, consider the implementation of a clustering algorithm.

Example code with GEE functions

# Sample the image to create input for clustering
sample_points = clustering_image.sample(
region=galapagos_aoi,
scale=30, # Scale in meters
numPixels=5000, # Number of points to sample
geometries=False # Don't include geometry to save memory
)

# Apply k-means clustering (unsupervised)
clusterer = ee.Clusterer.wekaKMeans(5).train(sample_points)

# Cluster the image
result = clustering_image.cluster(clusterer)

Example code in Python

import rasterio
import numpy as np
from osgeo import gdal, gdal_array

# Tell GDAL to throw Python exceptions and register all drivers
gdal.UseExceptions()
gdal.AllRegister()

# Open the .tiff file
img_ds = gdal.Open('Sentinel-2_L2A_Galapagos.tiff', gdal.GA_ReadOnly)
if img_ds is None:
raise FileNotFoundError("The specified file could not be opened.")

# Prepare an empty array to store the image data for all bands
img = np.zeros(
(img_ds.RasterYSize, img_ds.RasterXSize, img_ds.RasterCount),
dtype=gdal_array.GDALTypeCodeToNumericTypeCode(img_ds.GetRasterBand(1).DataType),
)

# Read each band into the corresponding slice of the array
for b in range(img_ds.RasterCount):
img[:, :, b] = img_ds.GetRasterBand(b + 1).ReadAsArray()

print("Shape of the image with all bands:", img.shape) # (height, width, num_bands)

# Reshape for processing
new_shape = (img.shape[0] * img.shape[1], img.shape[2]) # (num_pixels, num_bands)
X = img.reshape(new_shape)

print("Shape of reshaped data for all bands:", X.shape) # (num_pixels, num_bands)

The first block of code is not only shorter, but will handle large satellite datasets more efficiently because GEE operations are designed to scale in the cloud.

Although GEE functions are powerful, understanding the limitations of cloud processing is important when scaling your project.

Access to free cloud computing resources to process satellite images is a boon. However, it is not surprising that GEE imposes restrictions to ensure a fair distribution of resources. If you plan to use it for a large non-commercial project (eg research on deforestation in the Amazon region) and intend to stay within the free-tier limits you should plan accordingly. My general guidelines are:

  • Limit the sizes of your regions, divide them, and work in groups. I didn't have to do this in my project because I was working with one small water container. However, if your project involves large areas of landscaping this may be a reasonable first step.
  • Organize your writing by prioritizing using the GEE functions (see Lesson 1).
  • Choose data sets that allow you to optimize computing power. For example, in my last project, I used the Climate Hazards Group InfraRed Precipitation with Station (CHIRPS) data. The original dataset has a daily temporal resolution. However, it offers another version called “PENTAD”, which provides data every five days. It corresponds to the total rainfall for these five days. Using this dataset allowed me to save computing power by processing the combined version without sacrificing the quality of my results.
  • Check the description of your dataset, as it may reveal scaling features that can save computing power. For example, in my water balance project, I used Moderate Resolution Imaging Spectroradiometer (MODIS) data. Specifically, the MOD16 data set, which is a readily available Evapotranspiration (ET) product. According to the documentation, I can multiply my results by a scaling factor of 0.1. Scaling objects help reduce storage requirements by optimizing the data type.
  • If the worst comes to the worst, be prepared to compromise. Reduce the resolution of the analysis if the research standards allow it. For example, the GEE function “reduceRegion” allows you to summarize regional values ​​(sum, average, etc.). It has a parameter called “scale” that allows you to change the scale of the analysis. For example, if your satellite data has a resolution of 10 m and GEE cannot process your analysis, you can adjust the scale parameter to a lower resolution (eg 50 m).

As an example from my water and drought balance project, consider the following code:

# Reduce the collection to a single image (mean MSI over the time period)
MSI_mean = MSI_collection.select('MSI').mean().clip(pauteBasin)

# Use reduceRegion to calculate the min and max
stats = MSI_mean.reduceRegion(
reducer=ee.Reducer.minMax(), # Reducer to get min and max
geometry=pauteBasin, # Specify the ROI
scale=500, # Scale in meters
maxPixels=1e9 # Maximum number of pixels to process
)

# Get the results as a dictionary
min_max = stats.getInfo()

# Print the min and max values
print('Min and Max values:', min_max)

In my project, I used the Sentinel-2 satellite image to calculate the moist soil index (MSI). Then I used the GEE function “reduceRegion”, which calculates a summary of the values ​​in the area (mean, sum, etc.).

In my case, I needed to find the maximum and minimum MSI values ​​to check that my results are reasonable. The following plot shows the MSI values ​​distributed geographically in my study area.

Average monthly soil moisture index for the Paute valley (Ecuador) for the period 2010–2020. Image created using Google Earth Engine Python API and Geemap. Data source: European Space Agency (2025); Lehner, B., Grill G. (2013) and Lehner, B., Verdin, K., Jarvis, A. (2008).

The first image has a resolution of 10 m. GEE struggled to process the data. So, I used the scale parameter and reduced the resolution to 500 m. After changing this parameter GEE was able to process the data.

I am passionate about data quality. As a result, I use data but rarely trust it without verification. I like to invest time to make sure the data is ready for analysis. However, don't let photo editing get in the way of your progress.

My tendency to invest more time in image processing is from studying remote sensors and image processing “the old way”. By this, I mean using software that helps apply atmospheric and geometric corrections to images.

Today, science agencies that support satellite missions can deliver images with a high level of pre-processing. In fact, GEE's best feature is its catalog, which makes it easy to find ready-to-use analysis products.

Preprocessing is the most time-consuming task in any data science project. Therefore, it should be properly planned and controlled.

The best way before starting a project is to establish data quality standards. Based on your criteria, allow enough time to find the best product (which GEE does) and only use the necessary corrections (eg cloud masking).

If you love programming in Python (like me), you might find yourself writing everything from scratch.

As a PhD student (I'm starting with coding), I wrote a script to run a t-test on a learning environment. Later, I found a Python library that does the same job. When I compared the results of my script with those using the library, the results were correct. However, using the library from scratch would have saved me time.

I am sharing this tutorial to help you avoid these silly mistakes with GEE. I will mention two examples of my water balance project.

Example 1

To calculate the water balance in my bowl, I needed the ET data. ET is not an observable variable (like precipitation); must be calculated.

The ET figure is not trivial. You can look up equations in textbooks and use them in Python. However, some researchers have published papers related to this population and shared their findings with the public.

This is where GEE comes in. The GEE catalog provides not only observed data (as I originally thought) but also many derived products or modeled datasets (eg reanalysis data, land cover, vegetation indices, etc.). Guess what? I found a ready-to-use ET dataset in the GEE catalog — life saver!

Example 2:

I also consider myself a Geographic Information System (GIS) expert. Over the years, I have received a large amount of GIS data for my work such as watershed boundaries in shapefile format.

For my water balance project, my idea was to import the boundary plot file of my watershed into my GEE project. From there, I converted the file into a Geopandas object and continued my analysis.

This time, I didn't have the same luck as in Example 1. I lost precious time trying to work with this Geopandas thing that I can't integrate properly with GEE. In the end, this approach did not make sense. GEE has in its catalog a product of easy-to-handle water container borders.

Therefore, the important thing to do is to keep your workflow within GEE whenever possible.

As mentioned at the beginning of this article, combining GEE with Python libraries can be incredibly powerful.

However, even in simple analysis and plots, the correlation does not appear to be straightforward.

This is where Geemp comes in. Geemap is a Python package designed for interactive geospatial analysis and visualization with GEE.

Additionally, I also found that it can help create static sites in Python. I am making plots using GEE and Geemap for my water balance and drought project. The images included in this story use these tools.

GEE is a powerful tool. However, as a beginner, pitfalls are inevitable. This article provides tips and tricks to help you get started on the right foot with the GEE Python API.

European Space Agency (2025). European Space Agency. (Year). Harmonized Sentinel-2 MSI: MultiSpectral Instrument, Level-2A.

Friedl, M., Sulla-Menashe, D. (2022). MODIS/Terra+Aqua Land Cover Annual Type L3 Global 500m SIN Grid V061 [Data set]. NASA EOSDIS Earth Processes Distributed by the Center for Active Documentation. Accessed 2025–01–15 from https://doi.org/10.5067/MODIS/MCD12Q1.061

Lehner, B., Verdin, K., Jarvis, A. (2008): A new global hydrography based on atmospheric elevation data. Eos, Transactions, AGU, 89(10): 93–94.

Lehner, B., Grill G. (2013): Global river hydrography and channel networks: basic data and new methods for studying the world's major river systems. Hydrological Processes, 27(15): 2171-2186. Data is available at www.hydrosheds.org

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button