Machine Learning

Separate Numbers and Text in One Column Using a Power Query

This is a typical scenario for Self-Service BI with Excel data.

A few days ago, a client asked me the following question:

I have an Excel sheet with numbers and text in one column. I want to import this sheet into Power BI and analyze the numbers in that column.

How can I separate the numbers from the text in that column?

Remember that I also need the text in that column.

I had never been in this situation before, so I started using the method I knew.

I created a Dummy Excel with the same problem, which looks like this:

Figure 1 – Sample Data in Excel (Author's Image)

To build the PoC, I first loaded this data into a SQL Server database to see how I could solve it there.

Troubleshooting using SQL

IT-SQL has two useful functions in these situations:

  • TRY_CONVERT()
    • This one tries to convert the value to the target data type. If it fails, it returns NULL.
  • ISNUMERIC()
    • Tests whether the value is a numeric value. If so, it returns 1. Otherwise, 0.

Based on this information, I wrote a query to split the values ​​into two columns. One with numbers and one with text:

SELECT [Values]

            ,TRY_CONVERT(decimal(18, 5), [Values])         AS    [Number]

            ,IIF(ISNUMERIC([Values]) = 0, [Values], NULL) AS    [Text]

  FROM [dbo].[MixedValues];

The result is the following table:

Figure 2 – T-SQL query to extract numbers from text (Author's Image)

If you look closely, you can see that line 17 is recognized as text.

This is because the number contains nulls.

I will return to this later.

Switching to power query – Trying IsNaN()

Now, I loaded Excel into Power Query.

I defined the column as text and started working on this challenge.

The first attempt uses the Number.IsNaN() function.

This function returns true if the value is NaN. “NaN” is a null replacement, for example, due to division by 0.

I tried this to determine if a text is equal to a NaN.

This is the M-Code for the calculated column:

if Number.IsNaN([Value]) = true
then [Value]
else null

The result surprised me:

Figure 3 – Result with Number.IsNaN(). Why does this work not see the numbers? (Author's Image)

Strangely, the result is that it cannot convert a number to a number.

I think this happens because the data type of the column is text.

Then I tried converting the column to a number and using the IsNaN() function on the result:

if Number.IsNaN(Number.From([Value])) = false
then Number.From([Value])
else null

Now, numbers are converted to numbers, but text values ​​cause an error:

Figure 4 – Using IsNaN() on converted values ​​returns numbers, but a text error (Author's Figure)

Now logic works with numbers.

But conversion failed on lines containing text. This results in lines with errors.

Trying Value.Is() in a power query

Let's try another function: Value.Is()

This function checks if the value matches the data type.

This should be equivalent to the ISNUMERIC() function shown above:

if Value.Is([Value], Number.Type) = true
then Number.From([Value])
else null

Unfortunately, this function failed to return the expected result:

Figure 5 – Result when trying the Value.Is() function (Author's Figure)

When I tried the same method as above, by converting the value to a number first, I got the same result as before:

Figure 6 – Errors when trying to convert values ​​to number first (Author's Figure)

So, I suspect that the Value.Is() function expects a numeric data type, but this doesn't make sense to me.

At this time, I did not have time for deep research, as I was running out of time.

It was time to change direction.

To change the mind

Now I checked the error handling method in Power Query.

My idea was: What if I could find a conversion error and use this information?

I found this page with useful information: Errors – PowerQuery M | Microsoft Read

In this case, I found this quote:

try Number.From([Value]))

After adding the calculated column with this expression, I got this result:

Figure 7 – Experimental result (Author's Image)

I was hopeful, as I found no fault.

Next, was to expand the Records:

Figure 8 – Extend the Value from the Recorder output of the test call (Author's Figure)

I didn't need the Error columns—only the Value column.

This is the result after expansion:

Figure 9 – Result after expanding records (Author's Image)

Note that I renamed the columns directly in the ExpandRecordColumn() function.
Otherwise, I would have found a named column [Value.1].

This result was the first one where I did not find any errors.

Now, I added a calculated column to check if the new column is empty. If so, then the Actual Value column contains the text:

if [Numeric Value] = null then [Value] else null

Here is the result:

Figure 10 – Result with successful separation of numeric values ​​and text in one column (Author's Image)

After setting the correct data types and removing the Original column, I got this table:

Figure 11 – Result after cleaning (Author's Image)

Manage the number with spaces

But we still have line 17, which had an empty number.

How did I do this?

The most straightforward way was to remove any Nulls from the Value column:

Figure 12 – Insert a Replace Value call to remove spaces from the data (Author's Figure)

But I should have added this step before starting the steps to separate the two types of values:

Figure 13 – Add a Replace Value step in the right place (Author's Figure)

After adding this step, line 17 is recognized as a number and stored correctly.

Here is the data after loading it into Power BI:

Figure 14 – Clean data after uploading to Power BI (Author's Figure)

But this only works if the text values ​​are single words. It didn't work when multiple sentences or words were stored there.

The conclusion

This was a fascinating tour of how Power Query, or M-language, works with data types.

I am still not sure about the causes of the errors.

But I learned how to handle errors, or use the try call and handle the output.

This was very helpful.

However, as you can see from the actual value in line 17, the quality of the data is important.

I have another client where users from different countries are working on the same Excel file in their own number formats.

This is a nightmare because Excel is very tolerant of data types. It accepts everything, even when the column is formatted as a number.

In that case, I have to force users to use Excel's formatting options to make sure the numbers appear as they are.

Apart from this, I have no way to import this data into Power BI without a lot of effort to clean the numbers.

And rest assured that users always find a way to mess up numbers in Excel.

References

The data is generated from random numbers and names.

Here is a reference to the M language: Power Query M formula language reference – PowerQuery M | Microsoft Read

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button