Why You Should Defragment Your Fact Tables If You’re Using Incremental Refresh In Power BI Premium

A lot of people are building large datasets in Power BI Premium nowadays, and if you’re building a large dataset you’re very likely to be using Power BI’s incremental refresh functionality with your fact tables or manually creating and refreshing partitions in them. If so, you should occasionally do a special kind of refresh to shrink the dictionaries on your fact tables.

To illustrate why you need to do this I created a Power BI dataset with one table containing one column, where each value was a GUID (ie a unique text value). Here’s the M I used to generate it:

let
    Source = {1..10000000},
    Output = #table(type table [MyGUID=text], 
        List.Transform(
            Source, 
            each {Text.NewGuid()})
    )
in
    Output

After publishing the dataset I copied the original partition so there were two partitions in the table, each of which contained ten million rows. When you use incremental refresh in a table Power BI automatically generates and manages partitions for you in the background – and there isn’t much difference between doing this and creating partitions manually like I’m doing here.

Here’s what DAX Studio’s Model Metrics functionality showed when I first refreshed the dataset:

The Dictionary column is the column to focus on in this table. Dictionaries are the data structures created by Power BI’s storage engine, the Vertipaq engine, to hold all the distinct values in a column – the more distinct values there are, the larger the dictionary. As Marco points out here the DMVs that DAX Studio uses to measure dictionary size don’t always return consistent or accurate results but the values it returns are good enough for the purposes of this test.

I then did several Full refreshes on just one of the two partitions. Here’s what DAX Studio showed at this point:

As you can see, the size of the dictionary for the MyGUID column (and therefore the overall size of the dataset) has grown a lot – about 1GB – even though the cardinality of the column has stayed the same. This is because dictionaries are held at the table level, not at the partition level, so by refreshing just one partition I was adding 10 million new text values to the dictionary without deleting any of the old values that has been added to the dictionary by previous refreshes and which were no longer there in the table.

Finally, I ran the following Refresh command against the workspace’s XMLA Endpoint from SQL Server Management Studio:

{
  "refresh": {
    "type": "defragment",
    "objects": [
      {
        "database": "TableWithManyDistinctValues",
        "table": "MyTable"
      }
    ]
  }
}

[Note that XMLA Endpoints are only available if you’re using Premium or PPU]

This refresh is of type defragment and it is documented here. This is the explanation of what it does from the docs:

Defragment the data in the specified table. As data is added to or removed from a table, the dictionaries of each column can become polluted with values that no longer exist in the actual column values. The defragment option will clean up the values in the dictionaries that are no longer used.

After that you can see the dictionary for the table, and therefore the size of the dataset, was back to around its original size:

This is an extreme example of course, but I can imagine something nearly this bad happening in the real world if you use incremental refresh on a table with a text column containing addresses or user comments and if you don’t keep much history. This isn’t new behaviour or functionality: this happens in Analysis Services Tabular models too and is discussed in books like Marco and Alberto’s “Tabular Modelling in Microsoft SQL Server Analysis Services” but not many people are aware it still applies to Power BI datasets. This type of refresh can be quite slow but it is likely to be faster than the only other option (which is the only option if you’re not using Premium) which is to run a full refresh on the dataset.

So why should you run a refresh of type defragment on your dataset? There are two reasons:

  1. If you’re getting close to the size limits imposed by the Premium capacity SKU you’re using, and you want to make sure that you don’t hit them.
  2. Large datasets generally result in slower query performance, so reducing the size of your dataset may result in faster queries.

How often should you do this? In most cases not very often; it depends on how much your dataset shrinks when you do defragment. Once a year may be sufficient and I doubt anyone would need to do it more than once a month. One way to automate this would be using Power Automate to call the Enhanced Refresh API, as I showed in this series of posts.

3 thoughts on “Why You Should Defragment Your Fact Tables If You’re Using Incremental Refresh In Power BI Premium

  1. Thanks for the post Chris
    I am definitely gonna try this defragment technique and check my data model size.

    This is one of the constant problem I have been facing in many of my models in my org.
    I just don’t bring these high cardinality fields in the model, but these are again and again asked from the end users. Like user doing some clicks and filters on dashboard and found some observations, and now they want to see the detailed info for these like Transaction ID or any other high cardinality fields.
    I tried to have Direct Query + Agg kind of model but there are many calculations with iterator functions that are not supported in this.

    Do you have any ideas how I can move forward with this for my users?

    Thanks

Leave a Reply