Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

One feature of Power BI incremental refresh I’ve always been meaning to test out is the ability to create your own M queries to work with the “detect data changes” feature, and last week I finally had the chance to do it. The documentation is reasonably detailed but I thought it would be a good idea to show a worked example of how to use it to get direct control over what data is refreshed during an incremental refresh.

First of all I created a simple dataset with incremental refresh enabled. The source was a SQL Server table with two columns: Date (actually a datetime column) and Sales.

I then configured incremental refresh as follows:

In the background this created six yearly partitions:

Nothing interesting here so far, but the real challenge lies ahead: how exactly do you use custom queries with “detect data changes”?

I created a new table in my SQL Server database called DetectDataChangesTable with one row for every partition in the dataset (even though the incremental refresh configuration above means only the 2021 and 2022 partitions will ever be refreshed) and the values for the RangeStart and RangeEnd M parameters that would be set when each partition is refreshed:

I then created an M query in my dataset called DetectDataChangesQuery that connected to this table, filtered the RangeStart column by the current value of the RangeStart M parameter and the RangeEndColumn by the current value of the RangeEnd M parameter, and then returned just the Output column:

let
  Source = Sql.Databases(
    "ThisIsMySQLServerName"
  ),
  IncrementalRefreshDemo = Source
    {[Name = "IncrementalRefreshDemo"]}
    [Data],
  dbo_DetectDataChangesTable
    = IncrementalRefreshDemo
    {
      [
        Schema = "dbo",
        Item = "DetectDataChangesTable"
      ]
    }
    [Data],
  FilterByParams = Table.SelectRows(
    dbo_DetectDataChangesTable,
    each [RangeStart]
      = RangeStart and [RangeEnd]
      = RangeEnd
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    FilterByParams,
    {"Output"}
  )
in
  #"Removed Other Columns"

Here’s the output of the query in the Power Query Editor with the RangeStart M parameter set to 1/1/2021 and the RangeEnd M parameter set to 1/1/2022:

The important thing to point out here is that while the documentation says the query must return a scalar value, in fact the query needs to return a table with one column and one row containing a single scalar value.

After publishing the dataset once again, then next thing to do was to set the pollingExpression property described in the documentation. I did this by connecting to the dataset via the XMLA Endpoint using Tabular Editor 3, then clicking on the Sales table and looking in the Refresh Policy section in the Properties pane. I set the property to the name of the query I just created, DetectDataChangesQuery:

I then forced a full refresh of the Sales table, including all partitions, by running a TMSL script in SQL Server Management Studio and setting the applyRefreshPolicy parameter to false, as documented here. Here’s the TMSL script:

{
  "refresh": {
    "type": "full",
	"applyRefreshPolicy": false,
    "objects": [
      {
        "database": "IncrementalRefreshDetectDataChangesTest",
        "table": "Sales"
      }
    ]
  }
}

Scripting the entire table out to TMSL I could then see the refreshBookmark property on the two partitions (2021 and 2022) which could be refreshed in an incremental refresh set to 1, the value returned for those partitions in the Output column of the DetectDataChangesQuery query:

The refreshBookmark property is important because it stores the value that Power BI compares with the output of the DetectDataChangesQuery query on subsequent dataset refreshes to determine if the partition needs to be refreshed. So, in this case, the value of refreshBookmart is 1 for the 2021 partition but if in a future refresh the DetectDataChangesQuery returns a different value for this partition then Power BI knows it needs to be refreshed.

I then went back to the DetectDataChangesTable table in SQL and set the Output column to be 2 for the row relating to the 2021 partition:

Next, went back to SQL Server Management Studio and refreshed the table using a TMSL script with applyRefreshPolicy set to true (which is the default, and what would happen if you refreshed the dataset through the Power BI portal).

{
  "refresh": {
    "type": "full",
	"applyRefreshPolicy": true,
    "objects": [
      {
        "database": "IncrementalRefreshDetectDataChangesTest",
        "table": "Sales"
      }
    ]
  }
}

In the Messages pane of the query window I saw that Power BI had detected the value returned by DetectDataChangesQuery for the 2021 partition had changed, and that therefore the partition needed to be refreshed:

Lower down in the Messages pane the output confirmed that only the 2021 partition was being refreshed:

In Profiler I saw three SQL queries. The first two were to query the DetectDataChangesTable table for the two partitions that might be refreshed to check to see if the value returned in the Output column was different:

select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2022-01-01 00:00:00') 
and [_].[RangeStart] is not null) 
and ([_].[RangeEnd] = convert(datetime2, '2023-01-01 00:00:00') 
and [_].[RangeEnd] is not null)
select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2021-01-01 00:00:00') 
and [_].[RangeStart] is not null) 
and ([_].[RangeEnd] = convert(datetime2, '2022-01-01 00:00:00') 
and [_].[RangeEnd] is not null)

The third was to get the data for the 2021 partition, which was the only partition that needed to be refreshed:

select [_].[Date],
    [_].[Sales]
from [dbo].[Sales] as [_]
where [_].[Date] >= convert(datetime2, '2021-01-01 00:00:00') 
and [_].[Date] < convert(datetime2, '2022-01-01 00:00:00')

Finally, scripting the Sales table again to TMSL after the refresh had completed showed that the refreshBookmark property had changed to 2 for the 2021 partition:

And that’s it. I really like this feature but I’ve never seen anyone use this in the real world though, which is a shame. Maybe this blog will inspire someone out there to try it in production?

[UPDATE] An extra point to add is that if you use this functionality, the data source for the partitions must be the same as the data source used for the polling queries, otherwise you’ll get an error.

20 thoughts on “Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

  1. This is actually quite powerful and will prevent so much unnecessary partitions processing when designed properly. Thanks Chris for the article, simple and useful to many for sure.

  2. interesting…
    I have a question, I don’t have the premium in place, just the pro.
    did the incremental setup always create 1 partition per year?
    and did 1 change detected to 1 of these partition reprocesses this partition?

    my dataset can change in the past, rare case, but could occur, because a user may cancel an old transaction and replace it with a new one.
    I setup the incremental process to only reprocess the last days, but I’m curious to know what the system will do against the archive partition.
    is there a way to monitor what’s the system processes? (I’m connected to Dataverse, so I can’t track the sql queries)

    PS: sorry, it’s more than 1 question finally 😉

    1. No, incremental refresh can create partitions at other granularities, not just year – you’ll see day, month and I think quarter partitions too depending on how you configure it. One change will trigger a refresh of one partition, yes. You can always monitor what is being refreshed either by running a Profiler trace or by running the refresh from a TMSL script in SQL Server Management Studio, as shown in my post.

      1. Thanks Chris.
        but I can’t use the profiler, it’s Dataverse my source for now.
        I’ll do some tests using Synapse, in this case I’ll be able to track the queries.

      2. Unfortunately I’m using Dataverse and not able to use the profiler.
        but I’ll do some test on my new synapse env.

        for now I’m not sure how I have to track my changes.
        in my case, when the past changes, this mean that an old row is “deleted” (record inactive in CRM and excluded from my Power Query / Power BI dataset) and a new one take it’s place.

        I think I have to do some important changes in my power bi model, for now it’s not simple against dataverse, so I’m conducting tests with synapse for that.

        I’m not sure about what’s the best solution for now, I’ll test Power BI Premium too with the new datamarts concept.
        but not sure about the near real time side there.
        (ie: syncing dataverse in near real time to this datamart)
        but based on this blog post maybe I’ll be able to setup a custom table to trigger the refresh.

        my challenge is detecting the deleted rows in incremental and refresh processes…

        well… I like Power BI challenges and your blog is an incredible source of crazy info (following you from your beginnings in the olap world and applying so many of your recommandations in all my projects…)

  3. I think one of the reasons it was not implemented widely was that enabling the “Detect Data Changes” setting would double to amount of storage required by the dataset as the service would (allegedly) create a shadow copy of all partitions.

  4. Hi Cris, how should we calculate the Output column from the “DetectDataChangesTable”? Can you elaborate on this please? Can I use It to return the last datetime the data was updated en the data source?
    Thank you.

  5. Hi Chris,
    How we should calculate the “Output” column in the “DetectDataChangesTable”? Can we use it for returning the last datetime the data was updated/modified in the data warehouse?
    Can you please elaborate a little bit more on how to build this table.
    Thank you!

  6. Hi Chris,
    I’d like echo Pablo G’s request for more detail regarding how the Output column in the DetectDataChangesTable should be calculated.
    Thanks in advance!

    1. The Output column can contain any scalar value, so using a number is the easiest option. The important thing is that it changes when the data in the partition needs to be refreshed.

  7. For individuals with only a Pro license, you have the option to enhance your fact table by incorporating two extra columns: Year and Month.
    Subsequently, you can establish another Query (FactChangeTime) to compute the maximum ChangedTime for each year and month.
    By linking the Fact table with FactChangeTime using the Year/Month columns, you’ll create a cost-effective solution (for 10 years of data, there will only be 120 distinct values).
    This can then be utilized for the Detect Data Change setting, streamlining your process.

  8. Utilizing a traditional Sales table (Order Date, Value, ChangedDate), we can create a Query named DetectDataChangesQuery as follows:

    let
    Source = SalesAllColumns,
    FilteredRows = Table.SelectRows( Source, each [Order Date] >= Date.From( RangeStart) and [Order Date] = Date.From( RangeStart) and [Order Date] < Date.From( RangeEnd))
    in
    FilteredRows

    In this scenario, we can simply remove the ChangedDate column for a more streamlined approach.

    1. DetectDataChangesQuery:
      let
      Source = SalesAllColumns,
      FilteredRows = Table.SelectRows(Source, each [Order Date] >= Date.From(RangeStart) and [Order Date] < Date.From(RangeEnd)),
      GroupedRows = Table.Group(FilteredRows, {}, {{"LastChangedDate", each List.Max([ChangedDate]), type nullable date}})
      in
      GroupedRows

      1. Sales table
        let
        Source = SalesAllColumns,
        RemovedColumns = Table.RemoveColumns(Source, {“ChangedDate”}),
        FilteredRows = Table.SelectRows(RemovedColumns, each [Order Date] >= Date.From(RangeStart) and [Order Date] < Date.From(RangeEnd))
        in
        FilteredRows

  9. Hi Chris,
    I followed your steps to get a scaler value for data detection. I created a table with RangeStart, RangeEnd and Output fields for each partition. However, my source is not same as my fact table. And I am able to wrap my head around the relation part between Fact table and dbo_DetectDataChangesTable.

    IncrementalRefreshDemo = Source
    {[Name = “IncrementalRefreshDemo”]}
    [Data],
    dbo_DetectDataChangesTable = IncrementalRefreshDemo
    {
    [
    Schema = “dbo”,
    Item = “DetectDataChangesTable”
    ]
    } [Data],

    Is it possible to achieve the same using an Excel table for DetectDataChangesTable. If so, how do i relate it with my Fact table using M language?

    1. I don’t understand your question, but as I say in the post your data source for the DetectDataChanges table needs to be the same as the data source for your fact table otherwise it won’t work.

Leave a Reply