On-Demand Loading Of Direct Lake Power BI Datasets In Fabric

For any Power BI person, Direct Lake mode is the killer feature of Fabric. Import mode report performance (or near enough) direct on data from the lake, with none of the waiting around for data to refresh! It seems too good to be true. How can it be possible?

The full answer, going into detail about how data from Delta tables is transcoded to Power BI’s in-memory format, is too long for one blog post. But in part it is possible through something that existed before Fabric but which didn’t gain much attention: on-demand loading. There’s a blog post about it in Power BI Premium from December 2021 here:

https://powerbi.microsoft.com/en-us/blog/announcing-on-demand-loading-capabilities-for-large-models-in-power-bi/

TLDR; instead of loading all the data from the tables in your Lakehouse into memory when a query is run, on-demand loading means only the data that is needed for a query is loaded which naturally makes everything a lot faster. What’s more you can see what gets loaded into memory using DMVs. Here’s a simple example…

Let’s say you have a Lakehouse with a table in it called Sales that contains three columns: Country, Product and Sales.

I populated this table using a Dataflow gen2 using the following M expression:

let
  Source = #table(
    type table [Country = text, Product = text, Sales = number], 
    {
    {"UK", "Apples", 1}, {"France", "Apples", 2}, 
    {"UK", "Oranges", 5}, {"Germany", "Pears", 10}
    }
  )
in
  Source

Let’s also say you have a custom dataset built on this Lakehouse (you can’t use the default dataset, at least not yet, for what comes next) containing this table, called Sales Custom Dataset.

At this point you can open up DAX Studio, connect to your workspace’s XMLA Endpoint, and query this dataset. The DMV mentioned in the blog post above, DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS, tells you whether a column segment in a table can be paged in/out, if it is paged or out, the “temperature” (which represents the frequency that the column segment is accessed) and when the column segment was last accessed amongst other things. A column segment is a structure that holds compressed data for a column in the Vertipaq engine inside Power BI.

Running the following query:

Select
COLUMN_ID, SEGMENT_NUMBER, ISPAGEABLE, 
ISRESIDENT, TEMPERATURE, LAST_ACCESSED
from SYSTEMRESTRICTSCHEMA 
($System.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS, 
[DATABASE_NAME] = 'Sales Custom Dataset')

Immediately after creating the dataset or refreshing it (again, what refresh means for a Direct Lake dataset is something for another post but it has the side effect of paging everything out of memory) will give the following result:

Note that for this simple dataset there is only one row returned per column – this will not be the case for datasets with more data, or depending on how the data is loaded into the lake, where there will be more rows because there are more segments. Each row represents a column segment, not a column. Also note that apart from the RowNumber column (which is hidden and always there) every column segment is pageable, is not resident in memory, and has no value for Temperature or Last Accessed.

Now let’s say that you build a report on this custom dataset with a single table visual using just the Country and Sales column, like so:

Rerunning the DMV query above now returns the following:

You can now see that the Country and Sales columns are resident in memory, have a fairly “hot” temperature, and you can see the date and time they were last accessed.

Doing nothing else apart from waiting about five minutes and then rerunning the same DMV returns the following:

You can see that Country and Sales are still in memory but their temperature has reduced.

Adding the Product column to the table visual like so:

…and then rerunning the DMV query now gives the following results:

As you might expect, the only column segment for Product has now been paged into memory but it still has a lower temperature than the column segments for Country and Sales.

As well as looking at column segments, it’s also possible to do the same thing for column dictionaries by running the following DMV query:

Select
COLUMN_ID, DICTIONARY_SIZE, DICTIONARY_ISPAGEABLE,
DICTIONARY_ISRESIDENT, DICTIONARY_TEMPERATURE, DICTIONARY_LAST_ACCESSED
from SYSTEMRESTRICTSCHEMA 
($System.DISCOVER_STORAGE_TABLE_COLUMNS, 
[DATABASE_NAME] = 'Sales Custom Dataset')

Here’s an example of the output:

Finally, after leaving the dataset for two days and not running any other queries or opening any reports, rerunning both DMVs shows that everything has been paged out of memory again:

OK, so this is all very interesting for nerds like me who like to see how things work behind the scenes. As I said, though, it’s something that has happened for Large Datasets in Import mode for a while now; if you’re thinking that it would be cool to build a report using these DMVs and use the Temperature column to see which columns are the most frequently used, Gilbert Quevauvilliers already did that here.

There is one big difference between Import mode and Direct Lake mode to point out though. When you do a full refresh on an Import mode dataset the whole dataset has to fit in memory, and you are therefore at the mercy of the limits imposed on the amount of memory a dataset can use in either Shared capacity or the Premium capacity you are using. These same limits exist for Direct Lake datasets, but since refresh for Direct Lake datasets is not the same thing as refresh for Import datasets, at no point during a refresh does the whole of the dataset need to be in memory. This means the memory limits only apply to how much of the dataset can be paged into memory at any one time, so in some cases you’ll be able to work with much larger datasets than are possible in Import so long as your queries only need to read a small part of the data at a time. I say “in some cases” because it’s complicated: there are various other, as yet undocumented, rules about whether a Direct Lake dataset can be queried as such or whether queries fall back to Direct Query mode and some of these rules do relate to the data volumes used. The point is that as Power BI developers the way we think about dataset memory usage will have to change with Direct Lake mode.

6 thoughts on “On-Demand Loading Of Direct Lake Power BI Datasets In Fabric

  1. Hello Chris,

    I appreciate the article you shared. However, I find myself puzzled by a particular assertion you make, and which is made on several articles on Fabric, where you mention that “there is no need to wait for data to refresh.” Doesn’t the dataset’s data need to be refreshed at some point? Surely, a process for refreshing this data needs to be created and maintained somewhere, right?

    Moreover, how does this differ–in functionality–from the direct connections to datasets that we’ve been using for some time now? I’d appreciate it if you could shed some light on these aspects. Thank you!

    1. There is still the overhead of loading the data into the lake and converting it to Delta tables (which can be slow), but that’s it – there’s no further processing needed for Power BI.

      DirectQuery on datasets is something completely different. Direct Lake is a new storage mode in Power BI, a third option alongside Import mode and DirectQuery.

      1. A quick FYI for anyone who may not have heard …. 😉 …
        Microsoft said they will drop the name “DirectQuery for Power BI Datasets and Analysis Services.”

        For the updated name, please see the recent announcement of the GA here:
        https://powerbi.microsoft.com/en-us/blog/announcing-general-availability-for-composite-models-on-power-bi-datasets-and-analysis-services-models/

        I have to admit that “composite models” rolls off the tongue a little easier.
        And I’m guessing that “Direct Lake” had the potential of getting a codename that was equally unwieldy (“DirectQuery for ColumnStore Files in ADLS”). Thankfully it didn’t come to that!

  2. This is pretty exciting for those of us that want to make our dollars stretch a bit further. It always seemed to me that Microsoft was a bit cavalier in regards to memory conservation in Power BI. (Why not load massive volumes of tabular data into RAM, if you can simply make that the customer’s problem – and charge them more money while you are at it).

    But Direct Lake sounds like it will give us a way to put some of the “overflow” of our tabular models on disk, and give us a way to conserve memory. Maybe we can hold off a bit longer on upgrading to our next premium tier (P2, P3, etc).

    This technology is exciting in other ways too. One thing that comes to mind is that some day this might help work around some of the limitations with existing composite models. Composite models don’t scale very well when larger dimensions. But maybe there will be such a thing as “composite on direct lake” that will overcome that (ie. two different models will be allowed to point at the same underlying direct lake data, as co-equals, without importing duplicate data into both).

    One concern I have is the amount of time it takes for the PBI team to move stuff from “preview” to GA. Assuming they take a comparable timeframe as composite models, this “Direct Lake” stuff might not be GA until the beginning of 2026. I really wish Microsoft would qualify their “preview” label, and at least give us a tentative *year* for GA. This would allow customers to determine if we want to start exploring the technology right away, or just wait and let it simmer for another year or two.

Leave a Reply