Developing Large Power BI Datasets – Part 3 – Detail Tables

November 24, 2022December 21, 2022

Developing Large Power BI Datasets – Part 1

Power BI is architected to consume data in a dimensional model, with narrow fact tables and related dimensions. Introducing a big, wide table in a tabular model is extremely inefficient. It takes up space and memory resources, impacts performance, and complicates measure coding. Flattening records into a flat table is one of the worst things you can do in Power BI and a common mistake made by novice Power BI users.

This is a conversation I’ve had with many customers. We want our cake, and we want to eat it too. We want to have all the analytic capabilities, interactivity and high performance but we also want the ability to drill-down to a lot of details. What if we have a legitimate need to report on transaction details and/or a large table with many columns? It is well-known that the ideal shape is a star schema but what if we need to shape data for detail reporting? The answer is that you can have it both ways, but just not in one table.

For a well-oiled Power BI solution, the following example would be an ideal data model shape:

…but let’s say that for whatever reason, we have a set of requirements that can’t be accomplished using a dimensional model like this. Our skinny fact table doesn’t include all the details and adding them would be very costly. Our requirement is to:

Drill-down in context, from a summary total to more granular details
View transactions and detail records
See recent data changes in real-time

Based on my experimentation, here is an example of the difference in storage efficiency between a star schema and a flat table:

An Import model contains a fact table with 21 million rows. That table has only keys and a few numeric columns. The fact table is related to seven different dimension tables containing unique values and several text-type columns. This PBIX file when fully populated is about 35 megabytes in size. Importing the same data using a TSQL query that joins the fact with same seven dimension tables with repeated attribute values into one big flat table in the Power BI model results in a PBIX file that is about 350 megabytes. So, the same data stored in flat form takes up ten times as much space! Data refresh for the latter model is much slower as well.

DirectQuery and Composite Models

Life is a series of trade-offs and compromises. Adding a DirectQuery table to a model lets you reference an external table from a data source without actually importing the records. Queries against a DQ table are processed at the data source when the user interacts with the report rather than importing a cached copy of the data. When a DQ table is added to a data model with imported tables, it is called a composite model. Import model tables and DirectQuery tables can even have relationships between them. There are many advantages to using this design approach, but the significant trade-off is that DirectQuery typically doesn’t perform well compared to using Import mode tables.

Here is an actual use case:

Our details table is imported from an Azure SQL view that contains about 60 columns, many of which are repeating long text values. The table has about 12 million rows and the TSQL script in the underlying view combines several tables using left outer joins. Queries are not fast and can take 30 seconds to a couple of minutes to return results depending on filtering predicates. After adding the table to the model using DirectQuery and adding relationships to common dimension tables, I’ve created a drillthrough report page with parameter fields. There are four visuals on the page that group aggregate values using different fields. When drilling through to the page, filters are passed as parameters to the generated WHERE clause. In one typical test case, the page takes about a minute to render.

The DirectQuery table doesn’t add storage overhead to the model, but users aren’t very happy waiting a minute or more to see results.

Using Aggregations

To design aggregations, you create a table that contains the summarized and aggregated values at levels anticipated from common report grouping. The aggregation design creates a mapping between the detail and summary tables that will be used when a report user navigates to a level supported by the summary table.

I add a new table to the model called Sales Agg that contains only five columns: three keys and two summarized numeric fields. The aggregation design for the summary table looks like this:

The results are clear. Because the three column charts on the drillthrough report page produce queries supported by the aggregations, those three visuals render in about a second. The table visual isn’t supported by aggregations so it still takes several seconds to render – although less time than before because there was no competition for resources after the first three queries had completed. This at least gives the users summary results to read as they are waiting for the details table.

Troubleshooting and Optimizing Aggregations

The first time I tried to use the aggregations in the chart that grouped by the Region Country Name field, performance didn’t improve, so I knew that something was wrong. To troubleshoot, I captured the generated DAX query using Performance Analyzer, pasted it into DAX Studio and ran a trace using Server Timings. The Server Timings trace revealed that the attempt to use aggregations failed because I had accidentally used the Region Country Name field from the Inventory Details table rather than the Store dimension table referenced in the aggregation design. This was my goof, but it was easy to find and correct using the trace.

Adding a detail table to a model can get messy, especially if the table or view design changes over time. In an effort to keep datasets streamlined and easier to maintain, it may be best to keep the model uncluttered and leave these complicated table out of the Power BI model, just keeping the dimensional star schema.

Drillthrough to a Paginated Report

Paginated reports can report on data in a Power BI dataset but sometimes it’s best to use them for the exceptional reports that are best served using conventional SQL queries. In this example, I have created an Inventory Details report using a TSQL query that delivers the flat, wide detailed results (this is the wide table we were trying to avoid using in the Power BI data model). The reports accepts a few parameters used to filter the query.

Once deployed to the Power BI service, I can execute the Inventory Details report by passing parameter values in the URL.

In my Power BI report, I can build the URL string using a measure used to form a link to the target paginated report. My Inventory Report Link measure is defined using this DAX code:

The measure is categorized as a web link and then can be added to any visual that supports link navigation, in this case the matrix visual.

Here is an example of the link generated by the measure, including the three parameter values related to the selected Category, Year and Month:

A significant advantage is that we are able to use the right reporting tool to server two different purposes. Power BI is great for analytical reporting based on a semantic data model, and Paginated reports is a great tool for detail reporting directly from a transactional database. There is no need to clutter and complicate the data model when I can just use the best tool for the job, and then provide navigation features to make them work together.

Paul Turley

Microsoft Data Platform MVP, Principal Consultant for 3Cloud Solutions Specializing in Business Intelligence, SQL Server solutions, Power BI, Analysis Services & Reporting Services.

Power BI Direct Lake and DirectQuery in the Age of Fabric

I just returned from the Microsoft Fabric Community Conference in Las Vegas. Over 4,000 attendees saw a lot of demos showing how to effortlessly build a modern data platform with petabytes of data in One Lake, and then ask CoPilot to generate beautiful Power BI reports from semantic models that magically appear from data in a Fabric Lakehouse. Is Direct Lake the silver bullet solution that will finally deliver incredibly fast analytic reporting over huge volumes of data in any form, in real time? Will Direct Lake models replace Import model and solve the dreaded DirectQuery mode performance problems of the past? The answer is No, but Direct Lake can break some barriers. This post is a continuation of my previous post titled “Moving from Power BI to Microsoft fabric”.

Direct Lake is a new semantic model storage mode introduced in Microsoft Fabric, available to enterprise customers using Power BI Premium and Fabric capacities. It is an extension of the Analysis Services Vertipaq in-memory analytic engine that reads data directly from the Delta-parquet structured storage files in a Fabric lakehouse or warehouse.

Moving from Power BI to Microsoft Fabric

Fabric is here but what does that mean if you are using Power BI? What do you need to know and what, if anything will you need to change if you are a Power BI report designer, developer or BI solution architect? What parts of Fabric should you use now and how do you plan for the near-term future? As I write this in March of 2024, I’m at the Microsoft MVP Summit at the Microsoft campus in Redmond, Washington this week learning about what the product teams will be working on over the next year or so. Fabric is center stage in every conversation and session. To say that Fabric has moved my cheese would be a gross understatement. I’ve been working with data and reporting solutions for about 30 years and have seen many products come and go. Everything I knew about working with databases, data warehouses, transforming and reporting on data has changed recently BUT it doesn’t mean that everyone using Power BI must stop what they are doing and adapt to these changes. The core product is unchanged. Power BI still works as it always has.

The introduction of Microsoft Fabric in various preview releases over the past two years have immersed me into the world of Spark, Python, parquet-Delta storage, lakehouses and medallion data warehouse architectures. These technologies, significantly different from the SQL Server suite of products I’ve known and loved for the past twenty years, represent a major shift in direction, forming the backbone of OneLake; Microsoft’s universal integrated data platform that hosts all the components comprising Fabric. They built all of Fabric on top of the existing Power BI service, so all of the data workloads live inside familiar workspaces, accessible through the Power BI web-based portal (now called the Fabric portal).

CI/CD & DevOps for Power BI… Are We There Yet?

In my view, projects and teams of different sizes have different needs. I described DevOps maturity as a pyramid, where most projects don’t require a sophisticated DevOps implementation, and the most complex solutions do. The DevOps maturity is a progression, but only for projects of a certain scale. One of the following options might simply be the best fit for a particular project.
Unless you are throwing together a simple Power BI report that you don’t plan to maintain and add features to, the first and most basic managed project should start with a PBIX file or Power BI Project folder stored in a shared and cloud-backed storage location.
DevOps isn’t a requirement for all projects, but version control and shared file storage definitely is.

4 thoughts on “Developing Large Power BI Datasets – Part 3 – Detail Tables”

Oliver Mills says:

January 4, 2023 at 1:04 am

While this may be true in the majority of situations – I recently did some testing on import load performance of star schema vs one-big-table “OBT” (or multiple OBT’s in our case) and found the OBT solution to import much faster. I think, this is because our data model is a constellation of star schema’s (fact order, fact order line, fact order line item, fact product + all the dims for each star, this is around 60+ tables). Some of these tables are over 100m rows. We only want to see orders in the past couple of years, and consequently only products associated with those orders, and finally, only dim values that relate to those orders and products in the past couple of years.

Because of this, each fact and (significantly large) dim table needs to first lookup and find the orders sold in the last two years, then use these order id’s as a filter/inner join on each fact and dim to ensure we’re only loading the relevant rows we need. If we do this for all the fact tables, and the larger dimension tables, it means we are querying the fact order table over and over again. Doing the same thing using a few wide tables (one table each for orders, order lines, order line items and products) and doing all the filtering in 4 SQL queries resulted in a much faster import time.

I know that the star schema version will be much smaller in model size, and probably more performant due to the high cardinality of some of the columns, right now we’re more worried about refresh/load speed as our report refreshes kick off about 7am after the nightly ETL has run – so we need these reports to be ready by 9am for the business to consume. I may be missing something obvious here so would appreciate your thoughts on this dilemma.

Note: We can’t utilise incremental refresh because a small portion of our historical data updates very far back in time (over 2+ years), it’s only a few rows per partition, but enough that when we trigger an incremental refresh every single partition has to reload essentially meaning IR = full refresh – hence the concern for refresh/load efficiency.

Loading...

Reply
Pingback: The Importance of Proper Data Modeling in Power BI – Curated SQL
Jessica Jolly says:

November 25, 2022 at 6:22 am

As always Paul, thanks for your detailed explanations!

Loading...

Reply
Pingback: Developing Large Power BI Datasets – Part 2 – Partitioning – Paul Turley's SQL Server BI Blog

Developing Large Power BI Datasets – Part 3 – Detail Tables

DirectQuery and Composite Models

Using Aggregations

Troubleshooting and Optimizing Aggregations

Drillthrough to a Paginated Report

Like this:

Paul Turley

4 thoughts on “Developing Large Power BI Datasets – Part 3 – Detail Tables”

Leave a ReplyCancel reply

DirectQuery and Composite Models

Using Aggregations

Troubleshooting and Optimizing Aggregations

Drillthrough to a Paginated Report

Share this:

Like this:

Paul Turley

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

4 thoughts on “Developing Large Power BI Datasets – Part 3 – Detail Tables”

Leave a ReplyCancel reply

Discover more from Paul Turley's SQL Server BI Blog