Tuesday, April 4, 2017

Moving to SEI

In Motion!

It's a happy announcement:
I joined SEI's Boston Group this week to become part of this strong local consulting firm and its fabulous culture that operates in a variety of industries and areas of specialization with IT.

It wasn't easy to leave my friends and respected colleagues at Pragmatic Works, and I look forward to working with them again in the future.

Here on my blog I will continue to share what I learn. I fully expect to be learning at least as fast this year than I did last, so hold on here we go!

If you are curious about what I like about SEI, you should see the new website they've rolled out just in time for my arrival! sei.com

If you are looking for any IT assistance, please check out our capabilities.

Friday, March 24, 2017

Power BI Gotchas

Answers to some common questions and frustrations for new users of Power BI.


I'd like to help save you some time banging your head and searching the internet by pointing out some of the issues I've run into when working with clients over the last year.

Data loading and preparation:

There are many articles and blogs about the technical aspects of loading data into Power BI so I won't cover that here in great detail. However, I would like to call out some basic data transformations and cleansing activities that can make a big difference to the experience of report users.
Once you have loaded a table check the columns:
1.       Rename your data columns to use business terminology in the field names.
This isn't a database, this is a reporting data model and your business users need to be able to understand the names of the columns.
2.       Add units to your column names.
Again, the objective here is to make the reports readable and easy to interpret. You don't want your uses to wonder if you were summing minutes or seconds in the reporting graphic.
3.       Remove unnecessary columns:
Will the column be used for reporting? Really think about this, will the users of your reports NEED to see this column, slice or filter by it? If not, it doesn't belong in your reporting data model.
4.       Remove bad data, or unnecessary data records:
Will the RECORDS in your table all be reported on? Again, every record might not have valid data. Check for this and remove records that are not going to provide good insight in your reporting.
5.       Remove unused records, even from date tables!
Are the records part of the data set you need? Sometimes we load records that go all the way back years before our reporting range. Unless those records contribute to the value of information shown in the report, remove them. This is ESPECIALLY true of the Date or Calendar tables - think about those extra records. This is much more data than you need to be filtering through for your reports and interactive visualizations.
6.       Use data types wisely:
Change data types to be accurate for your business definition of the column. Sometimes colums will come in as whole numbers to Power BI because the sample used to define the column "sees" only numbers in the first 1,000 rows for the preview. If this is a text based column, your data load will error when you load the rest of the data.
7.       Use Trim if you think there might be leading or training spaces in your data.

Once you have loaded the data there is still preparation to complete. A common question often comes up right about now though, so let's talk about the 

Confusion between Calculations and Measures:

 These are similar: that the engine will calculate them for you, and both can be aggregated in your reports and visualizations. The key difference is that calculations are values that are stored in each row of your table, and measures are not calculated until used in the report context.

1.       Creating Calculations and Measures
Format your facts as part of your data model setup:
Often you will be told to use custom calculations instead of the default summarization on your columns.
1.       You will then know, and be able to define in a data dictionary, exactly how the calculation was built.
2.       However, if you do use default summarization, there will be a benefit when viewing reports: you will be able to use the "View records" command to see what records were included in the aggregate. This does not work for measures.
2.       Sometimes you need to create a new column as part of each row:
For example, you might want to calculate Amount as a Quantity * Price - that is a calculated column
1.       The "context" of the calculation is the row - always
2.       Created on data load and stored just like other columns loaded by Power Query
3.       Sometimes you want to show calculation results that change as we interact with our reports:
You might need to know the maximum Amount which is max(Quantity * Price), but when reviewed across all rows for orders placed within the current Year it will give a different answer than when viewed for all orders within a given City.
1.       In this case the "context" is all of the rows that are being evaluated, and which rows will be determined by "where" the measure is being used
                                         i.            Used as a value in a chart by dates, it might be the max Amount per Year, then you might drill down to max by Quarter and max by Month
                                       ii.            Used as a value in a Geographic map, it might be the max Amount in the Region, then drill down to a given State or Province, City, etc…

Some other important and sometimes confusing features:

  1.       Go through and format your facts as part of your data model setup
                       i.            Set default summarization.
                     ii.            Set the format to use for showing the fields
It is easier to do all of these steps for all of your columns at one time so that your reports look great from the start and you will know that the standard has been implemented. It is still likely that you will need to return to update or fix some which you can do a t any time.

2.       Create Groups:
You can categorize your records automatically by creating groups that become a new column in your table. "Other" as a category allows for new data values added in your incoming data. Be sure to check overtime that you don't need to add a new category though.

3.       Create Hierarchies:
                       i.            Creating a hierarchy is something people do accidentally all of the time, then they worry about damaging their data model by deleting them. Don't worry if you get an accidental hierarchy with an unintended drag/drop of a column in the Fields list, just go ahead and delete that hierarchy - your fields will stay as they were.
                     ii.            To create an intentional hierarchy drag a lower level field over to the highest level in the hierarchy and drop it there. There will be a yellow dotted line shown where it will drop. Each other column added to the "pile" will be added at the bottom.
                   iii.            If necessary rearrange them by dragging them up or down, or delete one using the "x" to the far right.
                   iv.            Adding a Hierarchy to the axis of a visualization will enable drop down.
                     v.            In the axis of a graph - you can remove levels of hierarchy that are not desired for that graph without any effect on the hierarchy.

4.       Even if you set the format of your column, you may need to set the visual scale on each individual visualization settings: None, Thousands. Pay attention to this, it makes a big difference to readability

5.       You can now "See Data" AND "See Records"
                       i.            See Data will show the values that are in the visualization - as a table.
                     ii.            See Records will show the table records that comprise the aggregations that are shown in the graph. Every record in it's entirety.
                   iii.            As mentioned above, "See Records" only works with data elements (including calculated columns?) not with measures that can be summarized
                   iv.            Trick - if you want a count as your visual value, but your data doesn't have a uniquely identifying column to "count", create a column that can be summarized for the counting (like "Trips") and default it to the value of 1. You can then sum or count the values in this column to know how many unique "trips".

6.       If you create a visualization (a bar chart) and the values are the same in every data point you have probably forgotten to create your relationships. You can draw them on the Relationships tab, or us the Manage Relationships dialog box from the Home ribbon bar.

7.       How to review your relationships:
If you have many tables and many relationships and want to review them all quickly, the dialog box from Manage Relationships may be the better choice, as it is quickly scanned for errors in column names, etc.…

8.       Single and "Both" direction filtering make a big difference in some reports:
                       i.            Single direction filtering enables the value from one slicer or visual to affect the values on the related table. However, filtering the second table by another value will not affect the values shown from the slicer.
                     ii.            The best way to visualize this is to see 2 slicers from two dimensional tables and when the dimensions are both single direction, the two slicers will not affect each other.
                   iii.            If the 2 dimensions are "Both" direction filtering, then selecting a value from one will affect the values shown from the other.

9.       Drill down in charts
                       i.            There are 3 drill down buttons on a chart that contains hierarchies:
                     ii.            The single down arrow button is a toggle to turn drill down on, when on selecting a value in the chart will "drill down" to the next level.
                   iii.            The double arrow will drill down to the next level, aggregating at that level ONLY:
                                         i.            Ex: if you go from showing 3 years of data down to showing months, the values for each month will be the aggregate of all of the month's values for every year, all together
                   iv.            The forked arrow aggregates at the combination of the level above and the new level:
                                         i.            Ex: if you go from showing 3 years down to months, you will see the values for each month of every year.

10.   A Date table enables time calculations if certain criteria are met:
                       i.            A date column must have a date data type
                     ii.            The date column must show a continuous value for every date in the time period (ex: 1/1/2017, 1/2/2017, …)
                   iii.            For any time intelligence calculations to work:
                                         i.            Joins with the date table must be made on the date column - NOT on a surrogate key.
                                       ii.            The values in the time calculation must also be continuous. If you are aggregating ParallelPeriod values, you must be doing it across a dataset that has a value for every date in the time period.

11.   Why isn't my calculation working?
                       i.            Check your calc - of course, but if you don't find anything
                     ii.            Save some time by reviewing your:
                                         i.            Page and Report filters
                                       ii.            Visualization filters
                                     iii.            Whether you have Relationship filtering in Single or Both directions

These are some of the "gotchas" that can waste your time when you create your reports and can cause your users confusion when they use your reports.

Are there others that you've spent time resolving that you'd like to share?



Monday, December 12, 2016

Reading SAS files from Power BI using R Script

Recently I have been working to read SAS files (sas7bdat file format) in Power BI. SAS technical support and the Power BI community team have both worked with me to attempt to use the SAS Local Provider for OLE DB which is the only provider that does not require a SAS server.

We were able to use the SAS Local Data Provider 9.44 in Excel by using the Data Connection Wizard to access Data Link Properties and to read the sas7bdat data into Excel as a data source (see details on this at the bottom of this post).

However, following similar steps in Power BI lead to an error message because "the SAS Local Provider for OLE DB does not 'like' the GetOleDbSchemaTable method. (quote from SAS Technical support):
(Text of error: "The 'SAS Local Data Provider 9.44' provider is not registered on the local machine.")
Error Message
Text of error: "The 'SAS Local Data Provider 9.44' provider is not registered on the local machine."
Now, this error seems actionable until you remember that we just used the provider to connect from Excel. This is due to the use of the GetOleDbSchemaTable method and we could not find a way to work around this problem in Power BI.

I have submitted this as a support ticket to Microsof and I look forward to their response on how to get a supported connection to SAS files.

The next step is to find another way:
Loading Data into the Dashboard file using R Script:

The best method I have found so far to read SAS files into Power BI is using R script:
https://businessintelligist.com/2016/09/09/reading-sas-sas7bdat-files-in-power-bi/

1) Locate your SAS Data set files (you can download a sample set in ZIP format from: http://www.principlesofeconometrics.com/sas.htm)

2) Install R locally (I have been using R 3.3.2 from CRAN (https://cran.r-project.org/bin/windows/base/)

3) Install Power BI Desktop locally:
Download Power BI Desktop from http://app.powerbi.com

4) Open Power BI, and first, set your version of R in Options and Settings:
a. Click the “x” in the upper right corner of the splash dialog box to close it:
Close the Power BI Desktop dialog box
b.       Select File -> Options and Settings ->Options -> R Scripting
c.       You should be able to select your installed R version from the Detected R home directories list and Detected R IDEs
5) Now, setup the data source for your Power BI solution:
a.       Drop down Get Data and select More… -> Other –> R Script, then the Connect button
b.       The Execute R Script dialog box provides an input location for the following script which will set the working directory, access the library and read the file into Power BI:



c.       Click OK to see the content of the SAS file, select the table(s) to load  (multiple table files haven’t been tested by me using R Script)
d.       Select to Load the data directly, or Edit to use the Query Editor to filter, clean, reformat or otherwise transform the data for this and all future data loads.

6)      Click Close & Apply when the data set is complete and you are ready to use it in Power BI
 
Graph using Airline data from SAS file











Using a faster library:
I have found some information that haven is a faster library than sas7bdat, the sample files for SAS that I have access to are not sufficient to test this, but the haven library works very well:

Using haven the script changes to reference the haven library (https://cran.r-project.org/package=haven) and to use the read_sas function to read the SAS file from that library as you can see in this sample to read another data file, USA:
I am very interested to know what the results are using production size and complex datasets, and am willing to help in anyway that I can if you have questions about using this method.

Publishing and Refresh of data:
The Power BI solution can be published to http://app.Powerbi.com using the Publish button. The R Script has imported the dataset, so that will be published to the Power BI Service along with the reports.

R Script as a data source does not yet support refresh. There is a setting for refreshing data from Files, but that does not work for R Script as a source, it is for CSV, etc…


Rachael

Tuesday, October 11, 2016

Azure Cloud Based Analytics

The elements of analytics are the same whether on premises or in the cloud. Cloud based tools for analytics can provide a platform that is far more extensible, economical and powerful than traditional on-premise. This session will review some cloud analytics use cases and how the tools from Microsoft combine to create business insight and value.
Azure Cloud Tools
Microsoft Intelligence (edited, previously Cortana Intelligence Suite) is a powerful combination of cloud based tools that manage data, provide analysis capability and store or visualize results. Together these tools represent the six elements of data analytics.


The talk on Data Based analytics includes a demonstration of Data Lake, PolyBase, Azure SQL Data Warehouse, Power BI, and Azure Machine Learning. The links to Power BI and Azure ML will take you to the apps, each has a free account you can sign up and use for testing. I encourage you to sign up for the free offer for $200 in Azure Credits (United States based offer, at the time of this posting Oct. 11, 2016).

Analytics projects
According to Gartner analytics projects include 6 elements and there are common challenges for each of these elements that must be addressed in any analytics project:

  1. Data sources - volume alone and the need to move large data sets
  2. Data models - complexity of setup as they simulate business processes
  3. Processing applications - data cleansing, remediation of bad data
  4. Computing power - analysis requires considerable power
  5. Analytics models - design of experiments and management of the models, model complexity
  6. Sharing and storage of results - informative visualizations and results embedded in applications

Tools provided by cloud vendors provide assistance and in some cases, with far superior solutions to these issues.
The Cloud analytics tools are designed to address the challenges we’ve been discussing:
 1) Scalability: Agility/ Flexibility
     -large data sources –power them off when not in use
     - heavy computing capability – pay only for what you use
    - Stand up a new environment – pilot a new capability
   - integrate data sources with multiple visualization and analytics environment
2) Economy –
    - less expensive to stand up than hardware and software,
    - fewer skills necessary in house for implementation, configuration, integration, use,
     - management and maintenance of systems
3) Security
   - Azure is designed for security from the ground up
   - Microsoft spends $1B a year on security research – more than most security firms
   - identity based security throughout the stack

   - protection for disaster recovery, regulatory specific platforms to address security needs
4) Capability
  - Integration with desktop tools for embedding visualizations, integrating predictive analytics
  - Consumer technology style interfaces enable faster learning and require fewer skills than script based on premises tools
  - Collaboration enabled by ties across Office 365 tools and Mobile capabilities
  - Powerful servers can simply process far more data and analysis than most local servers

How can my business benefit from analytics in the cloud?
Data, provided and used as a valuable asset of the firm provides leverage employees apply to problem solving activities. Fact based trend prediction leads to insights that provide business value.
Businesses taking advantage of the cloud to benefit from increased cloud computing power, the ability to handle large data sets and for the mobility provided by access to the analytics platform from any location.



Other articles that may help to answer this question are:

Power BI is a great starting point for moving analytics to the cloud
Power BI brings many customers their first exposure to cloud analytics. Companies find multiple benefits of collaboration between users, mobile access, and the experience of adopting a new way to use data. Workspaces provide secure areas to work together on Power BI solutions and are linked to Exchange groups and OneDrive folders where other collateral can be shared.

Storage in the cloud 
Using Azure cloud based storage is relatively inexpensive, remotely accessible from anywhere, secure and in coming data is free (data moving out of your region is charged). Large data sets can be managed in Azure Blob Storage, Azure Data Lake, Azure SQL DW and Azure SQL Database.

Predictive Analytics with Azure Machine Learning
Azure ML provides a friendly tool for building machine learning experiments, cleaning and visualizing data. The environment provides familiar analytic tools for R and Python scripting, built in models for statistical analysis and a process for encapsulating trained models in web services as predictive experiments. Integration with Excel and other applications is simplified by the web service interface.

Questions:
Q1) In the PolyBase example does modeling the data by casting columns hinder performance of the data load in the "Create Table As" (CTAS) statement?
(Original posting left this question unanswered, edited to add the answer below)
In creating the External Table the columns must be of the same data type.

However, when querying the external table to use CTAS to populate a SQL DW table I found that using cast to change the datatype of the column was faster to complete the new table creation. This was true for two cases, one where I converted in the input column varchar2(20) to an Int and another where I converted the input varchar(100) to a Date data type. I suspect this is due to the increased speed of input for Int and Date data types over varchars, but it indicates that the original query time isn't different.



Q2) Is it simple to switch from cold to hot in Azure Blob Storage?

The Azure hot storage tier is optimized for storing data that is accessed frequently. The Azure cool storage tier is optimized for storing data that is infrequently accessed and long-lived. - Microsoft documentation 
In regions where these settings have been enabled (not East yet), they can be changed in the Configuration blade of the Storage Account.

Q3) When to use Data Lake in comparison to Blob Storage.

In the demo for this talk I read the NYC 2013 Taxi data from the 12 - 2+GB CSV files into Azure SQL DW using CTAS. At this time, Data Lake is still in preview, and we cannot yet use CTAS with Azure Data Lake. This forced me to use Azure Blob Storage.
Additoinal differences are that Data Lake is optimized for parallel analytics workloads and for high throughput and IOPS. Where Blob Storage is not optimized for analytics but is designed to hold any type of file. Data Lake will hold unlimited data and Blob Storage is limited to 500TB. There are cost differences too, I recommend the Azure Pricing Calculator.

I recommend the Microsoft article Comparing Azure Data Lake Store and Azure Blob Storage for details.

Wednesday, September 28, 2016

Performance Tips and Techniques for Power BI and Power Query (update)

Power BI Performance Tips and Techniques update:

Back in March of this year I created an article and a webinar for the Pragmatic Works Blog.

The original concepts still hold, Power BI and Excel Power Query are very powerful tools for building data models and sharing data discovery and visualizations. As you move from initial prototyping to a more mature model you may find that performance can be bogged down if you don't manage the structure of your reporting information. This talk provides Tips and Techniques to manage this information (link to updated slides).







Since the original webinar I have presented this topic at a few user groups and SQL Saturdays and was asked some very good questions.

I'd like to answer these questions here, and to update the slide deck as well.


1) In Providence at the BI User Group I was asked: What happens in Kasper de Jonge's Memory Usage Tool macro when there is a join between two queries loaded into the model?




If the join is done in the data load, the data model contains only one table, as in the example discussed in the original presentation.

If there is a relationship created between two tables loaded in to memory, the joined tables are both shown in the pivot tables that result from the macro, and how much memory taken by each column is shown as a datatype and total for the column.

2) In NYC at the Power BI User Group a member asked: Does the HierarchySlicer cause performance problems?
To test this, I used a data table with nearly 14M rows linked to a Date table. Power BI connected via direct query to SQL Server 2016. 



The Hierarchy Slicer (HierarchySlicer.0.7.9.pbiviz) at Year / Month on the Date table compared to a "flat" YearMonthName non hierarchical slicer. In this case, each new slicer selection took about 2 seconds to repopulate the summary table to the left. The two took the same amount of time, I did not see a difference in performance.

3) At SQL Saturday in Dallas this question came up: Is the SSAS formula engine still single threaded in Power BI on the portal as it is on the desktop? The SSAS engine is the same in Power BI on app.powerBI.com as it is on your local drive. There would not be an improvement on calculations caused by multi-threading, although the local machine may or may not have more resources to dedicate to the formula engine thread.

Thank you all for the great questions, please comment on others questions or if you have more information!

Friday, July 8, 2016

World Class Data Organizations and SQL Server 2016


World Class Data Organizations (WCDO) intentionally invest in and build their data cultures to create business value. These businesses integrate data into daily activities beyond report or dashboard reviews. Employees know what data elements represent, how to access them, and how to use them to create new insights.



WCDO and SQL Server 2016 (pdf) presents the primary characteristics and activities of WCDO's and expands on some of the features of SQL Server 2016 designed to enable those activities.

This is the slide deck presented on June 30, 2016, and the beginning of a series of presentations I will deliver on this topic.

Saturday, June 11, 2016

Power BI Best Practices - Tactics

I've been asked to present Power BI best practices to a client on Monday. This two post series is part reference to great articles and partly my own insights on the topic, some gained by painful experience. Part one addresses BI project strategy, and this is part two on tactics:

#1 Dashboards are "a single place to monitor the current state of the data".
Microsoft support has tips for great dashboards and points out some design techniques that are important to keep in mind:
#2 Design  for the user's reporting needs.
Like Delora Bradish's exhortation to always consider reporting when designing your data warehouse, your data model should provide friendly table and column names that reflect familiar business terms well understood by the system's users.

Include only necessary data and hide any information that might be used for creating the data model or calculations but not shown (like key values) in reports.

#3 Design your data model to support self-service.
From Marco Russo we are guided to take advantage of data views to stabilize our data loading process:
  • Use views for your data sources to avoid dependence on table structures
  • Create a view schema in the database that consists only of the views for each table to be imported
  • Each view should only contain the columns necessary for reporting from the data model
  • Use meaningful names that will make sense to your users
Double check relationships in your data model to ensure that self service reporting will return valid data.

Carefully specify check data types and how or whether fields should be summarized.

Separate unrelated data into multiple tables to clarify reporting.

Use star schema when possible, as snowflakes can limit scalability. Use snowflakes when lookup tables can provide usability improvements without performance implications.

#4 Consider security and data refresh early in the design of your project.
Work with appropriate administrators, whether in Security or your DBA's, to build gateways for data refresh or direct query access from on premises data sources to the Power BI portal.

Schedule refreshes to correspond with your data frequency, there is no reason to update data eight times a day if it is loaded weekly.

#5 Take advantage of Power BI Content Packs internally and to access services.
Use organizational Content Packs to provide read-only pre-built solutions to your organization or other groups. Members of the group can copy these and personalize them as needed.

Use Content Packs for services published for many common business services. Use this link to see available content packs from the Power BI portal.

#6 Configure your solution to maximize Q&A Natural Language capabilities.
Chuck Schaeffer's CRM Search article provides considerable information about designing for Q&A:

  • "Datasets not represented with tiles are not considered, anticipate the types of question that may be asked and design the dashboard with tiles from datasets which may respond to sure questions".
  • Questions will join entities to provide answers (example sales and fiscal calendar to answer a question about "Last year's sales revenue").
  • Be sure to fix data types so that Q&A can correctly represent the answers.
  • Be sure to mark those that should not be aggregated with do not summarize.
  • Add data categories wherever possible.
  • Put your topics in their own datasets, break out each topic from a given entity.
  • Add synonyms to improve Natural Language capabilities of PBI on your data set.

#7 Be wary of performance issues that might affect your data model's performance.
These are a few tips from my blog entry on Pragmatic Work's site:
Tall, narrow tables are faster than wide tables with many columns, especially if the columns are highly unique (columns that are highly unique are said to exhibit "high cardinality").

Use integer keys for primary/foreign key relationships.

Reduce uniqueness in columns by splitting date and time values, and by reducing the "granularity" of time to hours if that is required for reporting.

Include only relevant data in the data model. One powerful example is not to include unnecessary dates in your Date

Please comment on my choices, or add your own best practices to this growing list.

The first post in this short series addresses strategic concerns for developing BI projects.