Clearing big chunks of cube data
- Jan. 22, 2019
I was deleting big chunks of data from TM1 cubes just the other day. In fact, based on the dimensions Year/Scenario/Version, I wanted to clean out data that had become irrelevant now. In any non-trivial TM1 model, this isn't an easy job. Planning Analytics does not have an automatic data clear option. A number of the complications that can arise, include:
- indicating which data can be cleared in a cube, can lead to a rather big number of combinations. Therefore, being able to indicate with wildcards which data can be cleared, is a must-have
- this clean up exercise can be executed by cube, or for all cubes, or for selected cubes. At a bare minimum, the chosen dimensions (Year/Scenario/Version) need to be part of the cube, if we want to delete data in the cube
- linked to the previous item, we can have different dimensions for Year (Time), or Scenario or Version, or a concatenated Scenario_Version dimension - still we will want to clear data in the cubes
- it takes time to execute the clear processes, and subsequently, to bounce the server in order to free up the memory again
- I want to approach this exercise in a generic fashion, such that we can reuse the concepts and coding in either the same TM1 model (in the future), either in different TM1 models
One of the newer pieces of functionality I wrote, was a check if, for a given cube, (up to) 5 dimensions exist in the cube. We don't need to supply all 5, usually 3 suffice (Year/Scenario/Version). But for the dimensions we do supply, all of them must exist. An alias can be provided, like a localized dimension name. The output, through a helper TI process and some global variables, includes a boolean result (do all dimensions exist, yes or no?) and also the dimension names that were not found in the cube.
So, my successful approach is as follows:
- create an Excel file (to be saved as a text file). The 11 columns are:
- column A: name of dimension 1 to select on, like Year
- column B: name of its elements to select on, like 2011 + 2012 + 2013 + 2014 + 2015 + 2016 + 2017
- column C: name of dimension 2 to select on, like Scenario
- column D: name of its elements to select on, like RF_?? (Rolling Forecast scenarios)
- column E: name of dimension 3 to select on, like Version
- column F: name of its elements to select on, like WRK*
- column G: name of dimension 4 to select on, like ...
- column H: name of its elements to select on, like ...
- column I: name of dimension 5 to select on, like ...
- column J: name of its elements to select on, like ...
- column K: the TEXTJOIN function in Excel: read this article. The output is a concatenated filter string, taking into account all non-empty columns A-J.
(*) Feel free to add a column to indicate the cube name, probably with wildcard characters (* and ?), because otherwise you will end up with many rows in the Excel / text file. Here I left out that column, and solved this in the mother TI process. (**) In a big TM1 model, this TI process is executed many, many times, so the TM1 server message log will be cluttered up. Remove the excess log entries later on if you want to. An empty cell in columns B, D, F, H, J would mean "all lowest-level dimension elements".
Optionally, add an extra column to indicate Yes/No, should this "record" be executed on the cubes, or skipped. Adapt the coding accordingly (the Itemskip and ItemReject functions can be used).
Watch and relax
After the hard work is done, sit back and relax ! You can now execute the mother TI process and wait for the 'Process completed successfully' message. Or, schedule this data clear process in a (nightly) chore. In passing, I just mention that dimension reordering could be advantageous too. Do a few experiments and compare the outcomes.
After manually clearing out data, if it has not been done before, limit rules and feeders to avoid calculating and feeding unnecessary steps. This can be a time-consuming activity to go through the main cubes of the model. However, it can lead to additional benefits - in terms of memory, calculation times, model reboot times, file size on disk, the time of backup procedures, etc. Oh yes, while clearing data, don't forget to turn off the transaction logging temporarily ;-)
You might want to speed up this clearing process a whole lot more! Take advantage of the cores that your server has, and execute clear processes with the correct parameters but in a PARALLEL fashion - as opposed to a sequential fashion. You can do this with the TM1runti.exe command. That little application sits in the bin64 installation directory folder. See: here for information and here for a practical example. Cubewise has a free download in this area too: Hustle. Make sure that you do not induce locking, this mechanism of parallel data loads only works for data (copy processes, clearing data, load processes without updating dimensions, currency conversion processes, data archiving processes, etc.) Notice that all your traditional views and subsets should have a unique name.
UPDATE April 2019: For those of you who are on TM1 server version 11.4 (PAL 2.0.6) or more recent: check out the RunProcess function instead of ExecuteProcess. It allows you to run the called processes in parallel - with the aforementioned caveats of avoiding lock-inducing commands. The cube slices that we clear are not overlapping so that is good. If you use really (!) temporary subsets and views with unique names, you avoid metadata actions and turn off logs in a clever way through the control cube, I assure you that you will not believe your own eyes ! :-)
What I found equally useful, is to inspect how many cells we have in each of the (bigger) cubes. Both input cells and calculated / fed cells are useful. You can run the Performance monitor in Planning Analytics to count those cells. What I did additionally, is setting up temporary cubes with the same dimensions as some of the big cubes. A TI process would write a 1 in the temporary cube for every input cell of the original cube. Like that, you can count by Year/Scenario/Version, and other detail, how many input cells. It turned out that certain currently calculated versions were affected by high-level input of data, spreading data over a cell that was previously 0... You can easily tally up to a few billion cells if you don't pay attention to this. Now, you need to determine this in the first place, but curing the situation is just as important - it's far from easy! Clearing data there might make your boss or customer really angry ;-)
All in all, my experience is that, if the different parts of the exercise are executed thoroughly, the benefits can be substantial. Do factor in also a period of validation by the business. The bigger the model, the more time should be foreseen.
See you later for more TM1 articles! There is already a follow-up article here.