Clearing big chunks of cube data
- Jan. 22, 2019
Introduction
I was deleting big chunks of data from TM1 cubes just the other day. In fact, based on the dimensions Year/Scenario/Version (not month), I wanted to clean out data that had become irrelevant by then. In any non-trivial TM1 model, this isn't an easy job. Planning Analytics does not have an automatic data clear option. A number of the complications that can arise, include:
- indicating which data can be cleared in a cube, can lead to a rather big number of combinations. Therefore, being able to indicate with wildcards which data can be cleared, is a must-have
- this clean up exercise can be executed by cube, or for all cubes, or for selected cubes. At a bare minimum, the chosen dimensions (Year/Scenario/Version) need to be all part of the cube, if we want to delete data in the cube. Otherwise, we risk losing data.
- linked to the previous item, we can have several different dimensions for Year (Time), or Scenario or Version, or even a concatenated Scenario_Version dimension - still we will want to clear data in the cubes
- it takes time to execute the clear processes, and subsequently, to bounce the server in order to free up the memory again
- I want to approach this exercise in a generic fashion, such that we can reuse the concepts and coding in either the same TM1 model (in the future), either in different TM1 models. Each and every non-trivial TM1 model will need to do this every now and then
One of the newer pieces of functionality I wrote, was a check if, for a given cube, (up to) 5 dimensions exist in the cube. We don't need to supply all 5, usually 3 suffice (Year/Scenario/Version). But for the dimensions we do supply, all of them must exist. An alias can be provided, like a localized dimension name. The output, through a helper TI process and some global variables, includes a boolean result (do all dimensions exist, yes or no?) and also the dimension names that were not found in the cube.
So, my successful approach is as follows:
- create an Excel file (to be saved as a text file). The 11 columns are:
- column A: name of dimension 1 to select on, like Year
- column B: name of its elements to select on, like 2012 + 2013 + 2014 + 2015 + 2016 + 2017 + 2018
- column C: name of dimension 2 to select on, like Scenario
- column D: name of its elements to select on, like RF_?? (Rolling Forecast scenarios)
- column E: name of dimension 3 to select on, like Version
- column F: name of its elements to select on, like WRK*
- column G: name of dimension 4 to select on, like ...
- column H: name of its elements to select on, like ...
- column I: name of dimension 5 to select on, like ...
- column J: name of its elements to select on, like ...
- column K: the TEXTJOIN function in Excel: read this article. The output is a concatenated filter string, taking into account all non-empty columns A-J.
- In the mother TI process, take the text file as the data source.
- In the Advanced > Data tab, you loop over all TM1 cubes in the model (*), including control cubes
- For every cube, look at the selections for dimensions. If the supplied dimensions ALL exist in the cube, then continue; if not, proceed with the next cube. The check is done with the extra TI process I wrote about (**).
- inspect the elements selections. If the syntax is valid, supply the concatenated string from column K to a 3rd TI process
- that last helper TI process could be a Bedrock data clear process, this saves you time from writing and testing the generic data clear process. Therefore, the concatenated string in column K should match the filter settings in the called Bedrock process.
(*) Feel free to add a column to indicate the cube name, probably with wildcard characters (* and ?), because otherwise you will end up
with many rows in the Excel / text file. Here I left out that column, and solved this in the mother TI process.
(**) In a big TM1 model, this TI process is executed many, many times, so the TM1 server message log will be cluttered up.
Remove the excess log entries later on if you want to. You could also opt to run the helper process on all (desired) cubes before starting the
clear operation. You will want to keep the result (clear cube or not ?) in an attribute against the }Cubes dimension. Last option, fill up such
attribute manually onbeforehand to indicate which cubes should be affected or left untouched. Like that, you can avoid the effort of writing a
generic process like I did. Mind you, in big TM1 models, populating the attribute manually can already be an exercise on its own, but it gives
you more control and flexibility.
An empty cell in columns B, D, F, H, J would mean "all lowest-level dimension elements".
Optionally, add an extra column to indicate Yes/No, should this "record" be executed on the cubes, or skipped. Adapt the coding accordingly (the Itemskip and ItemReject functions can be used).
Watch and relax
After the hard work is done, sit back and relax ! You can now execute the mother TI process and wait for the 'Process completed successfully' message. Or, schedule this data clear process in a (nightly) chore. In passing, I just mention that dimension reordering could be advantageous too. Do a few experiments and compare the outcomes.
Next steps
After manually clearing out data, if it has not been done before, limit rules and feeders to avoid calculating and feeding unnecessary steps. This can be a time-consuming activity to go through the main cubes of the model. However, it can lead to additional benefits - in terms of memory, calculation times, model reboot times, file size on disk, the time of backup procedures, etc. Oh yes, while clearing data, don't forget to turn off the transaction logging temporarily ;-)
You might want to speed up this clearing process a whole lot more! Take advantage of the cores that your server has, and execute clear processes with the correct parameters but in a PARALLEL fashion - as opposed to a sequential fashion. You can do this with the TM1runti.exe command. That little application sits in the bin64 installation directory folder. See: here for information and here for a practical example. Cubewise has free downloads in this area too: Hustle and RushTi. Make sure that you do not induce locking, this mechanism of parallel data loads only works for data (copy processes, clearing data, load processes without updating dimensions, currency conversion processes, data archiving processes, etc.) Notice that all your traditional views and subsets should have a unique name.
UPDATE April 2019: For those of you who are on TM1 server version 11.4 (PAL 2.0.6) or more recent: check out the RunProcess function instead of ExecuteProcess. It allows you to run the called processes in parallel - with the aforementioned caveats of avoiding lock-inducing commands. The cube slices that we clear are not overlapping so that is good. If you use really (!) temporary subsets and views with unique names, you avoid metadata actions and turn off logs in a clever way through the control cube, I assure you that you will not believe your own eyes ! :-)
What I found equally useful, is to inspect how many cells we have in each of the (bigger) cubes. Both input cells and calculated / fed cells are useful. You can run the Performance monitor in Planning Analytics to count those cells. What I did additionally, is setting up temporary cubes with the same dimensions as some of the biggest cubes. A TI process would write a 1 in the temporary cube for every input cell of the original cube. Like that, you can count by Year/Scenario/Version, and other detail, how many input cells the cube contains. It turned out that certain CURRENTLY calculated versions were affected by high-level input of data, spreading data over a cell that was previously 0... In total, you can easily end up with a few billions of populated cells if you don't pay attention to this. Now, you need to determine this in the first place, but curing the situation is just as important - it's far from easy like I showed you! Beware, clearing data in the wrong slices might make your boss or customer really angry ;-) Protect yourself and take a frequent backup of the entire data directory.
All in all, my experience is that, if the different parts of the exercise are executed thoroughly, the benefits can be substantial. Do factor in also a period of validation by the business. The bigger the model, the more time should be foreseen.
See you later for more TM1 articles! There is already a follow-up article here.