Next
How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

I don't trust statistics and personally believe that at least 74% of them are wrong.... but I bet nearly 100% of people with any interest in fantasy (or just any) TV shows are watching the 7th series of Game of Thrones (GoT) by HBO.
If you are one of those, join me in the analysis of the latest tweets regarding the subject. Please be also aware that, if you're not on the latest episode, some spoilers may be revealed by this article. My suggestion is then to go back and watch the episodes first and then come back here for the analysis!

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

If you aren't part of the above group then ¯\_(ツ)_/¯. Still this post contains a lot of details on how to perform analysis on any tweet with Tableau and BigQuery together with Kafka sources and sink configurations. I'll leave to you to find another topic to put all this in practice.

Overall Setup

As described in my previous post on analysing Wimbledon tweets I've used Kafka for the tweet extraction phase. In this case however, instead of querying the data directly in Kafka with Presto, I'm landing the data into a Google BigQuery Table. The last step is optional, since as in last blog I was directly querying Kafka, but in my opinion represents the perfect use case of all technologies: Kafka for streaming and BigQuery for storing and querying data.
The endpoint is represented by Tableau, which has a native connector to BigQuery. The following image represents the complete flow

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

One thing to notice: at this point in time I'm using a on-premises installation of Kafka which I kept from my previous blog. However since source and target are natively cloud application I could easily move also Kafka in the cloud using for example the recently announced Confluent Kafka as a Service.

Now let's add some details about the overall setup.

Kafka

For the purpose of this blog post I've switched from the original Apache Kafka distribution to the Confluent open source one. I've chosen the Confluent distribution since it includes the Kafka Connect which is

A framework for scalably and reliably streaming data between Apache Kafka and other data systems

Using this framework anybody can write a connector to push data from any system (Source Connector) to Kafka or pull data from it (Sink Connector). This is a list of available connectors developed and maintained either from Confluent or from the community. Moreover Kafka Connect provides the benefit of parsing the message body and storing it in Avro format which makes it easier to access and faster to retrieve.

Kafka Source for Twitter

In order to source from Twitter I've been using this connector. The setup is pretty easy: copy the source folder named kafka-connect-twitter-master under $CONFLUENT_HOME/share/java and modify the file TwitterSourceConnector.properties located under the config subfolder in order to include the connection details and the topics.

The configuration file in my case looked like the following:

name=connector1  
tasks.max=1  
connector.class=com.github.jcustenborder.kafka.connect.twitter.TwitterSourceConnector

# Set these required values
twitter.oauth.accessTokenSecret=<TWITTER_TOKEN_SECRET>  
process.deletes=false  
filter.keywords=#got,gameofthrones,stark,lannister,targaryen  
kafka.status.topic=rm.got  
kafka.delete.topic=rm.got  
twitter.oauth.consumerSecret=<TWITTER_CONSUMER_SECRET>  
twitter.oauth.accessToken=<TWITTER_ACCESS_TOKEN>  
twitter.oauth.consumerKey=<TWITTER_CONSUMER_KEY>  

Few things to notice:

  • process.deletes=false: I'll not delete any message from the stream
  • kafka.status.topic=rm.got: I'll write against a topic named rm.got
  • filter.keywords=#got,gameofthrones,stark,lannister,targaryen: I'll take all the tweets with one of the following keywords included. The list could be expanded, this was just a test case.

All the work is done! the next step is to start the Kafka Connect execution via the following call from $CONFLUENT_HOME/share/java/kafka-connect-twitter

$CONFLUENT_HOME/bin/connect-standalone config/connect-avro-docker.properties config/TwitterSourceConnector.properties

I can see the flow of messages in Kafka using the avro-console-consumer command

./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 --property schema.registry.url=http://localhost:8081 --property print.key=true --topic twitter --from-beginning

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

You can see (or maybe it's a little bit difficult from the GIF) that the message body was transformed from JSON to AVRO format, the following is an example

{"CreatedAt":{"long":1502444851000},
"Id":{"long":895944759549640704},
"Text":{"string":"RT @haranatom: Daenerys Targaryen\uD83D\uDE0D https://t.co/EGQRvLEPIM"},
[...]
,"WithheldInCountries":[]}

Kafka Sink to BigQuery

Once the data is in Kafka, the next step is push it to the selected datastore: BigQuery. I can rely on Kafka Connect also for this task, with the related code written and supported by the community and available in github.

All I had to do is to download the code and change the file kcbq-connector/quickstart/properties/connector.properties

...
topics=rm.got  
..
autoCreateTables=true  
autoUpdateSchemas=true  
...
# The name of the BigQuery project to write to
project=<NAME_OF_THE_BIGQUERY_PROJECT>  
# The name of the BigQuery dataset to write to (leave the '.*=' at the beginning, enter your
# dataset after it)
datasets=.*=<NAME_OF_THE_BIGQUERY_DATASET>  
# The location of a BigQuery service account JSON key file
keyfile=/home/oracle/Big-Query-Key.json  

The changes included:

  • the topic name to source from Kafka
  • the project, dataset and Keyfile which are the connection parameters to BigQuery. Note that the Keyfile is automatically generated when creating a BigQuery service.

After verifying the settings, as per Kafka connect instructions, I had to create the tarball of the connector and extract it's contents

cd /path/to/kafka-connect-bigquery/  
./gradlew clean confluentTarBall
mkdir bin/jar/ && tar -C bin/jar/ -xf bin/tar/kcbq-connector-*-confluent-dist.tar  

The last step is to launch the connector by moving into the kcbq-connector/quickstart/ subfolder and executing

./connector.sh

Note that you may need to specify the CONFLUENT_DIR if the Confluent installation home is not in a sibling directory

export CONFLUENT_DIR=/path/to/confluent  

When everything start up without any error a table named rm_got (the name is automatically generated) appears in the BigQuery dataset I defined previously and starts populating.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

A side note: I encountered a Java Heap Space error during the run of the BigQuery sink. This was resolved by increasing the heap space setting of the connector via the following call

export KAFKA_HEAP_OPTS="-Xms512m -Xmx1g"  

BigQuery

BigQuery, based on Dremel's paper, is Google's proposition for an enterprise cloud datawarehouse which combines speed and scalability with separate pricing for storage and compute. If the cost of storage is common knowledge in the IT world, the compute cost is a fairly new concept. What this means is that the cost of the same query can vary depending on how the data is organized. In Oracle terms, we are used to associating the query cost to the one defined in the explain plan. In BigQuery that concept is translated from "performance cost" to also "financial cost" of a query: the more data a single query has to scan, the higher is the cost for it. This makes the work of optimizing data structures not only visible performance wise but also on the financial side.

For the purpose of the blog post, I had almost 0 settings to configure other than creating a Google Cloud Platform, creating a BigQuery project and a dataset.

During the Project creation phase, a Keyfile is generated and stored locally on the computer. This file contains all the credentials needed to connect to BigQuery from any external application, my suggestion is to store it in a secure place.

{
  "type": "service_account",
  "project_id": "<PROJECT_ID>",
  "private_key_id": "<PROJECT_KEY_ID>",
  "private_key": "<PRIVATE_KEY>",
  "client_email": "<E-MAIL>",
  "client_id": "<ID>",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "<URL>"
}

This file is used in the Kafka sink as we saw above.

Tableau

Once the data is landed in BigQuery, It's time to analyse it with Tableau!
The Connection is really simple: from Tableau home I just need to select Connect-> To a Server -> Google BigQuery, fill in the connection details and select the project and datasource.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

An important feature to set is the Use Legacy SQL checkbox in the datasource definition. Without this setting checked I wasn't able to properly query the BigQuery datasource. This is due to the fact that "Standard SQL" doesn't support nested columns while Legacy SQL (also known as BigQuery SQL) does, for more info check the related tableau website.

Analysing the data

Now it starts the fun part: analysing the data! The integration between Tableau and BigQuery automatically exposes all the columns of the selected tables together with the correctly mapped datatypes, so I can immediately start playing with the dataset without having to worry about datatype conversions or date formats. I can simply include in the analysis the CreatedAt date and the Number of Records measure (named # of Tweets) and display the number of tweets over time.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Now I want to analyse where the tweets are coming from. I can use using the the Place.Country or the Geolocation.Latitude and Geolocation.Longitude fields in the tweet detail. Latitute and Longitude are more detailed while the Country is rolled up at state level, but both solutions have the same problem: they are available only for tweets with geolocation activated.

After adding Place.Country and # of Tweets in the canvas, I can then select the map as visualization. Two columns Latitude (generated) and Longitude (generated) are created on the fly mapping the country locations and the selected visualization is shown.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

However as mentioned before, this map shows only a subset of the tweets since the majority of tweets (almost 99%) has no location.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The fields User.Location and User.TimeZone suffer from a different problem: either are null or the possible values are not coming from a predefined list but are left to the creativity of the account owner which can type whatever string. As you can see, it seems we have some tweets coming from directly from Winterfell, Westeros, and interesting enough... Hogwarts!

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Checking the most engaged accounts based on User.Name field clearly shows that Daenerys and Jon Snow take the time to tweet between fighting Cercei and the Whitewalkers.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The field User.Lang can be used to identify the language of the User. However, when analysing the raw data, it can be noticed that there are language splits for regional language settings (note en vs en-gb). We can solve the problem by creating a new field User.Lang.Clean taking only the first part of the string with a formula like

IF  FIND([User.Lang],'-') =0  
    THEN [User.Lang] 
    ELSE 
        LEFT([User.Lang],FIND([User.Lang],'-')-1)
END  

With the interesting result of Italian being the 4th most used language, overtaking portuguese, and showing the high interest in the show in my home country.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Character and House Analysis

Still with me? So far we've done some pretty basic analysis on top of pre-built fields or with little transformations... now it's time to go deep into the tweet's Text field and check what the people are talking about!

The first thing I wanted to do is check mentions about the characters and related houses. The more a house is mentioned, the more should be relevant correct?
The first text analysis I want to perform was Stark vs Targaryen mention war: showing how many tweets were mentioning both, only one or none of two of the main houses. I achieved it with the below IF statement

IF contains(upper([Text]), 'STARK') AND contains(upper([Text]),'TARGARYEN')  
 THEN 'Both' 
 ELSEIF contains(upper([Text]), 'STARK') 
  THEN 'Stark' 
 ELSEIF contains(upper([Text]), 'TARGARYEN') 
  THEN 'Targaryen' 
 ELSE 'None' 
END

With the results supporting the house Stark

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

I can do the same at single character level counting the mentions on separate columns like for Jon Snow

IIF(contains(upper([Text]), 'JON')  
OR contains(upper([Text]),'SNOW'), 1,0)  

Note the OR condition since I want to count as mentions both the words JON and SNOW since those can uniquely be referred at the same character. Similarly I can create a column counting the mentions to Arya Stark with the following formula

IIF(contains(upper([Text]), 'ARYA'), 1,0)  

Note in this case I'm filtering only the name (ARYA) since Stark can be a reference to multiple characters (Sansa, Bran ...). I created several columns like the two above for some characters and displayed them in a histogram ordered by # of Mentions in descending order.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

As expected, after looking at the Houses results above, Jon Snow is leading the user mentions with a big margin over the others with Daenerys in second place.

The methods mentioned above however have some big limitations:

  • I need to create a different column for every character/house I want to analyse
  • The formula complexity increases if I want to analyse more houses/characters at the same time

My goal would be to have an Excel file, where I set the research Key (like JON and SNOW) together with the related character and house and mash this data with the BigQuery table.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The joining key would be like

CONTAINS([BigQuery].[Text], [Excel].[Key]) >0  

Unfortunately Tableau allows only = operators in text joining conditions during data blending making the above syntax impossible to implement. I have now three options:

  • Give Up: Never if there is still hope!
  • Move the Excel into a BigQuery table and resolve the problem there by writing a view on top of the data: works but increases the complexity on BigQuery side, plus most Tableau users will not have write access to related datasources.
  • Find an alternative way of joining the data: If the CONTAINS join is not possible during data-blending phase, I may use it a little bit later...

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Warning: the method mentioned below is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.

Without the option of using the CONTAINS I had to create a cartesian join during data-blending phase. By using a cartesian join every row in the BigQuery table is repeated for every row in the Excel table. I managed to create a cartesian join by simply put a 1-1 condition in the data-blending section.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

I can then apply a filter on the resulting dataset to keep only the BigQuery rows mentioning one (or more) Key from the Excel file with the following formula.

IIF(CONTAINS(UPPER([Text]),[Key]),[Id],NULL)  

This formula filters the tweet Id where the Excel's [Key] field is contained in the UPPER([Text]) coming from Twitter. Since there are multiple Keys assigned to the same character/house (see Jon Snow with both keywords JON and SNOW) the aggregation for this column is count distinct which in Tableau is achieved with COUNTD formula.
I can now simply drag the Name from the Excel file and the # of Mentions column with the above formula and aggregation method as count distinct.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The beauty of this solution is that now if I need to do the same graph by house, I don't need to create columns with new formulas, but simply remove the Name field and replace it with Family coming from the Excel file.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Also if I forgot a character or family I simply need to add the relevant rows in the Excel lookup file and reload it, nothing to change in the formulas.

Sentiment Analysis

Another goal I had in mind when analysing GoT data was the sentiment analysis of tweets and the average sentiment associated to a character or house. Doing sentiment analysis in Tableau is not too hard, since we can reuse already existing packages coming from R.

For the Tableau-R integration to work I had to install and execute the RServe package from a workstation where R was already installed and set the connection in Tableau. More details on this configuration can be found in Tableau documentation

Once configured Tableau to call R functions it's time to analyse the sentiment. I used Syuzhet package (previously downloaded) for this purpose. The Sentiment calculation is done by the following formula:

SCRIPT_INT(  
"library(syuzhet); 
r<-(get_sentiment(.arg1,method = 'nrc'))",  
ATTR([Text]))  

Where

  • SCRIPT_INT: The method will return an integer score for each Tweet with positives sentiments having positives scores and negative sentiments negative scores
  • get_sentiment(.arg1,method = 'nrc'): is the function used
  • ATTR([Text]): the input parameter of the function which is the tweet text

At this point I can see the score associated to every tweet, and since that R package uses dictionaries, I limited my research to tweets in english language (filtering on the column User.Lang.Clean mentioned above by en).

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The next step is to average the sentiment by character, seems an easy step but devil is in the details! Tableau takes the output of the SCRIPT_INT call to R as aggregated metric, thus not giving any visual options to re-aggregate! Plus the tweet Text field must be present in the layout for the sentiment to be calculated otherwise the metric results NULL.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Fortunately there are functions, and specifically window functions like WINDOW_AVG allowing a post aggregation based of a formula defining the start and end. The other cool fact is that window function work per partition of the data and the start and end of the window can be defined using the FIRST() and LAST() functions.

We can now create an aggregated version of our Sentiment column with the following formula

WINDOW_AVG(FLOAT([Sentiment]), FIRST(), LAST())  

This column will be repeated with the same value for all rows within the same "partition", in this case the character Name.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Be aware that this solution doesn't re-aggregate the data, we'll still see the data by single tweet Text and character Name. However the metric is calculated at total per character so graphs can be displayed.

I wanted to show a Scatter Plot based on the # of Mentions and Sentiment of each character. With the window functions and the defined above it's as easy as dragging the fields in the proper place and select the scatter plot viz.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The default view is not very informative since I can't really associate a character to its position in the chart until I go over the related image. Fortunately Tableau allows the definition of custom shapes and I could easily assign character photos to related names.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

If negative mentions for Littlefinger and Cercei was somehow expected, the characters with most negative sentiment are Sansa Stark, probably due to the mysterious letter found by Arya in Baelish room, and Ellaria Sand. On the opposite side we strangely see the Night King and more in general the WhiteWalkers with a very positive sentiment associated to them. Strange, this needs further investigation.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Deep Dive on Whitewalkers and Sansa

I can create a view per Character with associate tweets and sentiment score and filter it for the WhiteWalkers. Looks like there are great expectations for this character in the next episodes (the battle is coming) which are associated with positive sentiments.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

When analysing the detail of the number of tweets falling in each sentiment score category it's clear why Sansa and Whitewalkers have such a different sentiment average. Both appear as normal distributions, but the center of the Whitewalkers curve is around 1 (positive sentiment) while for Sansa is between -1 and 0 (negative sentiment).

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

This explanation however doesn't give me enough information, and want to understand more about what are the most used words included in tweets mentioning WhiteWalkers or Night King.

Warning: the method mentioned above is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.

There is no easy way to do so directly in Tableau, even using R since all the functions expect the output size to be 1-1 with the input, like sentiment score and text.
For this purpose I created a view on top of the BigQuery table directly in Tableau using the New Custom SQL option. The SQL used is the following

SELECT  ID, REPLACE(REPLACE(SPLIT(UPPER(TEXT),' '),'#',''),'@','')  word FROM [Dataset.rm_got]  

The

Fun with Oracle Analytics Cloud (OAC): Creating Essbase Cubes - 16-Aug-2017 05:20 - Performance Architects

Author: Andrew Tauro, Performance Architects

Just like there are multiple ways to skin a cat, there’s more than one way to create an Essbase cube in Oracle Analytics Cloud (OAC). While the best way to migrate on-premise Essbase cubes to OAC is to use the standalone “EssbaseLCMUtility” tool, to create cubes from scratch there are three ways that I have used so far: use the Web UI; build an Application Workbook by hand (or from a template); or use the Cube builder. The latter two are the focus of this blog.

The Application Workbook is essentially a Microsoft Excel workbook that contains a predefined set of tabs, with the contents arranged in a predetermined manner. What that means is the workbook has a bunch of tabs like this:

Each of these tabs serves a particular purpose, but from what I can tell only the first two are a “must” when creating the application:

The “Essbase.Cube” worksheet defines the application and database names, which are required information when creating a cube. In addition, this sheet is used to define the cube dimensions:

“Cube.Settings” and “Cube.Generations” define properties of the Essbase database. The former defines some crucial cube information, such as whether it is going to be a block storage option (BSO) or aggregate storage option (ASO) cube, and if it will allow for duplicate member names.

The remaining tabs populate the dimensions (“Dim” tabs), data (“Data” tabs) and/or define calculation scripts (“Calc” tabs) for the cube. If you are familiar with building Essbase dimensions or data files and/or writing calc scripts, these will look very familiar.

For those of you who are not familiar with these items, there is the option of using the Cube Designer.

This is an add-in for Microsoft Excel that you can download via Smart View from your OAC instance.

The “Cube Designer” menu item provides tabbed screens for creating the application workbook. Walking through the tabs allows to setup the application workbook, and the “To Sheet” and “From Sheet” options facilitate reading from, and pushing to, the active workbook:

Once complete, the cube can be created via the web user interface as an import.

This has greatly reduced the complexity of creating Essbase cubes, and is just one of the ways that OAC is redefining the way we perform analytics using Essbase.

As we explore the capabilities of OAC, we will continue to share our thoughts with you, so stay tuned. While you take this journey with us, if you have any questions on this, feel free to send us a note at communications@performancearchitects.com and we will be in touch.

#datavault 2.0, Hashes, one more time - 15-Aug-2017 12:52 - Dan Linstedt
A fact based look at Hashing and Sequences and Collision Strategies in Data Vault 2.0 Standards.
#datavault 2.0 and Hash Keys, Again? - 14-Aug-2017 15:07 - Dan Linstedt
Yet another dive in to Hash Keys and Data Vault 2.0
FDMEE Custom Reports: Query Definition - 11-Aug-2017 15:26 - Wayne Van Sluys
When creating a new custom report one first needs data and the Query Definition is the starting point in the process.  As I wrote in the first post of this series the report engine for FMDEE Reports is Oracle BI Publisher.  The query engine within BI Publisher generates and XML file and that XML file is joined up with the Layout Template and Translation Template to produce the Report output.



As with most of the ERP/EPM systems on the market the database supporting the
application can be quite intimidating and the vast number of tables confusing.  Fortunately most custom reports are derivatives of existing reports and therefore using the existing Query Definitions as a starting point can be beneficial in learning the tables used and their purpose.

For this exercise I am interested in a report to list Locations and the Data Load Rules associated with that Location. 

In the Query Definition section there is a Query that looks similar to what I need.  I will copy the SQL from the 3 text boxes.  The three different boxes allow for the use of inline parameters from prompts in the WHERE clause.


















Since my FDMEE Repository is on Oracle Database I will paste the SQL into SQL Developer.  Using [Ctrl] [F7] keys I can view the SQL in a nicely formatted layout.


Since I like to work in a Graphical view of my queries I will switch to the Query Builder view. This allows me to see the tables and joins easier.



The following image show some of the more frequently used tables within the FDMEE database/schema and what information they contain.


So now that I have a better understanding of the tables.

The report I want to produce is Data Load Rules per location and since the Data Load Rules are tied to Category I would like to know that as well.

Since my report only needs 3 columns I will create a new query in SQL Developer and drag the AIF_BALANCE_RULES, TPOVPARTITION and TPOVCATEGORY tables into the Query Builder and Join them as shown below.  Then I will select the columns I am interested in from the tables and finally I will set the sort order.  One tip with Reports it is faster to sort the data on query rather than within the report layout.



I will need to switch back to the Worksheet view in order to collect the SQL for my query.


In FDMEE I will create a new query definition named Location Data Load Rules and paste into the 2 text areas; the SELECT statement up to the ORDER BY into the Select Clause and paste the ORDER BY statement into the Group By/Order By Clause.



Save the Definition and then click the Validate Query to make sure that FDMEE does not have any issues with the query.




In order to work with the BIPublisher MSWord Template Builder I will need a sample data file. The Query Definition interface has a Generate XML button. Clicking this button allows me to create the sample XML data file. Typically it will return 25 rows of data.



When I open the file I can see the data structure of the query. In the case the result of my simple query there is the Location name, the Category name and the Data Load Rule name for each row of data. 



Now that have a sample data set I can create the layout for my report. But that is the topic for another post.

Author: Mike McLean, Performance Architects

One of the requirements during a recent Performance Architects Oracle Planning and Budgeting Cloud Service (PBCS) implementation project was to create a solution for their projection process.   During discussions with the Budget Office, we learned that requirements included:

  • Projections occur three times a year:
    • After Q1 actuals are complete
    • After Q2 actuals are complete
    • After Q3 actuals are complete
  • Historical actuals and budget data must be used to seed the projection scenario
  • Revenues need to be seeded using one methodology, while expenses are seeded using another methodology
  • After the projection scenario is seeded, additional adjustments may need to be made by department, fund, program, etc.

The calculation to seed the “Projection” scenario contains several components:

  1. Q1 revenue and expense actuals were copied to “Q1 Projection”:
  1. Using prior year actuals, “% of YearTotal” was calculated for all revenue accounts for each month. See the example below for October (500 / 9,750 = 5.1%):
  1. The monthly “% of YearTotal” is then multiplied by the “Budget YearTotal” value to calculate each month’s revenue. See the example below for October (5.1% x 8,100 = 415):
  1. Expense accounts are calculated by taking the average of “Q1 Actuals” and loading that value into all of the out months:
  2. After the business rule is launched, the results are displayed:

After the “Q1 Projection” is calculated, the Budget Office wanted the option to make additional adjustments.  This was accomplished by leveraging the hierarchy in the “Version” dimension.  We created a sibling of “Projection Calculated” and named the member “Projection Adjustments.”  The parent of those two members is “Projection Total”.

Need help with your PBCS implementation project? Contact us at sales@performancearchitects.com and we can help you out!

Editor’s Note

This year, Red Pill Analytics was the Analytics Sponsor at ODTUG Kscope17. Our company motto is #challengeeverything – so we knew we wanted to do something different and unexpected while at the conference.
What we eventually landed on was creating Analytics Stations using IoT technologies to show how an old school object, like a rotary phone, can be repurposed and turned into an interactive device.
Part 1 focuses on hardware.
Part 2 focuses on software.
Part 3 focuses on Analytics & JavaScript Visualization
Kscope17 also used beacon technology to analyze conference attendee activities. Red Pill Analytics pulled that information through a REST API and told the story of Kscope17 using Oracle Data Visualization.


All of the relevant code for the project that is discussed below is open sourced via MIT license and can be found at: https://github.com/RedPillAnalytics/matrix-agent-phone

In the first post we built the hardware for out IoT phone and in the second post we configured the Raspberry Pi to act as a kiosk and run an Electron framework desktop application. With the idea that Electron would be the primary codebase for our project: to load, transform, and display our data. As well as to handle hardware interaction with the IoT phone.

The first thing to know about Electron is that it is essentially a website running chrome less with native OS hardware and filesystem access. A perfect candidate for rapid development with easy styling and powerful feature sets because it allows use of D3.js (a powerful javascript visualization framework, https://d3js.org/) and other javascript visualization frameworks to design amazing displays of data, whilst also leveraging a large base of libraries for sourcing and transforming data.

In our visualization we used D3.js and Chart.js as well as some custom HTML, CSS and JS for the circular tiles.

Our data was gathered via the AWS JavaScript SDK (https://aws.amazon.com/sdk-for-node-js/) and specifically via Amazon Athena (https://aws.amazon.com/athena/) which allows you to make SQL calls against and Amazon S3 JSON based data lake.

Auto enabling APPROX_* function in the Oracle Database - 07-Aug-2017 11:46 - Brendan Tierney

With the releases of 12.1 and 12.2 of Oracle Database we have seen some new functions that perform approximate calculations. These include:

  • APPROX_COUNT_DISTINCT
  • APPROX_COUNT_DISTINCT_DETAIL
  • APPROX_COUNT_DISTINCT_AGG
  • APPROX_MEDIAN
  • APPROX_PERCENTILE
  • APPROX_PERCENTILE_DETAIL
  • APPROX_PERCENTILE_AGG

These functions can be used when approximate answers can be used instead of the exact answer. Yes can have many scenarios for these and particularly as we move into the big data world, the ability to process our data quickly is slightly more important and exact numbers. For example, is there really a difference between 40% of our customers being of type X versus 41%. The real answer to this is, 'It Depends!', but for a lot of analytical and advanced analytical methods this difference doesn't really make a difference.

There are various reports of performance improvement of anything from 6x to 50x with the response times of the queries that are using these functions, instead of using the more traditional functions.

If you are a BI or big data analyst and you have build lots of code and queries using the more traditional functions. But what if you now want to use the newer functions. Does this mean you have go and modify all the code you have written over the years? you can imagine getting approval to do this!

The simple answer to this question is 'No'. No you don't have to change any code, but with some parameter changes for the DB or your session you can tell the database to automatically switch from using the traditional functions (count, etc) to the newer more optimised and significantly faster APPROX_* functions.

So how can you do this magic?

First let us see what the current settings values are:


SELECT name, value
FROM v$ses_optimizer_env
WHERE sid = sys_context('USERENV','SID')
AND name like '%approx%';

NewImage

Now let us run a query to test what happens using the default settings (on a table I have with 10,500 records).


set auto trace on

select count(distinct cust_id) from test_inmemory;

COUNT(DISTINCTCUST_ID)
----------------------
1500


Execution Plan
----------------------------------------------------------
Plan hash value: 2131129625

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 70 (2)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 13 | | |
| 2 | VIEW | VW_DAG_0 | 1500 | 19500 | 70 (2)| 00:00:01 |
| 3 | HASH GROUP BY | | 1500 | 7500 | 70 (2)| 00:00:01 |
| 4 | TABLE ACCESS FULL| TEST_INMEMORY | 10500 | 52500 | 69 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Let us now set the automatic usage of the APPROX_* function.


alter session set approx_for_aggregation = TRUE;

SQL> select count(distinct cust_id) from test_inmemory;

COUNT(DISTINCTCUST_ID)
----------------------
1495


Execution Plan
----------------------------------------------------------
Plan hash value: 1029766195

---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 69 (0)| 00:00:01 |
| 1 | SORT AGGREGATE APPROX| | 1 | 5 | | |
| 2 | TABLE ACCESS FULL | TEST_INMEMORY | 10500 | 52500 | 69 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

We can see above that the APPROX_* equivalent function was used, and slightly less work. But we only used this on a very small table.

The full list of session level settings is:

alter session set approx_for_aggregation = TRUE;
alter session set approx_for_aggregation = FALSE;

alter session set approx_for_count_distinct = TRUE;
alter session set approx_for_count_distinct = FALSE;

alter session set approx_for_percentile = 'PERCENTILE_CONT DETERMINISTIC';
alter session set approx_for_percentile = PERCENTILE_DISC;
alter session set approx_for_percentile = NONE;

Or at a system wide level:


alter system set approx_for_aggregation = TRUE;
alter system set approx_for_aggregation = FALSE;

alter system set approx_for_count_distinct = TRUE;
alter system set approx_for_count_distinct = FALSE;

alter system set approx_for_percentile = 'PERCENTILE_CONT DETERMINISTIC';
alter system set approx_for_percentile = PERCENTILE_DISC;
alter system set approx_for_percentile = NONE;

And to reset back to the default settings:


alter system reset approx_for_aggregation;
alter system reset approx_for_count_distinct;
alter system reset approx_for_percentile;
#datavault Meetups! Australia / New Zealand! - 07-Aug-2017 09:25 - Dan Linstedt
Data Vault 1.0 and 2.0 Meetups in Australia and New Zealand in August 2017. I will be there in person!! don’t miss this chance to talk with me!
Six Plus One Types of Interviewers - 04-Aug-2017 13:29 - Abhinav Agarwal

R
emember Chuck Noland? The character in the movie Castaway, who has to use the blade of an ice-skate to extract his abscessed tooth, without anesthesia? The scene is painful to watch, yet you can't look away.

Interviews have this habit of turning up a Chuck Noland - in the interviewee or the interviewer. You willingly agree to subject yourself to the wanton abuse by random strangers who you may have to end up working for or with. Apart from the talented few whom companies are more eager to hire than they are to get hired, most are in less enviable positions.

What about interviewers? Not all are cut from the same cloth. But there are at least six types that I think we have all met in our lives, and a seventh one.

1. The Interview As an End In Itself - Hyper-excited newbie

You know this guy. You have been this person, most likely. You have a team now. You expect your team to grow. You have to build a team. You believe that you, and you alone, know what it takes to hire the absolutely best person for the opening you have.
You sit down and explain to the harried hiring HR person what the role is, what qualifications you are looking for, why the job is special, why just ordinary programming skills in ordinary programming languages will simply not cut it, why you as the hiring manager are special, and how you will, with the new hire, change the product, the company, and eventually the whole wide world. The HR executive therefore needs to spend every waking minute of her time in the pursuance of this nobler than noble objective. You badger your hiring rep incessantly, by phone, by IM, by email, in person, several times a day, asking for better resumes if you are getting many, and more if you aren't getting enough.
You read every single resume you get, several times over. You redline the points you don't like. You redline the points you like. You make notes on the resumes. You still talk to every single candidate. You continue interviewing, never selecting, till the economic climate changes and the vacancy is no longer available.
Yes, we all know this person.

2. Knows what he is looking for and knows when he finds it

This person is a somewhat rare commodity. This person does not suffer from buyer's remorse, knows that there is no such thing as a perfect candidate, and that the best he can hope to get is a person who comes off as reasonably intelligent, hard-working, ethical, and is going to be a team player.

This person will however also suffer from blind spots. Specifically, two kinds of blindspots. The first is that he will look for and evaluate a person only on those criteria that he can assess best. The second is that he is more likely to hire candidates that are similar to other successful employees in his team, and will probably become less likely to take chances on a different type of a candidate. On the other hand, this manager also knows that conceptual skills are more important to test than specific knowledge of some arcane syntax in a geeky programming language - if you are talking of the world of software for instance.
This person is a rare commodity.

3. Hire for Empire

Like our previous type of hiring manager, this hiring manager is also very clear-headed.  But, here the interviewer is hiring to add headcount to his team. Grow the empire. More people equates to more perceived power. This person understands three things, and understands them perfectly.
First, that if he is slow in hiring, then a hiring freeze may come in, and the headcount may no longer stay open.
Second, he (or she) is also unable and equally unwilling to evaluate a candidate, so just about anyone will do.
Third, and most importantly, this manager knows that every additional person reporting to him on the organization chart elevates him in importance vis-a-vis his peers, and therefore hiring is a goal noble enough to be pursued in its own right.
It's a win-win situation for everyone - except the customers, the company, and the team.

4. I have other work to do. What am I doing here? What is he doing here?

This person has little skin in the game. He has no dog in the fight. Pick your metaphor. He is there to take the interview because of someone's absence, or because in the charade of the interview "process" that exists at many companies, there exists a need to do this interview. The interviewer agrees because it is a tax that needs to be paid. You don't want to be labeled a non-team-player. Who knows when this Scarlet Letter may come to haunt you. So our interviewer sets aside half an hour or more, preferably less, of his time, and comes back wondering where thirty minutes of his life just went. That question remains unanswered.

5. Know-it-all and desperate to show it

This person perceived himself as an overachiever. This is the sort of person who will tell you with casual nonchalance that he had predicted the rise of Google in 1999  - just so you can get to know that he had heard of Google in 1999. This person knows he knows everything that there is to know, that it is his beholden duty to make you know it too, and it is your beholden duty to acknowledge this crushing sacerdotal burden he carries. This is the person who will begin the interview with a smirk, sustain a a wry smile, transform into a frown, and end with an exaggerated sense of self-importance.
Do not get fooled.
This person is as desperate, if not more, to interview you as you are to do well on the interview. He will in all likelihood end up talking more than the interviewee.
In every group in every department of every company there exists at least one such person. The successful companies have no more than one.

6. The rubber-stamp

The boss has decided the person who needs to be hired. The charade needs to be completed. The requisite number of people have to interview the candidate so that HR can dot the "I"s and cross the "T"s. Our interviewer here has to speak with this person. With an air of deference. He will ask all the right questions, but the answers do not matter. You sign off with a heartfelt, "Great talking to you. Thanks a ton for your time. Take care, and we really look forward to working with/for you." No, don't belittle this rubber-stamp. He could be you.

These are not mutually exclusive sets. There are overlaps that exist, sometimes in combinations that would warm Stephen King's heart.

Oh, what about the seventh type of interviewer? He is the Interviewer as Saboteur.  I will talk about him in a separate post.

This post appeared on LinkedIn on July 31st, 2017.
This is an edited version of a post I wrote on April 23rd, 2013.

© 2017, Abhinav Agarwal. All rights reserved.
Next