Next
On 5 December 2015, the European Parliament, the Council and the Commission reached agreement on the new data protection rules, establishing a modern and harmonised data protection framework across the EU. Then on 14th April 2016 the Regulations and Directives were adopted by the European Parliament.
NewImage
The EU GDPR comes into effect on the 25th May, 2018.
Are you ready ?
The EU GDPR will affect every country around the World. As long as you capture and use/analyse data captured with the EU or by citizens in the EU then you have to comply with the EU GDPR.
Over the past few months we have seen a increase in the amount of blog posts, articles, presentations, conferences, seminars, etc being produced on how the EU GDPR will affect you. Basically if your company has not been working on implementing processes, procedures and ensuring they comply with the regulations then you a bit behind and a lot of work is ahead of you.
Like I said there was been a lot published and being talked about regarding the EU GDPR. Most of this is about the core aspects of the regulations on protecting and securing your data. But very little if anything is being discussed regarding the use of machine learning and customer profiling.
Do you use machine learning to profile, analyse and predict customers? Then the EU GDPRs affect you.
Article 22 of the EU GDPRs outlines some basic capabilities regarding machine learning, and in additionally Articles 13, 14, 18, 19 and 21.
Over the coming weeks I will have the following blog posts. Each of these address a separate issue, within the EU GDPR, relating to the use of machine learning.
  • Part 2 - Do I have permissions to use the data for data profiling?
  • Part 3 - Ensuring there is no Discrimination in the Data and machine learning models.
  • Part 4 - (Article 22: Profiling) Why me? and how Oracle 12c saves the day
NewImage
Exploring the Rittman Mead Insights Lab - 27-Jun-2017 09:58 - Rittman Mead Consulting

What is our Insights Lab?

The Insights Lab offers on-demand access to an experienced data science team, using a mature methodology to deliver one-off analyses and production-ready predictive models.

Our Data Science team includes physicists, mathematicians, industry veterans and data engineers ready to help you take analytics to the next level while providing expert guidance in the process.

Why use it?

Data is cheaper to collect and easier to store than ever before. But collecting the data is not synonymous with getting value from it. Businesses need to do more with the same budget and are starting to look into machine learning to achieve this.

These processes can take off some of the workload, freeing up people's time to work on more demanding tasks. However, many businesses don't know how to get started down this route, or even if they have the data necessary for a predictive model.

R

Our Data science team primarily work using the R programming language. R is an open source language which is supported by a large community.

The functionality of R is extended by many community written packages which implement a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, statistical tests, time-series analysis, classification, clustering as well as packages for data access, cleaning, tidying, analysing and building reports.

All of these packages can be found on the Comprehensive R Archive Network (CRAN), making it easy to get access to new techniques or functionalities without needing to develop them yourself (all the community written packages work together).

R is not only free and extendable, it works well with other technologies and makes it an ideal choice for businesses who want to start looking into advanced analytics. Python is an obvious alternative, and several of our data scientists prefer it. We're happy to use whatever our client's teams are most familiar with.

Experienced programmers will find R syntax easy enough to pick up and will soon be able to implement some form of machine learning. However, for a detailed introduction to R and a closer look at implementing some of the concepts mentioned below we do offer a training course in R.

Our Methodology

Define

Define a Question

Analytics, for all intents and purposes, is a scientific discipline and as such requires a hypothesis to test. That means having a specific question to answer using the data.

Starting this process without a question can lead to biases in the produced result. This is called data dredging - testing huge numbers of hypotheses about a single data set until the desired outcome is found. Many other forms of bias can be introduced accidentally; the most commonly occurring will be outlined in a future blog post.

Once a question is defined, it is also important to understand which aspects of the question you are most interested in. Associated, is the level of uncertainty or error that can be tolerated if the result is to be applied in a business context.

Questions can be grouped into a number of types. Some examples will be outlined in a future blog post.

Define a dataset

The data you expect to be relevant to your question needs to be collated. Maybe supplementary data is needed, or can be added from different databases or web scraping.

This data set then needs to be cleaned and tidied. This involves merging and reshaping the data as well as possibly summarising some variables. For example, removing spaces and non-printing characters from text and converting data types.

The data may be in a raw format, there may be errors in the data collection, or corrupt or missing values that need to be managed. These records can either be removed completely or replaced with reasonable default values, determined by which makes the most sense in this specific situation. If records are removed you need to ensure that no selection biases are being introduced.

All the data should be relevant to the question at hand, anything that isn't can be removed. There may also be external drivers for altering the data, such as privacy issues that require data to be anonymised.

Natural language processing could be implemented for text fields. This takes bodies of text in human readable format such as emails, documents and web page content and processes it into a form that is easier to analyse.

Any changes to the dataset need to be recorded and justified.

Model

Exploratory Analysis

Exploratory data analysis involves summarising the data, investigating the structure, detecting outliers / anomalies as well as identifying patterns and trends. It can be considered as an early part of the model production process or as a preparatory step immediately prior. Exploratory analysis is driven by the data scientist, enabling them to fully understand the data set and make educated decisions; for example the best statistical methods to employ when developing a model.

The relationships between different variables can be understood and correlations found. As the data is explored, different hypotheses could be found that may define future projects.

Visualisations are a fundamental aspect of exploring the relationships in large datasets, allowing the identification of structure in the underlying dataset.

This is also a good time to look at the distribution of your dataset with respect to what you want to predict. This often provides an indication of the types of models or sampling techniques that will work well and lead to accurate predictions.

Variables with very few instances (or those with small variance) may not be beneficial, and in some cases could even be detrimental, increasing computation time and noise. Worse still, if these instances represent an outlier, significant (and unwarranted) value may be placed on these leading to bias and skewed results.

Statistical Modelling/Prediction

The data set is split into two sub groups, "Training" and "Test". The training set is used only in developing or "training" a model, ensuring that the data it is tested on (the test set) is unseen. This means the model is tested in a more realistic context and will help to determine whether the model has overfitted to the training set. i.e. is fitting random noise in addition to any meaningful features.

Taking what was learned from the exploratory analysis phase, an initial model can be developed based on an appropriate application of statistical methods and modeling tools. There are many different types of model that can be applied to the data, the best tends to depend on the complexity of your data and the any relationships that were found in the exploratory analysis phase. During training, the models are evaluated in accordance with an appropriate metric, the improvement of which is the "goal" of the development process. The predictions produced from the trained models when run on the test set will determine the accuracy of the model (i.e. how closely its predictions align with the unseen real data).

A particular type of modelling method, "machine learning" can streamline and improve upon this somewhat laborious process by defining models in such a way that they are able to self optimise, "learning" from past iterations to develop a superior version. Broadly, there are two types, supervised and un-supervised. A supervised machine learning model is given some direction from the data scientist as to the types of methods that it should use and what it is expecting. Unsupervised machine learning on the other hand, as the name suggests, involves giving the model less information to start with and letting it decide for its self what to value, and how to approach the problem. This can help to remove bias and reduce the number of assumptions made but will be more computationally intensive, as the model has a broader scope to investigate. Usually supervised machine learning is employed in a case where the problem and data set are reasonably well understood, and unsupervised machine learning where this is not the case.

Complex predictive modelling algorithms perform feature importance and selection internally while constructing models. These models can also report on the variable importance determined during the model preparation process.

Peer Review

This is an important part of any scientific process, and effectively utilities our broad expertise in modelling at Rittman Mead. This enables us to be sure no biases were introduced that could lead to a misleading prediction and that the accuracy of the models is what could be expected if the model was run on new unseen data. Additional expert views can also lead to alternative potential avenues of investigation being identified as part of an expanded or subsequent study.

Deploy

Report

For a scientific investigation to be credible the results must be reproducible. The reports we produce are written in R markdown and contain all the code required to reproduce the results presented. This also means it can be re-run with new data as long as it is of the same format. A clear and concise description of the investigation from start to finish will be provided to ensure that justification and context is given for all decisions and actions.

Delivery

If the result is of the required accuracy we will deploy a model API enabling customers to start utilising it immediately.
There is always a risk however that the data does not contain the required variables to create predictions with sufficient confidence for use. In these cases, and after the exploratory analysis phase there may be other questions that would be beneficial to investigate. This is also a useful result, enabling us to suggest additional data to collect that may allow a more accurate result should the process be repeated later.

Support

Following delivery we are able to provide a number of support services to ensure that maximum value is extracted from the model on an on-going basis. These include:
- Monitoring performance and accuracy against the observed, actual values over a period of time. Should there be discrepancies between these values arise, these can be used to identify the need for alterations to the model.
- Exploring specific exceptions to the model. There may be cases in which the model consistently performs poorly. Instances like these may not have existed in the training set and the model could be re-trained accordingly. If they were in the training set these could be weighted differently to ensure a better accuracy, or could be represented by a separate model.
- Updates to the model to reflect discrepancies identified through monitoring, changes of circumstance, or the availability of new data.
- Many problems are time dependent and so model performance is expected to degrade, requiring retraining on more up to date data.

Summary

In conclusion our Insights lab has a clearly defined and proven process for data science projects that can be adapted to fit a range of problems.

Contact us to learn how Insights Lab can help your organization get the most from its data, and schedule your consultation today.
Contact us at info@rittmanmead.com

Data Redaction New Features in Oracle 12c Release 2 - 27-Jun-2017 01:08 - Gavin Soorma
Using Tableau to Show Variance and Uncertainty - 26-Jun-2017 10:00 - Rittman Mead Consulting

Recently, I watched an amazing keynote presentation from Amanda Cox at OpenVis. Toward the beginning of the presentation, Amanda explained that people tend to feel and interpret things differently. She went on to say that, “There’s this gap between what you say or what you think you’re saying, and what people hear.”

While I found her entire presentation extremely interesting, that statement in particular really made me think. When I view a visualization or report, am I truly understanding what the results are telling me? Personally, when I’m presented a chart or graph I tend to take what I’m seeing as absolute fact, but often there’s a bit of nuance there. When we have a fair amount of variance or uncertainty in our data, what are some effective ways to communicate that to our intended audience?

In this blog I'll demonstrate some examples of how to show uncertainty and variance in Tableau. All of the following visualizations are made using Tableau Public so while I won’t go into all the nitty-gritty detail here, follow this link to download the workbook and reverse engineer the visualizations yourself if you'd like.

First things first, I need some data to explore. If you've ever taken our training you might recall the Gourmet Coffee & Bakery Company (GCBC) data that we use for our courses. Since I’m more interested in demonstrating what we can do with the visualizations and less interested in the actual data itself, this sample dataset will be more than suitable for my needs. I'll begin by pulling the relevant data into Tableau using Unify.

If you haven't already heard about Unify, it allows Tableau to seamlessly connect to OBIEE so that you can take advantage of the subject areas created there. Now that I have some data, let’s look at our average order history by month. To keep things simple, I’ve filtered so that we’re only viewing data for Times Square.

Average Orders for 2015-2016

On this simple visualization we can already draw some insights. We can see that the data is cyclical with a peak early in the year around February and another in August. We can also visually see the minimum number of orders in a month appears to be about 360 orders while the maximum is just under 400 orders.

When someone asks to see “average orders by month”, this is generally what people expect to see and depending upon the intended audience a chart like this might be completely acceptable. However, when we display aggregated data we no longer have any visibility into the variance of the underlying data.

Daily Orders

If we display the orders at the day level instead of month we can still see the cyclical nature of the data but we also can see additional detail and you’ll notice there’s quite a bit more “noise” to the data. We had a particularly poor day in mid-May of 2014 with under 350 orders. We’ve also had a considerable number of good days during the summer months when we cleared 415 orders.

Moving Average

Depending upon your audience and the dataset, some of these charts might include too much information and be too busy. If the viewer can’t make sense of what you’re putting in front of them there’s no way they’ll be able to discern any meaningful insights from the underlying dataset. Visualizations must be easy to read. One way to provide information about the volatility of the data but with less detail would be to use confidence bands, similar to how one might view stock data. In this example I’ve calculated and displayed a moving average, as well as upper and lower confidence bands using the 3rd standard deviation. Confidence bands show how much uncertainty there is in your data. When the bands are close you can be more confident in your results and expectations.

Orders by Month Orders by Day

An additional option is the use of a scatterplot. The awesome thing about a scatterplots is that not only does it allow you to see the variance of your data, but if you play with the size of your shapes and tweak the transparency just right, you also get a sense of density of your dataset because you can visualize where those points lie in relation to each other.

Boxplot

The final example I have for you is to show the distribution of your data using a boxplot. If you’re not familiar with boxplots, the line in the middle of the box is the median. The bottom and top of the box, known as the bottom and top hinge, give you the 25th and 75th percentiles respectively and the whiskers outside out the box show the minimum and maximum values excluding any outliers. Outliers are shown as dots.

I want to take a brief moment to touch on a fairly controversial subject of whether or not to include a zero value in your axes. When you have a non-zero baseline it distorts your data and differences are exaggerated. This can be misleading and might lead your audience into drawing inaccurate conclusions.

For example, a quick Google search revealed this image on Accuweather showing the count of tornados in the U.S. for 2013-2016. At first glance it appears as though there were almost 3 times more tornados in 2015 than in 2013 and 2014, but that would be incorrect.

On the flipside, there are cases where slight fluctuations in the data are extremely important but are too small to be noticed when the axis extends to zero. Philip Bump did an excellent job demonstrating this in his "Why this National Review global temperature graph is so misleading" article in the The Washington Post.

Philip begins his article with this chart tweeted by the National Review which appears to prove that global temperatures haven’t changed in the last 100 years. As he goes on to explain, this chart is misleading because of the scale used. The y-axis stretches from -10 to 110 degrees making it impossible to see a 2 degree increase over the last 50 years or so.

The general rule of thumb is that you should always start from zero. In fact, when you create a visualization in Tableau, it includes a zero by default. Usually, I agree with this rule and the vast majority of the time I do include a zero, but I don’t believe there can be a hard and fast rule as there will always be an exception. Bar charts are used to communicate absolute values so the size of that bar needs to be proportional to the overall value. I agree that bar charts should extend to zero because if it doesn’t we distort what the data is telling us. With line charts and scatterplots we tend to look at the positioning of the data points relative to each other. Since we’re not as interested in the value of the data, I don’t feel the decision to include a zero or not is as cut and dry.

The issue boils down to what it is you’re trying to communicate with your chart. In this particular case, I’m trying to highlight the uncertainty so the chart needs to draw attention to the range of that uncertainty. For this reason, I have not extended the axes in the above examples to zero. You are free to disagree with me on this, but as long as you’re not intentionally misleading your audience I feel that in instances such as these this rule can be relaxed.

These are only a few examples of the many ways to show uncertainty and variance within your data. Displaying the volatility of the data and giving viewers a level of confidence in the results is immensely powerful. Remember that while we can come up with the most amazing visualizations, if the results are misleading or misinterpreted and users draw inaccurate conclusions, what’s the point?

Graph databases: who hasn’t heard, so far, these words?
The newest pretty shiny tool for data scientist, the latest addition to the analytical toolbox after big data solutions few years ago.

So far it seems to still be a niche solution, too new to be widely adopted. Which also means it’s the perfect timing to take this train and not miss it.
I jumped on that train “seriously” in the last weeks and here, and in some future posts, you will find my findings. This is just a quick intro on what will make my work with graphs possible, the graph “brain”.

Why graph databases?

You can for sure do the same kind of analysis a graph database allows you to perform on your relational database. But you will have to write tons of code and it will, probably, perform quite badly.

Relational databases are excellent for their job, and nobody is saying relational is dead. But you can’t expect they keep being excellent for analysis or activities where other technologies / models would fit better. Same story with cubes: sure, you can store in your database data and perform analysis on it, but for some activities you will never outperform or even get close to a good Essbase cube.

Same for graph databases engines: they are optimized for performing analysis on graphs and managing data based on graphs composed by nodes (aka vertex) and edges each one having properties and labels.

PGX: the brain of graph analysis - sample graph

A sample property graph

There are multiple engines available on the market to store and manipulate graphs. I admit I just knew Neo4j and I didn’t search longer as Neo4j can easily be used to get started with a graph and get your hands dirty playing with graphs.

docker run -d -p 7474:7474 -p 7687:7687 -P --name neo4j neo4j:latest

Open a browser and connect to http://<docker host>:7474 and there you are!
Follow instructions on screen and enjoy graphs. Guided tutorials and a web visualization of your graph make it easy to get started with.

Oracle graph solutions

Oracle joined the party with their property graph solution: PGX, the acronym of Parallel Graph AnalytiX.
Actually, PGX can be a standalone graph tool, but in the current situation it is more the brain of the graph implementations Oracle used in various tools. It’s the “tool” performing operations on graphs in-memory, but doesn’t provide storage directly. Storage (read and write) is provided by external solutions.
The description of what it does sound great, the documentation is nicely written with lot of examples.

What is PGX?
PGX is a toolkit for graph analysis – both running algorithms such as PageRank against graphs, and performing SQL-like pattern-matching against graphs, using the results of algorithmic analysis. Algorithms are parallelized for extreme performance. The PGX toolkit includes both a single-node in-memory engine, and a distributed engine for extremely large graphs. Graphs can be loaded from a variety of sources including flat files, SQL and NoSQL databases and Apache Spark and Hadoop; incremental updates are supported.
(http://www.oracle.com/technetwork/oracle-labs/parallel-graph-analytix/overview/index.html)

PGX overview

PGX overview

As you can see from the generic structure of PGX, it’s a client-server “kind of” solution where the client will interact with the PGX engine (the server) and this one will then, if required, interact with different kind of storage to load or store graphs.
You can also build an ephemeral graph, on the fly, from the client and use it for the required analysis in memory and never store it anywhere.

PGX clients

Multiple clients already exist: PGX shell, java, javascript, python, Zeppelin notebook.
This list will grow in the future as the exposed API and REST interface can easily be used by other languages or tools.

PGX shell is probably the most complete and native one followed by the Java API. The python module (still didn’t find a link to download and install it directly, but you can find it in PGX embedded in Database 12cR2) seems to be using the Java API, so all the same functionalities can be exposed, even if the current status is maybe more limited.
The PGX Zeppelin interpreter is still kind of work in progress: if PGX isn’t local but connecting to a remote instance some functions do not work anymore (there isn’t a full support of the shell functions when connected remotely via Zeppelin).

PGX data sources

The list of supported sources so far seems to be: flat files (filesystem), SQL (database), NoSQL, Spark and HDFS.
Here the issues starts as apparently not all the sources are available in all the PGX distributions.

So far, I used the flat files and SQL loading from the database: globally worked fine. I’m probably not going to look into NoSQL, Spark and HDFS support as I don’t have these tools on my Docker images.

Multiple version and distributions

If in theory and on paper PGX is all nice and cool, there are some issues ….
So far there seems to be multiple different versions both in terms of version number and functionalities.

In Oracle Database 12c Release 2 (12.2.0.1.0) you have PGX version 2.1.0 with support to source (load) graphs from the database or filesystem (based on the list of JAR files I saw).
In Oracle Big Data Lite Virtual Machine 4.8 you will find PGX version 2.4.0 with support to source graphs from the filesystem, NoSQL and HDFS.
If you download PGX from the OTN website you get version 2.4.1 with support to source graphs from filesystem apparently only.

If you apply “Patch 25640325: MISSING PGQL FUNCTION IN ORACLE DATABASE RELEASE 12.2.0.1” to your 12c Release 2 database you will end up with PGX 2.4.0 and same sources for graphs: database and filesystem. The patch in addition to a newer version bring support for PGQL.

To make it short: 3 versions, 3 different data sources = not easy to really test all the feature of PGX easily. Double check the documentation for notes on top of pages with limitations of which release support the functionality (mainly when related to graph loading).

The version provided with Big Data Lite virtual machine must be the one named “Oracle Big Data Spatial and Graph” package, while the one delivered with database 12cR2 must be the one named “Oracle Spatial and Graph” package.

Apparently, reading posts on OTN forum, I’m not the only one dreaming for PGX 2.5.0 merging the current versions and providing support for all the sources making it easier to test and compare options.
I can understand and agree that licensing will be different and can justify support for different or limited sources, but the software must be developed as a single solution to guarantee compatibility and flexibility.

How to use it?

PGX can be used in multiple ways as you can see from the following picture I took from the doc.

PGX usage

Usage of PGX

The simplest and quickest way is to use the PGX shell you get with PGX (./bin/pgx). If you take the OTN version all it takes is to unzip the file, meet the requirements (in the end JAVA mainly) and you are ready to start with the shell.

How to exit the PGX shell?

It took me some time to find how to exit the PGX shell, all the classical “exit”, “quit”, “stop”, “please let me out” didn’t work …
Finally found that 

System.exit(0)
  works fine for that.

In my case I decided to use Apache Zeppelin as there is a PGX interpreter provided by Oracle, and Zeppelin also support Python (by using the pyopg module), SQL (by using a JDBC interpreter) and few other things. This make Zeppelin a good way to document and test commands, because you can have a nice documentation using markdown and next to it code you execute on the fly.

An extra argument justifying the usage or Zeppelin is the Oracle Lab Data Studio application which will come at some point in the coming months (like always with Oracle: not guaranteed) and which will support to import Zeppelin notebook. So, nothing will be lost …
Must be noted that, out of the box, there is no visualization plugin available in Zeppelin so far. Oracle Lab Data Studio will provide that out of the box.

You can of course get something similar by using Jupyter, another notebook application (python will work fine but didn’t look at porting the PGX interpreter from Zeppelin to Jupyter for now).

So far, I have a setup with Zeppelin, Oracle Database 12cR2 and a PGX server running in three Docker containers and communicating together. It’s the closest to what a standard “enterprise” setup would look like I got.
I’m still finalizing the setup and will write about it in a future post, maybe after the OTN PGX release will also support sourcing from the database. The Docker images will also be provided as they are extensions of existing images but pre-configured to work together (SSL certificates etc.).

Where to start? Documentation?

The documentation is really nicely done. Multiple tutorials making it simple to follow. Some examples and use cases. All the details about the API and how things work.

It’s definitely the best place to get started with PGX: https://docs.oracle.com/cd/E56133_01/latest/index.html

Stay tuned for more content about PGX and properties graphs in general, I’m going to work on this topic for quite some time…

The post PGX – Parallel Graph AnalytiX : the Oracle graph analysis brain appeared first on Gianni's world: things crossing my mind.

Kscope17 Conference Analytics Part 2 - 25-Jun-2017 16:49 - Red Pill Analytics

Editors Note:

This year, Red Pill Analytics is the Analytics Sponsor at ODTUG Kscope17. Our company motto is #challengeeverything – so we knew we wanted to do something different and unexpected while at the conference.
What we eventually landed on was creating Analytics Stations using IoT technologies to show how an old school object, like a rotary phone, can be repurposed and turned into an interactive device.
Part 1 focuses on hardware.
Part 2 focuses on software.
Kscope17 also used beacon technology to analyze conference attendee activities. Red Pill Analytics pulled that information through a REST API and told the story of Kscope17 using Oracle Data Visualization. This will be explained in Part 3, coming soon!

 


 

Because the project uses a Raspberry Pi Model 3B which uses Raspbian (a distribution of linux designed for the Raspberry Pi) all of our software used is run on Linux and on an ARM processor. The project primarily uses a framework called Electron (https://electron.atom.io/) for our logic and display code and our hardware interaction code.

The first step to setting up the Raspberry Pi is to burn Raspbian (found on Raspberry Pi official website, https://www.raspberrypi.org/downloads/) to a micro SD card which will be inserted into the Pi and act as the storage device and operating medium for the embedded device. I chose to burn the latest Raspbian image using a tool called Etcher (https://etcher.io/). The next step was to insert the micro SD card into the Pi, connect a screen via HDMI cable, connect a USB keyboard and mouse for initial setup, and connect a sufficiently specced power source.

The Pi boots up and the first things I did were connect to my local Wi-Fi network, disable underscan so that the monitor output consumed all of the available screen space, enable SSH, set locale information (keyboard/timezone) to United States, and change the system password for security.

Locale settings

Enabling SSH for SFTP

Disable underscan

Next was to transfer the electron application files from my development machine to the Pi via Filezilla and SFTP (note that in order to do this SSH must first be enabled via raspi-config), install the node/electron dependencies and test the application. I installed node with curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash - followed by sudo apt install nodejs.

After verifying the application worked as expected, I setup the node/electron app to run at Pi startup so it could act appliance like. To do this I add a single line @sh /home/pi/Desktop/phone.sh to the following file /home/pi/.config/lxsession/LXDE-pi/autostart and then create the following script file on the desktop on my Pi.

Kscope17 Conference Analytics Part 1 - 23-Jun-2017 13:01 - Red Pill Analytics

Editors Note:

This year, Red Pill Analytics is the Analytics Sponsor at ODTUG Kscope17. Our company motto is #challengeeverything – so we knew we wanted to do something different and unexpected while at the conference.
What we eventually landed on was creating Analytics Stations using IoT technologies to show how an old school object, like a rotary phone, can be repurposed and turned into an interactive device.
Part 1 focuses on hardware.
Part 2 focuses on software.
Kscope17 also used beacon technology to analyze conference attendee activities. Red Pill Analytics pulled that information through a REST API and told the story of Kscope17 using Oracle Data Visualization. This will be explained in Part 3, coming soon!


Step 1 was to take apart the Cortelco touch tone phone (available on Amazon for ~$20 at time of writing) and see what the internals looked like and figure out how we could tap into the numberpad and telephone hook switch. Taking the phone apart was as simple as undoing three philips head screws that were holding the plastic molded top in place and revealing several phone jacks/wires, a bell, the number pad, a hook switch, and a single small circuit board acting as the phones intelligence. Next step was to simply unscrew circuit boards and remove unneeded components such as the bell, main circuit board, and unused telephone jacks… leaving the desired components, namely: the number pad, the side phone jack (that runs to the headset), and the vertical mount circuit board that contains the hook switch. See the images below:
Telephone internals

 After removing phone internal

Unscrewing phone base from plastic molded top


Next is preparing the Raspberry Pi 3 which will be acting as the brain behind this embedded device (there is a software component to this and if that interests you, please visit part 2 where we will discussing setting up Raspbian Linux, interfacing with hardware from high level software, and configuring high level display and data code etc.) but for sake of this hardware write-up we will assume the Raspberry Pi software is already configured and will only focus on wiring/mounting of components etc. Namely amongst these mounting consideration is keeping the Raspberry Pi off the metal base of the phone as this would short out the device due to the exposed solder on the bottom of the Pi.


3D printing Pi enclosure

The solution to this was to simply 3D print a plastic case for the Pi so that it could still be mounted firmly inside the phone easily and be fully accessible for wiring etc. See pictures below of the print in progress and how it looks once finished and mounted inside of the phone.

 

 Enclosure from side

Pi sitting in enclosure

Enclosure placed inside phone base

With the Pi securely mounted in place in it’s plastic enclosure, the next step was to begin wiring the number pad, headset audio jack, and hook switch to dupont wires so they could easily be connected to the Pi’s 40 pin GPIO header. I simply stripped the wires, twisted them together, soldered them as needed, and applied a liquid electrical tape to each joint to prevent electrical shorts between wires. Lastly I simply plugged the female end of the dupont cables I used into the respective male header pins on the Pi. See below for pictures of the soldering process.

After twisting and soldering wires together

After applying white liquid electrical tape

 Dupont wires connected to Pi

With all of the wiring in place, and components mounted the last thing to do was to attach a micro USB cable and an HDMI cable and put the black phone top back on and plug in the handset phone line. You can see a completed picture below!

ODTUG Announces the 2nd Annual GeekAThon - 23-Jun-2017 11:08 - ODTUG
SAVE THE DATE: ODTUG announces its 2nd annual GeekAThon! Get your *GEEK ON* and dazzle the community with your brilliant skills!
OBIEE 12c Catalog Validation: Command Line - 23-Jun-2017 09:49 - Rittman Mead Consulting
OBIEE 12c Catalog Validation: Command Line

I wrote a blog post a while ago describing the catalog validation: an automated process performing a consistency check of the catalog and reporting or deleting the inconsistent artifacts.
In the post I stated that catalog validation should be implemented regularly as part of the cleanup routines and provides precious additional information during the pre and post upgrade phases.

However some time later I noted Oracle's support Doc ID 2199938.1 stating that the startup procedure I detailed in the previous blog post is not supported in any OBI release since 12.2.1.1.0. You can imagine my reaction...

OBIEE 12c Catalog Validation: Command Line

The question then became: How do we run the catalog validation since the known procedure is unsupported? The answer is in catalog manager and the related command line call runcat.sh which, in the server installations (like the SampleApp v607p), can be found under $DOMAIN_HOME/bitools/bin.

How Does it Work?

As for most of command line tools, when you don't have a clue on how it works, the best approach is to run with the -help option which provides the list of parameters to pass.

Catalog Manager understands commands in the following areas:

Development To Production  
createFolder        Creates folder in the catalog  
delete            Deletes the given path from the catalog  
maintenanceMode        Puts the catalog into or out of Maintenance Mode (aka ReadOnly)  
...

Multi-Tenancy  
provisionTenant        Provisions tenants into a web catalog  
...

Patch Management  
tag            Tags all XML documents in a catalog with a unique id and common version string  
diff            Compares two catalogs  
inject            Injects a single item to a diff file  
...

Subject Area Management  
clearQueryCache        Clears the query cache  

Unfortunately none of the options in the list seems to be relevant for catalog validation, but with a close look at the recently updated Doc ID 2199938.1 I could find the parameter to pass: validate.
The full command then looks like

./runcat.sh -cmd validate

In my previous blog I mentioned different types of validation. What type of validation is the default command going to implement? How can I change the behaviour? Again the -help option provides the list of instructions.

# Command : -cmd validate -help 

validate        Validates the catalog

Description  
Validates the catalog

For more information, please see the Oracle Business Intelligence Suite  
Enterprise Edition's Presentation Services Administration Guide.

Syntax  
runcat.cmd/runcat.sh -cmd validate  
    [ -items (None | Report | Clean) [ -links (None | Report | Clean) ] [-folder <path{:path}>] [-folderFromFile <path of inclusion list file>] ] 
    [ -accounts (None | Report | Clean) [ -homes (None | Report | Clean) ] ] 
    -offline <path of catalog> 

Basic Arguments  
None

Optional Arguments  
-items (None | Report | Clean)        Default is 'Report' 
-links (None | Report | Clean)        Default is 'Clean'. Also, '-items' cannot be 'None'. 
-accounts (None | Report | Clean)        Default is 'Clean' 
-homes (None | Report | Clean)        Default is 'Report'. Also, '-accounts' cannot be 'None'. 
-folder <path{:path}>            Which folders in the catalog to validate
-folderFromFile <path of inclusion list file>            File containing folders in the catalog to validate

Common Arguments  
-offline <path of catalog>

-folderFromFile <folder from file>        ----- Sample Folder From File ------
                        /shared/groups/misc
                        /shared/groups/_filters
                        ------------------------------------

Example  
runcat.cmd/runcat.sh -cmd validate -offline c:\oraclebi\data\web\catalog\paint  

Few bits to notice:

  • -offline: the catalog validation needs to happen offline. Either with services down or on a copy of the live catalog. Running catalog validation on a online catalog is dangerous especially with "Clean" options since could delete content in use.
  • -folder: the catalog validation can be run only for a subset of the catalog
  • None | Report | Clean: each validation can be skipped (None), logged (Report) or solved via removal of the inconsistent object (Clean)
  • Also, '-accounts' cannot be 'None'.: some validations are a prerequisite for others to happen
  • Default is 'Clean': some validations have a "Clean" as default value, meaning that will solve the issue by removing the inconsistent object, this may be inappropriate in some cases.

As written before, the initial catalog validation should be done with all options set on Report since this will give a log file of all inconsistencies without deleting pieces of the catalog that could still be valuable. In order to do so the command to execute is:

./runcat.sh -cmd validate -items Report -links Report -accounts Report -homes Report -offline <path_to_catalog> > cat_validation.log

runcat.sh output is displayed direcly in the console, I'm redirecting it to a file called cat_validation.log for further analysis.

If, after the initial run with all options to Report you want the catalog validation utility to "fix" the inconsistent objects, just change the desired options to Clean. Please make sure to take a backup of the catalog before since the automatic fix is done by removing the related objects. Moreover ensure that catalog validation is working on a offline catalog. The command itself can work on top on a online catalog but is never a good idea checking a catalog that could potentially be changed while the tool is running.

The output

Let's see few examples of how Catalog Validation spots inconsistent objects. For the purpose of this test I'll work with Oracle's Sampleapp.

Abandoned and inaccessible homes

Running the validation against the Sampleapp catalog provides some "interesting" results: some homes are declared "abandoned": this could be due to the related user not existing anymore in weblogic console, but that's not the case

E10    saw.security.validate.homes Abandoned home /users/weblogic  

Looking deeper in the logs we can see that the same user folders are flagged as

User facing object '/users/weblogic' has no user permissions and is inaccessible  

Logging in with the user weblogic doesn't allow me to check the "My Folders" in the catalog. When switching to "Admin View" and trying to open "My Folder" I get the following error

OBIEE 12c Catalog Validation: Command Line

As written in the logs looks like the user folder has permission problems. How can we solve this? One option is to use again the runcat.sh command with the forgetAccounts option to remove the inconsistent homes. However this solution deletes all the content related to the user that was stored under the "My Folders".

In order to keep the content we need to overwrite the folder's permission with an administrator account. Unfortunately, when right-clicking on the folder, the "Permission" option is not available.

OBIEE 12c Catalog Validation: Command Line

As a workaround I found that clicking on Properties and then on Set Ownership of this item and all subitems allows you to grant full access to the administrator which is then able to reset the proper user the relevant access privilege.

OBIEE 12c Catalog Validation: Command Line

Once the workaround is implemented the users is able to check his "My Folder" content, however the the errors are still present in catalog validation. The solution is storing the relevant artifacts in another part of the catalog, run runcat.sh with forgetAccounts option and then reimport the objects if needed.

Inconsistent Objects

The main two reasons generating inconsistent objects are:

  • Invalid XML: The object (analysis or dashboard) XML code is not valid. This can be caused by errors during the write to disk or problems during migrations.
  • Broken Links: analysis contained in a dashboard or linked from other analysis have been renamed or deleted.

Let's see how catalog validation shows the errors.

Invalid XML

To test this case I created a simple analysis with two columns and then went to the Advanced tab and deliberately removed an > to make the XML invalid.

OBIEE 12c Catalog Validation: Command Line

When trying to applying the change I got the following error which denied me the possibility to save.

OBIEE 12c Catalog Validation: Command Line

Since I really wanted to ruin my analysis I went directly to the file system under $BI_HOME/bidata/service_instances/ssi/metadata/content/catalog/root/shared/$REQUEST_PATH and changed the XML directly there.

After than I run the catalog validation with only the flag items equal to Report and the rest set to None since I'm looking only at invalid XMLs.
The result as expected is:

Message: Unterminated start tag, 'saw:column', Entity publicId: /app/oracle/biee/user_projects/domains/bi/bidata/service_instances/ssi/metadata/content/catalog/root/shared/rm+demo/notworkinanalysis, Entity systemId: , Line number: 9, Column number: 13  

Which tells me that my analysis notworkinganalysis is invalid with an unterminated start tag, exactly the error I was expecting. Now I have two choices: either fixing the analysis XML manually or rerunning the catalog validation with option Clean which will delete the analysis since it's invalid. As said before there is no automated fix.

I wanted to do a further example on this, instead of removing the >, i removed a quotation mark " to make the analysis invalid

OBIEE 12c Catalog Validation: Command Line

After clicking to Apply OBIEE already tells me that there is something wrong in the analysis. But since it allows me to save and since I feel masochist I saved the analysis.

OBIEE 12c Catalog Validation: Command Line

But... when running the catalog validation as before I end up seeing 0 errors related to my notworkinganalysis.

OBIEE 12c Catalog Validation: Command Line

The answer to Jackie Chan question is that I got 0 errors since in this second case the XML is still valid. Removing a " doesn't make the XML syntax invalid! In order to find and solve that error we would need to use Oracle's Baseline Validation Tool.

To test the broken links case I created the following scenario:

  • Analysis SourceAnalysis which has navigation action to TargetAnalysis

OBIEE 12c Catalog Validation: Command Line

  • Dashboard TestDashboard which contains the TargetAnalysis object.

In order to break things I then deleted the TargetAnalysis.

OBIEE 12c Catalog Validation: Command Line

Running catalog validation with the option links to Report. As expected I get a line

N1    saw.catalog.impl.scour.validateDeadLink Referenced path /shared/RM Demo/TargetAnalysis in file /shared/RM Demo/_portal/TestDashboard/page 1 is inaccessible.  

But I don't get anything on the SourceRequest object, for which navigation is failing.

OBIEE 12c Catalog Validation: Command Line

But if instead of an action link I use TargetAnalysis to filter the results of SourceAnalysis

OBIEE 12c Catalog Validation: Command Line

And then delete TargetAnalysis, I get the expected error:

N1    saw.catalog.impl.scour.validateDeadLink Referenced path /shared/RM Demo/TargetAnalysis in file /shared/RM Demo/SourceAnalysis is inaccessible

Summarizing the broken link validation reports if missing objects are included in the main definition of other objects (as filters or as parts of dashboards) but doesn't seem to report if the missing object is only linked via an action.

Conclusion

My experiments show that catalog validation finds some errors like invalid homes, XML files and broken links which otherwise users would hit at the run-time and that won't make them happy. But there are still some errors which it doesn't log like analysis with wrong column syntax, luckily for most of the cases other tools like the Baseline Validation can spot them easily so use all you have, use as frequently as possible and if you want more details about how it works and how it can be included in the automatic checks for code promotions don't hesitate to contact us!

Installing Scala and Apache Spark on a Mac - 22-Jun-2017 15:17 - Brendan Tierney

The following outlines the steps I've followed to get get Scala and Apache Spark installed on my Mac. This allows me to play with Apache Spark on my laptop (single node) before deploying my code to a multi-node cluster.

1. Install Homebrew

Homebrew seems to be the standard for installing anything on a Mac. To install Homebrew run

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
NewImage

When prompted enter your system/OS password to allow the install to proceed.

NewImage NewImage

2. Install xcode-select (if needed)

You may have xcode-select already installed. This tool allows you to install the languages using command line.


xcode-select --install

If it already installed then nothing will happen and you will get the following message.


xcode-select: error: command line tools are already installed, use "Software Update" to install updates

3. Install Scala

[If you haven't installed Java then you need to also do this.]

Use Homebrew to install scala.


brew install scala
NewImage

4. Install Apache Spark

Now to install Apache Spark.


brew install apache-spark
NewImage

5. Start Spark

Now you can start the Apache Spark shell.


spark-shell
NewImage

6. Hello-World and Reading a file

The traditional Hello-World example.


scala> val helloWorld = "Hello-World"
helloWorld: String = Hello-World

or


scala> println("Hello World")
Hello World
>

What is my current working directory.



scala> val whereami = System.getProperty("user.dir")
whereami: String = /Users/brendan.tierney

Read and process a file.


scala> val lines = sc.textFile("docker_ora_db.txt")
lines: org.apache.spark.rdd.RDD[String] = docker_ora_db.txt MapPartitionsRDD[3] at textFile at :24

scala> lines.count()
res6: Long = 36

scala> lines.foreach(println)
####################################################################
## Specify the basic DB parameters
## Copyright(c) Oracle Corporation 1998,2016. All rights reserved.##
## ##
##------------------------------------------------------------------
## Docker OL7 db12c dat file ##

## ##
## db sid (name)
####################################################################
## default : ORCL

## cannot be longer than 8 characters
##------------------------------------------------------------------

...

There will be a lot more on how to use Spark and how to use Spark with Oracle (all their big data stuff) over the coming months.


[I've been busy for the past few months working on this stuff, EU GDPR issues relating to machine learning, and other things. I'll be sharing some what I've been working on and learning in blog posts over the coming weeks]

Next