Should I Learn the R Programming Language?

Business intelligence expert Johnathan Mills, a Senior BI Consultant operating out of Vancouver, Canada thinks that ‘every programming language out there has its strengths and weaknesses. R’s strength is data; it is a language built to analyse and manipulate data. You will see this used by true, hardcore data scientists, but it does not currently have much traction outside of that area.’ (Selig, 2014)

He probably is not R’s number one fan, but as big data is becoming more and more important in modern business I contend that R, for that very reason is worth learning and that ‘yes, absolutely, you should learn R’. R is a popular language amongst data scientists and business analysts and there is a growing demand in the workplace for these skills.

Let me briefly outline ten major advantages to becoming fluent in the R programming language below. These are the main advantages, and probably the biggest of them all is that R Is open source and free to all.

Major Advantage Number One. Worldwide millions of statisticians and data scientists use R to solve statistical and calculation problems from fields as diverse as computational biology to data driven marketing. (Eglin, 2009)

Scientists in the field of Computational Biology use R to produce output like this.
Scientists in the field of Computational Biology use R to produce output like this.

Two. The power of the basic R download is pretty impressive but there are also 4,800 extra packages from repositories dealing with data mining, bio-informatics, econometrics and spatial analysis. All this for free. You are almost guaranteed to be able to download a solution for yourself or your employer.

Three. R produces excellent graphs that are arguably ready to publish as is. These include bar charts, scatter plots, mapping features – the works.

Four. If big data is your area, and this is a growing area – R specialises in this and is faster than other languages.

Five. There is a thriving worldwide R community to whom you can reach out to for answers. New open source solutions are being added every week.

Six. R is completely free and you can install it on as many machines as you like.

Seven. R is cross-platform – so it doesn’t matter if you’re using Mac, Windows or Ubuntu it will work.

Eight. R is capable of reproducible research. This means bang up to date data and analysis can be shared easily because code has already been written that pulls data, analyses it and then presents it and is at hand when needed. (Meyer, 2015)

Nine. R is open source and supports extensions. If you get really good at R you can go and write your own extension to solve a specific problem.

Ten. It relates well to other languages. (Shankhdhar, 2013)

Briefly I will play devil’s advocate and starting with point ten I could say that a general use programme like Python interfaces with other languages better and is easier to learn than R. Some people say that Python is the future for big data. It is certainly popular amongst the hi-tech startups. (Asay, 2013)

python1

However the figures tell a different story. A couple of years ago a large survey of over 700 data professionals asked the question: What programming/statistics languages have you recently been using for analytics/data mining/data science work? (Piatetsky, 2013)

61% used R
39% used Python
37% used SQL, and on average there were 2.3 languages used.

This is likely to stay constant as it is expected that R will still be used for years to come and I predict that R’s visualisation features will be improved upon, in particular.


References

Asay, Matt. ‘Python Displacing R As The Programming Language For Data Science’. Readwrite.com. N.p., 2015. Web. 23 Apr. 2015.

Eglen, Stephen J. ‘A Quick Guide To Teaching R Programming To Computational Biology Students’. PLoS Computational Biology 5.8 (2009): e1000482. Web. 23 Apr. 2015.

Meyer, Justin. ‘R Programming Help, How To’s, And Examples | Rprogramming.Net’. RProgramming.net. N.p., 2015. Web. 23 Apr. 2015.

Piatetsky, Gregory. ‘Top Languages For Analytics, Data Mining, Data Science’. Kdnuggets.com. N.p., 2015. Web. 23 Apr. 2015.

Revolution Analytics,. ‘What Is R?’. N.p., 2015. Web. 23 Apr. 2015.

Selig, Abe. ‘Buzzword Breakdown 2.0: 5 Baffling BI Terms Explained’. Plottingsuccess.com. N.p., 2015. Web. 23 Apr. 2015.

Shankhdhar, Gaurav. ‘Why Learn R | Reasons To Learn R Programming | Edureka’. Edureka Blog. N.p., 2015. Web. 23 Apr. 2015.

Wager, Tor D. et al. ‘A Bayesian Model Of Category-Specific Emotional Brain Responses’. PLOS Computational Biology 11.4 (2015): e1004066. Web. 23 Apr. 2015.

What are Management Information Systems and Are They Relevant?

Management information systems (MIS) mainly involve the software but also any other related business processes and resources that in synergy are used to extract information from lower level functional or tactical systems within a business or organisation. MIS nowadays can be quite powerful and sophisticated giving easy to read output in real-time or next to real-time. Ultimately the information is used to make more effective business decisions and further the organisation’s goals. (Banks, 2015)

dagda_bigdata

MIS are important and an overview of a timeline will illustrate that they have increased in importance over the years. In the past less information was available and perhaps more business decisions were made on instinct.

In their classic text book, Management Information Systems, published by Pearson, the authors Laudon and Laudon outlined five phases in the history of increased relevance of management information systems.

The first era, as they called it was the mainframe computer. This was when IBM was the only gunslinger in town. The cost was high and the value of the information not so much as it took so much time.

The second era was the personal computer. Computing became more widespread and no longer just the very big corporations could afford to implement an MIS. This technology was based on simple microprocessors.

Era number three called client/server was when computers started to share data through a network and was an improvement upon isolated computers. The fourth era was a development of this, dubbed the enterprise era, was when all computers became networked more efficiently and every computer was put to work. The fifth and current era, cloud computing, is the name given to the collection of technologies that make computing over the internet a viable and much more powerful solution to running an MIS.

Data,_2379

An MIS does not necessarily have to tie in every single business information system in the organisation. For example a CRM (Customer Relationship Management [system]) only processes data relevant to looking after the end customer. (Banks, 2015)

These systems are more important than ever before due to the fast pace of change in the business world and customer expectations in general.
According to an article in the International Journal of Reviews in Computing, Management Information Systems provide ‘information for the managerial activities in an organisation. The MIS is basically concerned with processing data into information and is then communicated to the various departments in an organisation for appropriate decision making.’

Another good definition of MIS, in this discussion of its importance is one given by Njoku in the International Journal of Knowledge & Research in Management & E-Commerce from 2013. It stated that ‘Managers at all levels in organizations must constantly work with relevant, timely, strategic, accurate, structured, cost-effective information in order to execute planning, control, decision making and problem solving efficiently and effectively. Effective management information systems (MISs) provide this information.’

One key consideration however regarding the importance of MIS is the element of communication of the information throughout the organisation. The Plotting Success website quoted Wayne Applebaum, the VP of analytics and data science at Avalon Consulting, LLC saying: ‘I recently heard a presentation that likened delivering an analytic that couldn’t be easily understood to a grocery store selling ketchup without a bottle. Packaging is very important.’ In other word the MIS needs to be output easily understood information for management. In fact the ability to create clear reports and dashboards is a key responsibility sought for in new business intelligence analyst/specialist hires.

Applebaum also said that ‘people skills are becoming more crucial’. This is key in our discussion of the importance of Management Information Systems. The people in charge of the MIS need to have enough business awareness to ask the right questions of the system and also the skills required to sell that information up the command line and ultimately to the entire organisation.

A good system coupled with people with the requisite skills makes Management Information Systems truly important to today’s business world.
With the increased use of robotics (IFR, 2013) and the proliferation of automated data gathering sources the trend into the foreseeable future in that of an increase in the importance of Management Information Systems as the volume, variety and velocity of the data that organisations must process to compete with others in their industry steadily increases.

robotics


References

Banks, Linda. ‘Importance Of The Management Information System’. Small Business – Chron.com. N.p., 2015. Web. 23 Apr. 2015.

Ifr.org,. ‘IFR Press Release – IFR International Federation Of Robotics’. N.p., 2015. Web. 23 Apr. 2015.

Prince, Chris, and Udochukwu Njoku. ‘Establishing And Managing Management Information Systems In Developing Countries’. International Journal of Knowledge and Research in Management and E-Commerce 3.4 (2015): Page 1. Print.

REDDY, G.SATYANARAYANA et al. ‘MANAGEMENT INFORMATION SYSTEM TO HELP MANAGERS FOR PROVIDING DECISION MAKING IN AN ORGANIZATION’. International Journal of Reviews in Computing E-ISSN:  2076-3336.ISSN:  2076-3328 (2015): Page 1. Print.

Selig, Abe. ‘Survey: What Employers Are Looking For In A Business Intelligence Analyst’. Plottingsuccess.com. N.p., 2015. Web. 23 Apr. 2015.

Business Intelligence – What is it and Has it Come of Age?

First of all what is Business Intelligence? Business Intelligence (BI) is all about transforming the raw data obtains from all sources into meaningful and useful information that managers can use to make strategic and operational decisions. Nowadays BI has the capacity to process large amounts of unstructured data, as well as traditional numeric and simple text data.

The insight into the data covers historical analysis, snapshots of current levels and predicting the future using statistical models. A list of functions that come under the BI banner include: reporting, online analytical processing, analytics, data mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. (Selig, 2014) and (Laudon, 2014)

As can be imagined a lot very useful information could potentially be gleaned through these disciplines. The better the business intelligence execution, the better the information, and ultimately better decisions for the business are made. (Rivera, 2015)

With the passing of every twelve months ambitions for BI are increasing parallel to the increase in the volume of data captured.

It could be argued that BI has come of age because nowadays the talk is about real-time BI. Today decisions have to made quicker than ever before and organisations that implement real-time BI can get a competitive edge. Modern enterprise software automates many tasks such as data collation freeing up time for knowledge work. (Rivera, 2015)

BI came of age in the airline industry when Continental Airlines in the US used it to totally transform its market position from last place to a dominant role. The airline ploughed approximately $30M into its development of a new real-time BI setup involving hardware, software and staff recruitment and training. (Rivera, 2015)

b777-continental-airlines

Within a space of six years Continental Airlines’ BI infrastructure had achieved a 1000% ROI and over $500M thanks to cost savings and increased revenues. This was realised through real-time customer service engagement, improved baggage and complaint handling, cleverer pricing and more precise booking and arrival information.

Another BI coming of age story would be the global research-based biopharmaceutical company AstraZeneca’s through having 250 of their employees using state of the art BI software for reporting and analysing sales, market share, product performance and cost/profit data. The information harvested led to the capacity to make sharper business decisions positively effecting their bottom line. (Saylor, 2007)

astrazeneca_logo_building_12

Yet another example of a successful implementation that shows how BI can really shine is Comdata’s FleetAdvance which enables commercial transportation fleets make smarter fuelling decisions by helping them choose the best routes and generally better manage fuel costs through smarter purchasing decisions. This system like the others above process real-time transaction data and output easy to understand score cards on user-friendly dashboard displays. (Spaulding, 2014)

truck

With the proliferation of online sales channels recently Brightpearl, the retail management company handles real-time data spanning all channel options for a client company ensuring to avoid double selling stock. Management using the software can get an instant snapshot of inventory and cash flow levels, facilitating a more efficient business overall. (Rivera, 2015)

Beneficial-Online-Shopping

The above are stellar examples of the effective implementation of BI, but has it come of age. I think these case studies show that it has. However can we say that BI has come of age universally in modern business settings? Perhaps not.

A recent survey of the use of BI in the finance and energy industry showed a surprising low level of proactive use of BI in companies surveyed. It found that only 20% of companies in these industries showed an interest in BI specialties like scorecards, dashboards and analytics techniques conducted in real-time. For a majority of respondents the analysis was reactive and historical where they relied on spreadsheets, manual data manipulation and periodic reports. (Groenfeldt, 2014)

A SunGard spokesman stated that ‘many companies don’t want to replace different systems with a single system because each of those is best in class and does what it is supposed to do.’ Also when an organisations business spans across geographic regions there is the issue of uniformity of naming conventions and data models. (Groenfeldt, 2014)

The majority of respondents used middleware and analytical tools like Tableau, Business Objects, Microstrategy and Cognos rather than a unique single real-time system. For these companies it could be said that BI has not yet come of age. (Groenfeldt, 2014)

In the future it is likely that more and more companies will come on board with real-time unified BI systems that are flexible enough to cope with regional and international differences in business practices and underlying models. And of course it will be cheaper to buy these enterprise solutions, more than likely as Moore’s law is still very much valid.


References

Groenfeldt, Tom. ‘Business Intelligence (BI) Isn’t. Very Intelligent. Yet.’. Forbes. N.p., 2014. Web. 23 Apr. 2015.

Laudon, Kenneth C, and Jane Price Laudon. Management Information Systems. Upper Saddle River, NJ: Prentice Hall, 2000. Print.

Rivera, Maricel. ‘Real-Time Business Intelligence In The Real World — TDWI’. Tdwi.org. N.p., 2015. Web. 23 Apr. 2015.

Saylor, Michael J. Customer Success With Microstrategy Business Intelligence. 1st ed. Microstrategy Incorporated, 2007. Web. 23 Apr. 2015.

Selig, Abe. ‘Buzzword Breakdown 2.0: 5 Baffling BI Terms Explained’. Plottingsuccess.com. N.p., 2015. Web. 23 Apr. 2015.

Spaulding, Jennifer. ‘Comdata And Credera™ Business Intelligence Collaboration Receives TDWI 2014 Best Practices Award – Blog.Credera.Com’. blog.credera.com. N.p., 2014. Web. 23 Apr. 2015.

The Three Vs of Big Data and Are They Useful in Data Management?

There are so many definitions flying around these days in the world of big data. ‘Data management’ is a bit of a 80s concept that according to the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is:

‘Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets.’

I contend that the three Vs are useful in understanding the world of data management and big data. Svetlana Sicular, a research director at Gartner, where she focuses on big data, data governance, and enterprise information management disclosed in a Forbes article, that she regularly uses the three Vs with her clients to explain data analytics and ‘not just to set a common ground, but to point out where big data challenges and opportunities are.’ (Sicular, 2013)

The three Vs are famous because they are integrally wrapped up with a famous definition of big data.

big data

But first what are the 3 Vs. The three Vs, in short are Volume, Velocity and Variety. More on this later. First the famous definition of big data from the large data firm Gartner:

‘Big data’ is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Note that ‘Big Data’ is in inverted commas. This is because Gartner reckons that in the future big data will simply be the norm in the future and taken for granted. (Sicular, 2013)

Hence the usefulness of the three Vs which were first introduced by Gartner analyst Doug Laney back in 2001 (Whatis, 2015) in a MetaGroup research publication entitled ‘3D data management: Controlling data volume, variety and velocity’.

Big data is a bit of a mysterious term really, a bit like cloud computing (which really involves several different technologies). The data that comprises big data can come from sources as diverse as unstructured data in the form of tweets and updates from Twitter and Facebook, web server logs, traffic flow sensors, satellite imagery, broadcast audio streams, banking transactions, the actual content of web pages, scans of government documents, GPS trails, telemetry from automobiles, financial market data, clicks on shopping websites etc. etc. (Dumbill, 2012)

landscape-of-big-data_510ace3d9f9cc
The figures in this infographic are just over a year old.

Enter the first V and its usefulness in shaping our understanding of data management. Volume. The volume of data nowadays is massive. An alternative definition of big data that I like is as follows: ‘Big Data is when the size of the data itself becomes part of the problem’ which author Mike Loukides gave in his O’Reilly report ‘What is Data Science?’.

The second V, Velocity frames the whole phenomenon of today’s data explosion and is more relevant in date management than ever before. As Pinal Dave points out in his Journey to SQL Authority blog there was a time when we used to believe that data of yesterday is recent. In fact the print versions of Irish newspapers such as the Irish Times and the Irish Independent are still operating in that paradigm (although probably more for show these days). Their highly popular website are where we really get our news. Twitter is even more immediate. This is the Velocity of today’s data. In fact a two minute old tweet can be old hat in a news timeline.

The third V, Variety is useful in data management in that it helps a manager to choose between massively parallel processing architectures known as data warehouses, e.g. Greenplum and Apache Hadoop- based systems. The Hadoop-based systems does not restrict the structure of data that it receives. Data warehouses on the other hand are more suited to gradually changing incoming data. Facebook uses Hadoop to process its data. (Dumbill, 2012)

The Variety V extends to highlighting data that organisations have, but don’t necessarily see as useful or potentially useful at present. Gartner call this dark data, e.g. elevator logs that could be uses to predict vacated real estate. Savvy organisations are taking this dark data and shedding light on it and extracting useful information from text (e.g. patterns of success and failure in work projects buried in staff emails), location and log files. (Sicular, 2013)

3vs

The three Vs of big data are more useful than ever before. Yes, some data experts have come up with more Vs: Volatility, Validity, Viability and Value, Veracity and Visualisation, but the original 3Vs help business people and data scientists get a good handle on the brave new world of data. The three Vs will continue to form the basis of every data scientist’s grasp of the world’s data firehose as it fundamentally and logically makes sense.


References

Dave, Pinal. ‘Big Data – What Is Big Data – 3 Vs Of Big Data – Volume, Velocity And Variety – Day 2 Of 21’. Journey to SQL Authority with Pinal Dave. N.p., 2013. Web. 23 Apr. 2015.

Dumbill, Edd. ‘What Is Big Data? – O’reilly Radar’. Radar.oreilly.com. N.p., 2015. Web. 23 Apr. 2015.

Loukides, Mike. ‘What Is Data Science? – O’reilly Radar’. Radar.oreilly.com. N.p., 2015. Web. 23 Apr. 2015.

Maronde, Lennard. ‘Big Data In 2014 – A Necessary Update’. Datashaka.com. N.p., 2015. Web. 23 Apr. 2015.

Sicular, Svetlana. ‘Gartner’s Big Data Definition Consists Of Three Parts, Not To Be Confused With Three “V”S’. Forbes. N.p., 2013. Web. 23 Apr. 2015.

WhatIs.com,. ‘What Is 3Vs (Volume, Variety And Velocity) ? – Definition From Whatis.Com’. N.p., 2015. Web. 23 Apr. 2015.

 

 

Assessment 2 – Try R

I downloaded the latest version of R for Windows from http://www.r-project.org/ and then also downloaded the ggplot2 graphing package. There are many libraries like this that are supplemental to the default libraries, and which result in R being a very powerful statistical processing language.
 
For simplicity I did not use ggplot2 in the end. I downloaded it from the Austria site – which is the primary CRAN (Comprehensive R Archive Network) server.
 
Rather than make up some random data for the purposes of this assignment I harvested some real life historical data from Dublin City Council’s http://dublinked.com/ public domain open data site. The site’s featured Dataset of the Week was Litter Warden’s Inspections data, which was collected by the wardens using smartphones with the OpenDataKit.
 
The dataset was large and so I copied a representative sample of 30 rows from the dataset and reduced the amount of columns to three, leaving out the geodata and other data. The column headings that I chose included: Location Type, Complaint Source and Type of Litter. The original litterwarden file from the dublinked.com site was a CSV and it opened in Excel on my computer by default. I saved a new smaller file again as a CSV as along with text files these are capable of being handled by R.
 
The next step was to import the .csv file into the R app. I used the choose function to assist me in finding the file and the code I used was as follows: data<-read.csv(file.choose(),header=T)
 
To check that my new data frame which I chose to call data was imported correctly I simply typed in data and hit return. My thirty lines of data appeared correctly. I then called the plot function to graphically display the data frame, by typing plot(data).
 
Please see a screen shot of the graphical display that the R app outputted below.

Plot of Litter Warden Data - 30 Rows

Please see a screen shot of the dataset .csv file that was used by R below:

Data Set Used

As no numerical data was used the graph is somewhat hard to interpret. The three unique instances of customer complaints are represented by the dots on the right hand side of the matrix. The bottom left box in the matrix plots all of the occasions of regular litter as opposed to unspecified waste, which for the purpose of this assignment I named Dog, as in dog waste from household pet dogs.
 
Other possible ideas/concepts that could be represented by R Graphics are discussed below.
 
A histogram or bar chart could be created with the function hist(x), where x is a numeric vector (or list) of values to be plotted. A variation on this is to use the option freq=FALSE to plot densities instead of frequencies.
 
Another approach involves representing data on kernel density plots. The well-known hockey-stick shaped graph is an example of a kernel density plot. This is achieved in R by using the function plot(density(x)), where x is a numeric vector.
 
One of the aforementioned packages or libraries in R, one called the sm package allows a data scientist using R to superimpose the kernel density of two or more groups. Perhaps it could be possible to compare Dublin City Council’s Litter Warden data with comparable data from another major city.

Assessment 1 – Fusion Tables

The fusion table discussed below can be found online at this link.

After setting up Google Fusion on my Google Drive account, from the Chrome Store, I downloaded the following two files locally to my computer:

  1. The map_lead kml file, that I got from the independent.ie website (a KMS datafile)
  2. The 2011 Population Statistics from the Central Statistics Office site, that I pasted into an Excel file

I then navigated to chrome://apps on the Chrome browser, and opened up Google Fusion Tables.

Then I selected the Population Excel file and clicked ‘Next’. I confirmed that Column Names are in Row 1 and clicked next. I then accepted the automatically populated field names and details and clicked ‘Finish’.

I then clicked on ‘File’, ‘New Table’ and this time I chose the ‘map_lead’ file and went through the same steps. I clicked back into the Population table and selected ‘File’ and then ‘Merge’. I then selected the ‘map_lead’ table and clicked next. In the ‘This Table’ column I accepted the ‘Province/County’ entry, but under the ‘map_lead’ column I changed it to ‘Name’, and then clicked ‘Next’.

I deselected the ‘Males’ and ‘Females’ options to reduce clutter and clicked ‘Merge’ and clicked the link to view the table.

I renamed the new fusion table to ‘Assessment 1 Fusion Tables’. Under the ‘map of geometry’ tab I clicked ‘heatmap’ and changed the location setting from ‘geometry’ to ‘Province/County’. I then adjusted the ‘Radius’ and ‘Opacity’ levels for greater visibility, and clicked ‘Done’.

Under ‘File’ and ‘Share’ I obtained the public link for the fusion table (I changed it to public first). At this stage I took the screengrab below.

Heat map of entire fusion table

To create a random distribution of counties based on population density I pressed ‘Filter’, then ‘Total Persons’ and arbitrarily chose the values between 20,000 and 120,000 and clicked the ‘Find’ button. I took a screen grab of this map.

Random distribution of counties based on population density

Heatmaps display colours on the map to represent the density of points from a table. The table must have a location column though, that contains individual points, e.g. a kml file.

Other information that can be gleaned from the heatmap can be produced by using different variations under the ‘Filter’ option, such as ‘Total Persons’ or ‘Province/County’. By way of example, I generated a heatmap for Galway city, by itself (please see screengrab below).

Galway city heatmap

The amount of information that can be gleaned from a heatmap depends on the amount of data in the source table. They can depict all kinds of numerical data, such as counties with mountains taller that 8,000 feet, or counties with more than 2% bogland etc.

I also included an intensity map with bucket levels under the ‘Fill’ option. Intensity maps are another way of displaying the information in tables. Click the last tab entitled ‘Intensity Map’ on the fusion table link here.