O’Reilly Strata Conference: Making Data Work

“There is no doubt that Strata Conference is a great magnet for top big data talent and that it has served its audience well – hungry for inspiring content on the latest and greatest big data and data science developments.” – Strata Attendee

The future belongs to those who understand how to collect and use their data successfully. And that future happens at Strata.

The best minds in data will gather in London this November for the O’Reilly Strata Conference  – to learn, connect, and explore the complex issues and exciting opportunities brought to business by big data, data science, and pervasive computing.

Prior to the event, O’Reilly is offering FREE Data Reports – the latest insights in data science and big data from our editors, authors and speakers.

Analyzing the Analyzers. Big data now Disruptive Possibilities Planning for Big data

Real-Time Big Data Analytics

 

 

 

 

 

 

Get any –or all – of these reports at oreilly.com/data/free

Available in PDF, DAISY, Mobi and ePub formats.

 

MSc Course in Data Science launched at Dundee University

“The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing and retail to healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data. ‘Data Scientist’ is now the hottest job title in Silicon Valley.” – Tim O’Reilly

As pointed out by Tim, we are surrounded by Data Science, also known as big data. I have now found out that there is a new MSc course in Data Science taught at the University of Dundee. Mark Whitehorn, the founder of the course, was kind enough to answer some questions.

Hi, Mark, I hear you’ve just launched a new MSc course in Data Science at the University of Dundee.  So, first questions first: what’s data science?

Very cool at the moment.  It’s a term we’re hearing more and more these days, and, as with many new terms, different people define it in different ways, and those ways are evolving even as we speak.  In a nutshell, a data scientist is someone with well-developed skills in analysing data, especially in analysing large amounts of data that does not fit readily into traditional tabular database structures.  That kind of data is often called Big Data, another term much in the news.

Is it a full-time or a part-time course?

It can be either.  You can take it as a full-time course of study lasting one year, or as a part-time one lasting two.  Most of our current students take the part-time option because they are already in employment and the course is designed to accommodate this.  Each year the part-timers attend two separate weeks of lectures in Dundee, plus about 2 extra days for examinations.  So the total time they have to spend in Dundee is 12 days per year; the rest of the time is home-based study with assignments, phone tutorials and a project.  Full-timers must be based in Dundee and attend four full weeks of lectures and also do the assignments, tutorials etc.  Both part and full time students also do a research project.

Why now and why Dundee?

As well being a Professor at the University, I also work as a consultant and my commercial experience tells me that businesses are crying out for people who can extract useful information from huge data sets; information covering performance, trends, behaviours, predictions, any scrap of information that can help a business to stay ahead of the curve.  With my commercial hat on I constantly see the need for people who are trained in data science all the time; so as an academic it made perfect sense to set up the course!

Furthermore in the School of Computing at the University of Dundee we have been running a very successful MSc course in Business Intelligence since 2010.  It has been very well received by the students, about 75 of them so far, and all of our full-time graduates have found employment in the BI field.  The part-timers already had jobs and many of them have moved to better jobs.  Launching the Data Science course seemed a natural progression and has already proved popular: we had about three times more applicants than places available for the January 2013 intake.

Finally the research work we do at Dundee has involved big data and data science since about 2007; in other words, we aren’t just talking about it, we’ve been doing it for at least five years.  We do actually know how to do it for real!

Anything else you’d like to say about data science?

I think it is complex to define fully but we have tried to give the flavour of the sorts of skills we expect our students to possess by the time they graduate.

General skills include:
  • excellent analytical capabilities
  • machine learning
  • statistics, maths  and data mining
  • algorithm development, code writing
  • data visualisation
  • understanding multi-dimensional database design and implementation
Specific skills include:

Technologies to handle big data

  • Hadoop and related technologies
  • MapReduce and its implementation on differing software platforms
  • NoSQL databases

Knowledge of languages such as

  • SQL, MDX , R
  • Functional and OOP languages such as Erlang and Java
General characteristics include:
  • Interdisciplinary interests
  • Excellent communication skills
  • Insatiable curiosity

Insatiable curiosity – I like that!

Yes, and it’s really an excellent indicator of the sort of person who will excel as a data scientist.  It denotes something rather like the old meaning of the hacker mentality.  Not someone who breaks into systems but a hacker in the sense of someone driven to understand, to explore all the options, to try all the permutations.  Someone who works on a problem and suddenly notices it is three in the morning, someone who has survived for days on flat food.

Duncan Ross, Director of Data Sciences at Teradata, has said that “The first and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens ‘if I push that button’.  Data Science selects for it.”  I think he’s absolutely right.

Any final thoughts before we wrap it up?

It has been great fun and very rewarding to run the BI course and I am sure the DS course will be equally so, not just for me and my colleagues, but also for the students.  I’ll give the last word to some of those who have already graduated from the BI course:

“Enrolling on the Business Intelligence Master’s Degree at the University of Dundee has been the single most important thing I have done to further my career since starting to work in IT 18 years ago. Not only have I thoroughly enjoyed the course and project work but I have met some great people and established a strong network of friends who work in the industry. Not to mention landing a dream job as a Data Scientist with Teradata at the end of it!” Chris Hillman

“It was hard work at times, but the reward you get as always, is proportionate to what you put into it. And it was fun.  Given the opportunity, I’d definitely do it again – two years of my time well invested, with no regrets and some great adventures on the way.”  Jon Reade

“The style is never dogmatic and always open-minded allowing everyone to input their own ideas in this fluid area of the business and scientific world.”  Gordon Meyer

“I feel it was a great decision to relocate from my home country for this amazing year that provided me sound knowledge and further inspiration.”  István Poprócsi

“I was delighted when I finally completed the course and was awarded an MSc. with distinction – but at the same time felt rather sad that this great experience had come to an end.  I’d learned so much and loved every minute of it.”  Andy Hogg

 

Mark whitehorn cropped & flippedProf. Mark Whitehorn specializes in the areas of data science, analytics, business intelligence (BI) and Big Data.

On the academic side, Mark holds the Chair of Analytics at the University of Dundee where he designed and runs two Masters courses, one in Business Intelligence, the other in Data Science. He also works with the prestigious Lamond labs. applying data science to proteomics.

In addition Mark works as a consultant to national and international companies, designing analytical systems. He is also a well-recognized commentator on the computer world, publishing articles, white papers and 11 books on database and BI technology.

For relaxation he collects and restores old cars, which keeps him out of too much trouble. He only wears a tie under duress and unashamedly belongs to the beard-and-sandals school of computing.  (And he doesn’t take life as seriously as this photo suggests!)

 

 

 

 

Fun at Strata and Velocity – London 2012

Last week I attended two conferences organised by O’Reilly: Strata (themed around Big Data) and Velocity (performance and administration of web applications).

Recently I have been exploring various NoSQL databases, so when I heard the Strata conference was coming to London for the first time, I decided to attend – after all, many of the NoSQL products are very closely associated with the world of Big Data.  Seduced by the discount for booking both, on a whim I decided to attend the Velocity conference as well.  Two days for each, so that would be 4 days of presentations.  I made sure I had plenty of sleep in advance…

Strata came first.  My goal here was to get a wider understanding of the whole Big Data scene – hot technologies, interesting problems that the community faces, and so on.

My first impression was that this is a field still being explored – even the problems are not yet well-defined.  Several of the speakers offered their own definitions of “Big Data”.  I think it was George Dyson who suggested that the Big Data era began when it became cheaper to keep all your data than to spend human effort to delete it.  A more subjective definition was that you know you have Big Data when you have to start thinking about the size of it – which suggests that the threshold will rise as the state of the art moves forward.

I’ve not seen Hadoop in real deployments, so it was interesting to hear the war stories about that, but there were plenty more technologies under discussion.  I heard about RDF, Clojure, Cascalogic, techniques for visualizing and exploring data, and much more.

Funnily though, one of the talks that had the most impact on me was not a “deep techie” thing at all: in the last session on Tuesday, Felienne Hermans of Delft University spoke about PhD research she’d done into corporate use of spreadsheets.  We all are vaguely aware that Excel gets (ab)used for all sorts of things – largely because it is a quasi-programming environment that is used by non-programmers – but do we really know the true extent?  A spreadsheet can combine data, logic and presentation with a complete failure of “separation of concerns”.  Felienne had worked with an investment bank where the management initially estimated there might be 10 thousand spreadsheets; the correct figure was more like 3 million.  A timely reminder that while we worry about the challenge of slightly rough data in our databases, there’s a whole lot of business-critical stuff out there in users’ home directories…

Velocity followed on Wednesday and Thursday, and here my objective was to catch up with a field where I was a bit stale – my real web experience dates from 5 years ago, and of course things have moved on.  There was a lot of talk about DevOps, but this isn’t so new to me; instead I tried to cast my net wide, and went to talks about queueing, monitoring, stories of real-life experience, and various new technologies.

The Velocity conference seemed slightly more “corporate” than Strata, perhaps because it seemed mostly to be about better ways of tackling well-known problems, rather than working out what the heck the problem actually is.  Strata was asking, “What do I do with all this data?  Is there a business model hidden in there, or knowledge that I can extract?  How can do I do any of that?”.  In contrast, Velocity mostly concentrated on more specific questions for a more mature field: “How can I monitor the performance of my app around the globe?  What metrics should I track?  Can I use DevOps-style agility to improve stability and deploy releases more quickly?”

At both conferences there was a good selection of exhibitors; particularly at Velocity where the more mature problem space means there are more players with competing offerings to sell.  As a fan of open-source, I find it encouraging how many of the free products now have companies to back them and sell extra support (and conversely, how many companies choose to open their core products).  Most of the stands were definitely geared to the technical nature of the conferences and were able to deal with proper in-depth questioning.

The least satisfactory aspect of the whole thing was the hotel conference rooms.  All of them had the same narrow chairs bolted together in rows.  I’m certainly no “big guy” but I was at least an inch or two wider than the chairs, so in a well-attended talk everyone ended up very tightly wedged, or taking it in turns to lean forward.  In most of the rooms the projection screens were low and you couldn’t see the bottom half of slides from the back.  A plus for the hotel was the good-quality food; though this may not have helped with the narrow seating!

As you’d expect, this event has sparked a whole lot of questions and further research to do.  I’ll certainly be looking into a bunch of new technologies – Hadoop the Definitive Guide is first up on my reading list – but it seems that statistics is going to become a surprisingly in-demand skill as businesses try to extract the patterns from their data.  Statistics in a Nutshell next, perhaps…

Overall, my first experience of these conferences was very positive – in both cases, it was a great way to get a survey of the scene and drill down to a few more in-depth areas too.  Of course I can’t speak for anyone who had more specific objectives, but it seemed that the corridor conversations around the formal talks offered plenty of opportunities to make contacts, and get into more detailed discussions.  I suspect I’ll return in the future, hopefully with some stories of my own to tell!

Gordon Banner looking lovelyGordon Banner is a sysadmin and infrastructure consultant who is interested in almost anything technological, but when forced to specialise will concentrate on supporting developers and maintaining applications at enterprise scale.

What is Machine Learning?

When I introduce myself to people as one of the “machine learning” guys at Rangespan, most often people will follow up with “What is Machine Learning?”. It’s a good question so let me explain.

Machine Learning is about how we can make computers learn like humans do. Let’s take the example of learning language. Take a one year old toddler, she might hear this new word called “bird” when people point to an object in the sky and decide to start using it herself whenever she sees something in the sky (perhaps mistakenly when it was an airplane). What that toddler just did is remarkable; she didn’t just memorize or associate the word “bird” with the occasions in which she saw things in the sky previously, rather she discovered a pattern in how the word “bird” is used, and decided to use it in a similar but slightly different context.

This idea of using pattern recognition to make generalisations lies at the heart of machine learning. A classic example studied extensively in the academic community is hand written digit recognition. All hand written digits differ from each other in big or small ways. What we’d like to do is discover broad patterns to be able to automatically recognise hand written digits: e.g. a six looks a bit like a spiral, a four has no round bits, … We could try to have programmers encode these patterns in algorithms but a much more successful approach, pioneered by nature and rediscovered by the machine learning community, is to give a computer a few examples of hand written digits together with their labels and let it figure out what patterns are useful to distinguish different digits. “Machine learning people”‘s jobs are about inventing new algorithms that can learn patterns from existing data in order to generalize.

For this reason, machine learning is also a driving force behind the big data movement. When more data is available, most machine learning algorithms can more easily discover patterns and generalizations.

For the past few years I had done lots of machine learning research at Microsoft in Cambridge but in September last year I decided to join a startup in London. The first thing that struck me was the buzz at the various meetups that are organised in the city. It’s great to join a group of people to talk about a common topic.

When a friend in New York sent me an email late last year about a discussion they had at the New York Machine Learning Meetup I imagined how fun it would be to get some people together in London to talk machine learning. When we started contemplating the structure of the meetup we thought we could make the meetups even more interesting if we could bring people at the cutting edge of research (academics) with people at the cutting edge of practice (industry) in the same room.

So in February 2012 we kicked off the first meetup with Jedidia Francis (PhD student from Oxford) and David Singleton (Google). We organised that meet up in the dev room at Rangespan and when 25 people showed up, space got a bit tight.

Six months and eight speakers later our group has grown to more than 300 people. We’ve had speakers talk about self driving cars, agents that learn how to play civilization from reading the manual, … We’ve learned about how machine learning systems are used at big organisations like Microsoft as well as at small startups like PeerIndex. We’ve been sponsored by PeerIndex, Forward Internet Group, VisualDNA, Rangespan and O’Reilly.

It’s quite clear that machine learning is hot, smoking hot. It’s exciting to see so many people wanting to learn machine learning and apply it to their application domain. I believe that there is no other place in the world where there is so much high quality machine learning happening than in and around London: with universities like UCL, Oxford, Cambridge, …, startups like PeerIndex, Last.FM, Rangespan, …, established companies like Microsoft, Google and all the hedge funds there is an almost unlimited supply of people who know machine learning as well as an unlimited demand for people who can work in machine learning.

If you want to help out by presenting your work and/or sponsoring the meetup, do get in touch. Otherwise, I look forward to meet all of you at the London Machine Learning Meetup soon!

Jurgen is a dev manager at the e-commerce startup called Rangespan where he is helping to build the biggest catalog and most competitive marketplace in the world. Before joining Rangespan Jurgen was a research scientist at Microsoft. As part of the Bing Personalization team, Jurgen invented new statistical models for predictive tasks and helped develop Bing’s recommendation engines. Jurgen has a PhD in Machine Learning from the University of Cambridge, an MSc in Computer Science from the University of Wisconsin-Madison, an MSc in Informatics from the Catholic University of Leuven, and was a Fulbright Scholar.

The French Perl Workshop aka Journées Perl

 

 

Once again I am at the French Perl Workshop aka Journées Perl. This year the Workshop is taking place in Strasbourg – the capital and principal city of the Alsace region in eastern France and is the official seat of the European Parliament. Strasbourg is a lovely City of approx a quarter million inhabitants. As I am sure you are aware Strasbourg has been under French and German rules several times during its very turbulent past. It is very difficult to go further East in France when you are in Strasbourg.

 

What is Strasbourg very well known for:

 

 

 

 

 

 

 

 

 

  • The storks – unfortunately they do no longer adorn the central buildings but you can see hundreds of them in the Orangerie.
  • The seat of the European Parliament
  • The Gothic Cathedral with its famous astronomical clock
  • La Petite France – home of the black and white timber-framed buildings
  • Descriptive street names: Rue des Dentelles (Laces Street), Rue des Tonneliers (barel makers street), Rue des Charpentiers (Woodworkers Street), Rue des Serruriers (Locksmiths Street) etc.

I am told that Strasbourg hosts one of the nicest Christmas market – something to think about for the future.

 

What was discussed at the conference –

The talks were mainly in French with some English exceptions such as

Nothing breathtaking was announced – all the talks were about well-known subjects such as Perl 5x, Perl 6, DBIx etc.

 

What was new and very encouraging

The organizers were a little afraid that a conference outside of Paris would be a number catastrophe – not this time, there was a lot of delegates and also a lot of new blood. Students came and showed a lot of interest in Perl and hopefully will continue to do so.

 

And now for something not so new!

After dinner, a select few ended up in a lovely park. You guessed this was Chartreuse time! The French Perl Workshop tradition organized by Philippe Bruhat (aka BooK) is to drink Chartreuse, late at night in a public area – nobody knows if it is legal but who cares. Chartreuse is a liqueur made by the Monks of the Ordre de Chartreux since mid 1700s. It is composed of distilled alcohol aged with 130 herbal extracts. The liqueur is named after the Monks’ Grande Chartreuse monastery, located in the Alps in the region of Grenoble in France.