Interviews

We have an exciting speaker lineup for Berlin Buzzwords 2014. Thus we have performed some interviews with #bbuzz speakers before the conference begins. Learn a little bit more about them and what they will be presenting this year. Enjoy!

Stephan Ewen

Could you briefly introduce yourself?

I am a Ph.D. student at the Technische Universität Berlin, currently graduating. I have worked on the Stratosphere research project, and  in the last months, in the Stratosphere open source system. My background is (parallel) databases, query processing, distributed systems, and optimization.

How did you get started developing software?

When I was 14, my parents would not allow me to get computer games, so I taught myself to program my own ones. Simple ACSII jump'n run in Basic. Very simple stuff, but fun.

What do you hope to accomplish by giving this talk? What do you expect?

I want people to get excited about Stratosphere. The Stratosphere system follows a bit of a different paradigm and architecture that the other systems in the Hadoop space, and I hope that more people get to know about it. I would hope that some people try it out, give the community some feedback. The best thing would be to get some people so excited that they become active in the open source effort.

What will your talk be about, exactly?

An overview of the Stratosphere system for data analytics. The current state, direction, what makes it unique. Approaches to iterative algorithms, program optimization, and robust program execution.

Have you enjoyed previous Berlin Buzzwords editions?

Yes, the past two years I have been either at Buzzwords, or at one of the attached hackathons.

When did you start contributing to Apache projects?

I am actually just starting, with Stratosphere entering the Apache Incubator. (The name Stratosphere will change though, too close to the name of another Apache project.)

What was the first Apache project you got in touch with?

As a user, probably the webserver, or commons. As someone who reads and debugs the code, Hadoop.

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

The "community over code" principle makes the community resilient and long living. Complex projects depend are often not led by the same people start to end, so knowledge and responsibilities need to spread among multiple people. Also, early adoption needs usually people who understand how the software works. Having a healthy communities promotes both.

What do you think are the risks when turning your pet (free software) project into an Apache project?

Community driven decisions are typically slower than the ones made in hierarchical organizations. Worst case, the community cannot agree and nothing happens, or the project needs to split up. As far as I have seen and heard it, this happens not very often, though. We have tried to adopt this way of working before submitting our Apache Incubator proposal, and it worked really well.

Coming from a research background - what was the biggest change when dealing with the open source community?

The part of the open source community that I am interacting with has been a pleasure to work with and actually no challenge at all. I find that the open source work is sometimes even more honest than the academic research world, because one gets better feedback whether the work is actually solving a relevant problem. A thing that I noticed, though, is that the academic research world is more formal about terms and vocabulary, so everyone speaks roughly the same language. In the open source world it sometimes takes me a while so figure out that someone refers with the same term to a totally different concept.

Looking at the great features of Stratosphere: What would you like it's future will look like? Will it replace Hadoop? Will installations exist in parallel? What's the biggest change still needed to achieve this goal?

Hadoop is more ha stack or ecosystem, rather than an individual technology. I think that the Hadoop ecosystem will further diversify, with technologies like YARN and Mesos making this even easier. Many systems will exist next to each other: MapReduce, Stratosphere, Spark, Tez, ... But I do believe that the original MapReduce engine part will loose in significance over time. With the major higher-level languages and APIs (Hive, Pig, Cascading, ...) moving to Tez, and newer APIs like Stratosphere and Spark coming about.

 

Christoph Goller

Could you briefly introduce yourself?

My name is Christoph Goller. I am Head of Research at IntraFind. Besides Information Retrieval and Computer Linguistics I am mainly interested in machine learning. Currently I am very excited about the resurrection of the deep learning approaches (at Google labs and others), since I did my PhD in that field during the 90s.

How did you get started developing software?

I started with a Casio programmable calculator and later used Basic on a Sinclair ZX Spectrum.

What do you hope to accomplish by giving this talk? What do you expect?

I want to show that there is much more than just scalable search. Using NLP / Text Analytics tools like text classification and information extraction can improve search results considerably.

What will your talk be about, exactly?

Together with my colleague Breno Faria I am going to talk about our automatic tagging system and how we use it at Zeit Online, the online version of the German newspaper "Die Zeit".

Have you enjoyed previous Berlin Buzzwords editions?

This is my 4th Berlin Buzzwords conference and I am looking forward to all the search-related stuff.

When did you start contributing to Apache projects?

I started to contribute to Lucene in 2004. At that time there were not many Lucene committers. Studying Lucene code taught me a lot about Information Retrieval and working on software that is used by so many people ist just great. I still love to go into the details of Lucene code.

 

Ralf Herbrich

Could you briefly introduce yourself?

My name is Ralf Herbrich and I am Director of Machine Learning for Europe at Amazon. My passion is probabilistic modelling, statistics and applied research. I have worked for Microsoft Research, Facebook and Amazon and, after 13 years living in Cambridge, UK and Mountain View, US, I am now back home in Berlin.

How did you get started developing software?

When I was 14 years old, I was lucky to get my hands on a Sinclair ZX81 (with the 16KB memory extension module, no less!). Unfortunately, the machine had a broken data port so I always started from scratch. I loved playing computer games and it was a good exercise to have to write all the code on paper first. Whenever I wanted to play, I first needed to type all the code and compile. The next machine was an Amstrad PC1640 - a step change in storage and speed!

What do you hope to accomplish by giving this talk? What do you expect?

I hope the talk will demystify some of the expectations that people have when thinking about the challenges of putting science into products. Also, I hope to excite more people about probabilistic modelling - one of the most powerful analytical techniques to solve real-world problems.

What will your talk be about, exactly?

I am planning to share my experiences of transferring research technologies into products such as TrueSkill (XBox Live's ranking system) or the click-through-rate prediction system that was used in bing or the AI for the board game Go – both explaining in simple terms the science behind these technologies and the process it took to get from a mere idea to a mass-market product (feature).

Have you enjoyed previous Berlin Buzzwords editions?

This is my first Buzzwords - I used to live abroad and did not get a chance to participate earlier.

Why did you choose graphical models for TrueSkill?

Graphical models allow to specify the procedural flow of information that produces the logs of real data – and then give efficient algorithms to invert this flow in order to compute the best (distribution) estimates for all unknown quantities from logged data. The framework fits perfectly to the problem of ranking – making a model how a person's skill translates into their actual performance in a match and how performance (differences) result in actual match outcomes that get logged whenever a party of people plays online.

Several machine learning algorithms aren't a particularly good fit for Apache Hadoop. Which ones stand out in being well suited for Hadoop, which ones are particularly hard to implement in a scalable and efficient way?

Hadoop – or MapReduce in general – is very well suited for algorithms which aggregate on chunks of the data and can combine the aggregations without any additional communication between the aggregations. Thus, any algorithms which de-compose into an aggregation step and one combination step are a particularly good fit. In terms of sparse linear models, Naïve Bayes is one such algorithm. For decision-tree models, a random forest is another good example (each aggregation learns a decision tree from a super-sampled part of the training set and the combination step creates a decision forrest from the various decision trees). Iterative learning or prediction algorithms are not well suited for Hadoop.

Machine learning is still some sort of magic and black art today. What do you think is needed to also make your average Joe developer understand how to apply machine learning to his problems, which algorithms to chose how to pre-process data?

I think part of the reason why Machine Learning is still a black art today is the focus on algorithms rather than data and models of data.What is important for making machine learning more widely usable is to provide interfaces which allow developers to focus on the type and transformation of data that will enter a machine learning algorithm and hide all the tuning and choices in a machine learning algorithm by automating parameter tuning. People often have deep inside knowledge what each column of their data (table) means. Surfacing hard-to-predict examples and correlated data columns would make machine learning more easily usable by non-experts.

 

Andrew Psaltis

Could you briefly introduce yourself?

My name is Andrew Psaltis, not to be too cliché, but I am a Big Data geek who loves building and hacking on in-the-moment large data systems. Currently I am a Principal Software Engineer at Ensighten, where I am deeply entrenched in building software to help redefine the marketing cloud. Previously, I was the primary architect of Webtrends Streams – the first scalable streaming architecture pushing real-time visitor and event-level digital intelligence into the marketing ecosystem, with this product Webtrends won the Digital Analytics Association 2013 New Technology of the Year Award (http://www.digitalanalyticsassociation.org/awards2013).

I am also currently authoring a book for Manning publications titled: The Art of building In The Moment Big Data Applications.

When I’m not working and writing, I love to spend time with my lovely wife and our two kids, coach U9 boy’s Lacrosse, take in as much Lacrosse as possible, bake (breads and pizza), hike, cycle, camp, and travel.

How did you get started developing software?

It seems like I have always been involved in software to some degree or another. However, it really got started for me when I was in grad school in the early 1990’s, pursuing a dual masters degrees in Biomechanics and Exercise Biochemistry. At the time the university could not afford the software to analyze our biomechanics research, so naturally we wrote it ourselves. At that moment I was hooked, I found my second love. However, it was not until the time that I saw friends that were completing their doctoral studies searching for and taking one of the few jobs in a remote location that I thought to myself – “You know I can continue my studies and be at the mercy of a job or I can follow this software passion and work anywhere I can speak the language”. As the saying goes, I made a left hand turn and never looked back.

What do you hope to accomplish by giving this talk?

I hope to accomplish several things –

1)     To help inspire others to look at Spark Streaming

2)     To show an example of how to surface the results of Spark Streaming computations.

What do you expect?

I expect to:

1)     For me to learn as much from the audience and others as they get from me. I truly believe teaching is a two way street.

2)     To have an amazing experience and soak it all in.

What will your talk be about, exactly?

My talk will be about how to put together a system that is composed of Kafka, Spark Streaming, and Websockets. Naturally there will be some other software involved, but this is really about getting data into and out of Spark Streaming. It is very easy to get data into Spark and Spark streaming. But then what? I plan to layout an example of how to structure a solution and to also provide guidance on how to do it.

Have you enjoyed previous Berlin Buzzwords editions?

This is my first Berlin Buzzwords and from looking at the schedule I anticipate it being fantastic. Unfortunately in the past I have only been able to watch the videos on YouTube.

What was the first Apache project you got in touch with?

Lucene. I have used it fairly regularly since late 2001 / early 2002. I love it and the tools that have evolved on-top of and around it, such as Solr and ElasticSearch.

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

I think it strikes this amazing balance of very hard working, smart, energetic people that are after a common goal and all work towards the betterment of something. I have tried, as I am sure others have to replicate this model inside of organizations, but the organization always gets in the way of itself.

 

Dawid Weiss

Could you briefly introduce yourself?

My name is Dawid Weiss (yes, not a typo -- it's spelled with a 'w'; as there's no 'v' in Polish alphabet), I'm a Polish native, working and living in Poznań, Poland (very close to Berlin, actually!). I've always been passionate about coding and throughout my professional career I went through working in the industry, then pursued my research interests in academia, resulting in a doctoral thesis, and currently I'm a co-owner of a company called Carrot Search, which does things somewhat related to my academic career (information retrieval, text processing and clustering). I'm also involved in a few open source projects and I'm Apache Lucene PMC member.

What will your talk be about?

My involvement in Lucene is primarily in improving the test framework. The way I was taught how to test software is probably similar to what most people get during their education -- unit, regression and integration testing, covering expected and special cases. Perhaps the crucial point is that tests don't change from run to run (they guard against regressions, but don't do anything novel). Lucene changes this paradigm and puts heavy emphasis on tests that explore random combination of data and components. Every test run is different. This doesn't mean that if something fails there's no way to repeat the failure, it's just that with every new run there's a chance of coming across something that nobody thought of (or couldn't predict). This "randomized testing" approach has been very successful in Solr, Lucene and Elasticsearch and I think it's a topic worth spreading beyond Lucene.

What do you hope to accomplish by giving this talk? What do you expect?

I hope to convey the idea that randomized testing is really useful in any project, not just Lucene. I'm not a radical -- it's not a replacement for traditional tests -- it's an additional tool to improve software. Lucene's testing framework also contains a few other interesting components that some folks may find useful in their work (like infrastructure to run tests concurrently, detecting stray threads or excessive memory use). I will try to introduce these components shortly.

By now randomised testing probably is one of the most important factors when it comes to stabilising the Lucene code base, armoring it against bugs. Can you tell us a bit about where that idea came from?

To be honest I can't quite remember now -- I think randomization was present in Lucene when I joined the project, I just tried to give it a more structured and independent implementation (so that it could be reused in projects). The idea itself has deep roots in research literature, I will mention this during the presentation.

What is the most hilarious bug ever found through randomised testing - either in Lucene or elsewhere?

Perhaps the best class of bugs comes not from the Java codebase, but from the JVM (java virtual machine) itself. Surprising as it may seem, Lucene tests notoriously hit JVM assertions and errors. Some of these have been fixed, some of them are still open. The problem with JVM bugs is that they rarely reproduce, or involve very complex code scenarios. I think the best-of-the-best bug must be related to readVLong/ readVInt method which is like a dozen lines of code in Java but causes headaches both in IBM J9 and in Oracle's hotspot implementations. There is one (known) bug that we can reproduce but which nobody has any clue how to fix (including hotspot developers).

It's one of the drawbacks of randomized testing: you can hit a scenario which causes an exception, but it's hard to explain what is actually happening (and what the desired behavior should be.

Will you give your presentation standing on your hands?

I like to deliver interesting content in interesting ways, but I'm too old for upside-down public speeches. I promise to think of something to make the presentation not so boring (and I realize people fall asleep when they're hear the "software test[zzz...]ing" phrase).

Anything you are planning to hack on during Berlin Buzzwords?

We've been thinking on improving the detection of file descriptor leaks (like unclosed sockets and files). So if you have any ideas, feel free to find me!

 

Stefan Schadwinkel

Could you briefly introduce yourself?

Hi, my name is Stefan Schadwinkel. I hold a PhD from several years of neuroscience research, but my main background is computer science and artificial intelligence. I'm a co-founder of DECK36, a consultancy agency based in Hamburg, Germany. We offer services in three main pillars: Automation & Operation, Architecture & Engineering, and "of course" Analytics & Data Logistics, which is the main area where I'm involved. In that area, we're using tools like Hadoop, Spark, and Storm as well as several NoSQL stores. Before that, I worked at the IT service provider of PokerStrategy.com, the largest online community for poker players. There, my main focus topic were analytical solutions to combat different kinds of online fraud.

What do you hope to accomplish by giving this talk? What do you expect?

I'd like to make one particular topic more interesting and much more recognized in the community: finding "duplicate" data. While it does not look too fancy at a first glance, it actually extends way beyond questions of "data input errors" or integrating two, previously separate, databases. And while there is actually a lot of recent research in this area, relatively little shows up prominently in the world of big data. This is especially pronounced as the research is spread across multiple disciplines that use a wide area of terms to describe their fields. On the other hand, with increasing data volume, "schema-less" data storage, and more and more real-time processing, we have a lot of usecases for that particular topic. So, with my talk, I'd like to show that it can be a field that is quite worthwhile to be explored.

What will your talk be about, exactly?

The talk will be, of course, about data de-duplication, but with a focus on usecases besides database consolidation. I'll give a brief intro to the field itself, it's history and it's main algorithmic approaches. Afterwards, I'll look at one particular algorithm that I find quite intriguing. It combines locality-sensitive hashing, an indexing scheme, and a handful of clever tricks in order to reduce the quadratic complexity of pairwise comparison with all other messages down to linear time. It is thus particularly well suited for finding fuzzy clusters of duplicates in real-time streams. After walking through the algorithm, I'll show and demonstrate an implementation of the algorithm that uses Storm and Riak.

Have you enjoyed previous Berlin Buzzwords editions?

Yes, I have enjoyed the 2012 edition very much. Unfortunately, I couldn't make it last year, so I'm the more excited about being part of it this year.

Why did you choose Apache Storm for real-time analytics at DECK36?

Most of the time, real-time analytics is not the only processing pathway and also the usecases have no precedents that were established in the long-term. Therefore it is key to use an iterative approach in order to validate potential business value as early as possible. To do so, one needs tools that are very flexible, easily extensible, and, especially with new projects, don't create long-term vendor lock-in. Apache Storm fits this bill very nicely, one key possibility being the ability to use shell components that can be written in practically any programming language. That allows for an evolutionary adoption that is easy and painless, empowers existing IT resources, and, by allowing access to the whole world running on the JVM, brings in new possibilities from beyond the limits of the currently used operational toolchain. An example would be enhancing the world of PHP-based web applications with real-time analytics capabilities that seamlessly integrates java-based machine learning with a vast pool of existing PHP-based business logic. That being said, while we're big fans of Apache Storm, we, of course, keep our eyes on projects like Apache Samza, Apache Spark Streaming, and Yahoo Samoa as well.

How do you see the domain of data science and its future development?

Having a background in both, computer science and natural science, I've seen many examples of how seemingly distant domains overlap and how results can only be gained from that area of intersection. For "data science", this makes immediately sense, because domain-knowledge does not help you when you lack the skills to practically crunch the data. On the other hand, without domain knowledge, crunching data alone will lead to nothing at best and to seriously dangerous conclusions at worst. This conundrum has existed in scientific research for a long time and basically "data science" has mainstreamed scientific research into day-to-day business. "Big data" has made that (a) possible and (b) necessary. And that's also the beauty of it. It makes tech people think about the business (domain knowledge), and it brings the business people to latest tech developments. Their feedback on how to leverage it can then flow back into the development cycle early on. If both can work together efficiently, prototyping and evolving solutions and applications along the way, it can really benefit all parties involved: as people start to share interests, transparency becomes fundamental, and the mutual aim to constantly evolve the organization with short feedback cycles also empowers each individual. At its core, it brings people together through stories. If you keep telling your stories, you will preserve your ways, and there is a value in that. But once you start telling stories backed up by data, things inevitably start to evolve. That, of course, goes beyond data science alone, but data science really moves things in that direction.

Do you see any interesting new big data trends?

One main trend that seems to gain momentum at the moment revolves around how big data relates to questions around data privacy and data security. While this is driven by multiple factors, it might not be known so much among the international community that particularly Germany seems to be have an especially high interest in that area. I'm looking forward to see what emerges out of the current discussion around this topic in the near to medium future.

Next to such specific topics like privacy, there are, of course, further developments. Similar to how data science works as an area in which multiple fields overlap, many of the current trends seem to be sourced in such areas of intersecting multiple domains. For instance, commercial software and open source are drawing themselves to each other. On one hand, open source tools move towards providing simple integration with more traditional software, especially in the areas of business intelligence and visual analytics. Then, on the other hand, commercial vendors adopt more and more from the world of Hadoop into their own product portfolio. This coincides with two more tracks: (1) the tools in the open source big data sphere matured a lot in the recent past and best practises for common tasks emerged and evolved concurrently, and (2) people retreat from the simplified view of "just throw more data at it" approach, as they have found that for certain usecases approaches like sampling, being probabilistic, and building customized niche algorithms yield far better results. In that regard, I also feel that larger organizations are becoming more open to using probability and data-driven methods, like randomly delivering multiple well-perfoming options as a baseline or using just the right amount of randomness to make personalized content more engaging.

Combined together, all these developments lead to two areas, the first one being an area where standard problems can be tackled much easier than before by leveraging the proper tools and practises. This frees up resources and leads to the second area: tackling the own custom niche with more specialized and sophisticated solutions, opening opportunities for new projects and smaller businesses. These can operate much faster than the big commercial players, but will integrate their solutions with them. Further, this again relates a lot to skills as well, and people will increasingly need skills from different domains in order to thrive in such interdisciplinary areas of intersection. Whether we call them "data science" or else, the demand for proper training and workshops will rise, both in the areas of becoming proficient with specific tools and practises, but also in the area of general, more abstract skills like becoming better at adapting, driving the own evolution so to speak through iteration, and systematic methods that allow to "connect the dots" between different domains and integrate them into the current solutions that people work on.

Collaboration is another key to that end and people facilitating the interaction between different departments will become even more important. Beyond that, the need for collaboration will also drive the emergence of new tools. Collaborative "open data" analytics projects are already on the rise. It is thus likely that platforms will reflect that. My guess would be that there will be platforms not specific on the tools and methods, but on actual open data research projects that will share their data and analytical status similarly to how github operates with code. As a side effect, this would provide a repository of analytics usecases and success stories. This is something that is much needed at the moment to make the benefits of big data analytics more accessible and tangible to organizations that currently still struggle to understand the potentials for their own specific business case.

So, to finish up, what the future brings one can never be sure, but I'm pretty sure it will be interesting.

 

Chris Hostetter

How did you get started developing software?

My memory is a bit fuzzy, but I'm pretty sure my very first bit of software development was around 3rd grade when I figured out that the Apple ][ floppy disks my school had for educational games like "Oregon Trail" had (resource) files on them containing all of the words that appeared in the game.  By editing those files, my friends and made the games say things like "Poop" and "You smell" which was the height of cool for a 3rd grader.

I also remember painstakingly typing in pages and pages of BASIC code that I found in BYTE magazines as a child, but in no way did I ever really understanding what I was typing -- I was just a very low baud / high error "human modem" in a paper based network connection.

I distinctly remember the first time I ever wrote (from scratch) a piece of software that did something useful for someone:  I had a part time job my senior year of high school doing data entry and page layout for a mortgage company.  I wrote tool for my boss, using Excel macros, that would prompt for a ZIP code and would then tell you what City/State that ZIP code was in according to a big table in a hidden excel sheet.  It ran in constant time -- not because I was smart and used an efficient data structure, but because I didn't know how to "break" out of a loop, so even if it found the ZIP code in the first row of the table, it still checked every other row.

When did you start contributing to Open Source projects?

My first contributions to Open Source were back in the days when I didn't really understand what Open Source was.  As a college student getting paid to hack on the Perl scripts that were used to manage the dorm networks @ UC Berkeley in the 1990s, I spent a lot of time hanging out on comp.lang.perl.misc (and later perlmonks.org), asking questions about the stuff I didn't understand, and posting answers to the questions that I (was frequently surprised I) could.  As I got more knowledgeable, my posts occasionally crossed the line from "How" to "Why" and eventually "What if..." and "We should...."

Some of those posts (IIRC) lead to new features/APIs in some of the Perl modules I was using.

At the end of the day, that's what 90% of "contributing" to Open Source projects is all about: Participating in the community, asking questions, offering answers, and sharing suggestions.

When did you start contributing to Apache projects? What was the first Apache project you got in touch with?

The first Apache project I participating in was Apache Lucene.  A co-worker of mine @ CNET was working on a proof of concept to replace an existing search system.  His experiments didn't wind up going anywhere useful, but he told me about one cool piece of it that did work -- and how nice it was to build on a third-party API where he had the source code so he could look under the hood when things didn't work.  Those comments stuck in my head.  Six months later, when I was working on my own proof of concept for building a new faceted search system, I remembered what he said about Lucene and thought it could potentially be a useful foundation for what I was trying to do, So I started looking into it.

My first contribution to Lucene was in the form of a question to the user list about the performance of RangeQuery vs doing the same thing in a Filter.  Trying to be thorough, I included a simple JUnit based micro-benchmark comparing how Lucene currently did things vs. my alternative idea that seemed faster, and asked why RangeQuery was recommended for this.  In response, Erik Hatcher said "You're spot on!" and committed my code and tests.  That was really the first time I realized how amazing Apache and Open Source in general could be compared to some of the commercial vendors I'd dealt with before.

My proof of concept faceted search system worked great -- largely thanks to a lot of great advice I got from other people in the Lucene community -- and was eventually folded in with a different proof of concept Yonik was working on in another group at CNET that become Apache Solr.

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

I think it all boils down to a key aspects of how the ASF is organized: It's a non-profit organization composed of individual people, independent of any specific corporate interests.

A lot of people have gotten burned by open source projects that existed at the will of a single company that controlled the project and it's destiny.  A corporate overlord can help push projects forward with clear purpose -- but it can frequently be at the expense of growing a healthy community of developers who might have diverse views about how to improve the project.  If a company goes out of business, or loses interest in projects they are shepherding, a lot of the resources involved (including the infrastructure hosting the code/forums/bugtracker) may dry up.  Project code may still be out there and available under an OSS license, but communities can easily become fractured and have trouble recovering.  At the other end of the spectrum: a project may become too successful, leading corporate interests to change so that they push for project changes (in licensing, APIs, release processes, etc...) that are motivated by what's best for the company instead of what's best for the project community as a whole.

The legal structure of the ASF makes these types of problems nearly impossible.  That leads to a lot of trust and confidence in Apache projects -- which leads to a diverse community of contributors (who help ensure that the projects evolve and grow), which in turn leads to diversity in the membership of the foundation itself to help ensure that the ASF and it's projects continue to stay strong and independent long after individual members or their companies may have moved on or shifted their goals and priorities.

This is your first Berlin Buzzwords - what do you expect?

- I expect to meet Amelia, and Mrs. Policeman.
- I expect Simon will make a rude hand gesture that distracts me during my talk.
- I expect an argument to break out at a bar, causing someone to open their laptop in anger to prove a point, leading to a brilliant pair-programmed patch by 2 (or more!) drunk developers.
- I expect Uwe to dance with my wife.
- I expect to attend at least one talk where I learn something that really excites me about a project I have not yet even heard of.
- I expect Berlin to be awesome.
- I expect Simon will be distracted by a rude hand gesture I will make during his talk.
- I expect to be eat some really great sausages & pretzels.
- I expect Robert to oversleep and miss his flight home.

 

Adrien Grand

Could you briefly introduce yourself?

Hi, my name is Adrien Grand and I come from Caen (France). I like open-source software and search engines, so you won't be surprised that I am a committer on the Apache Lucene project and a software engineer at Elasticsearch. :-)

What will your talk be about, exactly?

I am going to talk about the why and how of aggregations with Elasticsearch. Elasticsearch has had faceted search for a long time, but aggregations take this concept to the next level by adding composability and new metrics that you can compute on your data: unique counts, percentiles, etc. I will give an overview of what aggregations can - or cannot - do, how they work and what it implies in terms of performance.

Have you enjoyed previous Berlin Buzzwords editions?

Definitely, I like the topics, the human size of this conference and its high-quality talks. Moreover, it's a conference that several Lucene committers and power users attend, so it's a great place to discuss the future of the project!

When did you start contributing to Apache projects?

It was in April/May 2012. I had been using Apache Lucene and Solr in production for a bit more than one year and followed mailing-list and JIRA activity closely. When you daily use one technology in production, you don't need to think about the things to improve, production tells you. :-) So I started trying to submit improvements in the Lucene and Solr JIRAs. A few weeks later I went to Berlin Buzzwords where I met these people I had been interacting with on JIRA, and they announced me that I had been nominated to become a committer! I was super happy, I definitely didn't expect it to happen that quickly!

What is your major goal as a committer to Apache Lucene? Which areas do you think need most attention?

It's hard to tell, Lucene has so many capabilities nowadays!

While most people probably know Lucene for its full-text search capabilities, it is also very good at structured search, and analytics thanks to the columnar storage that was added in 4.0 (doc values). Lucene is also becoming better and better as a data-store: it added stored fields compression in version 4.1 and end-to-end checksums in version 4.8. Nowadays, Lucene might be a good fit for your application even if you don't have any full-text requirements. For example, there are lots of users of Elasticsearch who use it purely as a key-value store or for analytics, this is very exciting.

If I had to choose one area of Lucene that needs to be improved, I think I would pick highlighting. But I am convinced that there are still lots of things to improve in all areas!

 

Roman Shaposhnik

Could you briefly introduce yourself?

My name is Roman Shaposhnik and during the day I work for the most exciting startup company around: Pivotal Inc. After dark, I hang
around various Apache Software Foundation communities. At Pivotal my official title is a Sr. Manager in charge of the Open Source Hadoop platform team. What we are focusing on is advancing the state of the entire Hadoop ecosystem and providing a seamless integration between Hadoop APIs/services and the rest of our ultimate offering, the Pivotal One platform. At ASF I am serving as the current VP of Apache Incubator and as one of the founding members of Apache Bigtop project. I am also one of three authors of an upcoming Manning book "Giraph in Action": http://www.manning.com/martella/

When did you start contributing to Apache projects?

My first experience with ASF communities was thanks to Yahoo!. I joined the company right after it decided to take a tiny little project called Hadoop and leverage Apache Software Foundation for fostering a true open source community around it. When I joined in 2010, it was already clear that the decision of trusting ASF with licensing and governance of the project turned out to be a stroke
of genius. I really don't  think that Hadoop and its ecosystem projects would dominate the industry as much as they do today had it not been for Apache.

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

Two reasons: one boring and one super exiting. The boring one is the Apache License. It just so happens that Apache License won the hearts and minds of legal departments everywhere. Just like it used to be that 'nobody gets fired for buying IBM' it is now true that 'nobody gets fired for managing parts of their software portfolio under the Apache License'. A second reason has to do with a general trend of open source slowly replacing what standards used to accomplish back in '80s-2000s. Remember how it used to be? In order for the captains of the industry to agree on a technology roadmap they had to come together at the level of ISO or IEEE committees, produce 600+ document, give it back to their developers and hope for the best. Nobody does that anymore in 2014. It's too slow. Open Source software foundations, producing a reference implementation of the de-facto standard are a much better way. There's no shortage of good Open Source software foundation around, but it just so happens that ASF is one of the best when it comes to guaranteeing a very open, meritocracy-driven governance model around the project (and project == standard). We've just been at
it longer than most others. Companies and individual developers trust us not to screw up their community fostering efforts, but instead take it to the next level.

How did you get started developing software?

My mom got me the MK-61 (http://en.wikipedia.org/wiki/Elektronika_MK-61). It sported a single-row LED display and was meant for calculating the trajectories of ballistic missiles. It was a capable if somewhat boring device, and it was sure to turn me into a meticulous software developer. But then, I came across an article in a hobbyist magazine outlining all of the undocumented features it had and how to exploit those for creating adventure games (a single-row LED display, remember?). The article completely blew my mind and turned a wannabe software developer into an aspiring hacker. In an extremely serendipitous turn of events, my first real computer was the soviet clone of the PDP-11. For me, this little device instilled a big love for cleanly designed instruction sets, so it was no surprise that I later gravitated toward the ultimate hacker space later in life—the UNIX software development culture.

Anything you are planning to hack on during Berlin Buzzwords?

Oh yes! There's a project I'm totally obsessed with lately. It is called OSv (http://osv.io/).  OSv is designed from the ground up to execute a single application on top of a hypervisor, resulting in superior performance and effortless management. Especially for the Java applications, the kind of performance improvements it
is capable of are mind-blowing. And repeat after me: these are performance improvements of *unmodified* application compared to running it on the very same *bare* host. Let it sink in for a minute. Yes. I am talking running faster in a virtualized environment compared to running the very same application on the host itself. If at this point you are not itching to hack OSv, I am not sure we can be friends anymore.

When deploying Apache Hadoop and friends to production - in your experience what is the biggest challenge?

The biggest challenge also happens to be the biggest benefit. Apache Hadoop and its ecosystem projects are still rapidly evolving. What this means is that at times, enterprise-like features around backward compatibility, rolling upgrades, etc. tend to be prioritized closer to the bottom of the list by the Open Source development community. There's so much fundamental innovation going on, that vendors such as Pivotal, Cloudera and HortonWorks have to step in and help the customer with the 'boring' stuff: deployment and maintenance. On top of high rate of change, the other fundamental challenge is that Hadoop is a highly sophisticated distributed platform and we're still not quite sure how to best manage something like that. Most of the IT practices in a datacenter are still built around managing individual hosts. I really hope that we can change the game at Pivotal with our PaaS solution. At the end of the day, it is just silly that we have to deploy Hadoop on 'bare metal hosts' when we can make that experience much easier by deploying it to the PaaS layer. And with OSv, it will even run faster. What's not to like?

 

Uwe Schindler

Could you briefly introduce yourself?

My name is often mentioned together with the term “Generics Policeman”. Yes, I am working on policing the code and build system of Apache Lucene to find all sorts of violations, but I also work on adding new features. As a good policeman I have all open source “guns” for code checking available. Outside of the Lucene infrastructure I am working for my own software company “SD DataSolutions GmbH” and “PANGAEA – Data Publisher for Marine Environmental Sciences”. Of course all this is closely connected to Lucene, Solr, and Elasticsearch.

If you were the god of Java - what would you change?

I would require from the user that on every locale-sensitive operation the correct locale, time zone and character set has to be explicitly given. E.g., Java’s lowercasing that does this transformation with using the operating system’s default locale automatically is a horror for server-side applications. If you ever ran a program written on western computers in Turkey, you know what I am talking about. This newspaper headline shows you the consequences of not taking care of locales: “A Cellphone’s Missing Dot Kills Two People, Puts Three More in Jail.” At least, all those locale-sensitive functions in Java should be marked by an annotation, so you get a warning when you use them.

You've found a ton of bugs even in the JVM itself - for a developer interested in reducing the bugs in their software - which strategies would you recommend for testing (independent of programming language)?

Lots of bugs are hidden, because you just do not test for them. In Lucene, we started a quite new approach which is also controversial: Randomized testing. Opponents of this approach will tell you, that the problem is not reproducible, once you found a bug. To take care of this, randomize your input data and settings, but save the “random seed”. Once you found a bug, you can turn of the randomness by providing the seed and reproduce the failing build. There will be a talk at this conference by another Lucene committer, Dawid Weiss: “Randomize your tests and it will blow your socks off!”

When did you start contributing to Open Source, specifically to Apache projects?

I had my first contact with open source projects in 2003. At that time we were using the great iPlanet webserver and it lacked good support for the PHP language. So I started to maintain and improve the NSAPI plugin in the PHP project – I am still committer there. The first Apache project was Lucene, which I used since early 2006 in the area of geo data. Lucene lacked fast numeric queries to implement geo search and date ranges, so I implemented my own query type. I donated that to Apache a bit later in 2008 and was voted to be committer. I got PMC member quite soon and recently I was voted to be the chair of the project. I also contribute to other ASF projects like Apache TIKA and Ant.

Anything you are planning to hack on during Berlin Buzzwords? Where should attendees interested in Lucene try to meet you and ask questions?

I am not so good at hacking stuff at conferences – I am not comfortable with “hackathons” having the focus on “hack”; I need my home office, Skype, and espresso machine to do that effectively. I am attending conferences more to meet other people and of course the other Lucene committers! You may find me sitting together with Robert Muir and Simon Willnauer, planning new features or discussing Java bugs. Just join us and ask your question! Of course you are invited to join my talk and ask your questions there!

 

Steve Loughran

Could you briefly introduce yourself?

I'm Steve Loughran, I work at Hortonworks on interesting future bits of Hadoop -including improved cloud deployment, and currently on the "Slider" work to run arbitrary applications in a YARN cluster

How did you get started developing software?

Oh, that was a long time ago. Playing on home computers back in the "basic was all you had" era of the 1980s, then ended up at university doing computer science

What do you hope to accomplish by giving this talk? What do you expect?

I hope to give people curious about coding YARN applications insight into the concept and what they need to know. If you are prepared to invest the effort, it gives you way to code applications to run in a Hadoop cluster -with the shared data, and innovate on problems that nobody else has solved for you.

What will your talk be about, exactly?

Oh, that's a secret.

Actually it's about how  the core of a YARN application lives between the higher level "Codd/Djikstra" layer of code to do useful things, and the darker "Lamport Layer" of distributed computing, which the YARN infrastructure tries to handle for you. I'll talk about what I think you need to know -and what you can use to get started, tools like Twill  and Spring for YARN, and how you can architect and test your own applications. Though in a single session, I can't go into as much detail as I'd like.

Have you enjoyed previous Berlin Buzzwords editions?

Oh, I've loved them! Great event, fun people. I've also detoured to visit the Technical University of Berlin team, who along with others were working on Stratosphere. That's now under incubation at the ASF, which shows how we can get great work from universities into open source -and how events like Berlin Buzzwords can help.

If you were the god of Hadoop- what would you change?

In Hadoop? I'd change the configuration model. To what -I don't know- just away from what we have today which is a bit of a mess and very hard to debug.

When did you start contributing to Apache projects?

Probably about 2000 -I was working on big Java projects, before any of the IDEs were any good -and Ant turned out to be what was needed to build things. At the same time, JUnit introduced us all to test-driven development, which as a concept and a practice has transformed my life. I've turned out be good at writing tests because I keep breaking things -it's a use for my clumsiness an inability to write code that works first time.

What is your major goal as a committer to Apache Hadoop? Which areas do you think need most attention?

I think there's two areas that we should look at -and they aren't at the user level of applications

1. Operations: Hadoop needs to be easier to run, and while my colleagues are collaborating with others on the Ambari management tools, we need to think about how Hadoop itself can be easier to manage, to diagnose problems and help ops teams improve performance. Configuration is one area -but another interesting one is using Hadoop to analyse it's own generated data.

2. Cloud hosting. We need to make Hadoop better at running in a world where the response to a machine failure is a new one comes up with a different name, where persistent data is often in object stores, and where based on demand you can grow and shrink your Hadoop cluster. NetFlix have done a lot of work here -and share their code- but I think there are changes we could do to core Hadoop to make it easier for people.

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

It's always had a focus on people, rather than code, and that means you get to know lots of people across projects. This helps co-ordinate work, as well as solve problems.

What are the risks of going Apache?

You can't be so agile in your development. For something I'm doing at work next week I'm going to make everyone else stop coding on Slider for 3-4 hours and do a big renaming of packages and move around of the code. You can't do that in something like Hadoop -because it's being worked on by too many people, there's too many branches, and all the pending patches -of which there are too many- would break.

What do you hope to accomplish by giving this talk? What do you expect?

I hope to encourage people to open their laptops, turn to their favourite editor, and start writing code that runs on a Hadoop YARN cluster.

What do you think are the risks when turning your pet (free software) project into an Apache project?

The focus on people means that it is no longer "your code", but a team project, where that team includes people whom you don't know, haven't worked with yet -and have varying skills. You need to nurture them into contributing what they can, while still trying to maintain a coherent roadmap of where things are going.
 
Anything you are planning to hack on during Berlin Buzzwords?

I don't know yet. I think I might play with something outside the core Hadoop stack.

Where should attendees interested in Hadoop try to meet you and ask questions?

I'll be around...

Peter Karich

Could you briefly introduce yourself?

My name is Peter Karich, I’m a problem solver and former physicist from Germany. I like coding on algorithms especially ones that involve graphs, in 2012 I started hacking on GraphHopper. Also I worked for a company where I created the logging and application back end of a speech assistant.

How did you get started developing software?

I started programming at school with a snake clone on a very limited graphing pocket calculator which supported only a Basic-like language, CASIO CFX-9850G for the interested. It was relative populator in my school so that clones appear in lower classes with different author names ;). Later I moved to PC (Linux) creating a chess clone and timetabling software in C/C++/Prolog, and later Java.

Have you enjoyed previous Berlin Buzzwords editions?

Yes, in 2012. Was a great experience. Especially to meet committers of Apache Lucene and ElasticSearch.

If you were the god of GraphHopper - what would you change?

I am the god of GraphHopper so this question does not apply ;)

Why did you create GraphHopper at all when there are so many existing routing engines?

I found it curious that no existing Java routing engine was very fast and/or applied TDD which is very important for such a complex software. I was looking for a search engine similar to Lucene but to find paths in graphs.

What do you hope to accomplish by giving this talk? What do you expect?

I hope to give a good overview of what can be done with GraphHopper and also how developers can use the experiences I made for their own Java projects. I expect a good conference, nice people and good chats.

When did you start contributing to Open Source projects?

Probably around 2005

 

André Bois-Crettez

Could you briefly introduce yourself?

I am André Bois-Crettez, French software architect at Kelkoo, a shopping comparison website present in 13 countries. I worked for this company for 10 years now in the city of Grenoble, on different projects but in recent years it has been mainly about features and performance of our search engines. In the last few months, I started a new project using Hadoop and Spark for computation on big data.

How did you get started developing software?

By typing BASIC and Logo turtle programs more than 25 years ago. My, times goes by! I remember that with BASIC, a difficulty was translating the english keywords in French, not so easy when 7 years old and learning to program! The Logo language I used had French keywords, much easier for me, and much more rewarding to generate fractal graphics with recursion. Later on I did some Turbo Pascal, got first accustomed to Unix commands with the Quake in-game console, then got formal learning at university on C, Ada, assembly and Java languages as well as computer science and operating systems principles.

What do you hope to accomplish by giving this talk? What do you expect?

With my colleague Anca Kopetz, we hope to give back to the community, by explaining our experience implementing a high performance search solution in production that uses Apache Solr, dealing with high traffic and using a good number of features.

Have you enjoyed previous Berlin Buzzwords editions?

Not yet in person, but I did enjoy slides/video of past presentations. I look forward to discuss with people there!

Why did you choose Apache Solr for the shopping search engine at Kelkoo?

After Yahoo sold Kelkoo, we had only a few years to replace the Yahoo internal proprietary search engine. After studying several commercial and open-source engines for license prices, performance, features, and ease of customization, we settled on Solr. We will give more details during the talk!

When did you start contributing to Apache projects?

Simply by suscribing to solr-users mailing list end of 2011, to learn from experienced people, and answering questions. And then in 2013 as we implemented our Solr engine, we were involved in a few JIRAs, proposing fixes and tests and explaining needs. For a company it is sometimes not easy to choose what to keep internally and what to expose, but posting information publicly is beneficial for everybody, both for the project community and our own production setup.

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

I believe the ASF shows a good combination of the agility of pragmatic opensource and the rules of incubating projects until a certain level of community maturity is met. As for buzzworldy talks, having Hadoop as the de-facto standard for opensource Big Data is probably a reason it attracted other projects around this ecosystem.

Itamar Syn-Hershko

Could you briefly introduce yourself?

I've been working on search technologies and distributed systems for a while now, currently self-employed as a consultant and contractor doing lots of interesting stuff world-wide. I'm an Apache Lucene.NET committer, an author of a book and a true believer of Open-Source software.

How did you get started developing software?

Well, I have no recollection of the time _before_ I started developing software :)

What do you hope to accomplish by giving this talk? What do you expect?

Give people better understanding of Elasticsearch's extensibility, and show them how various types of plugins can help them solve problems better. And also when its not a good idea to use a plugin... I do expect the audience to know what Elasticsearch is, and to have some hands-on experience with it. I will skip the basics and get straight to the point. We will use some Lucene concepts, but those will be explained as necessary.

What will your talk be about, exactly?

I'll be giving a birds-eye view of Elasticsearch's plugin system. What is supported, what is not (or rather - not really), and what type of plugin to use when.

When did you start contributing to Apache projects?

I've been watching Lucene closely for many years now, started doing so when I was the lead of the CLucene project (porting Lucene to C++). However that is not an Apache project, so to answer your question - about 2-3 years ago when I became Apache Lucene.NET committer.
 
Anything you are planning to hack on during Berlin Buzzwords? Where should attendees interested in Lucene try to meet you and ask questions?

Sure! I have no idea what I'll be hacking on (yet!), but make sure to come find me, I'll be where the coffee and beer are!

 

Robert Muir

Could you briefly introduce yourself?

I'm Robert Muir, I'm an Apache Lucene committer and PMC member. I work for Elasticsearch, which builds on top of Lucene.

How did you get started developing software? What do you hope to accomplish by giving this talk? What do you expect?

Lucene has changed a lot in the past few years. The idea is to give an up-to-date summary of its capabilities. The hope is that this will be useful to folks who aren't very familiar with Lucene. Finally, the sections of the talk correspond to folders in the lucene source tree. Hopefully this will encourage some hacking.

Have you enjoyed previous Berlin Buzzwords editions?

Absolutely. It's my favorite conference: besides German beer, its great to see developers in Europe, and put some faces to names. Always a great variety of talks, tons of things to absorb each time.

Lucene has a big number of committers and contributors. In your experience, is there anything one should not be doing to avoid being pulled in and subsequently lose lots of free time?

Free time? What's that?

I recommend working on what you are excited and passionate about. Besides producing better code, it usually equates to more fun. I think this goes for any open-source project.

What's your preferred brand of beer?

Sierra Nevada. Sometimes I brew my own beer too, with varying results.
Also, beer in Germany is acceptable

 

Radu Gheorghe

Could you briefly introduce yourself?

I'm Radu Gheorghe, I come from Romania and I'm a psychologist turned search consultant. It's a long story :) Now I work mostly with Elasticsearch, at Sematext. Either through search/logging consulting, or in products such as Logsene, our log analytics SaaS. I'm also co-authoring Elasticsearch in Action.

How did you get started developing software?

When I was in the first grade, my grandparents had a Laser 500. It turned out I could write small apps in Basic that would do the math homework for me. Love at first sight. That love lasted until the 12th grade, when I decided to become a psychotherapist. But when I was a student, I got a job in IT, and the flame for my Ex revived. I could make a soap opera out of my career :)

What do you hope to accomplish by giving this talk? What do you expect?

Through our talk, we'll show just how similar Elasticsearch and Solr are in terms of functionality, and also the key differences. In the end, you'll know which of those differences have a technical impact on your projects, and which are just a matter of taste.

What will your talk be about, exactly?

Our talk will be a side-by-side demo of how one would do the main tasks in Elasticsearch and Solr. From indexing and searching, to scaling out and administering clusters. We will explain what needs to be done for either search engine and why.

Have you enjoyed previous Berlin Buzzwords editions?

Yeeeees! It was so nice to meet people I otherwise knew only virtually. Sessions are nothing short of impressive, it's a huge honor to be accepted as a speaker again.

 

Rafał Kuć

Could you briefly introduce yourself?

I'm Rafał Kuć, living near the eastern border of Poland (in Białystok). I'm a consultant and a software engineer at Sematext, Inc.
Currently focusing mostly on Solr and Elasticsearch and sharing the time between consulting, working on Logsene and blogging. I also
happen to be a book author, with titles about Solr and Elasticsearch. And most of all, I'm a father of two great kids

How did you get started developing software?

I started as a kid. My parents bought Atari 65XE with some games. After a while I started to wonder how those games as brought to life
and I started playing around with Basic. Later came Amiga where I did some game related development, but it was nothing one can consider serious - just for fun, with two friends. But those were the beginnings

What do you hope to accomplish by giving this talk? What do you expect?

I hope the audience will learn that both Solr and Elasticsearch are great products and each of them can be successfully used in most cases. We will also show the key differences between those two, discuss them a bit. I would really like to see people knowing the difference and what to expect from both Elasticsearch and Solr in similar use cases. I hope we will able to show, how to use both of them in a similar situations, what to expect from both of them and show on which functionality the emphasis was put in terms of them both.

What will your talk be about, exactly?

A side by side demo of Solr and Elasticsearch - however, we aim for it not to be a direct comparison or a "vs" session. We will concentrate on showing how to achieve the same thing in both of those great search engines - from indexing, through search, ending on analytics.

Have you enjoyed previous Berlin Buzzwords editions?

Of course I enjoyed them, even a lot

Many of the nowadays buzzwordly talks have come from the Apache Software Foundation. What do you think makes Apache projects so successful in particular for communities developing complex software?

I think there are many reasons, that are bind together to bring a value on its own. In most cases, the community over code is what pay off at the end. I guess people who know each other, understand each other and try to accomplish similar goal are able to finally bring a great product. Of course that takes time,  patience, hours of coding, etc. In addition to that, I think that having a great community, like most ASF projects have, is very encouraging and this is one of the things that keeps pushing forward, even the smallest Apache projects.

 

Alexander Sibiryakov

Could you briefly introduce yourself?

Well, I see myself as data-related problems enthusiast. I’m interested in challenges from data mining to high-performance systems. However, data mining abilities are always limited with the data you work with, because approaches of getting signals from the data are differs significantly. I’m good at full-text search problems in the modern web environment. At the same time I solved a lot of problems with design/development/maintenance of modern big data systems at tremendous scale of web search giant.

How did you get started developing software?

In the year 1993-94 I’ve got my first computer: «Kvorum» Russian ZX-Spectrum sinclair clone. 48Kb of memory, all storage on tape, and 3.5Mhz Z80 processor with color screen on TV. I was writing relatively small programs (e.g. «Game of Fifteen» emulator) on basic dialects and Z80 assembler. That was fantastic time! Me and my friends thought that everything is possible on Z80.

What do you hope to accomplish by giving this talk? What do you expect?

I hope I’ll find a connections: people, companies who are interested in this area. Of course to get some feedback about what do people think about investing time in search quality would be also great.

What will your talk be about, exactly?

It’s about how to get a search effectiveness for the end user. Current Apache projects and their modified versions from commercial companies made search easy to install and maintain. But I see almost no attention to the problems which arise during usage of the search. On what class of queries the system performs worse/best? How to improve the quality on one class, and to stay at least at the same level on others? What data available in production system and how to effectively reuse it? Are the snippets representative enough? All these questions I’m going to address in my talk. The biggest challenge for me is how to put all that stuff in 40 minutes and make it easily understood for everybody. We’ll see! So, see you at BB 2014!

A lot of NLP libraries are well tuned for the English language. Can you share more on what challenges arise when analysing Russian text?

IMO, the biggest challenge is absence of big and properly labeled data sets for Russian language which is needed to tune and evaluate NLP algorithms.

 

Eric Evans

Could you briefly introduce yourself?

My name is Eric Evans, I'm a long-time Free Software hacker living in San Antonio Texas. I've spent most of my career scaling large Internet services, and have a passion for distributed systems. I'm currently Chief Architect at The OpenNMS Group, a services and support company behind the OpenNMS network management platform.

When did you start contributing to Open Source projects?

My earliest contributions date back to somewhere around 1999 or 2000. By 2004 I was a regular contributor to OpenNMS (10 years ago, wow!), and in 2005 I joined the Debian project.

When did you start contributing to Apache projects?

Cassandra was the first Apache project I contributed to. I was working in R&D at Rackspace, and researching scalable storage when Facebook threw Cassandra over the wall in 2008.  We were eager to contribute at the time, but found ourselves hamstrung. Fortunately, a year later Cassandra entered the Apache incubator, and the rest is history.

What will your talk be about, exactly?

I'll be talking about OpenNMS as a use case for storage and analytics of time-series data. I'll briefly explain how we've been doing this, and the challenges that led us to choose Cassandra as a storage replacement. Finally, I'll introduce Newts, a novel open source time-series data store that we've been working on.

What do you hope to accomplish by giving this talk? What do you expect?

I think it's common to assume that the point of presenting a talk is to educate or inform the audience. While that may be one result, I'm not ashamed to admit that I hope to start a dialog that informs and educates <i>me</i>.  It would also be fantastic to walk away with new users and contributors!

Have you enjoyed previous Berlin Buzzwords editions?

Absolutely!  Buzzwords is easily the best conference of its kind. The opportunities you have to learn something new, or network with like-minded individuals is hard to overstate.  Anyone with an interest in search, scale, and storage should find a way to attend.

 

Katherine Daniels

Could you briefly introduce yourself?

I’m Katherine Daniels and I’m currently heading up operations at GameChanger Media in New York. I’ve been working in operations or system administration roles at startups for a few years, and in a previous life I did systems engineering and R&D at Hewlett-Packard in Colorado. When I’m not working, I can usually be found playing violin, rock climbing, or homebrewing.

How did you get started developing software?

The first programming I ever did was figuring out enough TI-BASIC to get my calculator to do some of my math homework for me, followed by some web development back in the day when the <blink> tag was still cool. After completing my CS degree, my work involved a mix of both software development and hardware testing, though it wasn’t until recently that I discovered the joys of operations and configuration as code.

What do you hope to accomplish by giving this talk? What do you expect?

I’m hoping to provide a fresh perspective on the topic of devops. I’d like to get people interested in new ways that the principles of the devops movement can be used to improve culture and productivity both in and out of engineering departments. I expect there will be questions about how this can be accomplished in organizations of various sizes, but hopefully also enthusiasm about positive changes that can be made.

What will your talk be about, exactly?

I’ll be talking about the concept of ‘devops’ - it’s a term that has become so popular some have said that it’s been overused, even lost its original meaning. I will discuss what it is and what it isn’t, how these principles can be used to spread their benefits to more parts of an organization than just development and operations teams, and what the future of the devops movement might look like.

There's often quite some friction between those developing and those operating a software system. Do you have some hints for those suffering to better the situation?

I’ve found that friction between those teams is often due to conflicting goals and priorities, or insufficient communication to allow teams to align those priorities. Try to focus on common ground between the different teams or groups, investigate how responsibilities can be shared, and concentrate on shared goals (like delighting your customers).

Identifying when something goes wrong automatically in a reliable fashion often is one of the goals in monitoring and alerting - any insights you can share on what steps to avoid when trying to make alerts more reliable?

When in doubt, err on the side of under-alerting. It might sound counterintuitive, but alerting too much leads to alert fatigue, reducing overall responsiveness by conditioning engineers to start ignoring alerts. Reducing alert noise is very important- few things are less reliable in this regard than stressed-out human beings trying to quickly figure out which of a long list of alerts is actually important.

In contrast to pure software development operations still seems to be a black art often acquired on the job, while watching others. For a student interested in learning more on successfully operating systems and keeping them up 24/7 - how would you recommend them to get started?

I’d recommend checking out Ops School, which is a great resource that’s actively being developed to teach people about operations. Aside from that, practice! Personal web, media, even Minecraft servers can provide hands-on experience configuring and maintaining a server. Students could also look for work in school computer labs, especially in CS departments.

 

Peter Bourgon

Could you briefly introduce yourself?

I'm Peter Bourgon, a backend software engineer. I've been programming for almost 20 years now, and professionally for over a decade. I've worked in embedded systems, telecom, financial and legal data systems, and now at SoundCloud, where I'm focusing on large-scale distributed software.

How did you get started developing software?

My first computer was an IBM 8086, and I started writing QBasic in elementary school. I was always fascinated by the net culture: local BBSes, FidoNet, TradeWars, moving up to CompuServe, and finally the real internet at 9600 baud in the mid-90s. I look at programming like carpentry or painting: a way to get an idea from my brain into reality. Computers can be this nearly frictionless vector for our innate and virtuous desire for creation. I find a lot of happiness in that.

Quite a bit of Soundcloud's functionality is based on search - can you share some of the more surprising anecdotes you encounter in your daily work?

The most surprising thing to learn about search at SoundCloud was how differently our users treat the search box. A single text input field needed to be a precise, specific artist locator, while at the same time launching expansive genre-based exploration sessions. It's tricky to find the right balance! Another interesting thing is how many user experiences you'd think are best solved with a typical search product are actually better served with more tailored technology. SoundCloud's autocompleter drives a lot more traffic than we'd ever expected, for example.

Not everyone has a cluster of their own - what would be your recommendation to students interested in scalable systems to gain experience?

No company or organization would expect you to know how to operate a specific piece of software, to know what all the config options are, or what the specific error messages mean, on day 1. In fact, I think spending too much time with a single scalable system is actually harmful. Your value as a distributed systems engineer is your ability to think critically and abstractly, about any kind of system. So it's incredibly important to develop a good understanding of the fundamentals: process management and scheduling, network protocols and failure modes, queueing theory, distributed data structures, and so on. You've got your whole career to specialize; as a student, you should be building your foundation.

What do you hope to accomplish by giving this talk? What do you expect?

Mostly I hope to get everyone familiar with the concept of CRDTs, and how they're totally practical tools that more developers should be considering and using. But I also want to be a sort of data point for a different kind of development philosophy. From where I sit, too many software engineers have resigned themselves to being software plumbers: solving problems by picking huge black-box data systems from the shelf, and plumbing them together without really understanding them. I believe, on balance, this type of development is deleterious to the engineers who do it, and ultimately to our profession as a whole. I believe if software engineers study theory (like CRDTs) rather than implementations, they can build software that's simpler, faster, and more elegant than the off-the-shelf solutions, without necessarily taking longer or sacrificing features.

 

Mikio Braun

Could you briefly introduce yourself?

​I did a Ph.D. in machine learning and worked as a PostDoc for the past ten years but was always very much interested in the practical side and applications. At some point, I wrote a Java library for matrix computations based on the FORTRAN libraries, I got a bit carried away, writing all kinds of magic code generators in Ruby, but that sort of got me started with Open Source Software. In 2009, we founded TWIMPACT, a startup working on real-time social media, which eventually lead to streamdrill, our current startup focussed on real-time analytics in general. I'm currently interested very much in approaches to real-time beyond scaling, in particular approximate algorithms.

What will your talk be about, exactly?

The talk will talk about ways to do real-time user profiling and recommendations with efficient algorithms which continually update the results without going through lots of storage or big compute clusters. I think it is possible, but it's also not trivial. You have to use a combination of algorithmic modification on the analysis algorithms themselves and approximative data structures to control the resource usage. I'll cover both aspects in my talk.

What do you hope to accomplish by giving this talk? What do you expect?

​Ideally, I hope to make people more aware of alternatives to tackle big data problems beyond just throwing a lot of computing powers at them. I also hope to bring the basic ideas of that approach across so people have some starting point for approaching their own problems that way.

Have you enjoyed previous Berlin Buzzwords editions?

​I have, actually, this is the third year I'm giving a talk there, and I always liked the mixture in the audience very much. I think it strikes a very good balance between being too academic and just being people who demo stuff.​

For a great developer without any background in machine learning or statistics: What would be your recommendation to get started?

​​Well, on the one hand, having a firm understanding of the concepts of linear algebra, and probability theory is essential, but then it's also important to have practical experience with data and algorithms and develop a good intuition on what will work on which problems and which doesn't. One also has to understand some concepts like overfitting and how to properly evaluate the performance of an algorithm. It's just too easy to train a classifier on some data and get good results, but then perform very badly on new data. So I'd recommend picking a few standard algorithms and some data sets and working them through. There are also lots of good books on how to do data analysis in Python or R like "Doing Data Science" by O'Neil and Schutt, "Python for Data Analysis" by McKinney, or "Machine Learning for Hackers" by Conway and White.