Predictive Analytics World 2012

At the end of November 2012 top predictive analytics experts, practitioners, authors and business thought leaders met in London at Predictive Analytics World conference. Cameral nature of the conference combined with great variety of experiences brought by over 60 attendees and speakers made a unique opportunity to dive into the topic from Findwise perspective.

Dive into Big Data

In the Opening Keynote, presented by Program Chairman PhD Geert Verstraeten, we could hear about ways to increase the impact of Predictive Analytics. Unsurprisingly a lot of fuzz is about embracing Big Data.  As analysts have more and more data to process, their need for new tools is obvious. But business will cherish Big Data platforms only if it sees value behind it. Thus in my opinion before everything else that has impact on successful Big Data Analytics we should consider improving business-oriented communication. Even the most valuable data has no value if you can’t convince decision makers that it’s worth digging it.

But beeing able to clearly present benefits is not everything. Analysts must strive to create specific indicators and variables that are empirically measurable. Choose the right battles. As Gregory Piatetsky (data mining and predictive analytics expert) said: more data beats better algorithms, but better questions beat more data.

Finally, aim for impact. If you have a call center and want to persuade customers not to resign from your services, then it’s not wise just to call everyone. But it might also not be wise to call everyone you predict to have high risk of leaving. Even if as a result you loose less clients, there might be a large group of customers that will leave only because of the call. Such customers may also be predicted. And as you split high risk of leaving clients into “persuadable” ones and “touchy” ones, you are able to fully leverage your analytics potencial.

Find it exciting

Greatest thing about Predictive Analytics World 2012 was how diverse the presentations were. Many successful business cases from a large variety of domains and a lot of inspiring speeches makes it hard not to get at least a bit excited about Predictive Analytics.

From banking and financial scenarios, through sport training and performance prediction in rugby team (if you like at least one of: baseball, Predictive Analytics or Brad Pitt, I recommend you watch Moneyball movie). Not to mention Case Study about reducing youth unemployment in England. But there are two particular presentations I would like to say a word about.

First of them was a Case Study on Predicting Investor Behavior in First Social Media Sentiment-Based Hedge Fund presented by Alexander Farfuła – Chief Data Scientist at MarketPsy Capital LLC. I find it very interesting because it shows how powerful Big Data can be. By using massive amount of social media data (e.g. Twitter), they managed to predict a lot of global market behavior in certain industries. That is the essence of Big Data – harness large amount of small information chunks that are useless alone, to get useful Big Picture.

Second one was presented by Martine George – Head of Marketing Analytics & Research at BNP Paribas Fortis in Belgium. She had a really great presentation about developing and growing teams of predictive analysts. As the topic is brisk at Findwise and probably in every company interested in analytics and Big Data, I was pleased to learn so much and talk about it later on in person.

Big (Data) Picture

Day after the conference John Elder from Elder Research led an excellent workshop. What was really nice is that we’ve concentrated on the concepts not the equations. It was like a semester in one day – a big picture that can be digested into technical knowledge over time. But most valuable general conclusion was twofold:

  • Leverage – an incremental improvement will matter! When your turnover can be counted in millions of dollars even half percent of saving mean large additional revenue.
  • Low hanging fruit – there is lot to gain what nobody else has tried yet. That includes reaching for new kinds of data (text data, social media data) and daring to make use of it in a new, cool way with tools that weren’t there couple of years ago.

Plateau of Productivity

As a conclusion, I would say that Predictive Analytics has become a mature, one of the most useful disciplines on the market. As in the famous Gartner Hype, Predictive Analytics reached has reached the Plateau of Productivity. Though often ungrateful, requiring lots of resources, money and time, it can offer your company a successful future.

Impressions of GSA 7.0

Google released Google Search Appliance, GSA 7.0, in early October. Magnus Ebbesson and I joined the Google hosted pre sales conference in Zürich where we had some of the new functionality presented and what the future will bring to the platform. Google is really putting an effort into their platform, and it gets stronger for each release. Personally I tend to like hardware and security updates the most but I have to say that some of the new features are impressive and have great potential. I have had the opportunity to try them out for a while now.

In late November we held a breakfast seminar at the office in Gothenburg where we talked about GSA in general with a focus on GSA 7.0 and the new features. My impression is that the translate functionality is very attractive for larger enterprises, while the previews brings a big wow-factor in general. The possibility of configuring ACLs for several domains is great too, many larger enterprises tend to have several domains. The entity extraction is of course interesting and can be very useful; a processing framework would enhance this even further however.

It is also nice to see that Google is improving the hardware. The robustness is a really strong argument for selecting GSA.

It’s impressive to see how many languages the GSA can handle and how quickly it performs the translation. The user will be required to handle basic knowledge of the foreign language since the query is not translated. However it is reasonably common to have a corporate language witch most of the employees handle.

The preview functionality is a very welcome feature. The fact that it can highlight pages within a document is really nice. I have played around to use it through our Jellyfish API with some extent of success. Below are two examples of usage with the preview functionality.

GSA 7.0 Preview

GSA 7 Preview - Details

A few thoughts

At the conference we attended in Zürich, Google mentioned what they are aiming to improve the built in template in the GSA. The standard template is nice, and makes setting up a decent graphical interface possible for almost no cost.

My experience is however that companies want to do the frontend integrated with their own systems. Also, we tend to use search for more purposes than the standard usage. Search driven intranets, where you build intranet sites based on search results, is an example where the search is used in a different manner.

A concept that we have introduced at Findwise is search as a service. It means that the search engine is a stand-alone product that has APIs that makes it easy to send data to it and extract data from it. We have created our own APIs around the GSA to make this possible. An easy way to extract data based on filtering of data is essential.

What I would like to see in the GSA is easier integration with performing search, such as a rest or soap service for easy integration of creating search clients. This would make it easier to integrate functionality, such as security, externally. Basically you tell the client who the current user is and then the client handles the rest. It would also increase maintainability in the sense of new and changing functionality does not require a new implementation for how to parse the xml response.

I would also like to see a bigger focus of documentation of how to use functionality, previews and translation, externally.

Final words

My feeling is that the GSA is getting stronger and I like the new features in GSA 7.0. Google have succeeded to announce that they are continuously aiming to improve their product and I am looking forward for future releases. I hope the GSA will take a step closer to the search as a service concept and the addition of a processing framework would enhance it even further. The future will tell.

Search in SharePoint 2013

There has been a lot of buzz about the upcoming release of Microsoft’s SharePoint 2013, how about the search in SharePoint 2013? The SharePoint Server 2013 Preview has been available for download since July this year, and a few days ago the new SharePoint has reached Release to Manufacturing (RTM) with general availability expected for the first quarter of 2013.

If you currently have an implementation of SharePoint in your company, you are probably wondering what the new SharePoint can add to your business. Microsoft’s catchphrase for the new SharePoint is that “SharePoint 2013 is the new way to work together”. If you look at it from a tech perspective, amongst other features, SharePoint 2013 introduces a cloud app model and marketplace, a redesign of the user experience, an expansion of collaboration tools with social features (such as microblogging and activity feeds), and enhanced search functionality. There are also some features that have been deprecated or removed in the new product, and you can check these on TechNet.

Let’s skip now to the new search experience provided out-of-the-box by SharePoint 2013. The new product revolves around the user more than ever, and that can be seen in search as well. Here are just a few of the new or improved functionalities. A hover panel to the right of a search result allows users to quickly inspect content. For example, it allows users to preview a document and take actions based on document type. Users can find and navigate to past search results from the query suggestions box, and previously clicked results are promoted in the results ranking. The refiners panel now reflects more accurately the entities in your content (deep refiners) and visual refiners are available out-of-the-box. Social recommendations are powered by users’ search patterns, and video and audio have been introduced as new content types. Some of the developers reading this post will also be happy to hear that SharePoint 2013 natively supports PDF files, meaning that you are not required anymore to install a third-party iFilter to be able to index PDF files!

Search Overview in SharePoint 2013

Search Overview in SharePoint 2013, Source: Microsoft SharePoint Team Blog

While the out-of-the-box SharePoint 2013 search experience sounds exciting, you may also be wondering how much customization and extensibility opportunities you have. You can of course search content outside SharePoint and several connectors that allow you to get content from repositories such as file shares, the web, Documentum, Lotus Notes and public Exchange folders are included. Without any code, you can use the query rules to combine user searches with business rules. Also, you can associate result types with custom templates to enrich the user experience. Developers can now extend content processing and enrichment, which previously could have only be achieved using FAST Search for SharePoint. More than that, organizations have the ability to extend the search experience through a RESTful API.

This post does not cover all the functionalities and if you would like to read more about what changes the new SharePoint release brings, you can start by checking the TechNet material and following the SharePoint Team Blog and the Findwise Findability Blog, and then get in touch with us if you are considering implementing SharePoint 2013 in your organization or company.

Findwise will attend the SharePoint Conference 2012 in Las Vegas USA between 12-15 November and this will be a great opportunity to learn more about the upcoming SharePoint. We will report from the conference from a findability and enterprise search perspective. Findwise has years of experience in working with FAST ESP and SharePoint, and is looking forward to discussing how SharePoint 2013 can help you in your future enterprise search implementation.

Analyzing the Voice of Customers with Text Analytics

Understanding what your customer thinks about your company, your products and your service can be done in many different ways. Today companies regularly analyze sales statistics, customer surveys and conduct market analysis. But to get the whole picture of the voice of customer, we need to consider the information that is not captured in a structured way in databases or questionnaires.

I attended the Text Analytics Summit earlier this year in London and was introduced to several real-life implementations of how text analytics tools and techniques are used to analyze text in different ways. There were applications for text analytics within pharmaceutical industry, defense and intelligence as well as other industries, but most common at the conference were the case studies within customer analytics.

For a few years now, the social media space has boomed as platforms of all kinds of human interaction and communication, and analyzing this unstructured information found on Twitter and Facebook can give corporations deeper insight into how their customers experience their products and services. But there’s also plenty of text-based information within an organization, that holds valuable insights about their customers, for instance notes being taken in customer service centers, as well as emails sent from customers. By combining both social media information with the internally available information, a company can get a more detailed understanding of their customers.

In its most basic form, the text analytics tools can analyze how different products are perceived in different customer groups. With sentiment analysis a marketing or product development department can understand if the products are retrieved in a positive, negative or just neutral manner. But the analysis could also be combined with other data, such as marketing campaign data, where traditional structured analysis would be combined with the textual analysis.

At the text analytics conference, several exciting solutions where presented, for example an European telecom company that used voice of customer analysis to listen in on the customer ‘buzz’ about their broadband internet services, and would get early warnings when customers where annoyed with the performance of the service, before customers started phoning the customer service. This analysis had become a part of the Quality of Service work at the company.

With the emergence of social media, and where more and more communication is done digitally, the tools and techniques for text analytics has improved and we now start to see very real business cases outside the universities. This is very promising for the adaptation of text analytics within the commercial industries.

Presentation: The Why and How of Findability

“The Why and How of Findability” presented by Kristian Norling at the ScanJour Kundeseminar in Copenhagen, 6 September 2012. We can make information findable with good metadata. The metadata makes it possible to create browsable, structured and highly findable information. We can make findability (and enterprise search) better by looking at findability in five different dimensions.

Five dimensions of Findability

1. BUSINESS - Build solutions to support your business processes and goals

2. INFORMATION - Prepare information to make it findable

3. USERS - Build usable solutions based on user needs

4. ORGANISATION - Govern and improve your solution over time

5. SEARCH TECHNOLOGY - Build solutions based on state-of-the-art search technology

Using log4j in Tomcat and Solr and How to Make a Customized File Appender

This article shows how to use log4j for both tomcat and solr, besides that, I will also show you the steps to make your own customized log4j appender and use it in tomcat and solr.

Default Tomcat log mechanism

Tomcat by default uses a customized version of java logging api. The configuration is located at ${tomcat_home}/conf/logging.properties. It follows the standard java logging configuration syntax plus some special tweaks(prefix property with a number) for identifying logs of different web apps.

An example is below:

handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

.handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

1catalina.org.apache.juli.FileHandler.level = FINE

1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

1catalina.org.apache.juli.FileHandler.prefix = catalina.

2localhost.org.apache.juli.FileHandler.level = FINE

2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

2localhost.org.apache.juli.FileHandler.prefix = localhost.

Default Solr log mechanism

Solr uses slf4j logging, which is kind of wrapper for other logging mechanisms. By default, solr uses log4j syntax and wraps java logging api (which means that it looks like you are using log4j in the code, but it is actually using java logging underneath). It uses tomcat logging.properties as configuration file. If you want to define your own, it can be done by placing a logging.properties under ${tomcat_home}/webapps/solr/WEB-INF/classes/logging.properties

Switching to Log4j

Log4j is a very popular logging framework, which I believe is mostly due to its simplicity in both configuration and usage. It has richer logging features than java logging and it is not difficult to make an extension.

Log4j for tomcat

  1. Rename/remove ${tomcat_home}/conf/logging.properties
  2. Add log4j.properties in ${tomcat_home}/lib
  3. Add log4j-xxx.jar in ${tomcat_home}/lib
  4. Download tomcat-juli-adapters.jar from extras and put it into ${tomcat_home}/lib
  5. Download tomcat-juli.jar from extras and replace the original version in ${tomcat_home}/bin

(extras are the extra jar files for special tomcat installation, it can be found in the bin folder of a tomcat download location, fx. http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.33/bin/extras/)

Log4j for solr

  1. Add log4j.properties in ${tomcat_home}/webapps/solr/WEB-INF/classes/ (create classes folder if not present)
  2. Replace slf4j-jdkxx-xxx.jar with slf4j-log4jxx-xxx.jar in ${tomcat_home}/webapps/solr/WEB-INF/lib (which means switching underneath implementation from java logging to log4j logging)
  3. Add log4jxxx.jar to ${tomcat_home}/webapps/solr/WEB-INF/lib

Make our own log4j file appender

Log4j has 2 types of common fileappender,

DailyRollingFileAppender – rollover at certain time interval

RollingFileAppender – rollover at certain size limit

And I found a nice customized file appender -  CustodianDailyRollingFileAppender online.

I happen to need a file appender which should  rollover at certain time interverl(each day) and backup earlier logs in backup folder and get zipped. Plus removing logs older than certain days. CustodianDailyRollingFileAppender already has the rollover feature, so I decide to start with making a copy of this class,

Parameters

Besides the default parameters in DailyRollingFileAppender, I need 2 more parameters,

Outdir – backup directory

maxDaysToKeep – the number of days to keep the log file

You only need to define these 2 parameters in the new class, and add get/set methods for them (no constructor involved). The rest will be handled by log4j framework.

Logging entry point

When there comes a log event, the subAppend(…) function will be called, inside which a super.subAppend(event); will just do the log writing work. So before that function call, we can add the mechanism for back up and clean up.

Clean up old log

Use a file filter to find all log files start with the filename, delete those older than maxDaysToKeep.

Backup log

Make a separate Thread for zipping the log file and delete original log file afterwards(I found CyclicBarrier very easy to use for this type of wait thread to complete task, and a thread is preferable for avoiding file lock/access ect. problems). Call the thread at the point where current log file needs to be rolled over to backup.

Deploy the customized file appender

Let’s say we make a new jar called log4jxxappender.jar, we can deploy the appender by copying the jar file to ${tomcat_home}/lib and in ${tomcat_home}/webapps/solr/WEB-INF/lib

Example configuration for solr,

log4j.rootLogger=INFO, solrlog

log4j.appender.solrlog=com.findwise.xx.log4j.fileappender.YyRollingFileAppender

log4j.appender.solrlog.File=${catalina.home}/logs/solr.log

log4j.appender.solrlog.Append=true

log4j.appender.solrlog.Encoding=UTF-8

log4j.appender.solrlog.DatePattern='.'yyyy-MM-dd

log4j.appender.solrlog.MaxDaysToKeep=10

log4j.appender.solrlog.Outdir=${catalina.base}/logs/backup

log4j.appender.solrlog.layout=org.apache.log4j.PatternLayout

log4j.appender.solrlog.layout.ConversionPattern = %d [%t] %-5p %c - %m%n

Solr.war

Last thing to remember about solr is to zip the deployment folder ${tomcat_home}/webapps/solr and rename the zip file solr.zip to solr.war. Now you should have a log4j enabled solr.war file with your customized fileappender.

Video: Introducing Hydra – An Open Source Document Processing Framework

Introducing Hydra – An Open Source Document Processing Framework from presented at Lucene Revolution hosted on Vimeo.

Presented by Joel Westberg, Findwise AB
This presentation details the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Semantic Search Engine – What is the Meaning?

The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.

The approach

In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.

Semantic Map

The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.

Semantic Search Map

Semantic Search Engine

The search engine will be composed of the following components:

  • Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
  • Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
  • Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
  • Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
  • Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
  • Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.

What do you think? Please let us know by writing a comment.

Findability, a holistic approach to implementing search technology

We are proud to present the first video on our new Vimeo channel. Enjoy!

Findability Dimensions

Successful search project does not only involve technology and having the most skilled developers, it is not enough. To utilise the full potential and receive return on search technology investments there are five main dimensions (or perspectives) that all need to be in focus when developing search solutions, and that require additional competencies to be involved.

This holistic approach to implementing search technology we call Findability by Findwise.