Month: May 2009

EMC World 2009: Enterprise Search Server (ESS)

To me one of the biggest news delivered during the conference was the new generation of Documentum full text indexing called the Enterprise Serch Server (ESS). This marks the first official message that EMC Documentum will move away from the OEM-version of FAST ESP which has been in use since Documentum 5.3 (2005). The inclusion of FAST back then meant that Documentum got a solution where metadata from the relational database where merged with text from the content file into an XML-file (FTXML) that could be queried using DQL. Before diving into the features of the new technology I guess everyone wonders about the reason for this decision. The main reasons are said to be:

  • Performance. 1 FAST Full-text node supports up to around 20 Million objects in the repository (some customers commented that their experience were closer to 10 M…) and it requires in memory indices. With Documentum installations containing Billions of objects that means 100+ nodes and that has been a hard sell in terms of hardware requirements.
  • Virtualisation. Apparently talks with Microsoft/FAST about the requirement on supportin all Documentum products on VMWare made no progress. This has been a customer demand for some time. MS/FAST cites intensive I/O-demands as a reason why they where not interested in certifying the full-text index on virtualisation.
  • NAS-support.
  • More flexible High Availability (HA) options. Today FAST can be clustered by adding new nodes which leads to a requirement of having the same amount of nodes for backup/high availability.

From a performance stand-point I personally think that the current implementation of FAST lead to slow end-user experience when searching in Documentum. One reason for this is that a search is first triggered to FAST which then delivers a search result set irrespective of my permissions. Instead the whole result set must be filtered by quering it towards the relational database. That takes time. This is also a reason why we have integrated an external search engone based on the more modern FAST ESP 5.x server with Security Access Module which means that acl:s are indexed and filtering can be done in one step when searching in the external FAST Search Front-end (SFE). More about how that is solved in ESS later on.

From a business perspective EMC outlines these challenges they see a need to satisfy:

  • End users expect Google/Yahoo search paradigms
  • IT-managers want low cost, scalable, ease of deployment and easy admininstration.
  • Requirements for large scale, distributed deployments with multiingual support.
  • Enterprise requirements such as low cost HA, backup/restore and SAN/NAS-suppprt.

New new ESS is based on the xDb technology coming from the aquisition of the company X-hive and leveraging the open source full-text indexing technology in the Lucene project. The goal for ESS is to leverage the existing open indexing architecture in Documentum. The idea is both to create a solution that really scales but of course with some trade-offs when it comes to space vs query performance.

ESS supports structured and unstructed search by leveraging best of breeed XML Database and XQuery Standards. It is designed for Enterprise readiness, scalabiity, ingestion throughput and high quality of search as core features. It also provides Advanced Data Management (enables control where placement of data on disk is done) functionality necessary for large scale systems. The intention is to give EMC to continue to develop and provide new search features and functionality required by their customer base.

It is architected for greater scalability and gives smaller footprint than current Full-Text Search as well as scale both horisontally (more nodes) as vertically (more servers on the same node). It is designed to support tens to hundreds of millions of objects per node.

This allows for solutions such as Archiving where there can be Billion+ emails/documents while preserving the high quality of search while still achieving scale. The query response time can be throttled up or down based on needs – priority can be shifted between indexing and quering.

The installation procedure is also simplified and EMC promises that a two node deployment can be up and running in less than 20 minutes. The solution is also designed to easily allow to add new nodes to an installation.

ESS is much more than a simple replacement of the full-text engne. It will focus on deliver these additional features compared to existing solutions:
– Low cost HA (n+1 Server based)
– Disaster Recovery
– Data Mangement
– VMWare Support
– NAS Support
– New Administration Framework

The new admin features includes a new ESS Admin interface which has a look and feel very similar to CenterStage. Since the intention is to support ESS on non-Documentum installation it is a separate web client. The framwoork also supports Web Services, Java API, JMX and it is open for administration using OpenView, Tivoli, MMC etc.

The server consists of:

  • ESS API
  • Indexing Services will have document batching capability, callback support for searchable indication and a Content Processing Pipeline with text extraction and linguistic analysis via CPS.
  • Search Services. This will provide search for meta-data, content or both (XQuery based) as well as multiple search options such as batching, spooling, filters, language, analyser etc. It will return results in a XML format and provides term highlight, summary and relevancy. The thread execution management support multi-query and parallell query. It also includes low level security filtering.
  • Content Processing Services is responsible for language detection, text extraction and linguistic analysis. The CPS can be local or remote (co-located with content for improved performance). It will have a pluggable architecture to support various analysers and/or text extractors. It will include out of the box support for Basis RLP and Apache SnowBall analysers. However only one analyser can be configured per ESS. (My question: Can I have different analysers on different nodes?). Content Processing can be extended by plugins.
  • Node and Data Management Services is the primary interface for all data and node management within ESS. It provides ability to control routing of documents and placements of collections and indices on disk. It deals with index management and supports bind, detach, attach, merge, freeze, read-only etc.
  • Analytics includes API’s and Data model for logging, metrics and auditing, ingestion and search analysis and facet computation services.
  • Admin Services. The example shown was really powerfull very an admin could view all searches made by a user by time and see what time it took to first result set. The one with a longer time could be explored by viewing the query to analyse why it took so long.

Below that the xDB can be found and in the botton the Lucene indices. The whole solution is 100% Java and xDb stores XML Documents in a Persistend DOM formats and support XQuery and XPath. Indices conists of a combination of native B-tree indices + Lucene. The xDb supports single and multi-node architecture and has support for multi-statement transactions and full ACID support. In additon it supports XQFT (see introduction it here) which is a proposed standard extension to XQuery which includes:

  • LQL via a full text entension
  • Logical full-text operator
  • Wildcard option
  • Anyall options
  • Positional filters
  • Score variables

ESS includes native security which means that security is replicated into the search server and security filtering is done on a low level in the xDb database. This means effective searches on large result sets and enables facet computation on entire result sets.

Native facet computation is a key feature in ESS which is of course linked to the new search interface in CenterStage which is based on facets in an iTunes-like interface. Facets are of course nothing new but it is good that EMC has finally realised that it is a powerful but still easy way to give users “advanced search”.

ESS Leverages a Distributed Content Architecture (for instance using BOCS) by only sendning the raw text (DFTXML) over the network instead of the binary file which can be very much larger in many cases (such as big PowerPoint files). ESS also utilizes the new Content Processing Services (CPS) as well as ACS.

The new solutions also makes it possible to do hot backups without taking the index server down before as it is today. Backup and restore can be done on a sub-index level. The new options for High Availability include:

  • Active/active shared data (the only one available for FAST)
  • Active/passive with clusters
  • N+1 Server based

Things I like to see but have not heard yet:

  • Word frequency analysis (word clouds based on document content)
  • Clustering and categorisation (maybe done by Content Intelligence Services)
  • Synonym management
  • Query-expansion management
  • How document similarity is handled by vector-space search (I guess done by Lucene?)
  • Boosting & Blocking of specific content connected to a query
  • Multiple search-views (different settings for synonyms, boost&blocking etc)
  • Visualisation of entity extraction and other annotations
  • Functionality or at least an API to manually edit entity extraction within the index. Semi-automatic solutions are the best.
  • Freshness management.
  • Speech-to-text integration (maybe from Audio/Video Transformation Services)

Personally I think this is a much needed move to really improve the internal search in Documentum and make much better use of the underlying information infrastructure in Documentum. It will be interesting to see what effect this has on Microsoft/FAST ambitions to support the Documentum connector. Maybe the remaining resources (no OEM to develop) can focus on bringing the connector from an old 5.3 API to a modern 6.5 API. I still see a need for utilising multiple search engines but as ESS gains more advanced features the rationale for an expensive external solution can change. The beta for Content Intelligence Studio will be one important step in outlining the overall enterprise search architecture for big ECM-solutions. In this lies of course tracking what Autonomy brings to market in the near future.

Another thing worth mentioning is that I during the past four conferences have heard quite a few complaints about the stability of the current FAST-based full-text index. It crashes/stops reguarly and often without letting anybody knowing it before users start complaing about strange search results.

A public beta will be released in Q3 2009 and customers are invited to participate. Participants will recieve a piece of hardware with the ESS pre-installed and pre-configured and after a few configuration changes in Content Server it should be up an running.

Customers will have the option of upgrading existing FAST full-text index  or run the new ESS side-by-side with FAST. ECM will also market ESS for non-Documentum solutions.

Be sure to also read Word of Pie’s notes as well as my previous notes from FAST Forward 09 around the future of FAST ESP.

EMC World 2009: Reflections from the Momentum conference

A very hectic week has passed by and EMC World 2009 is over. Just as I did last year I felt like reflecting a about the conference.

First of all many thanks to EMC for listening us and improving a lot of things from the last year. I have been to EMC World 07 and 08 and during both these occasions I felt a little lost as a Documentum customer among all these storage and virtualisation people. Back then I heard people referring with love to past Momentum conferences where the sense of community was there. In November 08 I had the chance to go to Momemtum in Prague as a speaker and it was actually a bit different from EMC World. Suddenly all the focus was on Documentum.

Things well done

So the establishment of a Content Management & Archiving (CMA) Community was just what we all needed. We all got yellow ribbons with text “Momentum” to attach to our badges which made us all much more visible to each other. We got all the sessions in the same area which meant no more running around and the chance to bump into people with those ribbons. Intead of having a very thick catalogue with all sessions merged together into a giant schedule we got our own CMA Show Guide which was really easy to use and made life much easier to me. Next to all the sessions we had a beautiful  Momentum Lounge which was manned all day around. You could even meet CMA executives for drinks after sessions on Wednesday and Thursday. It had nice sofas and chairs togeter with soft red lighting which made it quite cosy. In the solutions exhibitions all CMA Booths were gathered in the same area with a separate graphic profile then the rest of the EMC booths. Around the CMA booth you found all the CMA partners co-located. Finally we had our own CMA Party on Monday evening which was well attended as far I saw. In addition to that we finally seem to have a working online community both for Documentum and XML Technologies.

[nggallery id=4]

There was also a great thing to create a Blogger’s lounge where all people who blogged and Twittered could register. Outside the lounge there was a large screen displaying what we all were saying more or less live. And the Vanilla Latte served there was a life saver! On Tuesday their Barista started making mine as soon as I passed the entrance 🙂 What a service! I think EMC actually made social media into a working business tool here. Really something to build on. If you have not done it search for #emcworld on Twitter to see what it was all about.

I attened one Product Advisory Forum (PAF) around the new Enterprise Search Server (ESS) and that was a great experience. Ed Bueche and Aamir Farooq did a great job to inspire great discussions between us customers and the engineering team. I attended PAFs in Prague as well and those were also a great part of the conference.

We had access to wireless internet all around the conference area and that is vital for a conference like this. Especially for us who Blog and Tweet!

Things to improve

First of all EMC is a company which has a payoff saying “Where Information Lives” and touts itself as an information infrastructure company. I assume that all means digital information and is it something we Documentum people care about then it is information management. Then it does make a lot of sense taking notes and searching the web on a laptop computer during sessions. After all we are IT-nerds 🙂 Please get us some rooms with sufficient number of power outlets!

Why not even extend it further and use your own technology to integrate Tweets, Blog posts with the conference schedule so we more or less can interact live around sessions. It would even make sense for me at least to being able to register that I am attending a conference (voluntary of course) using the online profile community that alre which would make it even easier

There seem to be fewer sessions in general and especially I believe the number of developer oriented ones have become significantly fewer. I am not a coder myself so I actually think it makes sense to have sessions focused on people writing code and other with different advancement levels for us focusing on architectures, features and business cases. Another thing I noted is that there are no call for papers to EMC World the same way it works for Momentum (Europe). I think use cases from customers are an important part of the conference and it would be great to find a way to get them back in.

Please also have a look at what Word of Pie had to say about this year’s conference.

See you next year in Boston!

EMC World 2009: What is new with Digital Asset Management

Media Work Space
Controlled release in June 30th targetted at internal use at EMC Marketing, General Availability will come later this year. Still licensed with DAM. The new release will support Images, Presentation, Audio and video.

It will introduce a new gridless view which lists all objects as list with columns for attributs. Gridless view also can can show thumbnails at the left end of each line. There will also be a storyboard view much like the one existing in today’s Digital Asset Manager.

MWS will now have support for comments – which can interact with CenterStage comments.

Personalised Dashboard include the following views:

  • QuickFlows
  • Most Popular Assets
  • Recently Viewed Assets
  • Recently Updated Assets

To met that looks like they have starting to think in terms of Information Analytics…There is now also a feature to show the accumulative rating among users.

They see a need for customisations and an SDK or similar will be released during 2009

The Inbox allows to open a quickflow which actually was really nice-looking with attached images as thumbnails below. Looked rather similar to an email message which is the right way to go I think.

QuickSearch now supports searching on any index data.

Advanced Search has a tab called General and then for Presenation, Video, Audio and Images which allows for a higher level restriction of search.
Search on properties for instances image with a certain pixels…

There is new Presentation slide view which looks way more flexible than current PowerPoint assembly. Looks actually like viewing/reviewing slides now can be done completely without opening the application.

The view below the preview of the slides has tabs for Metadata, Versions, Rendtions, Comments, Permission Relationships

Slides can be rated and metadata can be editied just by clicking in the fields.

Video view supports thumbnails but also preview of the video utilzing FlipFactory. Looked like the previewer was using Flash.

FileSharing Services, My Documentum and Documentum for Outlook will be merged into a new MyDocumentum product and then moved into the Knowledge Worker group. Documentum Connector for InDesign & Quark Xpress are also part of My Documentum but from a Digital Asset Management side of  the house.

Many companies have 3D-data which comes from different CA-systems. Therefore they have started to develop CAD Integration with in Documentum with support of Right Hemisphere Integration (press release) which supports viewing data from 80 CAD/PLM-systems.

The solutions allows customers to request and repurpose derivatives
Flat Iron Solutions have a demonstration in the exhibition area at EMC World 2009.

Content Transformation Services

There are mainly bug fixes and some Improvements on the performance for the OEM products they are using mainly on the image side of the house.

CTS now includes support for for Adobe CS3 & CS4
There is an SDK for CTS which can be used to handle custom encoders….from my point of view the obvious question is whether or not i make sense to develop support for GIS-formats?

The next release of MWS will probably be available in September 2009.

There is available technology in the the platform to support annotations on video files but not yet exposed.

Aility to show forms in a Flex environment is something they are working on and it seems fairly important especially for us who use both TaskSpace and DAM with Forms.

VISION
The main areas which they focus on are:

Web Experience Management
Customer Comms Management (build websites based on preferences)
Customer Intelligence Management
Marketing Process Management
Brand Management include:
– Presentation
– Video
– Image
– Collateral
– 3D Image
– Agency Collaboration

MidYear
– New version of Presentation Assembly

End of Year
– MWS Pro
– Integrated Collaboration and Publsihing
– Campaign Management
– Marketiing and Web Metrics Tracking
– KPI
– Rapid and Setup of Brand

D7 – 2010
– MWS Field Editin
– SalesForce integration
– Support of Personalised Customer

MWS Pro
– Different Libraries as Tabs

Q1 2010 MWS & DAM Sp3