EMC World 2010: Next-generation Search: Documentum Search Services

Presented by Aamir Farooq

Verity: Largest ingex 1 M Docs

FAST: Largest Index 200 M Docs

Challenging requirements today that all requires tradeoffs. Instead of trying to plugin third party search engines chose to build and integrated search engine for content and case management.

Flexible Scalability being promoted.

Tens to Hundreds of Millions of objects per host

Routing of indexing streams to different collections can be made.

Two instances can be up and running in less than 20 min!

Online backup restore is possible using DSS instead of just offline for FAST

FAST only supported Active/Active HA. In DSS more options:

Active/Passive

Native security. Replicates ACL and Groups to DSS

All fulltext queries leverage native security

Efficient deep facet computation within DSS with security enforcement. Security in facets is vital.

Enables effective searches on large result sets (underpriveleged users not allowed to see most hits in result set)

Without DSS, facets computed over only first 150 results pulled into client apps

100x more with DSS

All metrics for all queries is saved and can be used in analytics. Run reports in the admin UI.

DSS Feature Comparison

DSS supports 150 formats (500 versions)

The only thing lacking now is Thesaurus (coming in v 1.2)

Native 64-bit support for Linux and Windows, Core DSS is 64-bit)

Virtutalisation support on VMWare

Fulltext Roadmap

DSS 1.0 GA compatible with D 6.5 SP2 or later. Integration with CS 1.1 for facets, native security and XQuery)

Documentum FAST is in maintenance mode.

D6.5 SP3, 6.6 and 6.7 will be the last release that support FAST

From 2011 DSS will be the search solution for Documentum.

Index Agent Improvements

Guides you through reindexing or simply processing new indexing events.

Failure thresholds. Configure how many error message you allow.

One Box Search: As you add more terms it is doing OR instead of AND between each terms

Wildcards are not allowed OOTB. It can be changed.

Recommendations for upgrade/migration

  • Commit to Migrate
  • No additional license costs – included in Content Server
  • Identity and Mitigate Risks
  • 6.5 SP2 or later supported
  • No change to DQL – Xquery available.
  • Points out that both xDb and Lucene are very mature projects
  • Plan and analyze your HA and DR requirements

Straight migration. Build indices while FAST is running. Switch from FAST to DSS when indexing is done. Does not require multiple Content Servers.

Formal Benchmarks

  • Over 30 M documents spread over 6 nodes
  • Single node with 17 million documents (over 300 Gb index size)
  • Performance: 6 M Documents in FAST took two weeks. 30 M with DSS also took 2 weeks but with a lot of stops.
  • Around 42% faster for ingest for a single node compared to FAST

The idea is to use xProc to do extra processing of the content as it comes into DSS.

Conclusion

This is a very welcome improvement for one of the few weak points in the Documentum platform. We were selected to be part of the beta program so I would now have loved to tell you how great of an improvement it really is. However, we were forced to focus on other things in our SOA-project first. Hopefully I will come back in a few weeks or so and tell you how great the beta is. We have an external Enterprise Search solution powered by Apache Solr and I often get the question if DSS will make that unnecessary. For the near future I think it will not and that is because the search experience is also about the GUI. We believe in multiple interfaces targeted at different business needs and roles and our own Solr GUI has been configured to meet our needs based from a browse and search perspective. From a Documentum perspective the only client today that will leverage the faceted navigation is Centerstage and that is focused on asynchronous collaboration and is a key component in our thinking as well, but for different purposes. Also even though DSS is based on two mature products (as I experienced at Lucene Eurocon this week) I think the capabilities to tweak and monitor the search experience at least initially will be much better in our external Solr than using the new DSS Admin Tool although it seems like a great improvement form what the FAST solution offers today.

Another interesting development will be how the xDB inside DSS will related to the ”internal” XML Store in terms of integration. Initially they will be two servers but maybe in the future you can start doing things with them together. Especially if next-gen Documentum will replace the RDBMS as Victor Spivak mentioned as a way forward.

At the end having a fast search experience in Documentum from now is so important!

Further reading

Be sure to also read the good summary from Technology Services Group and Blue Fish Development Group about their take on DSS.

Reblog this post [with Zemanta]
Share

At the Apache Lucene Eurocon in Prague

Today I am in Prague to attend the Apache Lucene Eurocon conference hosted by Lucid Imagination, the commercial company behind the Lucene/Solr -project. It seem to almost 200 people here attending the conference. I am looking forward to speak tomorrow of Friday afternoon about our experiences of using search in general and with Solr in particular. The main part of our integration is that we have integrated Solr with a connector that listens to a queue on our Enterprise Service Bus which is a part of Oracle SOA Suite 10g (the Oracle Middleware platform).

Share

EMC World 2010: Chiming in with Word of Pie about the future of Documentum

We have got a written reaction to Mark Lewis’ keynote held at EMC World 2010 in Boston. I both feel and have the passion around Enterprise Content Management and it is great that Laurence Hart spent so much time and effort on talking to people to craft this post. Someone need to say things even if they are not always easy to hear. So I will try to not repeat what he said in this blog post but rather try to provide my perspective which comes from what I have learned about Information and Knowledge Management over the past years. ECM and Documentum is a very critical component to move that IKM vision from the Powerpoint stage into reality. In our case an experimentation platform that allows to put our ideas to improve the ”business” of staff work in a large military HQ into something people can try, learn and be inspired from. Also, this turned out to be a long blog post which calls for an summary on top:

The Executive Summary (or message to EMC IIG) of this blog post:

  • Good name change but make sure You live up to your name.
  • A greater degree of agility is very much needed but do not simplify the platform so much that implementing an ECM-strategy is impossible.
  • Case Management is not the umbrella term, it is just one of many solutions on top of Documentum xCP
  • The whole web has gone Social Media and Rich Media. The Enterprise is next. Develop what You have and stay relevant in the 2010-ies!
  • Be more precise when it comes to the term ”collaboration”. There is a whole spectrum to support here.
  • Be more bold and tell people that Documentum offers an unique architectural approach to informtion management – stop comparing clients.
  • Tell people that enabling Rich Media, Case Management, E 2.0 and (Team) Collaboration on one platform is both important and possible.
  • I am repeating myself here: You want to sell storage, right? Make sure Video Management is really good in Documentum!

The name change

Before I start I just need to reflect on the name change from Content Management and Archiving into Information Intelligence Group (IIG). I agree with Pie…the had to be changed to make it more relevant in 2010 and a focus on information (as in information management which is more than storage ILM) is the right way to go. The intelligence part of it is of course a bit fun because of my own profession but still it implies doing smart things with information and that should include everything from building context with Enterprise 2.0 features to advanced Content and Information Analytics. You have the repository to store all of that – now make sure you continue to invest in analytics engine to generate structure and visualisation toolkit to make use of all the metadata and audit trails. Maybe do something with TIBCO Spotfire.

Documentum xCP – lowering the threshold and creating a more agile platform

Great. Documentum needs to be easier to deploy, configure and monitored. Needed to get know customers on board easier and make existing ones be able to do smarter things with it in less time. However, it is easy to fall into the trap of simplifying things to much here. To me there is nothing simple around implementing Enterprise Content Management (ECM) as a concept and as a method in an organization. One major problem with Sharepoint and other solutions is that they are way to easy to install so people actually are fooled into skipping the THINKING part of implementing ECM and think it is just ”next-next-finish”. All ECM-systems needs to be configured and adapted to fit the business needs of the organisation. Without that they will fail. xCP can offer a way to do that vital configuration (preceeded by THINKING) a lot more easier and also more often. We often stress how it is important to have the technical configuration move as close to any changes in Standard Operating Procedures (SOP) as possible. If Generals want to change the way they work and the software does not support it they will move away from using the software. Agility is the key.

In our vision the datamodel needs to be much more agile. Value lists need to updated often – sometimes based on ad hoc folksonomy tagging. Monitoring of the use of metadata and tags will drive that. Attributes or even object types need to be updated more often. Content need to be ingested quickly while providing structure later on (think XML Store with new schemas here). xCP is therefore a welcome thing but make sure it does not compromise the core of what makes Documentum unique today.

The whole Case Management thing

Probably the thing that most of us reacted against in the Mark Lewis Keynote was the notion that ECM-people in reality just have done Case Management all the time. I recently spend some time reflecting on that in another blog post here called ”Can BPM meet Enterprise 2.0 over Adaptive Case Management?”. There is clearly a continuum here between supporting very formal process flows and very ad-hoc Knowledge Worker-style work. They clearly seem different and while they likely meet over Adaptive Case Management but to me it makes no sense to have that term cover the whole spectrum – even for EMC Marketing 🙂

I immediately saw that Public Sector Investigative work is often used as an example of Case Management. Case Management in especially done by law enforcement agencies is fundamentally different from work done by Intelligence Agencies because in Case-based Police investigations there is usually some legal requirement to NOT share information between cases unless authorised by managers. This is of not the case (!) for all Case Management applications but from a cultural perspective it is important that Case Management-work by the Police is not a line of business that should be used as an example of information sharing. It is even so that the underlying concept actually is at ends with any concept of unified enterprise content management strategy where information should be shared. That is why workgroup-oriented tools such as i2 Analyst’s Workstation have become so popular there.

The point here is that it is important to not disable sharing in the architectural level because again it is what constitutes a good ECM-system that content can be managed in a unified way. Don’t be fooled by requirements for that – use the powerful security model to make it possible. Then Law Enforcement Agencies can use it as well. However, there must be more to ECM than Case Management – as Word of Pie suggests it is just ONE of many solutions on top of the Documentum xCP platform. A platform which is agile enough to quickly build advanced solutions for ECM on top.

Collaboration vs Sharing and E.20

So, Collaboration is used everywhere now but the real meaning with it actually varies a bit. First there are two kind of collaboration modes:

  • Synchronous (real-time)
  • Asynchronous (non-real time – ”leave info and pick up later)

Obviously neither Documentum nor Sharepoint is in real-time part of the business. For that you will need Lotus Sametime, Office Communications Server, Adobe Connect Pro or similar products. However, Google Wave provides a bit of confusion here since it integrates instant messaging and collaborative document editing/writing.

However, I am bit bothered by the casual notion of anything as a collaboration tool like Sharepoint and for that sake eRoom is getting. To further break this down I believe there is a directness factor in collaboration. Team collaboration has a lot of directness where you collaborate along a given task with collegues. That is not the same as many of the Social Media/Enterprise 2.0 features which does not have a clear recipient of the thing you are sharing. And sharing is the key since you basically are providing a piece of information in case anyone wants/needs it. That is fundamentally different from sending an email to project members or uploading the latest revision to the project’s space. Andrew McAffe has written about this concept and uses the concept of a bullseye representing strong and weak ties to illustrate this effect.

My point is that it is important that tools for team collaborations from an information architecture standpoint can become part of the more weaker indirect sharing concept. That is the vehicle to utilze the Enterprise 2.0 effect in a large enterprise. Otherwise we have just created another set of stove-pipes or bubbles of information that is restricted to team members. I am not saying that all information should be this transparent but I will argue that based on a ”responsibility to provide”-concept (see US Intel Community Information Sharing Policy) restricting that sharing of information should be exception – not the norm.

Sure as Word of Pie points out in his article ”CenterStage, the Latest ex-Collaboration Tool from EMC” there are definitely things missing from the current Centerstage release compared to both Sharepoint and EMC’s old tool eRoom. However, as Andrew Goodale points out in the comments I also think it is a bit unfair because both eRoom and at least previous versions of Sharepoint (which many are using) actually lacks all these important social media features that serves to lower the threshold and increase participation by users. They also provide critical new context around the information objects that was not available before in DAM, WebTop or Taskspace. Centerstage also provides a way to consume them in terms of activity streams, RSS-feeds and faceted search. Remember that Centerstage is the only way to surface those facets from Documentum Search Server today.

So, I am also a bit disappointed that things are missing in Centerstage that should be there and I also really want to stress the importance of putting resources into that development. Those features in there are critical for implementing all serious implementations of an ECM-strategy and the power of Documentum is that they all sits in the same repository architecture with a service layer to access them. Maybe partner with Socialcast to provide a best practice implementation to support a more extensive profile page and microblogging. Choose a partner for Instant Messaging in order to connect the real-time part of collaboration into the platform. Again, use your experience from records management and retention policies to make those real-time collaboration activities saved and managed in the repository.

Be bold enough to say you are an Sharepoint alternative – but for the right reasons

I’m not an IT-person, I come into this business with a vision change the way a military HQ handles information so I see Enterprise Content Management more as a concept than a technology platform. However, when I have tried to execute our vision it becomes very clear that there is a difference between technology vendors and I like to think that difference comes from internal culture, experience, and vision of the company. It is the ”why” behind why the platform looks like it does and has the features it has. So as long you are not building everything from scratch for yourself it actually matters a lot which company you chose to deliver the platform to make your ECM vision happen. That means that there IS a difference between Documentum and Sharepoint in the way the platform works and we need to be able to talk about that. However, what I see now is that most people focus on the client side of it and try to embrace it is a popular collaboration tool. Note that I say tool – not platform. All those focuses on the client side of it where the simplified requirement is basically a need for a digital space to share some documents in. However, the differentiator is not whether Centerstage or Sharepoint meets that requirement – both do. The differentiator is whether you have a conceptual vision on how to manage the sum of all information that an organization have and to what degree those concepts can be implemented in technology. That is where the Documentum platform is different from other vendors and why it is different from Sharepoint. Sharepoint is sometimes a little bit to easy to get started with which unfortunately means there is no ECM-strategy behind the implementation and when the organisation have thousands of Sharepoint sites (silos) after a year or so that is when that choice of platform really starts to differ.

This week at EMC World has been a great one as usual and there is no shortage of brilliant technical skills and development of features in the platform. What I guess bothers me and some other passionate ECM/Documentum-people is the message coming out from the executive level at IIG. In the end, that is where the strategic resource decision are made and where the marketing message being constructed. I think now there is a lot more to do on the vision and marketing level than actually needs to be done on the platform itself. The hard part seem to be proud of what the platform is today, realize it’s potential to remain the most capable and advanced on the market and use that to stay relevant in many applications of ECM – not just Case Management.

Rich Media – A lot of content to manage and storage to sell

One of the strong points of Documentum is that it can manage ALL kind of content in a good way and that includes of course rich media assets such as photos, videos and audio files. Don’t look upon this as some kind of specialised market only needed by traditional ”creative” markets. This is something everybody needs now. All companiens (and military units for that sake) have an abundance of digital still and video cameras where a massive amount of content needs to be managed just as all the rest of the content. There is a need for platform technologies that actually ”understands” that content and can extract metadata from it so that this content can be navigated and found easily. It is also important to assist users in repurposing this content so it can be displayed easily without consuming all bandwith and also easily be included in presentations and other documents. This is also very much relevant from a training and learning perspective where screencams and recorded presentations has so much potential. It does not have to be a full Learning Management System but at least an easy way to provide it. Maybe have a look at your dear friend Cisco and their Show and Share application. Oh, it is marketed as a Social Video System – the connections to Centerstage (and not just MediaWorkspace) is a bit too obvious. Make sure you can provide Flickr and Youtube for the Enterprise real soon. People will love it. Again, on one very capable platform.

Media Workspace is a really cool application now. Even if it does not have all the features of DAM yet (either) it is such a sexy interface on Documentum. The new capabilites of handling presentations and video are just great. Be sure to look more at Apple iPhoto and learn how to leverage (and create) metadata to support management of content based on locations, people and events. A piece of cake on top of a Documentum repository. Now it is a bit stuck in the Cabinet/Folder hierarchy as the main browsing interface.

Summary

I agree with Word of Pie that there is a lack of vision – an engaging one that we all can buy into and sell back home to our management. In my project we seem to have such a vision and for us Documentum is a key part of that. I just hoped that EMC IIG would share that to a greater degree. From our responses back home in Sweden and here at EMC World people seem to both want and like it (have a look at my EMC World presentation and see what you think). We can do seriously cool and fun stuff that will make management of content so much more efficient which should be of critical importance for every organisation today. At least in the military one thing is for sure and that is that we won’t get more people. We really have to work smarter and that is what a vision like this will provide a roadmap towards.

So be proud of what you do best EMC IIG and make sure to deliver INTEGRATED solutions on top of that. For those who care that will mean a world of difference in the long run and will gather looks of envy for those who did not get it.

Share

EMC World 2010: My presentation around using Documentum in a SOA-platform

Yesterday on Monday May 10 at 11 am I gave a speech at the Momentum 10 conference here at EMC World 2010 in Boston. The presentation was focused around our experiences of building an experimentation platform for next-generation information and knowledge management (IKM) for a large operational level military HQ. Contemporary conflicts are complex and dynamic in character and requires a new approach to IKM in order to be able to handle all those complexities based on a sound management of our digital information. At the core of our platform is EMC Documentum integrated over an Enterprise Service Bus (ESB) from Oracle. The goal is to maintain access and tracability on the information while removing stove-piped systems.

I have got quite a few positive reactions both from customers and EMC-people after the session which of course is just great. For instance see these notes from the session. All the presentations will be available for download for all participants but that will most likely take some time. So in the meantime you can download my presentation here instead:

Presentation at EMC World 2010 in Boston

Looking forward to comments are reflections. The file is quite big but that is because my presentations is high on screenshots and downsampling them to save file size will make it too hard to see what they are showing. Try zooming in to see details.

Share

Reflections on our project from an Enterprise Search perspective

In a blog post called ”Knowledge management: retrieve, visualize and communicate!” at their Findability blog the company Findwise reflects on the approach to Knowledge Management taken by the Knowledge Support project at JCDEC. It is inspired by the recent article in one of Sweden’s biggest technology magazines called Metro Teknik.

Share

Where the FAST Enterprise Search Platform (ESP) is going now…

I have spent the last week in Las Vegas attending the FAST Forward 09 conference. About a year ago the Norvegian company FAST Search & Transfer was acquired by Microsoft and like me customers all over the world wonder what would happen. Some thought it was great to have a huge company with its R&D resources to take the platform forward while others like me feared a technology transition which would include cancelling support for other operating systems and integration with nothing but Microsoft technology.

It was very clear that the Microsoft Marketing department had a lot to say about the conference and what messages that were to be conveyed. Somewhere behind all that you could still see some of the old FAST mentality but it was really toned down. To me the conference was about convincing existing customers that MS is committed to Enterprise Search and to give Sharepoint customers some idea of what Enterprise Search is all about.

It is clear that the product line is diversifying in a common Microsoft strategy:

Solutions for Internet Business

  • FAST Search for Internet Business
  • FAST Search for Sharepoint Internet sites
  • FAST AdMomentum
  • Solutions for Business Productivity

  • FAST Search for Sharepoint
  • FAST Search for Internal Application
  • FAST Search for Sharepoint won’t be available until Office Wave 14 (incl Sharepoint) will be released so in the meantime there will be a product called FAST ESP for Sharepoint that can be used today and will have a license migration path towards FAST Search for Sharepoint. That product will have product license of aroudn 25 000 USD and then additional Client Access License (CAL) will follow in a standrad MS manner.

    So what does all of this means for us who like to see FAST ESP continue as an enterprise component in a heterogenous environment? Well, MS has commited to 10 years of support for current customers, I guess in a gesture towards those who are worried. Over and over again I heard representatives talking about how important those high-end installations on other operating systems are. The same message appeared when it came to connectors and integration with Enterprise Content Management systems like EMC Documentum. Still, most if not all demos was connected to Sharepoint and/or other MS-specific technologies.

    The technical roadmap means that the past year has been devoted in rewriting their next generation search platform from Java to .Net. The first product that will be released is the Content Integration Studio (CIS) which consist of Visual Studio (I guess earlier in Eclipse) component and a server-side execution engine. This will only be available on Windows since it is deeply connected to the .Net-environment. It looks like a promising product with support for flows instead of linear pipeline to handle the processing of information before it is handed of to the index engine. CIS therefore sits in-front of FAST ESP and a combination of actions in flow and in old pipelines can be executed. Information from CIS is written to the ESP which then creates the index and also processes queries to it.

    What I think we can expect is that new innovation is focused on creating a modular architecture where CIS is the first one. Features in ESP will the be gradually reengineered in a .Net-environment and thus creating a common search platform some years into the future. It will likely mean that we will still see one or two upgrades to the core ESP as we know it today to enable it to function together with the new components. Content Fusion will most likely be the next module that will extend ESP but on a .Net-architecture.

    When it comes to the presentation logic where we today have the FAST Search Front-End (SFE) we will see them either as Web parts for Sharepoint or as AJAX Aerogel from MS. These are currently developed using Javascript but will include Silverlight later on.

    These will initially be offered in both a IIS and a Tomcat flavour and possibly others if there is demand. They will intitially integrated with ESP and Unity and thus opening up for a new approach of developing a search experience on top of them.

    I general I don’t like the Microsoft approach of insisting of owning the whole technology stack by themselves and refusing to invest in other standards-based projects. Instead of developing their own AJAX libraries they could have used ExtJS or even Google Web Toolkit. While it is not open source MS argues that it is a very Permissive licence from MS that has many of the same qualities. A good thing is that MS was comitted to make sure that this framework works on all major browsers including FireFox, Safari and Chrome. It is interoperable with JQuery.

    In summary I think it is kind of a mixed experience. The new features being developed are truly needed to make FAST keep being one of the most advanced search engines available. I think many of the features look really promising and I can’t wait to get my hands on then. On the other hand it is clear that things are going proprietary (FAST ESP had a lot of open source in it), it is being aligned in a Microsoft stack and thus gradually minimizing options. That includes how new technologies are being implemented (MS-ones instead of open source), what operating systems it will run on and how the support for developing presenation logics look like. It means I have to have people how know both Java and .Net, both Flash and Silverlight (possibly JavaFx) and both ExtJS/GWT and MS AJAX/Aerogel.

    We are deeply invested in the EMC Documentum Platform and would of course like to continue use ESP as a way to add advanced capabilities and performance to our architecture. However, I think I will over time get sick and tired on Microsoft sales people trying to convince me to use Sharepoint instead of Documentum. For anybody who know how both platform work it is almost a joke but I will most likely have to keep explaining and explaining. I just hope that we can have decent connector developed for Documentum.

    Too read more you can go to the FAST Forward Blog which has many interviews, look at videos at the Microsoft Press Room and check out the chatter on ffc09 tagged tweets on Twitter. An finally here is what CMS Watch has to say about it.

    Share