xml | Content Perspective

Tag: xml

EMC World 2010: Next-generation Search: Documentum Search Services

Posted on 2010/05/222010/05/22 by alexandra

Presented by Aamir Farooq

Verity: Largest ingex 1 M Docs

FAST: Largest Index 200 M Docs

Challenging requirements today that all requires tradeoffs. Instead of trying to plugin third party search engines chose to build and integrated search engine for content and case management.

Flexible Scalability being promoted.

Tens to Hundreds of Millions of objects per host

Routing of indexing streams to different collections can be made.

Two instances can be up and running in less than 20 min!

Online backup restore is possible using DSS instead of just offline for FAST

FAST only supported Active/Active HA. In DSS more options:

Active/Passive

Native security. Replicates ACL and Groups to DSS

All fulltext queries leverage native security

Efficient deep facet computation within DSS with security enforcement. Security in facets is vital.

Enables effective searches on large result sets (underpriveleged users not allowed to see most hits in result set)

Without DSS, facets computed over only first 150 results pulled into client apps

100x more with DSS

All metrics for all queries is saved and can be used in analytics. Run reports in the admin UI.

DSS Feature Comparison

DSS supports 150 formats (500 versions)

The only thing lacking now is Thesaurus (coming in v 1.2)

Native 64-bit support for Linux and Windows, Core DSS is 64-bit)

Virtutalisation support on VMWare

Fulltext Roadmap

DSS 1.0 GA compatible with D 6.5 SP2 or later. Integration with CS 1.1 for facets, native security and XQuery)

Documentum FAST is in maintenance mode.

D6.5 SP3, 6.6 and 6.7 will be the last release that support FAST

From 2011 DSS will be the search solution for Documentum.

Index Agent Improvements

Guides you through reindexing or simply processing new indexing events.

Failure thresholds. Configure how many error message you allow.

One Box Search: As you add more terms it is doing OR instead of AND between each terms

Wildcards are not allowed OOTB. It can be changed.

Recommendations for upgrade/migration

Commit to Migrate
No additional license costs – included in Content Server
Identity and Mitigate Risks
6.5 SP2 or later supported
No change to DQL – Xquery available.
Points out that both xDb and Lucene are very mature projects
Plan and analyze your HA and DR requirements

Straight migration. Build indices while FAST is running. Switch from FAST to DSS when indexing is done. Does not require multiple Content Servers.

Formal Benchmarks

Over 30 M documents spread over 6 nodes
Single node with 17 million documents (over 300 Gb index size)
Performance: 6 M Documents in FAST took two weeks. 30 M with DSS also took 2 weeks but with a lot of stops.
Around 42% faster for ingest for a single node compared to FAST

The idea is to use xProc to do extra processing of the content as it comes into DSS.

Conclusion

This is a very welcome improvement for one of the few weak points in the Documentum platform. We were selected to be part of the beta program so I would now have loved to tell you how great of an improvement it really is. However, we were forced to focus on other things in our SOA-project first. Hopefully I will come back in a few weeks or so and tell you how great the beta is. We have an external Enterprise Search solution powered by Apache Solr and I often get the question if DSS will make that unnecessary. For the near future I think it will not and that is because the search experience is also about the GUI. We believe in multiple interfaces targeted at different business needs and roles and our own Solr GUI has been configured to meet our needs based from a browse and search perspective. From a Documentum perspective the only client today that will leverage the faceted navigation is Centerstage and that is focused on asynchronous collaboration and is a key component in our thinking as well, but for different purposes. Also even though DSS is based on two mature products (as I experienced at Lucene Eurocon this week) I think the capabilities to tweak and monitor the search experience at least initially will be much better in our external Solr than using the new DSS Admin Tool although it seems like a great improvement form what the FAST solution offers today.

Another interesting development will be how the xDB inside DSS will related to the “internal” XML Store in terms of integration. Initially they will be two servers but maybe in the future you can start doing things with them together. Especially if next-gen Documentum will replace the RDBMS as Victor Spivak mentioned as a way forward.

At the end having a fast search experience in Documentum from now is so important!

Further reading

Be sure to also read the good summary from Technology Services Group and Blue Fish Development Group about their take on DSS.

Interesting thoughts around the Information Continuum

Posted on 2010/05/162010/05/20 by alexandra

In a blog post called “The Information Continuum and the Three Types of Subtly Semi-Structured Information” Mark Kellogg discusses what we really mean with unstructured, semi-structured and structured information. In my project we have constant discussions around this and how to look upon the whole aspect of chunking down content into reusable pieces that in itself needs some structured in order to be just that – reusable. At first we were ecstatic over the metadata capabilities in our Documentum platform because we have made our unstructured content semi-structured which in itself is a huge improvement. However, it is important to see this as some kind of continuum instead of three fixed positions.

One example is of course the PowerPoint/Keynote/Impress-presentation which actually is not one piece. Mark Kellogg reminded me of the discussions we have had around those slides being bits of content in a composite document structure. It is easy to focus on the more traditional text-based editing that you see in Technical Publications and forget that presentations have that aspect in them already. To be honest when we first got Documentum Digital Asset Manager (DAM) in 2006 and saw the Powerpoint Assembly tool we became very enthusiastic about content reuse. However, we found that feature a little bit too hard to use and it never really took off. What we see in Documentum MediaWorkSpace now is a very much remamped version of that which I look forward to play around with. I guess the whole thing comes back to the semi-structured aspect of those slides because in order to facilitate reuse they somehow need to get some additional metadata and tags. Otherwise it is easy the sheer number of slides available will be too much if you can’t filter it down based on how it categories but who has created them.

Last year we decided to take another stab at composite document management to be able to construct templates referring to both static and dynamic (queries) pieces of content. We have made ourselves a rather cool dynamic document compsotion tool on top of our SOA-platform with Documentum in it. It is based on DITA and we use XMetaL Author Enterprise as the authoring tool to construct the templates, the service bus will resolve the dynamic queries and Documentum will store and transform the large DITA-file into a PDF. What we quickly saw was yet another aspect of semi-structured information since we need a large team to be able to work in parallell to “connect” information into the finished product. Again, there is a need for context in terms of metadata around these pieces of reusable content that will end up in the finished product based on the template. Since we depend of using a lot of information coming in from outside the organisation we can’t have strict enforcement of the structure of the content. It will arrive in Word, PDF, Text, HTML, PPT etc. So there is a need to transform content into XML, chunk it up in reusable pieces and tag it so we can refer to it in the template or use queries to include content with a particular set of tags.

This of course bring up the whole problem with the editing/authoring client. The whole concept of a document is be questioned as it in itself is part of this Continuum. Collaborative writing in the same document has been offered by CoWord, TextFlow and the recently open source Google tool Etherpad and will now be part of the next version of Microsoft Office. Google Wave is a little bit of a disrupting force here since it merges the concept of instant messaging, asynchronous messaging (email) and collaborative document editing. Based on the Google Wave Federation protocol it is also being implemented in Enterprise Applications such as Novell Pulse.

So why don’t just use a wiki then? Well, the layout tools is nowhere as rich as what you will find in Word processors and presentation software and since we are dependent on being able to handle real documents in these common format it becomes a hassle to convert them into wiki format or even worse try to attach them to a wiki page. More importantly a wiki is asynchronous in nature and that is probably not that user friendly compared to live updates. The XML Vendors have also went into this market with tools like XMetaL Reviewer which leverages the XML infrastructure in a web-based tool that almost in real-time allow users to see changes made and review them collaboratively.

This lead us into the importance of the format we choose as the baseline for both collaborative writing and the chunk-based reusable content handling that we like to leverage. Everybody I talk to are please with the new Office XML-formats but say in their next breath that the format is complex and a bit nasty. So do we choose OpenOffice, DITA or what? What we choose as some real impact on the tool-end of our solutions because You probably get most out of a tool when it is handling its native format or at least the one it is certified to support. Since it is all XML when can always transform back and forth using XSLT or XProc.

Ok, we have the toolset and some infrastructure in place for that. Now comes my desire to not stove-pipe this information in some close system only used to store “collaborative content”. Somehow we need to be able to “commit” those “snapshots” of XML-content that to some degree consitutes a document. Maybe we want to “lock it” down so we know what version of all of that has been sent externally or just to know what we knew at a specific time. Very important in military business. That means that it must be integrated into our Enterprise Content Management-infrastructure where it in fact can move on the continuum into being more unstructured since it could even be stored as a single binary document file. Some we need to be able to keep the tracability so you know what versions of specific chunks was used and who connected them into the “document”. Again, just choosing something like Textflow or Etherpad will not provide that integration. MS Office will of course be integrated with Sharepoint but I am afraid that implementation will not support all the capabilities in terms of tracability and visualisation that I think you need to make the solution complete. Also XML-content actually like to live in XML-databases such as Mark Logic Server and Documentum XML Store so that integration is very much need more or less out of the box in order to make it possible to craft a solution.

We will definitely look into Documentum XML Technologies more deeply to see if we can design an integrated solutions on top of that. It looks promising especially since a XProc Pipeline for DITA is around the corner.

Dave Kellogg on Palantir

Posted on 2010/04/092010/04/09 by alexandra

I recently began reading the blog written by Dave Kellogg who is the CEO of Mark Logic, a company devoted to XML-based content management. I think I came to notice them when I discovered what cool technology EMC got when it bought X-hive which has now become Documentum xDb/XML Store. Mark Logic and X-hive was of course competitors in the XML Database market. In a recent blog post he reflects on the Palantir product after attending their Government Conference.

The main scope of his blog post is around different business models for a startup and that is not my expertise and I don’t have any particular opinion around that although I tend to agree and it was interesting to read his reflections of how other companies such as Oracle (yet another competitor to Mark Logic and xDb) have approached this.

Instead my thinking is based around his analysis of the product that Palantir offers and how that technology relates to other technology. I think most people (including Kellogg) mainly view Palantir as a visualisation tool because you see all these nice graphs, bars, timelines and maps displaying information. What they tend to forget is that there is huge difference between a tool that ONLY do visualisation and one that actually let you modify the data (actually modifying contextual data around them such as metadata and relations) within those perspectives. There are many different tools around Social Network Analysis for instance. However, many of them assumes that you already have databases full of data just waiting to be visualised and explored. Nothing new here. This is also what many people use Business Intelligence toolkits for. Accessing data in warehouses that is already their, although the effort of getting there from transactions oriented systems (like in retail) is not small in any way. However, the analyst using these visualisation-heavy toolkits access data read-only and only adds analysis of data already structured.

Here is why Palantir is different. It provides access to raw data such as police reports, military reports, open source data. Most of it in unstructured or semi-structured form. When it comes into the system it is not viewable in all these fancy visualisation windows Palantir has. Instead, the whole system rests on a collaborative process where people perform basic analysis which includes manual annotations of words in reports. This digital marker pen allows users to create database objects or connect to existing ones. Sure this is supported by automatic features such as entity extraction but if you care about data quality you do not dare to put them in automatic mode. After all this is done you can start exploring the annotated data and linkages between objects.

However, I do agree with Dave Kellogg that if people think BI is hard, this is harder. The main reason is that you have to have a method or process to do this kind of work. There are no free lunches – no point of dreaming about full automation here. And people need training and mindset to be able to work efficiently. Having played around with TIBCO Spotfire lately I feel that there is a choice between integrated solutions like Palantir which has features from many software areas (BI, GIS, ECM, Search etc) or using dedicated toolkits with your own integration. Powerful BI with data mining is best done in BI-systems whereas they probably never will provide the integration between features that vendors like Palantir offers. An open architecture based on SOA can probably make integration in many ways easier.