Momentum 2012 Session: Documentum Architecture with Jeroen

Momentum 2012 Session: Documentum Architecture with Jeroen

The Documentum stack is huge so this will be an overview. This more slide-oriented whereas the innovation track will feature more demos.

The Big Picture

Walks us through the stack based around different repositories

Model of the metadata on top of data. On the side of the slide you see a stack of supporting components he calls the “I/O Stack”.

  • Multichannel Publishing (DocScience) 2way
  • Unified Analytics Platform (Big Data with Greenplum) 2way
  • Capture Services (Captiva) 1way
  • E-discovery services (Kazeon) 1way

Move away from coding into configuration is the main goal.

Content Server focused on on-premise deployment (SOAP/REST) – single-tenant.

NGIS in the cloud where it is focused on REST – multi-tenant.

Second server architecture for the public cloud. Why? He sees different demands for public cloud.

The aim is to provide a multi-tenant solution where each added tenant cost as little as possible – preferably just another object.

For cloud D2 is first and xCP will be made available on the cloud later on.

Should be possible to design for the current stack – run it on the public cloud stack later on.

The mechanism for the transfer is XML – formalising the application model.

He has talks about a hybrid scenario – content supply chain – manage the extended enterprise with all suppliers as the use case.

D7 Performance Improvements

  • Talking again about improved context switch performance
  • Cache management has been improved -> Lower Memory Usage – Focused on Type Cache mainly
  • Improve the number of active sessions you can handle
  • Improve the response time
  • Reduced the number of sessions on Oracle database

Will affect the number of Content Servers needed as well as database deployment. Drasticly reducing the number of sessions needed to support a certain number of concurrent users as well as almost not affecting response times 0-500 users. Again shows impressive charts where the scalability of Documentum is much more linear than before.

REST Strategy

  • Big effort of implementing REST across the board
  • Pure REST approach (must support XML, JSON and Atompub)
  • Do not see it as an API but rather a design paradigm
  • REST as SOA Strategy
  • Discovery of services using resource linking
  • ns.emc.com for describing web concepts
  • Connecting resoursces accross products

The request service layer is an abstract layer.

New initiative called Linked Data Platform – Co-chair of the working group with IBM

REST and xCP – REST services provide access to xCP resources

xCP clients also use platform resoruces such as Folders, Documents and other Documentum types.

xCP clients consume xCP and platform serviecs.

Project Line of Sight

It is about Hyperic and the Documentum stack.

Immediate focus areas:

  • Enable monitoring Documentum products famialies
  • Linux and Windows
  • 6.7 SP2 onwards

Future:

  • -Captiva, DocSCI, Kazeon and other products
  • Provide data analytics and corrective actions enablement

Linked to the bigger initiative with xMS – a new approach to deploying Documentum.

Automated process to instantiate the complex deployments. Not only initial deployments but also upgrades and patches.

Each machine has an hyperic agent that connect to the Hyperic Server where the monitored data end up and can be consumed by he Hyperic Portal where you can:

  • Monitor
  • Discover
  • Alerts
  • Correct

Immediate targets:

  • Health and availability metrics
  • Diagnotics and logging info
  • Alertss and notifications

Later:

  • Documtum platform (CS, BOCS, ACS…)
  • XPlore and Webtop
  • Supports 6.7 SP2 and onwards

Reflection: What will become of Reveille’s Documentum products? Is there an overlap?

xPlore and Federate Search Services 2

  • xPlore 1.2 released Q4 2011
  • Customized Processing (post-linguistic analysis using UMIA customization)
  • Tokens are indexed based on different languages
  • Since xDB is a schemaless database means that you can modify the datastructure with any model changes.

Have phased out FAST – 1000 deployments with xPlore now.

Federated Search Services extend the reach for search. FSS does not index – it provides access to other indices.

D2, the power of configuration

The concept of D2 – The power of runtime configuration. The idea is to be able to combine a subset of users with a subset of content and apply a configuration to that intersection. That becomes the context. EMC has seen a lot of common patterns in Webtop customizations. Better UI if you can trim it down. Makes it possible to have a very targeting user configuration.

xCP

Rapid Application Composition (not configuration as D2)

Integrated set of tools to provide that. Actually means that there will be implications on all three types of users:

  • New User (D2 and Mobile for instance)
  • New Application Developer (Information, Process, External Systems, Analytics Modeling and the Composisiton)
  • New Administrator (xMS m.m.)

Defines a semantic model of the application. From that we generate an optimised runtime. Generates domain-specific REST Services.

xCP and the stateless BPM engine is mentioned as a big thing at this conference as well. More about that in the xCP session.

Big Data Analytics

Possible to integrate with Greeenplum in xCP 2.0 without any modifications. Works through JDBC Data Services configured in the xCP Builder that connects with Greenplum UAP.

The integration with DocSci is with Web Services.

Example of an insurance app where data is feed from a car devices monitoring driving behavior. Data is used to crunch large amounts of data into driver reports being output by DocSci and new insurance rates to be applied to the driver’s insurance.

The process defintion can be configured to interrogate the Greenplum Engine.

Always Impressed by the Scale of the EMC World Dining Hall

Impressive, no other word for it. A story to tell people back home in Sweden. Had a recent experiences at a party where it took 200 people over 45 min to get food from a buffe. Here at EMC World the Dining Hall is just amazingly large. I mean serving breakfast to over 10000 people in an hour. Do the coffee come in pipelines and food in endless lines of trucks outside. I don’t know but it is way cool.

Momentum 2012 Session: The New Documentum D2

Here we go, the first session at Momentum 2012 at EMC World 2012 in Las Vegas. After breakfast in the enormous Dining Hall we gathered in the Momentum area of the conference for the first session around newly acquired Documentum D2 client. Responsible for the session were Brian Roche and Boris Carbonelli. Peggy Ringhausen is also working with D2 and was present at the session.

The main goals for this project are:

  • Respond quickly
  • Deploy easily
  • Delight Users

The overall idea is to compress the time from inception to roll-out to end users, something Documentum users has not been used to especially from the WDK era. Sure WDK was a great (even price-winning) technology of its time but we grew accustomed that even small customizations of the user interface took many days of development and required recompiling the application or at least restarting the server. That has all changed now.

New in EMC Documentum D2 4.0

  • Cross-browser Support (all the major ones)
  • Workspaces
  • Themes
  • Widgets
  • Cross repository search
  • Open Simultaneous workspaces
  • View Centerstage Spaces

There is a set of fredefined workspaces  but you can also create your own with drag and drop widgets from the widget gallery.

Customers thought it was important with themes so that was included in this release. Very easy to change and modify themes without restart.

Interface is based on Googe Web Tool (GWT) which of course makes me wander about how the convergence with the ExtJS-based interfaces will look like.

D2FS

D2 do not use UCF for content transfer instead there is either a D2 Applet or a D2 Branch Office Caching Services (BOCS) plugin that does the job. That of course also means that Java needs to be installed on client machines. D2 is also available in the EMC Ondemand solution.

Just as their CTO Jeroen Brian promised to do less slides and more demo so after a 15 minutes or so it was time to fire up the doing on a Macbook. The demo of the user was made in Safari whereas the admin and configuration stuff was run in a virtual machine inside VMWare on that Mac.

One thing that I enjoy with Documentum Centerstage is that it restore sessions and similarly D2 restores the interface just where you left it. It is a vital approach to lower the threshold for how users quickly can get back to work by picking it up where they left it.

Workspace Gallery include a set o predefined workspace which contains different widgets depending on the targeted user category. Things like:

– Consumer

– Contributor (full, preview, vertical)

– Web

Very fast browsing which is nice. Autohide of the menu and he could easily group documents by status. You can also go full screen directly from thumbnail to show slides or full pages. I wander what the default size of the renditions are set to. Need to be big enough to actually avoid firing up Powerpoint or Keynote.

It seemed very easy layout configuration of widgets, just drag them around to where you wanted them.

There is a new Centerstage Browser as a widget which is great I think. I think it is wrong when certain type of contextual information about an object can only be viewed in a certain client. Different presentation of course but not different information in different clients.

The also have a new web widget to show web pages inline in D2. Google Gadgets can be included which means that there already are a catalogue of things to start from. You can also interact between Google Gadgets and widgets in D2.

It is possible to configure what widgets made available to users not only available workspaces.

Did a demo where he apply watermark upon download of file to computer. Changed the setting for the watermark, hit saved and showed the same watermark again with no restart. Cool!

D2 Config is configured using a Matrix which to me looked a bit strange at first.

The horizontal Lines contain rules whereas the vertical Columns define in which context to apply each rule. Behind each rule for instance lies complete configuration options so the interface initially looks like you can’t do much but the configuration options are huge.

Show how easy it was to create a new tab and show a new field with an attribute with a drop down menu which can be populated through Dictionary and DQL

No server restart – no compilation required. Applies for visible metadata or style of water marks. Themes can be switched runtime.

Internal widgets are those coming with the product and external widgets (develop yourself or use from internet like Google Gadgets.

Demoed how to search for Gadget using the tools search, inserted the widget URL from the Gadget and used it in a workspace.

Also workflow information is available as widgets in D2:

  • Workflow
  • Browser
  • Task lists
  • Task Details
  • Attachment
  • Notes
  • Performers

No workflow map as of now.

Besides watermarks headers and footers can also be applied dynamically upon download

D2 4.0 is scheduled is Q3 and after that there will be a release every 6th months according to the current release schedule.

There is another product called O2 which is used to transfer MS document properties to and from the repository.

 

On my way to Las Vegas and EMC World 2012

 Managed to get access to the lounge at Chicago O’Hare while waiting for our flight to Las Vegas later. Really looking forward to come back to Vegas and the Momentum conference. Last time at EMC World was 2010 in Boston where I was a speaker. I really hope they have kept the Momentum-feeling that I think we all appreciated. Our own area for the sessions, a lounge and a badge of some kind indicating that we are all about software and not huge hard drives. Will be great to listen to all the sessions around Enterprise Content Management and information management as well as interact with both EMC staff as well as customers. Those of us that like to go to conferences like EMC World have one thing in common. We have seen the issues large organizations have with digital content and want (even know how) to something about it. That understanding is sadly not something to expect in our daily lives. I really, really hope that we will see interfaces on top of Documentum that we can show colleagues with pride this year. Modern tools on top of a great platform.

Reflections from Momentum 2011 Berlin

EMC (IIG) the company

  • A real tech company
  • Responsive employees
  • Easy to get access inside the company
  • Willing to share information
  • Sometimes hard to figure out ”who is who” in EMC Information Intelligence Group (IIG)

As a customer it is important how the company feels. My experience is that EMC is a company where you can find tech-savvy people who really like what they are doing. And they are good at it. The general experience is that employees are interested in listening to us and very responsive to our needs. It is easy to quickly get access to both key business people as well as people in engineering. On the other hand that is often required because the product is quite complicated. On the negative side the company is big and that means that things are not always coordinated and it can sometimes be difficult to figure out who is who among all the different product managers, general managers, solutions directors and architects.

EMC IIG seems open and transparent to me. Sure there are disclaimers but they are talking openly about most things and there is no NDA at the conference.

 

Strategy

I feel a big difference this year – maybe because I have been away for over a year due to my year at the National Defense College. The big difference is that EMC Information Intelligence Group finally seems to get it. For real. Away from the idea that Case Management is something different than Enterprise Content Management. A realization that nice-looking usable user interfaces is a key thing. Understanding that the cloud is a key component of EMC IIG future. Communicating the power of configuration instead of coding is the real power of xCP but not just the interfaces – the whole application. Finally working to get decent analytics to make use of the contextual information that already exists around objects in the repository. Somehow it feels like there is a new executive team in place that wants to be a little bit more bold and wants to move IIG in a certain direction.

EMC has made numerous acquisitions after it bought Documentum but now it feels like they are finding out that they all have lots of different pieces of technology within the company that together can be a bigger whole.

Working with EMC owned VMWare to provide not only certification for all Documentum components but also leveraging the power of their virtualization infrastructure to both ease deployment but also enable efficient use of infrastructure.

Working with VMWare-owned Socialcast to include activity streams into Documentum user interfaces.

Working with RSA to enchance the security features of the platform.

Working with Greenplum to power analytics but also provide a new perspective of handling big data with smart on top of it – big information.

 

Towards a unified client

  • Client situation is a mess today
  • C6 acquisition was a good move
  • A unified client is coming along
  • Wonderful to see the focus on iOS apps

The user interface of Documentum is frankly a mess nowadays. A result of too many teams working in their own bubble creating user interfaces based on different customer groups. WDK-based Webtop with its DAM-cousin. Taskspace which is also WDK but gains some power from Forms Builder and xCP technologies. ExtJS-based Centerstage which look great but is a bit late and light in features. Feature-rich Media Workspace which is based on Flex in a world where Adobe Flash is obviously loosing traction and HTML5 is taking off. Steve Jobs really made a difference here it turns out. On top of that Desktop applications for OS X and Windows as well as an Outlook client. It is not that I think there is a need for different clients. There is. Especially from a training perspective where some companies require almost zero training whereas other can accept more extensive training.

The inclusion of C6 Technologies into Documentum is a welcome move and I heard lots of positive reactions to that. However, the key thing is that EMC IIG is now firmly committed to unifying all clients with one technology stack which of course will focus a lot on configurability. So in the end it could very well mean that the number of clients will be much bigger, but will be just different configurations based on very specific user needs. The unified client will most likely be based on C6 and ExtJS technologies which means that Flex is going away quickly. So is WDK and Taskspace but in a longer perspective. So think of D2 as a Webtop replacement and X3 as the new Centerstage with lots of widgets including ones for rich media management. Probably we will see the C6 iPad client replace the existing Documentum client as well. Expect an iPhone client soon as well.

Speaking about iOS. To me it almost like a new world compared to my first EMC World in 2007. Everybody at EMC were using Blackberries and Macs were hardly seen. Now the iPad app is out, Peggy talks about “everybody loves their iPads”, Macs are in booths and on stage, there are several Documentum apps and almost all contest prices consists of iPads. Macgirlsweden is both happy and astonished at this development J

 

Policy-based deployment with monitoring

Ok, so Documentum is not easy to deploy. It takes a while but as Jeroen put it: “You guys want to do complicated stuff!”. I think he is right and it might sometimes be a good thing since you have stop and think (not like Sharepoint which is way to easy to install in that sense). You choose Documentum because you have a complicated process to support, large amount of content and an ECM vision. Still, agility really needs to be improved and that will also simplify deployment. So improvement is important for several reasons.

The first part of that is the xCelerated Management System which in essence lets you describe and model your applications and your deployment needs. Tools then translate these policies onto your VMWare-powered infrastructure and deploys the whole Documentum platform based on your needs. Taken into account the number of users, the type of content, type of processes and what kind of high availability demands you have. Finally all of this is monitored using a combination of open-source Hyperic and their Integrien engine they got through an acquisition. Integrien now seem to have become VMWare vCenter Operations. That architecture will in my opinion set EMC Documentum way ahead of its competitors especially if it can provide some additional agility when the Next-Generation Information Server (NGIS) comes.

 

Analytics and Search

  • xPlore is looking good
  • Thesaurus-support is a good thing
  • QBS is great
  • Custom-pipeline support based on UIMA is great

A dear subject of mine where EMC IIG finally seem to get their act together. They have there own search server called xPlore which is based on open-source Lucene and their own powerful XML-database xDB. A really smart move now when FAST, Autonomy, and Endeca have been bought by the other IT-giants.

xPlore 1.2 provides some really cool features both in terms of baseline search capabilities like thesaurus support but also more text analytics oriented features. The content processing pipeline now supports extensions based on UIMA which opens up to having other entity extraction engines connected into explore. Another really cool feature is Query-Based subscriptions which really leverages the Documentum repository. Create a search query based on a combo of free text and metadata. Save it and set it up to run with different intervals and notify you of any new content that has been ingested. You can even use to to fire of a workflow in order to have somebody take action. Hopefully we will see some xCP integration in the xPlore 1.3 release where the search experience and indexing is linked to the characteristics of the xCP Application Model.

In his Innovation Speech the Chief Architect Jeroen van Rotterdam also showcased a modified centerstage which used a recommendation engine based on a Hidden Markow model to suggest similar content to users based on similiarity in context and similarity in content. A really powerful feature that makes EMC live up to its name: Information Intelligence Group (IIG). Jeroen also mentioned that they are working on video and audio analytics including speech-to-text which is then indexed into explore. That will most likely arrive in the iPad client first.

Another cool thing that is coming for the Content Intelligence Services (CIS) component is automated metadata extraction based on rules and taxonomy cold-start. Which means that you could start generating taxonomy based on your existing content.

Next-Generation Information Server (NGIS)

It seems that there has been a big investment in the xDB technology and therefore it is a key component in NGIS. Not any surprise there since Jeroen is one of the founders of the company that EMC bought. That could also mean that future installations of Documentum will not require a traditional SQL RDBMS which would not be such a bad thing. One less license and one less skill set to manage. NGIS is being designed with both the cloud and “big information” in mind. The idea is to be able to use different datastores such as Atmos, Greenplum, Isilon etc together with NGIS. I really like the term “big information” which is a way to take what we now know as “big data” to the next level where it also covers unstructured data and documents. Since there is a wave of information coming over us now it seems smart to design this for huge datasets from the beginning. After all we need to manage it whether we like it or not. As Peter Hinssen put it at the final keynote: “It is not information overload – it is a filter failure”. We CAN handle vast amount of data if we design the architecture right. Another interesting concept is to bring processing to data(nodes) instead of what we do today when we have a central processing node which we pipe all data through. Everybody is realising that the first releases of NGIS will not be feature-complete in comparison with Documentum Content Server but I also wonder to what the cloud focus really mean for NGIS. I hope it means cloud as a technical concept and not only public cloud meaning that NGIS only will be available for OnDemand at first. On the other hand, an early access program is now opening up and that will most likely be run on premise. NGIS will be an important aspect to make Documentum retain its position as the leader in ECM-technology. In the light of the other innovation going on it can be a bright future.

Cloud and EMC OnDemand

So now you can run a complete Documentum stack in the cloud. Great thing which I think will broadened the market a bit. Much easier to get up and running and an ability to focus on core ECM-capabilities instead of installning server OS, DBMS and managing storage. A good thing is the ability to have extra power available if needed. Provisions of a full platform is said to happen in 6-8 hours dependning on configuration. Deployment will be in a vCube where all Documentum servers will be managed as images. Each customer gets its own vCube. It will be possible to run a vCube on premise but that means that EMC still manages the configuration over the internet even though it is running on your hardware. There will be some limitations on the level of customizations that you can do in order to have EMC take responsibilty over the vCube. Remember all server OS and DBMS licenses are included in the vCube. All together the cloud initiative is driving huge configuration and deployments which all aspects of Documentum will gain from.

 

Venue and atmosphere

  • Keep working on the IIG and Documentum community feeling

Another Momentum conference has ended and it is time to reflect on our experiences from this event. This was my second European conference but I have attended four EMC World conferences. I keep hearing that they are different and also stories from the old Momentum conferences before EMC acquired Documentum. During my first EMC World events I really felt that the Documentum community was lost among a wave of storage people roaming around. However, the Momentum brand has been strengthened and I believe the difference between the US and the European conference is much smaller now. I think the main difference is the crowd and the atmosphere. The locations in Europe are a bit smaller in scale but also the event sites physically look different. In all EMC IIG made a very good job organizing this event with no visible friction from my point of view.

 

Practical things

  • More power outlets
  • Dedicated wifi in the keynote area (to allow use of Social media)
  • Set up a blogger’s lounge based on the EMC World concept

In general EMC created a very well organized event but there are some room for improvements anyway. One thing is the meals area. For some reason the Americans prefer round tables ”en masse” whereas this event was located in the ordinary breakfast restaurant in the hotel. Tables were straight ones with 2-8 seats each. To me that did not invite to as many spontaneous lunch encounters as I experience at EMC World. People tend to stay in their small groups and eat in those as well.

Another recurring issue is of course shortages in power outlets, which I found really strange in an IT-conference and with EMC’s strong push for social media interactions. Even though iPads are much more common now (even at EMC events) I think the conference experience would be more productive with a decent number of outlets and a capable Wifi network. My best experience so far is still a Microsoft conference around FAST Search in Vegas where all 1200 participants had tables with outlets.

The were a social media center but I felt it was way to small compared to the spacious EMC World blogger’s lounge. There are still quite few people who are using social media during the conference and a good lounge would encourage interaction IRL between us. Consider creating badges where your Twitter name and blog address is printed.

 

Social events

  • Make them about networking
  • Make it possible to talk – have areas without very loud music
  • Make sure those with allergies can eat and eat safely.

First, of all I don’t drink alcohol at all. So I that sense I may not be representative for the group at large. Still, since this is a professional conference I do have some opinions based on what the utility of these social events could have. Of course, it should be a more relaxed time and a possibility to have some fun. However, I do like to see these events as very good opportunity for networking between all of us at the conference. Locating these events in nightclubs with very loud music is therefore not an ideal setting for networking. I think the EMC World Social events in the US are better that way. Spending the night in Universal Studios for instance was a very much different experience than Ewerke in Berlin. Not just because there are terrific and fun rides there but also because there were lots of places to sit down, eat good food and talk a lot. I had a great evening there last time talking a lot about the future of content analytics with EMC staff and customers. So at least provide areas where people can talk to each other. Make the events more of continuation of the conference day. Make sure that it is in theme – any entertainment should have some connection to ECM. Maybe a stand-up around our community or a show with music with dedicated lyrics about us. Also, it would be great to have more non-alcoholic alternatives than orange juice, coke and Fanta. Also, I am allergic to nuts and I had a small incident where I accidentally ate something with nuts in it. Provide good information and possibly alternatives for us with allergies.

DISCLAIMER: All opinions here are my own and does not represent any official view of my employer. Information are based on notes and conversations and may contain errors.

Enhanced by Zemanta

Notes from the Momentum 2011 session “Current and Future Architecture of Documentum”

These are notes from the session with Jeroen van Rotterdam, Chief Architect, IIG Services. It may contain errors and all these sessions are subject to change from an EMC perspective.

The focus on the Documentum 6.7 release was improved quality and performance improvements

Gives an example from a classic HA Configuration consisting of:

LoadBalancer

4 Web Servers

1 DocBroker

2 Content Server

 

He sometimes gets the question: ”Why is it so hard to deploy DCTM?” He smiled and exclaimed ”You guys want to do complicated stuff”.

 

The current components of the Content Server Repository:

–       Content Files (FS)

–       Metadata (RDBMS)

–       XML Store (xDB)

–       xPlore Full Text (xDB)

External sources

–       Centera

–       Atmos

 

Gives another example of a customer with 20k users

Branch Office Caching Server

–       Predictive caching (push content)

–       Distributed write option (async and sync). Local write and then syn cup.

The idea is to monitor users in a similar type of context.

Some users usually starts with an activity and will be in that process flow and therefore it is his/her context. Content related to that context can then be pushed to servers close to the user.

 

xMS

xMS is yet another acronym which in this case means xCelerated Management System

–       Define requirement – Blueprints

–       Describe them independent of deployment options

–       Automatically deploy blueprint to a target

 

In the Run component there can be:

-multiple VM Clusters running on

-multiple ESX

-virtual machines are created based on the blueprints and will be assigned ESX servers

 

The final component is what they call the Telemetry Project

-Monitor the runtime using open-source Hyperic

They have created hyperic adaptors to the Documentum products.

Integrated with the Integrien product (which now seem to be VMWare vCenter Operations)

Policy also includes upscaling configuration so it is easy to add more power to a configuration.

Automatic remedies like firing up an additional virtual machine

Total amount of metrics

 

Session optimizations

DFC Session Pooling

DFC frees session to pool if idle for 5 seconds

Expensive to switch context for users (to make sure they don’t see what the other users where doing)

 

Platform DFS Services/Platform Rest/Application Services

SOAP DFS/REST

DFC

Two type of services

Core Platform and CMIS on top of that

Generate Application Services based on modeling from xCP stack (simple to use REST services will be generated for a specific part of the model)

 

Builder Tools:

–       Application Modeling

–       UI Builder

Semantic Model of the Application

Generate Optimized Runtime
– Indices etc

The Value of xCP is not just the UI but the application services and optimzed runtime is also of great value. Argues that xCP is sometimes misunderstood in that sense.

 

Dormant State Feature D7

Needed to support cloud deployment

No downtime

Bring the server to a dedicate state for changes (read-only, stopped audit trail, stopped indexing).

Partial availability for users in this state.

The idea is to spread update load on different content servers

Rolling upgrade – continues operation – apply patches on by one

Snapshot of the vApp is possible because it is in a safe state

 

NGIS – Public Cloud

Goal is full-blown multi-tenent architecture

Tremendous investment in xDB over the past years.

Argues that xPlore now beat search vendors FAST, Autonomy, Endeca and since all of them are bought by a big player EMC now has access to solid search technology.

Tenant level backup in xDb 10

xDB/Xplore

–       XACML Security

–       Tree compression (previous version is stored as a change)

–       Search over history (storing complex graph that allow you to query all the versions)

–       Distributed Query Execution

 

Big Data becomes Big Information when you Put Smart on top of the data

Bring processing to the data rather than data to the processing

Impossible with the huge amounts of data of tomorrow to bring data to (central) processing nodes.

 

Plain Hadoop will not work in this case…plain MapReduce is optimzied for back-end.

We need real-time MapReduce processing a lot of research ongoing right now.

Stream-based (looking at Yahoo).

 

SmartContainers (next year)

Kazeon is integrated into NGIS

Offering a builder to model your metadata to generate the run-time

Early access program is available.

Business Intelligence: Sometimes a problematic term

I often find myself in between the world of military language and the completely different language used in the information technology domain. Naturally it didn’t take long before I understood that term mapping or translation was the only way around it and that I often can act like a bridge in discussions. Understanding that when one side says one thing it needs to be translated or explained to make sense in the other domain.

Being an intelligence officer the term Business Intelligence is of course extremely problematic. The CIA has a good article that dives into the importance of defining intelligence but also some of the problems. In short I think the definition used in the Department of Defense (DoD) Dictionary of Miltary and Associated Terms can illustrate the core components:

The product resulting from the collection, processing, integration, evaluation, analysis, and interpretation of available information concerning foreign nations, hostile or potentially hostile forces or elements, or areas of actual or potential operations. The term is also applied to the activity which results in the product and to the organizations engaged in such activity (p.234).

The important thing is that in order to be intelligence (in my area of work) it both has to gone through some sort of processing and analysis AND only cover things foreign – that is information of a certain category.

When I first encountered the term business intelligence at the University of Lund in southern Sweden it then represented activities done in a commercial corporation to analyse the market and competitor. It almost sounded like a way to take the methods and procedures from military intelligence and just apply it in a corporate environment. Still, it was not at all focused on structured data gathering and statistics/data mining.

So when speaking about Business Intelligence (BI) in a military of governmental context it can often cause some confusion. From an IT-perspective it involves a set of technical products doing Extract-Transform-Load, Data Warehousing as well as the products in the front-end used by analysts to query and visualise the data. Here comes the first issue of a more philophical issue when seeing this in the light of the definition of intelligence above. As long as the main output is to gather data and visualising it using Enterprise Reporting or Dashboards directly to the end user it ends up in a grey area whether or not I would consider that something that is processed. In that use case Business Intelligence sometimes claims to be more (in terms of analytical ambitions) than a person with an Intelligence-background would expect.

Ok, so just displaying some data is not the same thing as doing indepth analysis of the data and use statistical and data mining technology to find patterns, correlations and trends. One of the major players in the market, SAS Institute, has seen exactly that and has tried to market what they can offer as something more than “just” Business Intelligence by renaming it to Business Analytics. That means that the idea is to achieve “proactive, predictive, and fact-based decision-making” where the important word is predictive I believe. That means that Business Analytics claims to not just visualise historic data but also claim to make predictions into the future.

An article from BeyeNETWORK also highlights the problematic nature of the term business intelligene because it is often so connected with data warehousing technology and more importantly that only part of an organisation’s information is structured data stored in a data warehouse. Coming from the ECM-domain I completely agree but it says something about the problems of thinking that BI covers both all data we need to do something with but also that BI is all we need to support decision-makers. The article also discusses what analysis and analytics really mean. Looking at Wikipedia it says this about data analysis:

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.

The question is then what the difference is between analysis and analytics. The word business is in these terms also and that is because a common application of business intelligence is around an ability to measure the performance through an organisation through processess that are being automated and therefore to a larger degree measurable. The BeyeNETWORK article suggests the following definition of business analytics:

“Business analysis is the process of analyzing trusted data with the goal of highlighting useful information, supporting decision making, suggesting solutions to business problems, and improving business processes. A business intelligence environment helps organizations and business users move from manual to automated business analysis. Important results from business analysis include historical, current and predictive metrics, and indicators of business performance. These results are often called analytics.”

Looking at what the suite of products that is covered under the BI-umbrella that approach downplays that these tools and methods have applications beyond process optimization. In law enforcement, intelligence, pharmaceutical and other applications there is huge potential to use these technologies to not only understand and optimize the internal processes but more importantly the world around them that they are trying to understand. Seeing patterns and trends in crime rates over time and geography, using data mining and statistics to improve understanding of a conflict area or understanding the results of years of scientific experiments. Sure there are toolsets that is marketed more along words like statistics for use in economics and political science but those applications can really use the capabilities of the BI-platform rather than something run on an individual researchers notebook.

In this article from Forbes it seems that IBM is also using business analytics instead of business intelligence to move from more simple dashboard visualizations towards predictive analytics. This can of course be related to the IBM acquisition of SPPS which is focused on that area of work.

From the book written by Davenport and Harris, 2007

However, the notion of neither Business Intelligence nor Business Analytics says anything about what kind of data that is actually being displayed or analyzed. From a military intelligence perspective it means that BI/BA-tools/methods are just one out of many analytical methods employed on data describing “things foreign”.

In my experience I have seen that misunderstandings can come from the other end as well. Consider a military intelligence branch using…here it comes BI-software…to analyse incoming reports. From an outsider’s perspective it can of course seem like what makes their activity into (military) intelligence is that they use some form of BI-tools and then present graphs, charts and statistical results to the end user. Resulting from that I have heard over and over again that people believe that we should also “conduct intelligence” in for instance our own logistics systems to uncover trends, patterns and correlations. That is wrong because an intelligence specialists are both skilled in analytics methods (in this case BI) and the area or subject they are studying. However, since these tools are called Business Intelligence the risk for confusion is high of course just because of the Intelligence word in there. What that person means is of course that it seems like BI/BA-tools can be useful in analysing logistics data as well as data of “things foreign”. A person doing analysis of logistics should of course be a logistics expert rather than an expert in insurgency activities in failed states.

So lets say that what we currently know as the BI-market evolves even more and really assumes a claim to be predictive. A logical argument on the executive level to argue that the investment must provide something more than just self-serve dashboards. From a military intelligence perspective that becomes problematic since all those activities does not need to be predictive. In fact it can be very dangerous if someone is let to believe that everything can be predictive in contemporary complex and dynamic conflict environment. The smart intelligence officer rather need to understand when to use predictive BI/BA and when she or he definitely should not.

So Business Intelligence is a problematic term because:

  • It is a very wide term for both a set of software products and a set of methods
  • It is closely related to data warehousing technology
  • It includes the term intelligence which suggests doing something more than just showing data
  • Military Intelligence only covers “things foreign”.
  • The move towards expecting prediction (by renaming it to Business Analytics) is logical but dangerous in a military domain.
  • BI still can be a term for open source analysis of competitors in commercial companies.

I am not an native English speaker but I do argue that we must be careful to use such as strong word as intelligence when it is really justifiable. Of course it is still late for that, but still worth reflecting on.

Enhanced by Zemanta

EMC & Greenplum: Why it can be important for Documentum and ECM

It's the logotype of Greenplum
Image via Wikipedia

Recently EMC announced it was acquiring the company Greenplum which many people interpret as EMC is putting more emphasis on the software side of the house. Greenplum is a company that focuses on data warehousing technology for the very large datasets that is called “big data” applications where the most public examples are Google, FaceBook, Twitter and such. Immediate reactions to this move from ECM is of course it is a sign of market consolidation and a desire play among the largest players like Oracle/Sun, IBM and HP by being able to offer a more complete hardware/software combo stack to its customers. Oracle/Sun of course has its Exadata machine as an appliance-based model to get data warehousing capability. Chuck Hollis comments on this move by hightlighting how it is a logic move that fits nicely both with EMC storage techonology but also of course the virtualisation technology coming out of VMWare. To highlight the importance EMC will create a new Data Computing Product Division out of Greenplum. As a side note I think it is better to keep the old name to keep the “feeling” around the product just as Documentum is a better name than the previous Content Management & Archiving Division. After an initial glance of Greenplum it seems to be an innovative company that can solve business problems where established big RDBM vendors does not seem to be able to scale enough.

With my obvious focus is on Enterprise Content Management I would like to reflect how I think or maybe hope this move will matter to that area of business. In our project we started looking deeper into the data warehousing and business intelligence issues in January this year. Before we had our focus in implementing a service-oriented architecture with Documentum as a core component. We already knew that in order to meet our challenges around advanced information management there was a need to use AND integrate different kind of products to solve different business needs. ECM for the unstructured content, Enterprise Search to provide a more advanced search infrastructure, GIS-technology to handle maps and all spatial visualisation and so on. Accepting that there is no silver bullet but instead try to use the right tool for the right problem and let each vendor do what it is best at.

Replicate data but stored differently for different use cases
SOA-fanatics has a tendency to want very elegant solutions where everything is a service and all pieces of information is requested as needed. That works fine for more steady near realtime solutions where the assumption is that there is a small piece of information needed in each moment. However, it breaks down when you get larger sets of data that is needed to do longer term analysis, something which is fairly common for intelligence in support of military operations. If each analyst requests all that data over a SOAP-interface it does not scale well and the specialised tool that each analyst needs is not used to its full potential. The solution to this is to accept that the same data needs to be replicated in the architecture for performance reasons sometimes as a cache – common for GIS-solution to get responsive maps. However, there is often a need for different storage and information models depending on the use case. A massive audit trail stored in an OLTP-system based on a SQL-database like Documentum will grow big and accessing it for analysis can be cumbersome. The result is that the whole system can be slowed down just because of one analysis. Instead we quickly understood the need for a more BI-optmized information model to be able to do massive user behaviour analytics with acceptable performance. It is in the end a usability issue I think. Hence the need for a data warehouse to offload data form an ECM-system like Documentum. In fact it not only applies for the audit trail part of the database but also makes up for excellent content analytics by analysing the sum of all metadata on the actual content objects. The audit trail reveals the user interaction and behaviour while the content analytics parts gives a helicopter perspective on what kind of data is stored in the platform. Together the joined information provide quite powerful content/information and social analytics.

Add a DW-store to Documentum?
The technology coming from X-hive now has become both the stand-alone XML database xDB but also of course the Documentum XML Store that sits beside the File Store and the relational database manager. That provides a choice to store information as a document/file in the file store, as structured information in the SQL-database or as XML Documents in the XML Store. Depending on use case we have the choice to choose the optimal storage together with different ways of accessing it. There are some remarkable performance numbers looking at running Xqueries on XML Documents in the XML Store as being presented at EMC World 2010. Without knowing how it makes any sense from an architecture perspective I think it would be interesting to have a Data Warehouse Store as yet another component of the Documentum platform. To some degree it is already in there within the Business Process Management components where the Business Activity Monitor in reality is a data warehouse for process analytics. Analysis is off-loaded from the SQL-database and the information is stored in a different way to power the dashboards in Taskspace.

Other potential pieces in the puzzle for EMC
I realize that Greenplum technology is mainly about scalability and big data applications but to me it would make sense to also use the technology just as xDb in Documentum to become a data warehousing store for the platform. A store focused on taking care of the structured data in a coherent platform with the unstructured that is already in there. Of course it would need a good front-end to enable using the data in the warehouse for viualisation, statistics and data mining. Interestingly Rob Karel has an interesting take on that in hist blog post. During EMC World 2010 EMC announced a partnership with Informatica around Master Data Management (MDM) and Information Lifecycle Management (ILM) which also was a move towards the structured data area. Rob Karel suggest that Informatica could be the next logical acquisition for EMC althought there seem to be more potential buyers for them. Finally he suggests picking up TIBCO both to strengthen EMCs BPM offering but of course also to get access to the Spotfire data visualisation, statistics and data mining platform.

We have recently started working with Spotfire to see how we can use their easy-to-use technology to provide visualisations of content and audit trail data in Documentum. So far we are amazed over how powerful it is but still very easy to use. In a matter of days we have even been able to create a statistics server powered visualisation showing likelyhood of pairs of document being accessed together. Spotfire can then be used to replace Documentum Reporting Services and the BAM solution in Taskspace. Their server components are made in Java but the GUI is based on .Net which is somewhat a limitation but maybe something EMC can live with on the GUI-side. The Spotfire Web Player runs fine on Macs with Safari at least.

An opportunity to create great Social Analytics based on ECM
I hope the newly created Information Intelligence Group (IIG) at EMC sees this opportunity and can convince the management at EMC that there are these synergies except going for the the expanding big data and cloud computing market that is on the rise. In the booming Enterprise 2.0 market upcomers like Jive Software have added Social Analytics to their offering. Powering Centerstage with real enterprise class BI is one way of staying at the front of competitors with much less depth in their platform from an ECM perspective. Less advanced social analytics solutions based on dashboards will probably satisfy the market for  while but I agree with James Kobielus that there will be a need to analysts in the loop and these analysts expect more capable BI-tools just like Spotfire. It resonates well with our conceptual development which suggests that a serious approach to advanced information managements requires some specialists focusing on governing and facilitating the information in the larger enterprise. It is not something I would leave for the IT-department, it is part of the business side but with the focus on information rather than the information technology.

Enhanced by Zemanta

iPhone/iPad and mobile access to ECM

Behold the iPad in All Its Glory
Image via Wikipedia

Inspired by my recent discovery of a Documentum client for iPhone and iPad by Flatiron solutions I wanted to do some research what is going on when it comes to mobile access using iPhone OS for Enterprise Content Management systems. It turned out that there are a few solutions out there but first I would like to dwell a little bit about the rationale for all of this.

First of all we are of course going more and more mobile. Sales of laptop computers are increasing on the expense on stationary ones. Wireless high-speed internet is no longer just available as Airport/WiFi but also as 3G/4G connections using phones and dongles for laptops. Nothing new here. Another recent change is Web 2.0 and it’s workrelated counterpart Enterprise 2.0 which is now gaining a lot of traction among companies and organisations. It is all about capitalized on the Web 2.0 effects but in an Enterprise context. Lower threshold to produce information and even more to particpate with comments and rating based on relationships to people. All this drives consumption of information even more as the distance between producer and consumers is shorter than ever before.

Here comes the new smartphone (basically following the introduction of the iPhone) where it actually makes sense to use that for a number of different tasks which previously was possible but not very pleasant to do. The bigger form factor of the iPad to me opens even more possibilities where mobile access meets E 2.0 based on ECM. Not only does the appliance makes sense to use on the move but it also has really good support for collaboration and sharing on the move.

It seems the open-source community is doing good here. Alfresco is an open-source ECM-system created by the founders of Documentum and Interwoven and there are actually a few solutions for accessing Alfresco on the iPhone. This slide share presentation outlines one solution:

iPhone Integration with Alfresco – Open Source ECM

Another is Freshdoc for the iPhone developed by Zia Consulting. The company also seem to have presented an Fresh Docs for Filenet iPad application at IBM IOD (Information on Demand) Conference in Rome, Italy May 19 – 21. It is open source and can be downloaded at Google Code.
Yet another company that provides iPad access is the open source product Saperion ECM. Open Text Social Media also provides an iPhone App for their platform. Another company that seem to be in the works for an iPhone app is Nuxeo.
Cara for iPhone is also available from Generiscorp – an application that uses CMIS to connect to repositories with CMIS-support which includes both Documentum and Alfresco.
In our application the mobile access is somewhat less importance but the iPad changes that to some degree. Even if you maybe can’t offer mobile over the air acccess enabling users to have large screen multi-touch interfaces like the iPad is of course very interesting. From a Documentum perspective the only thing we have seen in the mobile area from EMC itself is a Blackberry client for Centerstage (check p.22 in the PDF) (there is also a Blackberry client available for IRM). I understand that Blackberry is popular in the US but in terms of being visionary having a nice iPhone OS app is important I think. As I said before there are many similarities between how information is handled in the iPad and how an ECM-system like Documentum handles information. It is all about metadata.

In the light of the fact that Flatiron’s iPhone app iECM so far is not said to be a product for purchase but rather a proof-of-concept I wonder if EMC or some partner would be the best way to provide access to a long-term iPhone OS app for Documentum.

Reblog this post [with Zemanta]

EMC World 2010: Next-generation Search: Documentum Search Services

Presented by Aamir Farooq

Verity: Largest ingex 1 M Docs

FAST: Largest Index 200 M Docs

Challenging requirements today that all requires tradeoffs. Instead of trying to plugin third party search engines chose to build and integrated search engine for content and case management.

Flexible Scalability being promoted.

Tens to Hundreds of Millions of objects per host

Routing of indexing streams to different collections can be made.

Two instances can be up and running in less than 20 min!

Online backup restore is possible using DSS instead of just offline for FAST

FAST only supported Active/Active HA. In DSS more options:

Active/Passive

Native security. Replicates ACL and Groups to DSS

All fulltext queries leverage native security

Efficient deep facet computation within DSS with security enforcement. Security in facets is vital.

Enables effective searches on large result sets (underpriveleged users not allowed to see most hits in result set)

Without DSS, facets computed over only first 150 results pulled into client apps

100x more with DSS

All metrics for all queries is saved and can be used in analytics. Run reports in the admin UI.

DSS Feature Comparison

DSS supports 150 formats (500 versions)

The only thing lacking now is Thesaurus (coming in v 1.2)

Native 64-bit support for Linux and Windows, Core DSS is 64-bit)

Virtutalisation support on VMWare

Fulltext Roadmap

DSS 1.0 GA compatible with D 6.5 SP2 or later. Integration with CS 1.1 for facets, native security and XQuery)

Documentum FAST is in maintenance mode.

D6.5 SP3, 6.6 and 6.7 will be the last release that support FAST

From 2011 DSS will be the search solution for Documentum.

Index Agent Improvements

Guides you through reindexing or simply processing new indexing events.

Failure thresholds. Configure how many error message you allow.

One Box Search: As you add more terms it is doing OR instead of AND between each terms

Wildcards are not allowed OOTB. It can be changed.

Recommendations for upgrade/migration

  • Commit to Migrate
  • No additional license costs – included in Content Server
  • Identity and Mitigate Risks
  • 6.5 SP2 or later supported
  • No change to DQL – Xquery available.
  • Points out that both xDb and Lucene are very mature projects
  • Plan and analyze your HA and DR requirements

Straight migration. Build indices while FAST is running. Switch from FAST to DSS when indexing is done. Does not require multiple Content Servers.

Formal Benchmarks

  • Over 30 M documents spread over 6 nodes
  • Single node with 17 million documents (over 300 Gb index size)
  • Performance: 6 M Documents in FAST took two weeks. 30 M with DSS also took 2 weeks but with a lot of stops.
  • Around 42% faster for ingest for a single node compared to FAST

The idea is to use xProc to do extra processing of the content as it comes into DSS.

Conclusion

This is a very welcome improvement for one of the few weak points in the Documentum platform. We were selected to be part of the beta program so I would now have loved to tell you how great of an improvement it really is. However, we were forced to focus on other things in our SOA-project first. Hopefully I will come back in a few weeks or so and tell you how great the beta is. We have an external Enterprise Search solution powered by Apache Solr and I often get the question if DSS will make that unnecessary. For the near future I think it will not and that is because the search experience is also about the GUI. We believe in multiple interfaces targeted at different business needs and roles and our own Solr GUI has been configured to meet our needs based from a browse and search perspective. From a Documentum perspective the only client today that will leverage the faceted navigation is Centerstage and that is focused on asynchronous collaboration and is a key component in our thinking as well, but for different purposes. Also even though DSS is based on two mature products (as I experienced at Lucene Eurocon this week) I think the capabilities to tweak and monitor the search experience at least initially will be much better in our external Solr than using the new DSS Admin Tool although it seems like a great improvement form what the FAST solution offers today.

Another interesting development will be how the xDB inside DSS will related to the “internal” XML Store in terms of integration. Initially they will be two servers but maybe in the future you can start doing things with them together. Especially if next-gen Documentum will replace the RDBMS as Victor Spivak mentioned as a way forward.

At the end having a fast search experience in Documentum from now is so important!

Further reading

Be sure to also read the good summary from Technology Services Group and Blue Fish Development Group about their take on DSS.

Reblog this post [with Zemanta]