EMC & Greenplum: Why it can be important for Documentum and ECM

Recently EMC announced it was acquiring the company Greenplum which many people interpret as EMC is putting more emphasis on the software side of the house. Greenplum is a company that focuses on data warehousing technology for the very large datasets that is called “big data” applications where the most public examples are Google, FaceBook, Twitter and such. Immediate reactions to this move from ECM is of course it is a sign of market consolidation and a desire play among the largest players like Oracle/Sun, IBM and HP by being able to offer a more complete hardware/software combo stack to its customers. Oracle/Sun of course has its Exadata machine as an appliance-based model to get data warehousing capability. Chuck Hollis comments on this move by hightlighting how it is a logic move that fits nicely both with EMC storage techonology but also of course the virtualisation technology coming out of VMWare. To highlight the importance EMC will create a new Data Computing Product Division out of Greenplum. As a side note I think it is better to keep the old name to keep the “feeling” around the product just as Documentum is a better name than the previous Content Management & Archiving Division. After an initial glance of Greenplum it seems to be an innovative company that can solve business problems where established big RDBM vendors does not seem to be able to scale enough.

With my obvious focus is on Enterprise Content Management I would like to reflect how I think or maybe hope this move will matter to that area of business. In our project we started looking deeper into the data warehousing and business intelligence issues in January this year. Before we had our focus in implementing a service-oriented architecture with Documentum as a core component. We already knew that in order to meet our challenges around advanced information management there was a need to use AND integrate different kind of products to solve different business needs. ECM for the unstructured content, Enterprise Search to provide a more advanced search infrastructure, GIS-technology to handle maps and all spatial visualisation and so on. Accepting that there is no silver bullet but instead try to use the right tool for the right problem and let each vendor do what it is best at.

Replicate data but stored differently for different use cases
SOA-fanatics has a tendency to want very elegant solutions where everything is a service and all pieces of information is requested as needed. That works fine for more steady near realtime solutions where the assumption is that there is a small piece of information needed in each moment. However, it breaks down when you get larger sets of data that is needed to do longer term analysis, something which is fairly common for intelligence in support of military operations. If each analyst requests all that data over a SOAP-interface it does not scale well and the specialised tool that each analyst needs is not used to its full potential. The solution to this is to accept that the same data needs to be replicated in the architecture for performance reasons sometimes as a cache – common for GIS-solution to get responsive maps. However, there is often a need for different storage and information models depending on the use case. A massive audit trail stored in an OLTP-system based on a SQL-database like Documentum will grow big and accessing it for analysis can be cumbersome. The result is that the whole system can be slowed down just because of one analysis. Instead we quickly understood the need for a more BI-optmized information model to be able to do massive user behaviour analytics with acceptable performance. It is in the end a usability issue I think. Hence the need for a data warehouse to offload data form an ECM-system like Documentum. In fact it not only applies for the audit trail part of the database but also makes up for excellent content analytics by analysing the sum of all metadata on the actual content objects. The audit trail reveals the user interaction and behaviour while the content analytics parts gives a helicopter perspective on what kind of data is stored in the platform. Together the joined information provide quite powerful content/information and social analytics.

Add a DW-store to Documentum?
The technology coming from X-hive now has become both the stand-alone XML database xDB but also of course the Documentum XML Store that sits beside the File Store and the relational database manager. That provides a choice to store information as a document/file in the file store, as structured information in the SQL-database or as XML Documents in the XML Store. Depending on use case we have the choice to choose the optimal storage together with different ways of accessing it. There are some remarkable performance numbers looking at running Xqueries on XML Documents in the XML Store as being presented at EMC World 2010. Without knowing how it makes any sense from an architecture perspective I think it would be interesting to have a Data Warehouse Store as yet another component of the Documentum platform. To some degree it is already in there within the Business Process Management components where the Business Activity Monitor in reality is a data warehouse for process analytics. Analysis is off-loaded from the SQL-database and the information is stored in a different way to power the dashboards in Taskspace.

Other potential pieces in the puzzle for EMC
I realize that Greenplum technology is mainly about scalability and big data applications but to me it would make sense to also use the technology just as xDb in Documentum to become a data warehousing store for the platform. A store focused on taking care of the structured data in a coherent platform with the unstructured that is already in there. Of course it would need a good front-end to enable using the data in the warehouse for viualisation, statistics and data mining. Interestingly Rob Karel has an interesting take on that in hist blog post. During EMC World 2010 EMC announced a partnership with Informatica around Master Data Management (MDM) and Information Lifecycle Management (ILM) which also was a move towards the structured data area. Rob Karel suggest that Informatica could be the next logical acquisition for EMC althought there seem to be more potential buyers for them. Finally he suggests picking up TIBCO both to strengthen EMCs BPM offering but of course also to get access to the Spotfire data visualisation, statistics and data mining platform.

We have recently started working with Spotfire to see how we can use their easy-to-use technology to provide visualisations of content and audit trail data in Documentum. So far we are amazed over how powerful it is but still very easy to use. In a matter of days we have even been able to create a statistics server powered visualisation showing likelyhood of pairs of document being accessed together. Spotfire can then be used to replace Documentum Reporting Services and the BAM solution in Taskspace. Their server components are made in Java but the GUI is based on .Net which is somewhat a limitation but maybe something EMC can live with on the GUI-side. The Spotfire Web Player runs fine on Macs with Safari at least.

An opportunity to create great Social Analytics based on ECM
I hope the newly created Information Intelligence Group (IIG) at EMC sees this opportunity and can convince the management at EMC that there are these synergies except going for the the expanding big data and cloud computing market that is on the rise. In the booming Enterprise 2.0 market upcomers like Jive Software have added Social Analytics to their offering. Powering Centerstage with real enterprise class BI is one way of staying at the front of competitors with much less depth in their platform from an ECM perspective. Less advanced social analytics solutions based on dashboards will probably satisfy the market for  while but I agree with James Kobielus that there will be a need to analysts in the loop and these analysts expect more capable BI-tools just like Spotfire. It resonates well with our conceptual development which suggests that a serious approach to advanced information managements requires some specialists focusing on governing and facilitating the information in the larger enterprise. It is not something I would leave for the IT-department, it is part of the business side but with the focus on information rather than the information technology.

