<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.3.2" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>DigitalReasoning.com</title>
	<link>http://blog.digitalreasoning.com</link>
	<description>Finding meaning in human language</description>
	<pubDate>Fri, 29 Aug 2008 02:44:44 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.2</generator>
	<language>en</language>
			<item>
		<title>What is the Synthesys Platform?</title>
		<link>http://blog.digitalreasoning.com/2008/08/28/what-is-the-synthesys-platform/</link>
		<comments>http://blog.digitalreasoning.com/2008/08/28/what-is-the-synthesys-platform/#comments</comments>
		<pubDate>Fri, 29 Aug 2008 02:44:44 +0000</pubDate>
		<dc:creator>bill.day</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/08/28/what-is-the-synthesys-platform/</guid>
		<description><![CDATA[You may have noticed SynthesysSDK.com.

This begets the obvious question:  What is the Synthesys Platform?
The short answer:  Digital Reasoning &#8217;s Synthesys Platform provides the first true Software Development Kit (SDK) and server platform for Unstructured Data Analytics (UDA).
The slightly longer answer:  The Synthesys Platform helps you find unexpected, critical knowledge hidden in your [...]]]></description>
			<content:encoded><![CDATA[<p>You may have noticed <a href="http://synthesyssdk.com/">SynthesysSDK.com</a>.</p>
<p><img src="http://synthesyssdk.com/starshine.jpg" alt="Click to visit SynthesysSDK.com" width="512" height="105" /></p>
<p>This begets the obvious question:  What is the Synthesys Platform?</p>
<p>The short answer:  <a href="http://digitalreasoning.com">Digital Reasoning</a> &#8217;s <a href="http://synthesyssdk.com">Synthesys Platform</a> provides the first true Software Development Kit (SDK) and server platform for Unstructured Data Analytics (UDA).</p>
<p>The slightly longer answer:  The Synthesys Platform helps you find unexpected, critical knowledge hidden in your data.  Synthesys takes unstructured text as input, uses entity extraction with strong semantic relationship analysis to operate on the input, and then outputs abstracted knowledge objects.  You can then use these objects (people, places, connections, etc.) to understand and analyze what&#8217;s important.</p>
<p>For an in depth answer and to speak to us about possibly joining our limited beta, please contact us via <a href="http://synthesyssdk.com/">the form on SynthesysSDK.com</a>.  </p>
<p>You can also attend one of our upcoming events or tech talks, for example my &#8220;<a href="http://techfests.com/Tulsa/2008/Speakers/BillDay/default.aspx">Hacking the Meaning in Human Communication</a>&#8221; presentation at the upcoming <a href="http://techfests.com/Tulsa/2008">Tulsa TechFest</a> in early October.</p>
<p>Watch for much more information on the Synthesys Platform on this blog in the weeks to come.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/08/28/what-is-the-synthesys-platform/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Measurement improves software development</title>
		<link>http://blog.digitalreasoning.com/2008/07/10/measurement-improves-software-development/</link>
		<comments>http://blog.digitalreasoning.com/2008/07/10/measurement-improves-software-development/#comments</comments>
		<pubDate>Thu, 10 Jul 2008 15:00:35 +0000</pubDate>
		<dc:creator>peter.mancini</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[entity extraction]]></category>

		<category><![CDATA[f-measure]]></category>

		<category><![CDATA[f1]]></category>

		<category><![CDATA[GeoLocator]]></category>

		<category><![CDATA[measurement]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/07/10/measurement-improves-software-development/</guid>
		<description><![CDATA[There are two possible outcomes:
if the result confirms the hypothesis, then you&#8217;ve made a measurement.
If the result is contrary to the hypothesis, then you&#8217;ve made a discovery.
Enrico Fermi

A couple of years ago we started a process of programming that was very different than anything I’ve seen in the last 15 years or so that I’ve [...]]]></description>
			<content:encoded><![CDATA[<p align="right"><em>There are two possible outcomes:</em></p>
<p align="right"><em>if the result confirms the hypothesis, then you&#8217;ve made a measurement.</em></p>
<p align="right"><em>If the result is contrary to the hypothesis, then you&#8217;ve made a discovery.</em></p>
<p align="right"><strong>Enrico Fermi</strong></p>
<p align="left"><img border="0" vspace="5" width="512" src="http://ti.arc.nasa.gov/projects/worldwind/images/screenshots/07.jpg" alt="Photo courtesy NASA World Wind" height="384" /></p>
<p align="left">A couple of years ago we started a process of programming that was very different than anything I’ve seen in the last 15 years or so that I’ve been at it. We had a challenge given to us to produce a geographical location service built upon our entity extraction technology. It was an interesting exercise which at the time we had no experience doing. The object of the game is to read in text documents, discover location references, disambiguate them, look them up in a gazetteer and mark them up with the coordinates. This can be done either as an additional final section or, the more difficult case, in-line.</p>
<p align="left">So off we went. Now the very first attempts at measuring this were done by me. I had had a lot of statistics in college but never thought I’d really get to use it. I came up with my own measures which were pretty close to recall and precision. Giving both numbers just didn’t fly with the management at the time. It was confusing. They wanted one number. After a little research I discovered both recall, precision and the mysterious F1 (or F-Measure).</p>
<p align="left"> In the case of this task we defined tokens as either relevant or irrelevant. If the token represented a PPL (populated place) then it was relevent. Otherwise it was irrelevent. So if a relevent item was marked up with the correct location it was a true positive. If it was not marked up or marked up with the wrong location it was a false positive. If an irrelevent item was marked up it was a false negative. The debates raged on what to do in the case where the system found a location but just did not disambiguate it correctly and over what to do when tokens were improperly co-located (as in what if &#8220;Rio de Janeiro&#8221; came up as &#8220;de Janeiro&#8221; instead.) Ultimately we decided to keep it simple. Any error below the level marking something right or wrong was deemed just a detail.</p>
<p>It took a lot of measurements and a lot of debate but we got it to work. This learning process produced a lot of healthy discussion and when we did finally decide on what formulas were best everyone could clearly see how to proceed.</p>
<p>The first day we calculated the f-measure of our geo-coordinate markup service it came up an astoundingly low 37 out of 100. I went over the numbers several times. Management wasn’t happy. What was decided next ended up being a great model for future development. We were put in a conference room with our computers and a white board. We were told not to leave until the f-measure was above 80. The way the development worked we had one person who did work on the trained categories system and another guy who did the application programming. I was doing measurements and creating reference sets. Three of us working towards one task, side by side.</p>
<p>We would discuss potential strategies and would then run them through the test harness. Every strategy would impact recall and precision. Often this would show how these concepts are opposed. As one is increased the other is decreased. What you are looking for is opposition that is not equal such that the f-measure rises. You want the decrease to be smaller than the increase. While it seems obvious most people don’t program that way. They come up with a bunch of ideas, implement them and just accept the measurements they get. In our case each change was tested. Yes it was slow but it separated out the good ideas from the bad ideas. We also, in this way, discovered other weaknesses that were fixed. If we had not been looking at this on a case by case basis we would have missed the subtle clues that helped us iron out the other parts of the system that were contributing to the final result.</p>
<p>I believe that honestly measuring your tools’ accuracy is important not just for sales and customer reassurance but also for the whole development life cycle. Efforts are underway to allow the unsupervised portion of the DRS system to aid in getting the Geo Reasoning system at or above 90 f-measure. Right now 75-80 is state of the art. Every point of f-measure gain beyond 80 is far more difficult to achieve than all the ones prior. However a learning system should be capable of this feat. <em>More on that later</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/07/10/measurement-improves-software-development/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Getting your mojo back with Dojo</title>
		<link>http://blog.digitalreasoning.com/2008/07/09/getting-your-mojo-back-with-dojo/</link>
		<comments>http://blog.digitalreasoning.com/2008/07/09/getting-your-mojo-back-with-dojo/#comments</comments>
		<pubDate>Wed, 09 Jul 2008 21:26:29 +0000</pubDate>
		<dc:creator>jeremy.gossett</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Digital Reasoning]]></category>

		<category><![CDATA[Dojo]]></category>

		<category><![CDATA[Matthew Russell]]></category>

		<category><![CDATA[O'Reilly Media]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/07/09/getting-your-mojo-back-with-dojo/</guid>
		<description><![CDATA[Matthew Russell, Director of Advanced Technology, joined Digital Reasoning in October of 2007 and has been making an impact since day one.  A talented and dedicated programmer, Matthew&#8217;s long hours and creative energy have been focused on improving the Interceptor user interface, architecting web applications and devising innovative ways to embed the company&#8217;s core [...]]]></description>
			<content:encoded><![CDATA[<p><em>Matthew Russell, Director of Advanced Technology, joined Digital Reasoning in October of 2007 and has been making an impact since day one.  A talented and dedicated programmer, Matthew&#8217;s long hours and creative energy have been focused on improving the Interceptor user interface, architecting web applications and devising innovative ways to embed the company&#8217;s core technology platform into commercial products .  It has been a busy year for Matthew.  Despite the relocation to Nashville and devoting countless hours to the work at Digital Reasoning, Matthew managed to complete his first book -&#8221;<a href="http://www.amazon.com/Dojo-Definitive-Guide-Matthew-Russell/dp/0596516487/ref=pd_bbs_sr_1?ie=UTF8&amp;s=books&amp;qid=1214417200&amp;sr=8-1">Dojo - The Definitive Guide</a>&#8221; - published by <a href="http://press.oreilly.com/pub/pr/2024">O&#8217;Reilly Media</a> and released on June 17th.  We talked to Matthew about <a href="http://blog.digitalreasoning.com/2008/06/15/powering-up-ajax-apps-with-dojo/">his book</a>, the writing experience and plans for the future.</em></p>
<p><a href="http://www.amazon.com/Dojo-Definitive-Guide-Matthew-Russell/dp/0596516487/ref=pd_bbs_sr_1?ie=UTF8&amp;s=books&amp;qid=1214417200&amp;sr=8-1"><img src="http://oreilly.com/catalog/covers/9780596516482_cat.gif" alt="The Definitive Guide" align="right" width="180" height="236" /></a></p>
<blockquote><p><strong>Q: What is Dojo?</strong><br />
<strong>A:</strong> <a href="http://dojotoolkit.org/">Dojo</a> is a piece of client-side technology - Javascript based - that creates great user experience on the web.  It&#8217;s a toolkit, technically speaking, it&#8217;s something you can use to create a great user experience in a web browser.</p>
<p><strong>Q: What makes Dojo superior to other Javascript toolkits?</strong><br />
<strong>A:</strong>  The overall architecture is very well thought out.  It&#8217;s industrial strength, it&#8217;s battle tested.  Big blue chip companies are using it.  And it has tremendous breadth and depth.  It doesn&#8217;t just solve a little narrow problem, it can solve lots and lots of different kinds of problems, but the solutions aren&#8217;t just cursory&#8230;they are very involved.</p>
<p><strong>Q: How and when were you introduced to Dojo?</strong><br />
<strong>A:</strong> At a previous company, a colleague and I worked on all these applications for the intelligence community and one really common issue with intelligence datasets was that there was generally a lot of data that needed to be displayed in a tabular format.  We started to scope out what other people have done&#8230;other technologies in the Javascript toolkit realm and Dojo was on of those.  From there, I started to learn all the other things Dojo automates and makes simpler.</p>
<p><strong>Q: What other writing have you done and how did this book come about?</strong><br />
<strong>A:</strong>  I had a great professor while studying Computer Science at the Air Force Academy who was my thesis advisor and he cultivated writing in a way that while writing your thesis you would produce enough materials for white papers and technical papers.  I started writing fairly frequently for <a href="http://oreilly.com/">O&#8217;Reilly</a> on the <a href="http://www.macdevcenter.com/">MacDevCenter</a> site at the time.  So, I had been doing development for Dojo and thought it would be a neat thing to write about.  I sent in a pitch for an article on the topic and it sort of escalated and eventually someone got back to me and said maybe we want to write something bigger&#8230;maybe a book.</p>
<p><strong>Q: How long did it take to write the book and what was that experience like?</strong><br />
<strong>A:</strong>  The actual book writing process took roughly 10 months.  I signed contract last July and I put the finishing touches on it the first week of June.  I would estimate I spent roughly 1200 hours writing the book.  One thing about writing a book - it&#8217;s not just about knowing the material from a technical standpoint.  There&#8217;s so much overhead.  How do I organize these thoughts?  What information do I put in what chapter?  What&#8217;s the most logical ordering for chapters?  How do I keep the content written in such a way that it engages the reader and doesn&#8217;t become boring, dry, technical material?  I think I stayed true to that O&#8217;Reilly style of keeping it fun and engaging the readers.  The hardest thing about writing the book in my opinion is that it has always been a moonlighting effort for me, it&#8217;s not my daytime job.  So, if you can imagine, way more than 50% of your nights and weekends, for almost a year, being taken up.  After you&#8217;ve been to work, had a long hard day, okay, you come home, eat dinner, bore your family for a while, then sit there for six hours writing till the wee hours of the morning&#8230;that&#8217;s the hardest part.</p>
<p><strong>Q: What are your expectations for the book?</strong><br />
<strong>A:</strong>  I personally always looked at a book as being successful if it goes into a second edition.  It must have been good enough to keep selling beyond that first threshold.  I think they&#8217;re printing between 8,000 to 12,000 copies of my book.  I would be really happy if it goes into a second edition.</p>
<p><strong>Q: As a result of writing &#8220;Dojo - The Definitive Guide&#8221; you&#8217;ve had a few new opportunities to share your expertise on the subject.  Tell us about being invited to speak at OSCON, The Open Source Conference, and the June article in Linux Journal.</strong><br />
<strong>A:</strong>  I was encouraged by my O&#8217;Reilly editor to submit a proposal for a talk, and I would imagine that having O&#8217;Reilly care enough to publish a book on the topic in the first place, probably helped some. Getting in to do the talk wasn&#8217;t a given, but having the book probably helped.  My <a href="http://en.oreilly.com/oscon2008/public/schedule/speaker/6606">OSCON</a> talk is on a component of Dojo called GFX.  It&#8217;s a sub-project of Dojo that allows a developer to do drawing and animation on the screen using one of many backends&#8230;SVG, Microsoft Silverlight, VML and in theory you could plug in any kind of drawing backend into it, you write the code according to this GFX API, pick the backend you want to render it with and it just happens.  You write the code once and point it anywhere.</p>
<p>I submitted a proposal for <a href="http://www.linuxjournal.com">Linux Journal</a> last summer.  I was just perusing their site and noticed they had an issue coming out about web technology.  I thought it might be a good way to get Dojo out there into the mainstream even further than the book.</p>
<p><strong>Q: What have you learned going through this book-writing process?</strong><br />
<strong>A:</strong> I&#8217;ve really come to appreciate just how much work it is.  The next time I see a typo in a book I&#8217;m going to give the author a lot more slack than I used to.</p>
<p>Knowing technical content is one thing.  Being able to communicate is another thing.  Being able to communicate technical content is a third thing.  Then writing a book about it is entirely different.</p></blockquote>
<p>Digital Reasoning is fortunate to employ some of the best and brightest minds in their fields and Matthew Russell is no exception.  You can find his book - &#8220;<a href="http://www.amazon.com/Dojo-Definitive-Guide-Matthew-Russell/dp/0596516487/ref=pd_bbs_sr_1?ie=UTF8&amp;s=books&amp;qid=1214417200&amp;sr=8-1">Dojo: The Definitive Guide</a>&#8221; on bookshelves now.  Subscribers to Linux Journal can <a href="http://www.linuxjournal.com/article/9900">click here</a> to read Matthew&#8217;s article &#8220;<a href="http://www.linuxjournal.com/article/9900">Dojo: the JavaScript Toolkit with Industrial-Strength Mojo</a>&#8220;.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/07/09/getting-your-mojo-back-with-dojo/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Measuring associative networks for quality of analytics</title>
		<link>http://blog.digitalreasoning.com/2008/06/24/measuring-associative-networks-for-quality-of-analytics/</link>
		<comments>http://blog.digitalreasoning.com/2008/06/24/measuring-associative-networks-for-quality-of-analytics/#comments</comments>
		<pubDate>Tue, 24 Jun 2008 18:50:56 +0000</pubDate>
		<dc:creator>peter.mancini</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[associative network]]></category>

		<category><![CDATA[f-measure]]></category>

		<category><![CDATA[QoA]]></category>

		<category><![CDATA[Quality of Analytics]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/06/24/measuring-associative-networks-for-quality-of-analytics/</guid>
		<description><![CDATA[
“I personally think we developed language because of our deep inner need to complain.” – Jane Wagner

&#160;
When it comes to most text analytical tools the only measures given are recall and precision. Some may tie them up nicely and appropriately in the f-measure which is simply the harmonic mean of those two numbers. Usually the [...]]]></description>
			<content:encoded><![CDATA[<blockquote>
<p><em>“I personally think we developed language because of our deep inner need to complain.” – Jane Wagner</em></p>
</blockquote>
<p>&nbsp;</p>
<p>When it comes to most text analytical tools the only measures given are recall and precision. Some may tie them up nicely and appropriately in the f-measure which is simply the harmonic mean of those two numbers. Usually the discussion of quality ends there and you are quickly whisked off into discussions of functionality, user interface and processing speed.</p>
<p>As I wrote before there are many issues with measuring NLP tools. One of those issues is a lack of accredited measures to apply to them. At DRS we have at the root of our analysis something called the associative network. You can read about how these work in theory and examine a few examples of them. Generally there are a lot of problems with them revolving around their explosive need for memory and the time it takes to process them. At DRS we’ve solved a lot of those problems and find that medium sized corpora work just fine on your standard 2GB laptop. Let me briefly explain what an associative network is, as we’ve defined it.</p>
<p>An Associative Network is a set of related elements from a distribution of elements based on shared features to one or more elements selected from that distribution. Essentially, it is supposed to give you ranked elements that are semantically &#8220;closer&#8221; to the element(s) provided for comparison. The effectiveness of Associative Networks generally turns on (a) the selection of features of the elements in the distribution to compare and (b) the features of the element provided for comparison that are relevant in ranking. For instance, if I were to provide &#8220;fly&#8221; as a linguistic element to a set of linguistic elements in a data set, I might want &#8220;flying&#8221;, &#8220;traveling&#8221;, and &#8220;moving&#8221; as my expected association. This, of course, assumes the &#8220;sense&#8221; of &#8220;fly&#8221; is as a predicate and not as an entity (such as an insect). If the latter were the case, I might expect the associations to be &#8220;insect,&#8221; &#8220;bug,&#8221; and &#8220;fruit fly.&#8221;</p>
<p>The key above is to recognize features about the elements as used in the data (&#8221;fly&#8221; as predicate and &#8220;fly&#8221; as entity would have very different features if properly measured) and which features are apt for comparison (the string &#8220;fly&#8221; may be insufficient to specify the appropriate set of features to prioritize because its sense may be ambiguous without the user selecting &#8220;entity&#8221; or &#8220;predicate&#8221; as a qualifier on &#8220;fly&#8221;). The ideal Associative Network solves the traditional Natural Language Processing problems of automatic thesauri creation, clustering of semantic nearest neighbors, and brings us very close to effective, unsupervised sense disambiguation technology. Those are some ways that Digital Reasoning applies its Associative Network technology.</p>
<p>This technology is exciting and is very new in commercial grade applications. It is important to understand the strength and weaknesses of this tool. If you were evaluating an analytical tool it would be important for you to evaluate the accuracy of such a system and its utility. I was asked months ago to come up with a measure for Associative Networks. Since I am lazy I went and looked high and low for someone else’s measure first! Sadly there wasn’t anything out there. So I started to analyze what was coming out of our tool. Every attempt failed to produce something I would want to show because the scientific side of me rejected the processes I was developing. The problem was subjectivity. Your hard sciences like physics have unambiguous predictions from theory. The Strange Quark charge is always going to be 87 MeV by the Standard Model and as predictions go this one has always been measured this way in experimentation. When we get to softer sciences things start to get a little more ambiguous and subjective. As I stated in a prior post you have to reject subjectivity as much as possible.</p>
<p>So there I was staring at input terms for the associative net and the resulting list of associated terms given as output. What, therefore, defines good associations for “tree” or “Teddy Roosevelt” or “quark”? When we look at the various ways in which the associative network can be tweeked (there are many variables that control the process) and the fact that different corpora will produce different associations I began to think there was no way to measure this. At least a non-subjective way. Subjectively I can look at a list and using my own knowledge say whether the list “looked right” or not. That is hardly a measure. It certainly isn’t scientific.</p>
<p>Throwing all subjectivity out the window I needed to find a scientific method… I had to make predictions and prove them out through experimentation. Then it hit me. It’s not just the associations. It’s the network. There should be a way of looking at two terms and predicting a third in relationship to both. So, taking an analysts approach I looked at a document from one of my corpora and found that the USS Nimitz has 5,900 sailors and the reactor has a peak-output of 190 MW. Ok, now we are talking. The intersection of associations between USS Nimitz and Sailors should contain 5,900 and the intersection of USS Nimitz and peak-output should contain 190 MW. It seems so simple and yet it eluded me for 2 months trying to solve this problem. I am currently working on a test of this concept and the write-up of the theory. I am sure I will come across some interesting issues and along the way discover more ways of testing associative nets and other semantically related data organization tools. By making these methods open it allows them to be used widely. By making them general in use (this method could be used on a wide variety of systems, including humans) they will have much more Universal applicability. I’ll use this place as my initial forum to announce the results of the experiment and methods one can use to replicate the experiment.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/06/24/measuring-associative-networks-for-quality-of-analytics/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Powering up Ajax apps with Dojo</title>
		<link>http://blog.digitalreasoning.com/2008/06/15/powering-up-ajax-apps-with-dojo/</link>
		<comments>http://blog.digitalreasoning.com/2008/06/15/powering-up-ajax-apps-with-dojo/#comments</comments>
		<pubDate>Sun, 15 Jun 2008 12:49:34 +0000</pubDate>
		<dc:creator>bill.day</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Ajax]]></category>

		<category><![CDATA[book]]></category>

		<category><![CDATA[Dojo]]></category>

		<category><![CDATA[Matthew Russell]]></category>

		<category><![CDATA[O'Reilly]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/06/15/powering-up-ajax-apps-with-dojo/</guid>
		<description><![CDATA[Congrats to our friend and colleague Matthew Russell on the publication of his new O&#8217;Reilly book, &#8220;Dojo:  The Definitive Guide&#8220;.

Watch our blog for more information from Matthew on his new book.
]]></description>
			<content:encoded><![CDATA[<p>Congrats to our friend and colleague <a href="http://www.oreillynet.com/pub/au/2054">Matthew Russell</a> on <a href="http://news.oreilly.com/2008/06/powering-up-ajax-apps-with-doj.html">the publication of his new O&#8217;Reilly book</a>, &#8220;<a href="http://oreilly.com/catalog/9780596516482/index.html">Dojo:  The Definitive Guide</a>&#8220;.</p>
<p><a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&#038;location=http%3A%2F%2Fwww.amazon.com%2FDojo-Definitive-Guide-Matthew-Russell%2Fdp%2F0596516487&#038;tag=billday&#038;linkCode=ur2&#038;camp=1789&#038;creative=9325"><img src="http://ecx.images-amazon.com/images/I/51vzWGdNSKL._SS500_.jpg" alt="Click to read reviews or buy a copy from Amazon" /></a></p>
<p>Watch our blog for <a href="http://blog.digitalreasoning.com/2008/07/09/getting-your-mojo-back-with-dojo/">more information from Matthew on his new book</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/06/15/powering-up-ajax-apps-with-dojo/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The problem of quality assessment of NLP tools</title>
		<link>http://blog.digitalreasoning.com/2008/05/16/the-problem-of-quality-assessment-of-nlp-tools/</link>
		<comments>http://blog.digitalreasoning.com/2008/05/16/the-problem-of-quality-assessment-of-nlp-tools/#comments</comments>
		<pubDate>Fri, 16 May 2008 23:27:10 +0000</pubDate>
		<dc:creator>peter.mancini</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[AN]]></category>

		<category><![CDATA[correctness]]></category>

		<category><![CDATA[QA]]></category>

		<category><![CDATA[quality]]></category>

		<category><![CDATA[utility]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/05/16/the-problem-of-quality-assessment-of-nlp-tools/</guid>
		<description><![CDATA[Hello, my name is Peter Mancini and I&#8217;ve been with Digital Reasoning for over 3 years. Lately my work has been focused on developing quality metrics for our NLP/UDA tools. Below I have a sort of random discussion about the philosophy of quality assessment. It is not a thorough discussion but more a starting point [...]]]></description>
			<content:encoded><![CDATA[<p><font face="Times New Roman"><em>Hello, my name is Peter Mancini and I&#8217;ve been with Digital Reasoning for over 3 years. Lately my work has been focused on developing quality metrics for our NLP/UDA tools. Below I have a sort of random discussion about the philosophy of quality assessment. It is not a thorough discussion but more a starting point for discussion. I am always interested in hearing comments </em></font><font face="Times New Roman"><em>about my work and in this I am particularily interested since it is new ground for me.</em> </font></p>
<p><font face="Times New Roman">The difficulties with measuring the quality of any natural language system are great. The main source of difficulty comes from the subjective nature of language. Language is also living and a speaker or author can get creative with the use of words that breaks semantic rules but still conveys meaning. I believe that the vast majority (&gt;90%) of language breaks the standard language at least once per written page if not more on average. Further, look at spoken language taken in transcripts; people reconstruct sentences half-way through. Speakers will often not even finish sentences but trade sentence fragments back and forth and both speakers will feel they got the meaning of the dialog correctly. A decent NLP system has to deal with this since  most of the input data will typically not be peer reviewed white papers but text collected from many sources. I propose the only useful measure of the quality of NLP tools is utility: how useful is the tool at helping the wielder performing their task. Note the difference between “utility” and “correctness.” Correctness is getting the standard language right, utility is leveraging the underlying meaning of the text for further purposes.</font></p>
<p><font face="Times New Roman"> </font><strong><em><font face="Times New Roman">Utility vs. Correctness</font></em></strong></p>
<p><font face="Times New Roman">Utility is the ability to use the output to perform a task. Correctness is the ability of the tool to register its output with a set of rules. However, as we said, language is living and often the rules are broken. Many times they are broken by accident and other times they are specifically broken to convey a special meaning. Another aspect is looking at how an entity might be used and possibly changing its <em>canon</em> definition with a more useful one.</font></p>
<p><font face="Times New Roman">Take this as an example. Your NLP processor is parsing the following text:</font></p>
<p><font face="Times New Roman"> </font>“The hospital is expected to be opened on 12/02/2008.”<font face="Times New Roman"> </font></p>
<p><font face="Times New Roman">In the above utterance we see at the end a string of numbers and backslashes. Most systems would classify 12/02/2008 as one cardinal number. Other systems would classify it as 3 cardinal numbers with separators. Either way is correct. However when it is displayed as the following:</font></p>
<p><font face="Times New Roman"> </font>“The hospital is expected to be opened on 12 February 2008.” </p>
<p><font face="Times New Roman">…then it becomes Cardinal Number, Proper Noun Singular and Cardinal Number. OK, correct but is it useful to be correct here? How is this presentation substantially different than the prior one other than cosmetically? In this case it would be more interesting from a programming perspective if you treat February as a type of Cardinal Number. This can give the programmer more power to distinguish dates no matter what presentation. You run into problems with the months of April, May, June and August but only because they are also names of people. However there is always a trade off. The point is, you can get more utility out of the system by identifying dates easier with one approach than another. The quality of the system therefore shouldn’t be based upon how correct it is but how the output can be leveraged by software.</font></p>
<p><font face="Times New Roman"> </font><strong><em><font face="Times New Roman">Soft vs. Hard Assessment</font></em></strong></p>
<p><font face="Times New Roman">All quality assessments have some measure of subjectivity to them. When you look at entity extraction, for example, even with well defined concepts such as “location” you end up with ambiguities that are resolved more by opinion than objective determination.</font></p>
<p><font face="Times New Roman">For example consider these examples:</font></p>
<ol>
<li>“We traveled to Boston.”</li>
<li>“We were there in 1999.”</li>
<li>“He was forced into exile.”</li>
<li>“The accusation was all in his mind.”</li>
</ol>
<p><font face="Times New Roman">In example 1 we have an unambiguous location. In example 2 we have a location in time. Time is the 4<sup>th</sup> dimension but we can’t place the location geographically. In example 3 we have an abstract location. It might be possible to determine the extents of a physical location but one could be in exile in a very abstract way that does not imply a change of location. In example 4 we have a further example of a highly abstract location. You can take this to further abstractions. “They ran in the Boston Marathon” might be a location to some as it implies a well established route. In various discussion of this last example some have told me they would prefer the system to designate the location as Boston Massachusetts. As a long time resident of the area I can assure you less than a mile of the race is actually within the city limits of Boston. Also, the word Boston here is actually a modifier for Marathon, so if you accept Boston Marathon then you also have to accept other places where Boston is a modifier such as in these examples Boston Creme Pie, Boston Whaler, Boston College (actually located in Chestnut Hill), Boston Legal (only exists on TV), Douglas A-20 Boston (a WWII bomber), Boston Market (a chain of resturants) and many others.</font></p>
<p><font face="Times New Roman">The key in situations like this is to set your rules and be consistent when scoring based upon them. However, one has to keep the amount of subjective perspective to the absolute minimum in any analysis. Otherwise you have “open ended criteria” without a pre-determined expected outcome which means that any analysis done with these parameters will always leave you victim to too many false positives being accepted as true. You do that by setting up reasonable constraints on the concept.</font></p>
<p><font face="Times New Roman">Our biggest challenge today is coming up with a model for qualitative analysis of Association Networks (AN). These are terms that are associated with a given term. They can include synonyms, narrower terms, broader terms, attributes and host of other things due to semantic interaction. Here we will first be looking to evaluate quality through utility. It would be hard to measure “correctness” for two reasons. The first is: what is the definition of correctness for an association network? The second is that it takes a lot of data to create a good association network and thus having a reference set that is perfectly understood such that one can predict all of the output from the input is a prohibitively expensive operation. This second issue is more a testament to my laziness than it is to the impossibility to measure correctness but there you have it. Looking at utility we still have an issue with subjectivity. The key will be to minimize that subjectivity so that creative and irreproducible results do not plague the analysis.</font></p>
<p><font face="Times New Roman">My next post will be about measuring the quality of Association Networks - what I&#8217;ve tried and what I think works.</font></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/05/16/the-problem-of-quality-assessment-of-nlp-tools/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Somebody finally said it</title>
		<link>http://blog.digitalreasoning.com/2008/04/30/somebody-finally-said-it/</link>
		<comments>http://blog.digitalreasoning.com/2008/04/30/somebody-finally-said-it/#comments</comments>
		<pubDate>Wed, 30 Apr 2008 17:27:37 +0000</pubDate>
		<dc:creator>tim.estes</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/04/30/somebody-finally-said-it/</guid>
		<description><![CDATA[I recently gave a presentation at the Nashville Technology Council&#8217;s 4th Annual Innovation Conference.
I noted in my presentation that there isn’t any real early stage innovation investing in Middle Tennessee outside of the healthcare space.  In fact, the truth is that most TN venture firms don’t even keep their money in-state:

Now, I’ll predict the reactions to [...]]]></description>
			<content:encoded><![CDATA[<p>I recently gave a presentation at the <a href="http://www.technologycouncil.com/" title="Nashville Technology Council">Nashville Technology Council</a>&#8217;s <a href="http://www.technologycouncil.com/news.php?viewStory=1220" title="NTC Innovation Conference homepage">4th Annual Innovation Conference</a>.</p>
<p><a href="http://blog.digitalreasoning.com/wp-content/uploads/2008/05/innovation_2008.pdf" title="NTC Innovation Conference presentation slides">I noted in my presentation</a> that there isn’t any real early stage innovation investing in Middle Tennessee outside of the healthcare space.  In fact, the truth is that most TN venture firms don’t even keep their money in-state:</p>
<p><img src="http://blog.digitalreasoning.com/wp-content/uploads/2008/04/tninvestmentsbylocation.jpg" alt="Tennessee investments by location" /></p>
<p>Now, I’ll predict the reactions to this from the community:</p>
<ol>
<li>Some will be offended and shoot the messenger</li>
<li>Some will point to all of the new initiatives going on to spur entrepreneurship in the area</li>
<li><a href="http://venturenashville.blogspot.com/2008/03/venture-notes_30.html">Some will agree</a> and try to figure out ways to change it (but not have the resources to do anything about it)</li>
<li>One or two may get involved that can make a difference and invest in new things (maybe)</li>
</ol>
<p><a href="http://blog.digitalreasoning.com/wp-content/uploads/2008/05/innovation_2008.pdf" title="Read my NTC Innovation Conference presentation">Click here to read my full presentation (PDF format)</a>.  Below I&#8217;ve summarized the key takeaways.</p>
<p>First, the presentation wasn’t meant to be critical of any particular group. The funds that are analyzed have an obligation to their investors, not to Tennessee or particular industries. If they make good returns, they are doing their job. Enough said.</p>
<p>While it would be great if their money stayed in Tennessee and was used by Tennessee innovators to create new things, investing in-state is not the job of these VCs.  Their job is to make money.  It is true that funds in major markets (San Francisco, Boston, Seattle, DC, etc.) have a track record of making money through innovation and riskier investing. Some markets (like Nashville) make money by investing in consistent performers – even though that gets a lot less press and generates less excitement. Whatever floats someone’s <a href="http://www.investopedia.com/terms/i/irr.asp">IRR</a>…</p>
<p>Second, the opportunities laid out in my presentation (peer to peer platforms, novel content distribution and recommendation services, and healthcare informatics) are VERY ACTIVE and TIMELY opportunities. It will be a crime if we don&#8217;t have at least one or two startups in Middle Tennessee that go after each of these opportunities.</p>
<p>Unfortunately, that means that somewhere between $20-100M has to go to work for them to really have a chance. It doesn’t all have to come from here, just the seed and Series A money ($5-15M of the above amount). Groups such as Sequoia Capital will place growth capital in other markets outside the Silicon Valley, just not early stage.</p>
<p>Therefore, we need emerging Middle Tennessee companies to be funded by angels and early stage groups with critical ties to mainstream companies.  These early investors also need ties to executives with strong track records recruited from ongoing concerns.  If we can get these kinds of early investments, then big funds will have better trust in the business development acumen of Nashville-based startups and follow-on funding will become available to help these emerging companies compete on the national and international stages</p>
<p>Third, we need strategic involvement and leadership from major companies in the area. We need strategic investment to go into firms such as Nissan, HCA, Ingram, MCA, and AT&amp;T. We need executives willing to team with smart tech people and put their credibility to work to expand and grow the aforementioned emerging markets. Only by bridging experience, relationships, and innovation can we transform the Middle Tennessee economy into something that will be competitive for decades to come. That means that we need major firms to see the upside in spin-out innovation and equity/licensing strategies, not just bottom lines. This should be easier for private companies to experiment with than public ones.</p>
<p>Clearly every person has to do what works for them. If you are a VP at MCA or a deputy CIO at Nissan or Gaylord, you’ve got security and good personal upside. But there may be an opportunity to build something new and great with the right people and idea.</p>
<p>Young people in Nashville trying to do innovative things aren’t getting the backing and support they need. Many are being turned down repeatedly, often without being given a reason. Many of them get frustrated and leave. I heard this directly from numerous people after my speech. These aren’t people with sour grapes. They are bright, idealistic, passionate, and talented. And they are tired of being treated like unproven fodder by the people holding the money. If Middle Tennessee isn&#8217;t going to support them, they will go somewhere where there are a hundred options instead of five.  They&#8217;ll fall in love with one of those options, break out, and become successful.  And Tennessee will miss out on an economic opportunity in the process.</p>
<p>My request to executives and investors is this:  If an up-and-comer presents their idea to you, please take the time to explain your thinking whether you ultimately say yes or no. They need constructive guidance and they need to connect with people that can help them grow into leadership and teach them how to monetize their ideas.</p>
<p>I know that time is valuable and if you aren’t going to invest, it&#8217;s easy to just toss a group over the side and move to the next one, but after 3-5 tosses, many of these entrepreneurs will move and then they won&#8217;t be around four to five years later to give you the best deal you’ve ever seen.  Make the extra effort. It will come around to your advantage and the betterment of our whole economy in the long run.</p>
<p>Failing to support new ideas and entrepreneurs is not going to make the people that have money now poorer in the near term.   But by not supporting innovation in Middle Tennessee now, we are sending people with talent elsewhere, to places where tremendous money is being made by backing these intelligent and ambitious people.</p>
<p>We can and should do better.</p>
<p>This is all meant as free advice for a place that I love from <a href="http://www.wnfoundersmuseum.org/foundfamilies.htm">one of the sons of its founders</a> (<a href="http://www.wnfoundersmuseum.org/foundfamilies.htm">I can trace my family roots here back to the Gower family</a>). Nashville can choose to be a hub of critical information and knowledge, entertainment, and culture in America, but it has some work to do.  Early stage innovation investing inside Tennessee would be a good start.  Suggestions on how to &#8220;get there from here&#8221; are very welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/04/30/somebody-finally-said-it/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Sun highlights Digital Reasoning Systems</title>
		<link>http://blog.digitalreasoning.com/2008/04/04/sun-highlights-digital-reasoning-systems/</link>
		<comments>http://blog.digitalreasoning.com/2008/04/04/sun-highlights-digital-reasoning-systems/#comments</comments>
		<pubDate>Fri, 04 Apr 2008 16:47:56 +0000</pubDate>
		<dc:creator>bill.day</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[commercial]]></category>

		<category><![CDATA[customer]]></category>

		<category><![CDATA[government]]></category>

		<category><![CDATA[intelligence]]></category>

		<category><![CDATA[Interceptor]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[software development]]></category>

		<category><![CDATA[Sun]]></category>

		<category><![CDATA[Sun Microsystems]]></category>

		<category><![CDATA[tools]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/04/04/sun-highlights-digital-reasoning-systems/</guid>
		<description><![CDATA[Sun Microsystems has written and published a &#8220;Customer Snapshot&#8221; highlighting how we used Java and various development tools and methodologies to build our Interceptor Suite.
Click here to read the article.
The article highlights some of our pre-existing intelligence and government work.  I also hope to get another Sun snapshot covering our new commercial efforts published [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.sun.com/">Sun Microsystems</a> has written and published a &#8220;<a href="http://www.sun.com/customers/index.xml">Customer Snapshot</a>&#8221; highlighting <a href="http://www.sun.com/customers/software/digital_reasoning.xml">how we used Java and various development tools and methodologies to build our Interceptor Suite</a>.</p>
<p><a href="http://www.sun.com/customers/software/digital_reasoning.xml">Click here to read the article.</a></p>
<p>The article highlights some of our pre-existing intelligence and government work.  I also hope to get another Sun snapshot covering our <a href="http://blog.digitalreasoning.com/2008/01/31/web-30-and-its-discontents/">new</a> <a href="http://blog.digitalreasoning.com/2007/10/22/the-ediscovery-market/">commercial</a> <a href="http://blog.digitalreasoning.com/">efforts</a> published as soon as we are ready to dive into the details.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/04/04/sun-highlights-digital-reasoning-systems/feed/</wfw:commentRss>
		</item>
		<item>
		<title>GeoLocator 2.1 released</title>
		<link>http://blog.digitalreasoning.com/2008/03/27/geolocator-21-released/</link>
		<comments>http://blog.digitalreasoning.com/2008/03/27/geolocator-21-released/#comments</comments>
		<pubDate>Thu, 27 Mar 2008 20:40:52 +0000</pubDate>
		<dc:creator>bill.day</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[alignment]]></category>

		<category><![CDATA[coordinates]]></category>

		<category><![CDATA[extraction]]></category>

		<category><![CDATA[geocoordinates]]></category>

		<category><![CDATA[GeoLocator]]></category>

		<category><![CDATA[location]]></category>

		<category><![CDATA[process]]></category>

		<category><![CDATA[release]]></category>

		<category><![CDATA[text files]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/03/27/geolocator-21-released/</guid>
		<description><![CDATA[We released GeoLocator 2.1 and I wanted to blog a short bit about the release announcement here too, so that readers can link to the press release for details.
The announcement notes:
GeoLocator™ 2.1 can process over 14,000 text files every hour, with each text file averaging around seven kilobytes each. That is the equivalent of reading [...]]]></description>
			<content:encoded><![CDATA[<p>We released <a href="http://www.digitalreasoning.com/GeoLocator">GeoLocator 2.1</a> and I wanted to blog a short bit about the release announcement here too, so that readers can link to <a href="http://www.digitalreasoning.com/geolocator-21-released">the press release</a> for details.</p>
<p>The announcement notes:</p>
<blockquote><p>GeoLocator™ 2.1 can process over 14,000 text files every hour, with each text file averaging around seven kilobytes each. That is the equivalent of reading War and Peace, which is almost 1500 pages long, 33 times in an hour. In fact, if you were to print all of those text files on standard, letter-size paper and set them side-by-side you could cover almost 35 acres.</p></blockquote>
<p>Reading &#8220;<a href="http://en.wikipedia.org/wiki/War_and_Peace">War and Peace</a>&#8220;, including extracting all of the locations in it and aligning them to geocoordinates, in less than 2 minutes.  That&#8217;s fast!</p>
<p><a href="http://digitalreasoning.com/GeoLocator">Learn more from the GeoLocator page here.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/03/27/geolocator-21-released/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Web 3.0 and its discontents</title>
		<link>http://blog.digitalreasoning.com/2008/01/31/web-30-and-its-discontents/</link>
		<comments>http://blog.digitalreasoning.com/2008/01/31/web-30-and-its-discontents/#comments</comments>
		<pubDate>Thu, 31 Jan 2008 15:59:14 +0000</pubDate>
		<dc:creator>tim.estes</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[Amazon]]></category>

		<category><![CDATA[BitTorrent]]></category>

		<category><![CDATA[combinatorics]]></category>

		<category><![CDATA[distributed]]></category>

		<category><![CDATA[entity extraction]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[graph]]></category>

		<category><![CDATA[grid]]></category>

		<category><![CDATA[hosted]]></category>

		<category><![CDATA[inference]]></category>

		<category><![CDATA[language]]></category>

		<category><![CDATA[lexicon]]></category>

		<category><![CDATA[location]]></category>

		<category><![CDATA[mapping]]></category>

		<category><![CDATA[metadata]]></category>

		<category><![CDATA[Metaweb]]></category>

		<category><![CDATA[model]]></category>

		<category><![CDATA[n-gram]]></category>

		<category><![CDATA[New York Times]]></category>

		<category><![CDATA[ontology]]></category>

		<category><![CDATA[patterns]]></category>

		<category><![CDATA[performance]]></category>

		<category><![CDATA[quantum]]></category>

		<category><![CDATA[Radar Networks]]></category>

		<category><![CDATA[reasoning]]></category>

		<category><![CDATA[resolution]]></category>

		<category><![CDATA[rules]]></category>

		<category><![CDATA[scale]]></category>

		<category><![CDATA[Semantic Web]]></category>

		<category><![CDATA[semantics]]></category>

		<category><![CDATA[silos]]></category>

		<category><![CDATA[social networking]]></category>

		<category><![CDATA[states]]></category>

		<category><![CDATA[Sun]]></category>

		<category><![CDATA[swarming]]></category>

		<category><![CDATA[Technology Review]]></category>

		<category><![CDATA[Web 3.0]]></category>

		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://blog.digitalreasoning.com/2008/01/31/web-30-and-its-discontents/</guid>
		<description><![CDATA[For my first post to our new blog, I thought I would jump into an area that is of great and timely interest: The emerging &#8220;Semantic Web&#8221; and the technologies and solutions proposed to enable it.
There has been a lot of &#8220;Web 3.0&#8243; buzz in the last year.  See for example this MIT &#8220;Technology [...]]]></description>
			<content:encoded><![CDATA[<p>For my first post to <a href="http://blog.digitalreasoning.com">our new blog</a>, I thought I would jump into an area that is of great and timely interest: The emerging &#8220;Semantic Web&#8221; and the technologies and solutions proposed to enable it.</p>
<p>There has been a lot of &#8220;Web 3.0&#8243; buzz in the last year.  See for example <a href="http://www.technologyreview.com/Infotech/18306/">this MIT &#8220;Technology Review&#8221; article</a>, <a href="http://money.cnn.com/magazines/business2/business2_archive/2007/07/01/100117068/index.htm?postversion=2007070305">Business 2.0&#8217;s piece on Radar Networks</a>, <a href="http://www.nytimes.com/2007/03/09/technology/09data.html?ex=1331096400&amp;en=a87d4f61e6052888&amp;ei=5090&amp;partner=rssuserland&amp;emc=rss">the New York Times&#8217; Metaweb article</a>, and <a href="http://www.nytimes.com/2006/11/12/business/12web.html?ex=1320987600&amp;en=a54d6971614edc62&amp;ei=5090&amp;partner=rssuserland&amp;emc=rss">John Markoff&#8217;s original Web 3.0 article</a> from the NY Times in late 2006.  The reaction in the blogsphere has been equally interesting.  There appears to be a combination of believers and advocates, both <a href="http://ross.typepad.com/blog/2006/11/there_is_no_web.html">Web 2.0 players who are mad at the hype being stolen</a> and <a href="http://www.crn.com/software/202404824">those who are skeptics</a>.  If I were to put myself in a camp, I&#8217;d have to say I&#8217;m an &#8220;optimistic skeptic&#8221;.</p>
<p>I believe something like <a href="http://technology.timesonline.co.uk/tol/news/tech_and_web/the_web/article2726190.ece">this vision of Web 3.0</a> will play out, but it might take the market six or seven &#8220;attempters&#8221; before we find a Google of Web 3.0.</p>
<p>Whoever eventually gets it right must overcome at least three critical issues to make the Web 3.0 vision reality.  I&#8217;ll lay them out here.</p>
<p><strong>The W3C vision of the Semantic Web is a Dead End</strong></p>
<p><em>Semantics = Metadata + Reasoning</em> was and is a bad idea in the context of bridging human communication and machines.</p>
<p><a href="http://www.sigir.org/forum/2004D/sparck_jones_sigirforum_2004d.pdf">Karen Sparck Jones sums it up nicely here.</a>  As she explains rather eloquently, we really have to look at <em>which</em> Semantic Web we are talking about. Most parties in the space sell the value of the &#8220;high end&#8221; Semantic Web (inferential reasoning from advanced world models using a uniform lexicon with known derivation rules) but really only have technology for the &#8220;low end&#8221; Semantic Web (human tagging/machine entity extraction + crude resolution procedures for mapping/structuring like elements into classes). The truth is that building the sub-domain ontology hooks is just a way to replace business logic with a derivative of XML. The marginal gains in flexibility by this approach are a costly tradeoff for the complexity, bloat, and performance implications of pushing around such an overly expressive and poor representation of knowledge.</p>
<p>Most vendors in the space seem to think they can execute on a simple ontology around some patterns of activity, such as a limited ontology around people, places, and particular electronic modes of communication. This is really just reengineering the integration of Outlook/Exchange with social networking or the development of mapping rules from certain descriptive strings in Wikipedia along a priori detectable paths (such as X is a Location and is tied to this person&#8217;s entry). While there is little doubt that this does enrich the content (a la <a href="http://marklogic.com/">MarkLogic</a>&#8217;s enterprise offerings), it really isn&#8217;t the Semantic Web. This doesn&#8217;t mean that there is little value in that – there is – but, it isn&#8217;t the promise of the Semantic Web.</p>
<p><strong>Scalability and Complexity</strong></p>
<p>No one has demonstrated deep semantic web infrastructure of any scale.  This doesn&#8217;t mean that such infrastructure is impossible, just that no one has shown it working at Web-scale.</p>
<p>There has been talk of powerful triple- and n-tuple stores and searches over billions of tuples in millisecond time.  Given what <a href="http://www.gigaspaces.com/">GigaSpaces</a> and other <a href="http://en.wikipedia.org/wiki/Tuple_space">tuple space</a> architectures have accomplished, this isn&#8217;t as big an issue as people think.  Of course, most of these numbers are out of the funded companies, not necessarily indicative of real world environments.</p>
<p>The problem isn&#8217;t the scale, it&#8217;s the ambiguity of the state-space.  Language and the agents that use it are utterly magical in how easily they deal with ambiguity that defies simple traversal of a limited graph.  This is one reason many bright people have speculated that the brain must have some quantum properties in how it makes inferences across such a large number of potential states.</p>
<p>To see how hard the combinatorics of this are, compare it with work in <a href="http://en.wikipedia.org/wiki/N-gram">n-gram</a> models.  People were still receiving PhDs for dissertations on 5-gram models as of a few years ago.  While some may argue that the fixed semantics of a rule-base/ontology don&#8217;t lead to anything in this kind of state space, I&#8217;d challenge them to deal with a large, dynamic lexicon and more than trivial top-level ontological classes.</p>
<p>The bottom line is that a schematic representation is likely only going to be able to handle traversal along very rigid paths that are mapped to very specific use cases. In other words, if you try to make business logic that is remotely as dynamic as human semantics in language,  you will have a problem representing them correctly.  Worse yet, current representations are inherently unscalable.</p>
<p>There is promising work in this area looking at semantics as a superposition of states with runtime collapse into the appropriate sense (see <a href="http://www.maya.com/">Maya Design</a>).  Most systems that try to do this the &#8220;old fashioned way&#8221; take <a href="http://blogs.zdnet.com/BTL/?p=5541">around a second to process a sentence on modest hardware</a>.  And they are still limited to modest global semantics.</p>
<p>Here&#8217;s the key point:  Without proof of scale, it&#8217;s just a cute demo.</p>
<p>Magic is a lot easier in a controlled environment under unrealistically small data constraints. It&#8217;s just a looser way of <a href="http://en.wikipedia.org/wiki/Overfitting">overfitting</a>. The US Intelligence Community has already been through this and is on the other side of investing a whole lot more money than <a href="http://en.wikipedia.org/wiki/Sand_Hill_Road">Sand Hill Road</a> in getting this to work.  Upon looking into this a while back, former <a href="http://www.usnews.com/usnews/news/articles/061103/3qahaseltine.htm">NSA Chief Scientist Eric Haseltine</a> was <a href="http://www.gcn.com/print/24_2/34877-1.html?topic=news">summarily unimpressed</a>.</p>
<p>Thankfully, some people are doing a good job of setting expectations honestly for what could be done right now (<a href="http://money.cnn.com/magazines/business2/business2_archive/2007/07/01/100117068/index.htm?postversion=2007070305">see Nova&#8217;s comments in the Business 2.0 article</a> mentioned at the beginning of this post). Those people have a far better prospect of creating value and loyalty in a future user base.</p>
<p><strong>We don&#8217;t need a Web 3.0 version of Google: Just Say NO to Semantic Silos!</strong></p>
<p>The Web 3.0 vision should not be realized through any one Web site.  Instead we should work to realize it using diverse software that sits all over, the distributed hybrid of the current grid architecture of the Amazons, Suns, and Google and the BitTorrent model of distributed tracking and swarming of intelligence.</p>
<p>The alternative, implementing the current hosted model of data for semantics, would be very dangerous.  And yet that is <a href="http://www.wired.com/techbiz/people/news/2007/04/mag_schmidt_trans?currentPage=1">the espoused goal of the current market leader</a>.</p>
<p>What could be more Orwellian than having to go to an outside server to determine what the accurate sense of the word &#8220;tax&#8221; or &#8220;war&#8221; is?  That prospect should scare you.  Semantic silos have much larger consequences than our current lock-in to a given social network or hosting for a video.  Should such silos develop, it will be tantamount to auctioning off the truth.</p>
<p>The moral answer to this is to build the Semantic Web using a new type of software, not the same old centralized uber-service.  And that is the answer we are pursuing and look forward to discussing with you.</p>
<p>Please <a href="http://www.digitalreasoning.com/Contact">contact us with your thoughts</a> and <a href="http://blog.digitalreasoning.com/feed">subscribe to this blog</a> to join the discussion.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.digitalreasoning.com/2008/01/31/web-30-and-its-discontents/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
