Archive for the 'tagging' Category

Questions for Open Calais?

So I’m interviewing the folks over Thomson Reuters on Thursday for a piece that should be published at CJR. We’ll being talking about a relatively new service they’re providing freely. That service is called Open Calais, and it does some fancy stuff to plain text.

What fancy stuff? If you send it a news article, Open Calais will give you back the deets—and, way more importantly, it will make them obvious to your computer as well. That’s my description inspired by the Idiot’s guide, anyhow. (Yes, “deets” means “details” to cool kids, so get on board.)

<digression>Basically, the whole point of the semantic web is to make what’s obvious to you also obvious to your computer. For people who have always anthropomorphized their every laptop and piece of software—loved them when they just work, coaxed them when they slow to a crawl, and yelled at them when they grind to a halt—this can be a serious head-scratcher and a boring one at that. I blame Clippy the Microsoft Office Assistant. I also blame super-futuristic sci-fi movies that give us sugar-plum images of computers as pals—bright, sophisticated, and in possession of a knowledge like we epistemologically gifted humans have. Screw Threepio. Finally, I blame that jerk Alan Turing, who fed us the unintuitive half-truth that a computer could be conscious.

So it feels really silly so to say, again, but computers are ones and zeroes, NAND gates and NOR gates. They called computers because they do computation. They don’t do meaning as such. (Oh boy do I hope I get flamed in the comments by someone who knows his way around BsIV way better than I do.)</digression>

Open Calais will pick out people, companies, and places—these are called “named entities.” It will also identify facts and events in articles. Because Thomson Reuters is finance-focused information provider, many of the facts and events it can recognize are about business relationships like subsidiaries and events like bankruptcies, acquisitions, and IPOs. The list goes on and on. Finally, Open Calais will identify very broad categories like politics, business, sports, or entertainment.

Open Calais will also associate these deets with more further information on teh interwebs. So just for instance, if the web service identifies a person in your article, it will give you and your finicky, picky, and ultimately dumb computer a nice pointer to this computer-friendly version of wikipedia called dbpedia. Or if Calais identifies a movie, it will offer a pointer to linked data, as far as I can tell, is still a pretty vague notion. It promises to deliver more than it has to date, and that’s not a derogation.

But why freely—or essentially so in most cases? If you keep within liberal limits, you owe Thomson Reuters no money in exchange. Correct me if I’m wrong, but all they want, more or less, is that you offer them attribution and use their linked-data pointers (they call them URIs). Ken Ellis, chief scientist at Daylife, which may be best known to journalists through its association with Jeff Jarvis, took a stab at answering the “why free?” question:

Thomson Reuters has a large collection of subscription data services. They eventually want to link to these services. Widespread use of Calais increases the ease with which customers can access these subscription data services, ultimately increasing their ability to extract revenue from them.

That sounds to me like Thomson Reuters is interested in making its standards the standards. And that bargain really does sound reasonable. I guess.

But journalists are a wildly skeptical bunch. They’re skeptical—aloof even, way too cool for school and ideology. Journalists have a pretty acute and chronic deficiency in a little thing called trust. Maybe it’s justified, or maybe it’s not. Maybe it’s mostly justified, or maybe it’s mostly unjustified.

Either way, my gut’s telling me that journalists are going to need a fuller narrative from Thomson Reuters about why they should rely on another news and information company. When I talk to Tom and Krista, that’s what I’ll be largely interested in.

And you? What do you want to know about Open Calais. Leave your questions in the comments, and I’ll be sure to ask them.

Obstreperous Minnesota

Every once in a while—and maybe more often than I’d like to admit—I re-read Clay Shirky. Today, I re-read “Ontology Is Overrated.”

And today, I’m ready to disagree with it around the margins.

On fortune telling. Yes, Shirky’s correct that we will sometimes mis-predict the future, as when we infer that some text about Dresden is also about East Germany and will be forever. But, no, that doesn’t have to be a very strong reason for us not to have some lightweight ontology that then inferred something about a city and its country. We can just change the ontology when the Berlin Wall falls. It’s much easier than re-shelving books, after all; it’s just rewriting a little OWL.

On mind reading. Yes, Shirky’s correct that we will lose some signal—or increase entropy—when we mistake the degree to which users agree and mistakenly collapse categories. And, yes, it might be generally true about the world that we tend to “underestimate the loss from erasing difference of expression” and “overestimate loss from the lack of a thesaurus.” But it doesn’t have to be that way, and for two reasons.

First, why can’t we just get our estimations tuned? I’d think that the presumption would be that we could at least give a go and, otherwise, that the burden of demonstrating that we just cannot for some really deep reason falls on Shirky.

Second, we don’t actually need to collapse categories; we just need to build web services that recognize synonymy—and don’t shove them down our users’ throats. I take it to be a fact about the world that there are a non-trivial number of people in the world for whom ‘film’ and ‘movies’ and ‘cinema’ are just about perfect synonyms. At the risk of revealing some pretty embarrassing philistinism, I offer that I’m one of them, and I want my web service to let me know that I might care about this thing called ‘cinema’ when I show an interest in ‘film’ or ‘movies.’ I agree with Shirky that we can do this based solely on the fact that “tag overlap is in the system” while “the tag semantics are in the users” only. But why not also make put the semantics in the machine? Ultimately, both are amenable to probabilistic logic.

Google showed it is the very best at serving us information when we know we care about something fuzzy and obscure—like “obstreperous minnesota.” I don’t think Shirky would dispute this, but it’s important to bear in mind that we also want our web services to serve us really well when we don’t know we care about something (see especially Daniel Tunkelang on HCIR (@dtunkelang)). That something might be fuzzy or specific, obscure or popular, subject to disagreement or perfectly unambiguous.

People and organizations tend to be unambiguous. No one says this fine fellow Clay Shirky (@cshirky) is actually Jay Rosen (@jayrosen_nyu). That would be such a strange statement that many people wouldn’t even understand it in order to declare it false. No one says the National Basketball Association means the National Football League them. Or if someone were to say that J.P. Morgan is the same company as Morgan Stanley, we could correct him and explain how they’re similar but not identical.

Some facts about people and organization can be unambiguous some of the time, too. Someone could argue that President Obama’s profession is sports, but we could correct her and explain how it’s actually politics, which maybe sometimes works metaphorically like sports. That doesn’t mean that Obama doesn’t like basketball or that no one will ever talk about him in the context of basketball. There may be more than a few contexts in which many people think it makes little sense to think of him as a politician, like when he’s playing a game of pick-up ball. But I think we can infer pretty well ex ante that it makes lots of sense to think of Obama as a politician when he’s giving a big televised speech, signing legislation, or meeting with foreign leaders. After all, what’s the likelihood that Silvio Berlusconi or Hu Jintao would let himself get schooled on the court? Context isn’t always that dependent.

That’s one small step for Google, one giant leap for text-audio convergence

So you’ve seen the cult classic youtube video “The Machine Is Us/ing Us.”

It’s mostly about the wonders of hypertext—that it is digital and therefore dymanic. You can remix it, link to it, etc.

But form and content can be separated, and XML was designed to improve on HTML for that reason. That way, the data can be exported, free of constraints.

Google’s now embarked on a mission to free the speech data locked up in youtube videos.

There’s no indication that it’ll publish transcripts, which super too bad, but it’s indexing them and making them searchable. Soon enough every word spoken on youtube will be orders of magntitude more easily located, integrated, and re-integrated, pushed and pulled, aggregated and unbundled.

Consider a few simple innovations borne of such information.

Tag clouds, for instance, of what the english-speaking world is saying every day. If you take such a snapshot every day for a year and animate them, then you get a twisting, turning, winding stream of our hopes and fears, charms and gripes.

Clusters, for another, of videos with similar topics or sentiments. Memetracking could move conversations away from the email-like reply system in youtube to being something more organic and less narrowly linear.

Advertisements, for a last, of a contextual nature, tailored to fit the video without having to rely on human-added metadata.

Wait, announcements, for a very last, of an automated kind. If you create a persistent search of ‘obama pig,’ grab the rss feed, and push it into twitter, then you’re informing the world when your fave presidential candidate says something funny.

The Great Unbundling: A Reprise

This piece by Nick Carr, the author of the recently popular “Is Google Making Us Stupid?” in the Atlantic, is fantastic.

My summary: A print newspaper or magazine provides an array of content in one bundle. People buy the bundle, and advertisers pay to catch readers’ eyes as they thumb through the pages. But when a publication moves online, the bundle falls apart, and what’s left are just the stories.

This may no longer be revolutionary thought to anyone who knows that google is their new homepage, from which people enter their site laterally through searches. But that doesn’t mean it’s not the new gospel for digital content.

There’s only one problem with Carr’s argument, though. By focusing on the economics of production, I don’t think its observation of unbundling goes far enough. Looked at another way—from the economics of consumption and attention—not even stories are left. In actuality, there are just keywords entered into google searches. That’s increasingly how people find content, and in an age of abundance of content, finding it is what matters.

That’s where our under-wraps project comes into play. We formalize the notion of people finding content through simple abstractions of it. Fundamentally, from the user’s perspective, the value proposition lies with the keywords, or the persons of interest, not the piece of content, which is now largely commodified.

That’s why we think it’s a pretty big idea to shift the information architecture of the news away from focusing on documents and headlines and toward focusing on the newsmakers and tags. (What’s a newsmaker? A person, corporation, government body, etc. What’s a tag? A topic, a location, a brand, etc.)

The kicker is that, once content is distilled into a simpler information architecture like ours, we can do much more exciting things with it. We can extract much more interesting information from it, make much more valuable conclusions about it, and ultimately build a much more naturally social platform.

People will no longer have to manage their intake of news. Our web application will filter the flow of information based on their interests and the interests of their friends and trusted experts, allowing them to allocate their scarce attention most efficiently.

It comes down to this: Aggregating documents gets you something like Digg or Google News—great for attracting passive users who want to be spoon fed what’s important. But few users show up at Digg with a predetermined interest, and that predetermined interest is how google monetized search ads over display ads to bring yahoo to its knees. Aggregating documents make sense in a document-scarce world; aggregating the metadata of those documents makes sense in an attention-scarce world. When it comes to the news, newsmakers and tags comprise the crucially relevant metadata, which can be rendered in a rich, intuitive visualization.

Which isn’t to say that passive users who crave spoon-fed documents aren’t valuable. We can monetize those users too—by aggregating the interests of our active users and reverse-mapping them, so to speak, back onto a massive set of documents in order to find the most popular ones.

Whither Tag Clouds?

A few weeks ago, one could do relatively little clicking around the interwebs and notice the tear of pretty tag clouds powered by wordle. Bloggers of all stripes posted a wordle of their blog. Some, like Jeff Jarvis, mused about how the visualizations represent “another way way to see hot topics and another path to them.”

For as long as tag clouds have been a feature of the web, they’ve also been an object of futurist optimism, kindling images of Edward Tufte and notions that if someone could just unlock all those dense far-flung pages of information, just present them correctly, illumed, people everywhere would nod and understand. Their eyes would grow bright, and they would smile at the sheer sense it all makes. The headiness of a folksonomy is sweet for an information junkie.

It’s in that vein that ReadWriteWeb mythologizes the tag cloud as “buffalo on the pre-Columbian plains of North America.” A reader willing to cock his head and squint hard enough at the image of tag clouds “roaming the social web” as “huge, thundering herds of keywords of all shades and sizes” realizes that the Rob Cottingham would have us believe that tag clouds were graceful and defenseless beasts—and also now on the verge of extinction. He’s more or less correct.

I used to mythologize the tag cloud, but let’s be honest. They were never actually useful. You could never drag and drop one word in a tag cloud onto another to get the intersection or union of pages with those two tags. You could never really use a tag cloud to subscribe to RSS feeds of only the posts with a given set of tags.

A tag also never told you whether J.P. Morgan was a person or a bank. A tag cloud on a blog was never dynamic, never interactive. The tag cloud on one person’s blog never talked to the tag cloud on anyone else’s. I could never click on one tag and watch the cloud reform and show me only related tags, all re-sized and -colored to indicate their frequency or importance only in the part of the corpus in which the tag I clicked on is relevant.

But there’re also a cool-headed thoughts to have here. If tag clouds don’t work, what will? What is the best way to navigate around those groups of relatively many words called articles or posts? In the comments to Jarvis’s post, I asked a set of questions:

How will we know when we meet a visualization of the news that’s actually really useful? Can some visualization of the news lay not just another path to the “hot topic” but a better one? Or will headlines make a successful transition from the analog past of news to its digital future as the standard way we find what we want to read?

I believe the gut-level interest in tag clouds comes in part from the sense that headlines aren’t the best way to navigate around groups of articles much bigger than the number in a newspaper. There’s a real pain point there: scanning headlines doesn’t scale. Abstracting away from them, however, and focusing on topics and newsmakers in order to find what’s best to read or watch just might work.

I think there’s a very substantial market for a smarter tag cloud. They might look very different from what we’ve seen, but they will let us see at a glance lots of information and help us get to the best stuff faster. After all, the articles we want to read, the videos we want to watch, and the conversations we want to have around them are what’s actually important.

Twine Beta

I read an awful lot of RSS feeds. Not a record-shattering amount, but enough that it’s hard for me to keep them all organized in Google Reader.

Despite my efforts to keep them in “folders” of different kinds—some organized by topic, others by how frequntly I’d like to read them—I lose track of feeds for days or weeks on end sometimes. Then, when I do get a firm grip on all my feeds, I find that I’ve spent several hours of time I could’ve spent actually reading. That maintenance is getting to be a pain.

I’m hopeful that Twine can help me add a permanent smarter layer of organization to all my feeds. That smarter layer could be sensitive to my evolving reading habits. I’m also hoping that Twine can help me groups of topically similar posts across scattered blogs on the fly.

So early access to the beta would be awesome!


There are more than a few ways to remind yourself to read something or other later.

Browsers have bookmarks. Or you can save something to delicious, perhaps tagged “toread,” like very many people do. You can use this awesome firefox plugin called “Read It Later.”

But I like to do my reading inside Google Reader; others like their reading inside their fave reader.

So what am I to do? My first thought was Yahoo Pipes. It’s a well-known secret that Pipes makes screen-scraping around partial feeds as easy as pie. So I thought I could maybe throw together a mashup of and pipes to get something going.

My idea was to my to-be-read-later pages to delicious with a common tag—the common “toread” maybe. I could then have pipes fetch from delicious the feed based on that tag. The main urls for each delicious post point to the original webpage, and so, with the loop operator, I could locate the feed associated with each of the urls in the delicious feed. Original urls in hand, I was thinking I could have pipes auto-discover the associated feeds and then re-use those urls to locate the post within the feed corresponding to the page to be read later.

Well, I don’t think it can be done so easily. (Please! Someone prove me wrong!)

Meantime, I’ll just use my handy grease monkey plug-in that let’s me “preview” posts inside the google reader wrapper—so that I don’t have to hop from tab to tab like a lost frog.

Meantime, someone should really put together this app. Of course, it would really only work simply with pages that have rss analogues in a feed. But if, through Herculean effort, you found some practicable way to inform me that a given page doesn’t, but you could parse out the junk and serve me only the text, you’d be a real hero. Otherwise, just tell me that the page I’m trying to read later doesn’t have an rss analogue, give me an error message, and I’ll move on…assured in the knowledge that it will soon enough.

Josh Young's Facebook profile

What I’m thinking

Error: Twitter did not respond. Please wait a few minutes and refresh this page.

What I'm saving.

RSS What I’m reading.

  • An error has occurred; the feed is probably down. Try again later.