"Any medium powerful enough to extend man's reach is powerful enough to topple his world. To get the medium's magic to work for one's aims rather than against them is to attain literacy."
-- Alan Kay, "Computer Software", Scientific American, September 1984
New UMW Data up
0So I've spent the last several weeks doing, in essence, a complete rewrite of the scripts that scrape in data from UMWBlogs. It's all now much more modular, which I hope will make it much more nimble to expand out into new data sets. The first priority will be grabbing feeds from a wider variety of sources. Then, it'll be into tapping into linked open data sources.
0The upshot of the rewrite is to produce a chain of classes focused on bringing in data from particular sources. It's all still built on ARC, but with classes built around it with particular attention. There's a generalized class for ingesting data into the data store (the language of 'ingesting' is borrowed from repository apps like Fedora). But in practice it's all about the more particular classes. So, for example, there's a class focused on scraping the data out of the content of a post, which goes into a big DOMDocument. It grabs out the <a> nodes, and has a class designed to deal with those. Similarly for <img> nodes, <embed> nodes, and tags that are associated with the post. I'm using SimplePie as the starting point for parsing the feed, so all that falls into place pretty quickly.
0Following the hierarchy of the feed, the chain runs through sub-ingestors. So the instantiation of a class that deals with a SimplePie FeedItem instantiates the class for <a> nodes, <img> nodes, etc. as needed. Then they all get dumped into the triplestore from the top down.
0I'm now closing in on about 70,000 triples, with more being added every day as I scrape out more from the feeds.
0Good starting points for it are at the list of Directories, Galleries, and Exhibits at my new other, non-technical blog, Semantic UMW

Comments
Post new comment