Rather than post a summary of each paper or each session, I'm going to summarize the papers I saw under some broad topic categories.
The papers in this post describe research on teaching machines to do labor-intensive tasks that are currently performed by human experts. Most papers have multiple authors, although I've credited by name only the author(s) who presented the paper. Entire papers are available in the conference proceedings; let me know if you'd like to see one.
Topic Model Methods for Automatically Identifying Out-of-Scope Resources
Steven Bethard, et al
Session 1
The researchers are studying ways to automate scope classification in topic-specific digital libraries. Human scope classification currently results in huge backlogs of resources waiting to be admitted or denied inclusion into a topic-specific digital library. The researchers worked with submissions to DLESE (Digital Library for Earth System Education) to attempt to train a computer to make those scope decisions. They encountered an immediate problem with available data. They were able to acquire 13,000 in-scope records because these items had been admitted into DLESE. Because DLESE does not keep extensive records about items not admitted to the digital library, the researchers only had information about 76 out-of-scope records. As a result, the researchers had to do some juggling of the numbers of records used for both the training data (76 out-of-scope records vs. 3,000 in-scope records) and the evaluation data (527 out-of-scope records vs. 231 in-scope records). Differences like these between training data and evaluation data are notorious for causing machine learning models to puke. Basically, the machine can't learn the correct way to behave because the information it's trained on does not at all resemble the real-world information it's supposed to evaluate. The researchers also pointed out that another problem with this work is the subtlety of topic distinctions. "Diamond formation" is a DLESE-relevant topic, while "diamond cutting" is not. It's very difficult to model such fine differences in an algorithm.
Automatically Generating High Quality Metadata by Analyzing the Document Code of Common File Types
Lars Fredrik Høimyr Edvardsen, et al
Session 1
Learning Management Systems (Blackboard, Moodle, Axio, etc) and even Intranets have a lot of potential for resource-sharing and interlinking, but the requirements for automating this work are very high. Successful automatic interlinking requires high-quality metadata that, in turn, demands a lot from submitters in terms of time, knowledge of the metadata schema, etc. Existing algorithms that do this work don't function well without very high-quality metadata. The researchers want to improve Automatic Metadata Generation (AMG) algorithms to do this work that currently relies almost wholly on humans. If one considers these algorithms in terms of the adage "garbage in, garbage out," how does the algorithm know what is/not garbage? To answer this question, the researchers use a combined approach that looks at a document's content, the coding of that content, and the contents' context within the document. For example, it's possible to identify title sections within a document based on the words' position on the page, markup in the document code (not just for html documents - Office documents use markup, too), format (bold, italics), font size, and so forth. The researchers report improvements using AMG algorithms based on this combined approach over the performance of existing AMG algorithms. There are some really interesting implications for this work. Documents are frequently submitted to digital libraries and institutional repositories as PDF files. However, PDFs contain very little embedded markup metadata, so it would be very difficult to automatically generate metadata for documents saved in this format. There are also implications for machine search of finding aids created and stored in Word documents.
Automatically Characterizing Resource Quality for Educational Digital Libraries
Steven Bethard, Philipp Wetzer, et al
Session 8
Winner, best paper
Manual accessioning to a digital library doesn't scale at all, and isn't portable since different collections have different policies - that is, different definitions of quality. The researchers analyzed existing data to attempt to identify and quantify major quality indicators that might be used to train machines in accessioning. The work proceeded in three major segments:
1) Understand expert quality judgments - this process resulted in 25 facets that made up a judgment of "quality."
2) Construct a testbed for computational methods - this process legitimized the facets by showing that experts use these measures of "quality" when assessing materials.
3) Make a machine learning model to train a computer in assessment - this process used a set of 1000 digital library records that had been annotated by experts for the presence or absence of the quality indicators.
The researchers report that their first models trained by this set of annotated records have improved identification of quality measures by as much as 18% over the baseline measurements. The researchers note that their model is distinct from others because it begins with data, not a specific theory of information retrieval.
Improving Optical Character Recognition through Efficient Multiple System Alignment
William B. Lund, et al
Session 8
Winner, best student paper
Mass-digitization is now a very common occurrence, but the format of an original document often affects the quality of the digital full-text item. For instance, typewritten pages, mimeographs, and microforms all contain poor-quality text that causes problems for OCR (Optical Character Recognition) engines. The researchers' work is very much about, "If one is good, more must be better." They scanned the same document, a typewritten memo written by General Eisenhower during the closing months of WWII, using three different commercially-available OCR engines. They then lined up the same text from all three engines to compare their failure points. The memo had also been hand-transcribed by a student, so the researchers could compare the scanned text to transcribed text and know exactly what failed. Computationally aligning the output of three OCR engines (as compared to two - think of the difference between a number squared and the same number cubed) was a very difficult problem that took the researchers down a side path. In order for the work to have real-world implications, the alignment needed to take place in a reasonable amount of time using a reasonable amount of system resources. In order to cut the time and system resources spent on even one alignment problem, the researchers developed a heuristic for the A* algorithm that allowed it to treat alignment as a minimum cost path problem. The resulting efficient text-alignment calculation provided by the heuristic-enhanced algorithm significantly improved the accuracy of a single OCR engine.
Automatic Quality Assessment of Content Created Collaboratively by Web Communities: A Case Study of Wikipedia
Marcos André Gonçalves, et al
Session 11
Digital libraries, including Wikipedia, rely on human judgment which is not scalable. Furthermore, manual reviews of articles are subject to human bias. A possible solution is to design automatic quality assessment tools. In order to do this, the researchers had to determine the aspects of a Wikipedia article that influence perceptions of quality. They found 4 broad features of quality that could then be broken down into individual indicators - Article representation (How long is the article? Is it a Featured Article, or some other category?), Review features (how often has the article been edited? Has it ever reached a stable point?), Network features (What links to the article? What does the article link to?), and Text features (Does the article have distinct sections? Does it have a table of contents?). The researchers report a significant improvement in quality prediction using their combined model over the current state-of-the-art method.
Comments