Unsupervised Neural Computing for Information Extraction

Sunday, February 14, 2016
Samuel Goree, Oberlin College, Oberlin, OH
Background: Text Analysis and Information Extraction for document clustering requires techniques that often require human supervision, such as establishment of pre-defined keywords for each cluster, to work correctly. Manual operations associated with traditional methods can be time consuming and error prone. Our poster shows the results of investigating alternate techniques of information extraction using unsupervised artificial neural network methods. Methods: We investigated key issues about the use of AI technologies as alternatives to traditional methods in text analysis and information extraction. We use two different unsupervised Artificial Neural Network techniques, Adaptive Resonance Theory and Kohonen Self-Organizing Maps to cluster documents based on word frequency data from a small corpus of article abstracts, then a larger corpus of letters and town hall meeting records on fracking in New York State. We also investigate the prospects of spellchecking as a way to improve consistency, but conclude that it would take a more sophisticated system than we had to do more good than harm. Results: Our research yielded data on the inherent possibilities and limitations of AI technologies for information extraction and found that Kohonen Self-Organizing may be more useful to someone who is interested in content but unsure about aspects of a corpus might be the most interesting. We compare the clusters each technique found, and find that Kohonen maps generate clusters that are the most similar to our human-generated descriptions of the files. Kohonen maps placed article abstracts about conditional random fields and ontology-based information extraction into clusters labeled "condition, random, fields" and "extraction, evaluation, ontology," respectively. This is an example of how patterns in the text can correlate with what a human would find reasonable. Conclusions: We posit that the Kohonen model is relatively easy to use for analyzing text data and is effective at identifying potentially interesting patterns in data that a human expert can then explore further. Our poster will give examples of text data, typical output data and summary data that shows similarities and differences in the results from analyses by humans and by machines.