<html> <p>[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. <a href=„https://jamesturk.github.io/scrapeghost/“ target=„_blank“>Scrapeghost</a> relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.</p><p>What makes Scrapeghost different is how data gets organized. For example, when instantiating
scrapeghost
one defines the data one wishes to extract. For example:</p><pre class=„brush: bash; gutter: false; title: ; notranslate“ title=„“>from scrapeghost import SchemaScraperscrape_legislators = SchemaScraper(schema={„name“: „string“,„url“: „url“,„district“: „string“,„party“: „string“,„photo_url“: „url“,„offices“: [{„name“: „string“, „address“: „string“, „phone“: „string“}],})</pre><p>The kicker is that this format is entirely up to you! The GPT models are <a href=„https://hackaday.com/2022/05/18/natural-language-ai-in-your-next-project-its-easier-than-you-think/“>very, very good at processing natural language</a>, and
scrapeghost
uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.</p><p>It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s <a href=„https://jamesturk.github.io/scrapeghost/tutorial/“ target=„_blank“>a tutorial</a> and even a command-line interface, so check it out.</p> </html>