Qgelm

Tired Of Web Scraping? Make The AI Do It

Originalartikel

Backup

<html> <p>[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. <a href=„https://jamesturk.github.io/scrapeghost/“ target=„_blank“>Scrapeghost</a> relies on OpenAI&#8217;s GPT API to parse a web page&#8217;s content, pull out and classify any salient bits, and format it in a useful way.</p><p>What makes Scrapeghost different is how data gets organized. For example, when instantiating

scrapeghost

one defines the data one wishes to extract. For example:</p><pre class=„brush: bash; gutter: false; title: ; notranslate“ title=„“>from scrapeghost import SchemaScraperscrape_legislators = SchemaScraper(schema={„name“: „string“,„url“: „url“,„district“: „string“,„party“: „string“,„photo_url“: „url“,„offices“: [{„name“: „string“, „address“: „string“, „phone“: „string“}],})</pre><p>The kicker is that this format is entirely up to you! The GPT models are <a href=„https://hackaday.com/2022/05/18/natural-language-ai-in-your-next-project-its-easier-than-you-think/“>very, very good at processing natural language</a>, and

scrapeghost

uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.</p><p>It&#8217;s an experimental tool and you&#8217;ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There&#8217;s <a href=„https://jamesturk.github.io/scrapeghost/tutorial/“ target=„_blank“>a tutorial</a> and even a command-line interface, so check it out.</p> </html>

Cookies helfen bei der Bereitstellung von Inhalten. Diese Website verwendet Cookies. Mit der Nutzung der Website erklären Sie sich damit einverstanden, dass Cookies auf Ihrem Computer gespeichert werden. Außerdem bestätigen Sie, dass Sie unsere Datenschutzerklärung gelesen und verstanden haben. Wenn Sie nicht einverstanden sind, verlassen Sie die Website.Weitere Information