spaCy · Industrial-strength Natural Language Processing in Python This app works best with JavaScript enabled. spaCy 💥 Out now: spaCy v3.0Menu Usage Models API Universe Usage Models API Universe Industrial-Strength Natural Language Processing in Python Get things done spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive. Get started Blazing fast spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. If your application needs to process entire web dumps, spaCy is the library you want to be using. Facts & Figures Awesome ecosystem In the five years since its release, spaCy has become an industry standard with a huge ecosystem. Choose from a variety of plugins, integrate with your machine learning stack and build custom components and workflows. Read more Edit the code & try spaCy # pip install -U spacy # python -m spacy download en_core_web_sm import spacy # Load English tokenizer, tagger, parser and NER nlp = spacy.load("en_core_web_sm") # Process whole documents text = ("When Sebastian Thrun started working on self-driving cars at " "Google in 2007, few people outside of the company took him " "seriously. “I can tell you very senior CEOs of major American " "car companies would shake my hand and turn away because I wasn’t " "worth talking to,” said Thrun, in an interview with Recode earlier " "this week.") doc = nlp(text) # Analyze syntax print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks]) print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"]) # Find named entities, phrases and concepts for entity in doc.ents: print(entity.text, entity.label_) Features Support for 64+ languages 55 trained pipelines for 17 languages Multi-task learning with pretrained transformers like BERT Pretrained word vectors State-of-the-art speed Production-ready training system Linguistically-motivated tokenization Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more Easily extensible with custom components and attributes Support for custom models in PyTorch, TensorFlow and other frameworks Built in visualizers for syntax and NER Easy model packaging, deployment and workflow management Robust, rigorously evaluated accuracy New in v3.0 Transformer-based pipelines, new training system, project templates & more spaCy v3.0 features all new transformer-based pipelines that bring spaCy's accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. Training is now fully configurable and extensible, and you can define your own custom models using PyTorch, TensorFlow and other frameworks. The new spaCy projects system lets you describe whole end-to-end workflows in a single file, giving you an easy path from prototype to production, and making it easy to clone and adapt best-practice projects for your own use cases. See what's new From the makers of spaCy Prodigy: Radically efficient machine teaching Prodigy is an annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Whether you're working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster. Try it out Reproducible training for custom pipelines spaCy v3.0 introduces a comprehensive and extensible system for configuring your training runs. Your configuration file will describe every detail of your training run, with no hidden defaults, making it easy to rerun your experiments and track changes. You can use the quickstart widget or the init config command to get started, or clone a project template for an end-to-end workflow. Get started Language Afrikaans Albanian Arabic Armenian Basque Bengali Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kannada Korean Kyrgyz Latvian Ligurian Lithuanian Luxembourgish Macedonian Malayalam Marathi Multi-language Nepali Norwegian Bokmål Persian Polish Portuguese Romanian Russian Sanskrit Serbian Setswana Sinhala Slovak Slovenian Spanish Swedish Tagalog Tamil Tatar Telugu Thai Turkish Ukrainian Urdu Vietnamese Yoruba Components taggerparsernertextcat Hardware CPUGPU (transformer) Optimize for efficiencyaccuracy # This is an auto-generated partial config. To use it with 'spacy train' # you can run spacy init fill-config to auto-fill all default settings: # python -m spacy init fill-config ./base_config.cfg ./config.cfg [paths] train = null dev = null [system] gpu_allocator = null [nlp] lang = "en" pipeline = [] batch_size = 1000 [components] [components.tok2vec] factory = "tok2vec" [components.tok2vec.model] @architectures = "spacy.Tok2Vec.v2" [components.tok2vec.model.embed] @architectures = "spacy.MultiHashEmbed.v2" width = ${components.tok2vec.model.encode.width} attrs = ["ORTH", "SHAPE"] rows = [5000, 2500] include_static_vectors = false [components.tok2vec.model.encode] @architectures = "spacy.MaxoutWindowEncoder.v2" width = 96 depth = 4 window_size = 1 maxout_pieces = 3 [corpora] [corpora.train] @readers = "spacy.Corpus.v1" path = ${paths.train} max_length = 2000 [corpora.dev] @readers = "spacy.Corpus.v1" path = ${paths.dev} max_length = 0 [training] dev_corpus = "corpora.dev" train_corpus = "corpora.train" [training.optimizer] @optimizers = "Adam.v1" [training.batcher] @batchers = "spacy.batch_by_words.v1" discard_oversize = false tolerance = 0.2 [training.batcher.size] @schedules = "compounding.v1" start = 100 stop = 1000 compound = 1.001 [initialize] vectors = null 🪐Get started: pipelines/tagger_parser_ud The easiest way to get started is to clone a project template and run it – for example, this template for training a part-of-speech tagger and dependency parser on a Universal Dependencies treebank.$python -m spacy project clone pipelines/tagger_parser_ud End-to-end workflows from prototype to production spaCy's new project system gives you a smooth path from prototype to production. It lets you keep track of all those data transformation, preprocessing and training steps, so you can make sure your project is always ready to hand over for automation. It features source asset download, command execution, checksum verification, and caching with a variety of backends and integrations. Try it out In this free and interactive online course you’ll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches. It includes 55 exercises featuring videos, slide decks, multiple-choice questions and interactive coding practice in the browser. Start the course Benchmarks spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy right up to the current state-of-the-art. You can also use a CPU-optimized pipeline, which is less accurate but much cheaper to run. More results Pipeline Parser Tagger NER en_core_web_trf (spaCy v3) 95.1 97.8 89.8 en_core_web_lg (spaCy v3) 92.0 97.4 85.5 en_core_web_lg (spaCy v2) 91.9 97.2 85.5 Full pipeline accuracy on the OntoNotes 5.0 corpus (reported on the development set). Named Entity Recognition System OntoNotes CoNLL ‘03 spaCy RoBERTa (2020) 89.8 91.6 Stanza (StanfordNLP)1 88.8 92.1 Flair2 89.7 93.1 Named entity recognition accuracy on the OntoNotes 5.0 and CoNLL-2003 corpora. See NLP-progress for more results. Project template: benchmarks/ner_conll03. 1. Qi et al. (2020). 2. Akbik et al. (2018). spaCy Usage Models API Reference Online Course Community Universe GitHub Discussions Issue Tracker Stack Overflow Connect Twitter GitHub YouTube Blog Stay in the loop! Receive updates about new releases, tutorials and more. Sign up © 2016-2021 ExplosionLegal / Imprint