GiNZA Version 4.0: Improving Syntactic Structure Analysis Through Japanese Bunsetsu-Phrase Extraction API Integration
We are now one step closer to more seamless Japanese language analysis. GiNZA version 4.0 was released on August 16, 2020. It is based on spaCy version 2.3, the first iteration of this open-source software library that officially supports Japanese. This latest version of GiNZA offers vast improvements in terms of functionality and performance. Let’s explore how its Japanese bunsetsu-phrase extraction APIs make Japanese language analysis easier than ever before.
Introduction for Japanese Syntactic Analysis
Natural language processing (NLP) encompasses a series of techniques used to systematically analyze the words that people use regularly in their day-to-day lives. This series of “parsing processes” is vital for successfully analyzing the nuances of any language.
In a written Japanese sentence, for example, the first step is to separate words. We then group them into bunsetsu-phrases. Lastly, we interpret the dependency relations.
In Fig. 1, the input aggulutinative sentence “妻が買ったかばんを友達に⾒せたい。” is divided into the appropriate word sequence by the morphological analyzer, and the bunsetsu-phrases indicated by the rounded rects and the word dependencies indicated by the arrows are analyzed by the parser. Here is an example.
Tokenization is typically performed first, followed by recognition of basic phrases (e.g., noun phrases and verb phrases) and interpretation of dependencies between phrases. Since proper nouns such as names are often expressed as compound words, we must process their recognition by grouping these words together and giving them a classification.
In the case of English sentences, spaces almost always separate words; compound words such as “New York” must be grouped together, and contractions like “don’t” must be interpreted separately as “do” and “not.”
What You Can Do With GiNZA
GiNZA leverages the multilingual NLP framework spaCy to implement a series of processing flows (pipelines) needed for Japanese language analysis. Since spaCy is primarily designed and developed for Western languages, Japanese morphological analysis is not built-in by default. When you install spaCy’s Japanese model, the morphological analyzer SudachiPy is also installed. GiNZA uses SudachiPy at the top of its pipeline for tokenization (word separation) and part-of-speech assignment.
In this pipeline’s next stage, spaCy’s standard convolutional neural network (CNN) dependency structure analyzer determines the dependency between tokens. It then outputs a directed graph of the relationship between tokens of the subject (nsubj) and the object (obj, iobj), with the central token of the sentence as the root. In NLP terms, this directed graph is known as a parse tree.
For the next stage of the pipeline, the CNN named entity expression qualifier identifies named entity expression segments from the input parse tree and assigns a classification. So far, GiNZA’s process is almost identical to spaCy’s Japanese model. They diverge at the end of GiNZA’s pipeline, where we add two proprietary components.
The first component, Compound Splitter, can be configured to output two levels of token subdivision. The second component, Bunsetu Recognizer, recognizes phrases and their headword segments from the syntax tree with high accuracy by detecting special dependency labels that mark the head tokens of the phrases. We’ll delve more into the API used to obtain the results of this phrase certification process in the next section.
GiNZA’s Phrase Extraction APIs
GiNZA v4 adds several APIs to its package that use phrases as processing units (available via import ginza). The list of phrase extraction APIs can be found here. We’ll cover the most common ones below.
ginza.bunsetu_spans(doc_or_span)
Returns a list of phrase sections included in the span of the Doc object or sentence given by the argument:
ginza.bunsetu_phrase_spans(doc_or_span)
Returns a list of spanning head phrase sections of a bunsetsu-phrase:
The Span object’s label field returned by bunsetu_phrase_spans() is assigned the classification of the phrase:
ginza.bunsetu(token)
Returns the bunsetsu-phrase that the token cleaned by the argument belongs to. It also returns a string consisting of the orth_field of the tokens in the bunsetsu-phrase concatenated with “+” by default:
To combine the lemma_ fields of the tokens in the bunsetsu-phrase, give the second argument as shown below:
To get the list of tokens that make up a bunsetsu-phrase, give the following join_func argument:
ginza.phrase(token)
Returns the head phrase section of the phrase that the token cleaned by the argument belongs to. Note that you can apply the arguments mentioned for ginza.bunsetu() here as well:
ginza.sub_phrases(token, phrase_func)
Returns the subsidiaries of the bunsetsu-phrase that the given token belongs to. You can use ginza.bunsetu or ginza.phrase for phrase_func:
Design Principles of GiNZA’s Phrase extraction APIs
To enable concise and intuitive coding, GiNZA’s phrase extraction APIs combine the functional programming concept of currying with the object-oriented programming concept of polymorphism. Assume you’ve performed the following imports:
For bunsetu(), the following calls yield the exact same result:
- bunsetu(token, lemma_)
- bunsetu(lemma_)(token)
The latter call is a currying variation of the first one that generates and returns a function with only the first argument token not given in the former call. In other words, the latter is a function that returns a function. An additional argument (token) must be given to this returning function to actually receive the processing result.
Currying may seem like a slog. But it enables the following coding style:
The second argument of this sub_phrases() is a curried version of bunsetu(), which defines how to get the sub phrases.
The argument variation of bunsetu() shown above may seem strange to those who are used to Python. Python has provided type hinting in recent versions. But advanced typing features such as polymorphism are not provided by default for this dynamically typed language. However, the standard Python package contains a singledispatch that provides type dispatching restricted to the first argument. Some of GiNZA v4’s phrase extraction APIs use this singledispatch to achieve (pseudo-) currying.
The following source code is (pseudo-) curried in traverse(), the main implementation of bunsetu(). When annotated with @singledispatch, traverse() returns a function with a token argument as the return value.
In contrast, _traverse() annotated with @traverse.register(Token) is interpreted as a variation of traverse(). If the first argument is an instance of the Token class when traverse() is called, the process is dispatched to _traverse(). Note that the function’s first argument type which registers with @singledispatch must be a class. Conversely, the type specification of the basic variation annotated with @singledispatch is not required.
Evaluating GiNZA’s Phrase Recognition Accuracy
Table 1 below summarizes the results of our evaluation of GiNZA v4’s accuracy in recognizing phrasal headwords. GiNZA uses UD_Japanese-BCCWJ v2.6 as its training data. Since this is unavailable to the public, we evaluated GiNZA’s accuracy using UD_Japanese-GSD v2.6, a corpus available to anyone. You can extend the dependency labels in the training data of GiNZA v4 with the suffix “_bunsetu” to identify the head of the phrase from the dependency labels in the analysis results.
The experimental results show that the overall accuracy of the identification of the phrase-head relationship is approximately 96%. When the morphological analysis results of SudachiPy and the dependency structure analysis results of spaCy are correct, the head of the phrase is almost perfectly identified.
GiNZA’s Advantage In Analysis Accuracy
GiNZA’s architecture shares many common elements with the final release of the spaCy v2.3 Japanese model, including the use of SudachiPy for tokenization. But there are a few key differences worth mentioning.
The spaCy Japanese model uses SudachiPy in Mode A. This separates words in the shortest units to simplify learning. GiNZA utilizes Mode C to handle more specific meanings. GiNZA’s training data, UD_Japanese-BCCWJ, is also several times larger than UD_Japanese-GSD, the training data of the spaCy Japanese model.
The Path to the Official spaCy Japanese Model Release
spaCy has been gaining more users in recent years, especially when it comes to European and American languages. This is due to its ability to perform analysis of multiple languages with just a single library. Contrary to popular belief, spaCy has actually been capable of using Japanese models since its initial version. But the functionality provided by the previous Japanese model was limited to tokenization because there was no commercially available, openly licensed dataset with syntactic information (Treebank) in Japanese.
Megagon Labs and the National Institute for Japanese Language and Linguistics released GiNZA under the MIT license for commercial use in the spring of 2019. The response from this release made the need for this technology readily apparent. Using the knowledge gained from building GiNZA, its research team started to work on the issue of integrating the Japanese dependency structure analysis model into spaCy.
In the fall of 2019, the Universal Dependencies Japanese Working Group led the removal of the non-commercial clause from the license of UD_Japanese-GSD. Soon after, the prerequisites for using UD_Japanese-GSD as training data for spaCy’s dependency structure analyzer were established. But manually annotated gold labels had to be added to UD_Japanese GSD to perform named entity extraction, an important feature for industrial applications.
To address this, Megagon Labs participated in the UD Japanese working group. Besides the construction of UD v2.6, we worked on the assignment of correct answer named entity expression labels. In May 2020, we released UD_German-GSD v2.6-NE. It is our hope that the release of this dataset will contribute to the further development of Japanese NLP.
A Step Towards Better Japanese Language Analysis
We hope you’ve enjoyed this overview of the new functionalities that GiNZA version 4.0 offers. We believe this latest iteration provides a number of features and highly accurate models that are immediately useful for Japanese language analysis.
If you have Python 3.6 or higher installed, you can run `pip install -U ginza` to start using GiNZA 4.0 right now. If you are interested in Japanese parsing with Universal Dependencies or NLP framework, please visit GiNZA’s page here.
Do you have any questions about how GiNZA works? Contact us today!
Written by Hiroshi Matsuda and Megagon Labs