Pertanyaan Meningkatkan ekstraksi nama manusia dengan nltk


Saya mencoba mengekstrak nama-nama manusia dari teks.

Adakah yang punya metode yang akan mereka rekomendasikan?

Ini yang saya coba (kode di bawah): saya menggunakan nltk untuk menemukan semua yang ditandai sebagai seseorang dan kemudian membuat daftar semua bagian NNP dari orang itu. Saya melewatkan orang-orang di mana hanya ada satu NNP yang menghindari meraih satu nama keluarga.

Saya mendapatkan hasil yang layak tetapi bertanya-tanya apakah ada cara yang lebih baik untuk menyelesaikan masalah ini.

Kode:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

Keluaran:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

Terlepas dari Virgin Galactic, ini semua output yang valid. Tentu saja, mengetahui bahwa Virgin Galactic bukan nama manusia dalam konteks artikel ini adalah bagian yang sulit (mungkin tidak mungkin).


32
2017-11-29 17:33


asal


Jawaban:


Harus setuju dengan saran bahwa "membuat kode saya lebih baik" tidak cocok untuk situs ini, tetapi saya dapat memberi Anda beberapa cara di mana Anda bisa cobalah untuk menggali.

Melihat Entitas Pengenal Entitas Stanford (NER). Pengikatannya telah dimasukkan ke dalam NLTK v 2.0, tetapi Anda harus mengunduh beberapa file inti. Disini adalah naskah yang dapat melakukan semua itu untuk Anda.

Saya menulis skrip ini:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

dan mendapat hasil yang tidak begitu buruk:

('Francois', 'PERSON')   ('R.', 'PERSON')   ('Velde', 'PERSON')   ('Richard', 'PERSON')   ('Branson', 'PERSON')   ('Virgin', 'PERSON')   ('Galactic', 'PERSON')   ('Bitcoin', 'PERSON')   ('Bitcoin', 'PERSON')   ('Paul', 'PERSON')   ('Krugman', 'PERSON')   ('Larry', 'PERSON')   ('Summers', 'PERSON')   ('Bitcoin', 'PERSON')   ('Nick', 'PERSON')   ('Colas', 'PERSON')

Semoga ini bermanfaat.


14
2018-06-09 11:13



Anda dapat mencoba melakukan resolusi nama yang ditemukan, dan memeriksa apakah Anda dapat menemukannya di database seperti freebase.com. Dapatkan data secara lokal dan permintaan itu (itu di RDF), atau menggunakan api google: https://developers.google.com/freebase/v1/getting-started. Sebagian besar perusahaan besar, lokasi geografis, dll. (Yang akan tertangkap oleh snippet Anda) kemudian dapat dibuang berdasarkan data freebase.


5
2017-12-08 23:57



Untuk orang lain, saya menemukan artikel ini bermanfaat: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...

5
2018-02-25 20:27



Spacy bisa menjadi alternatif yang baik untuk mengambil nama-nama dari sebuah teks.

https://spacy.io/usage/training#ner


2
2017-12-06 15:39



Jawaban dari @trojane tidak cukup berhasil untuk saya, tetapi banyak membantu untuk yang satu ini.

Prerequesites

Buat folder stanford-ner dan unduh dua file berikut ke dalamnya:

Naskah

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

Hasil

(u'Bitcoin', u'LOCATION')       # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON')         # Wrong
(u'Galactic', u'PERSON')       # Wrong
(u'Bitcoin', u'PERSON')        # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON')        # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')     
(u'Bitcoin', u'LOCATION')       # Wrong
(u'BTC', u'ORGANIZATION')       # Wrong

1
2017-07-12 15:49



Saya sebenarnya ingin mengekstrak hanya nama orang, jadi, berpikir untuk memeriksa semua nama yang muncul sebagai output terhadap wordnet (Database leksikal besar bahasa Inggris). Informasi Lebih Lanjut tentang Wordnet dapat ditemukan di sini: http://www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet

person_names=person_list
person_list = []
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

KELUARAN

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

Terlepas dari Larry Summers semua nama sudah benar dan itu karena nama belakang "Summers".


1
2018-03-26 20:37



Ini cukup berhasil bagiku. Saya hanya harus mengubah satu baris agar dapat berjalan.

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):

perlu

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

Ada ketidaksempurnaan dalam output (misalnya, mengidentifikasi "Pencucian Uang" sebagai seseorang), tetapi dengan data saya, basis data nama mungkin tidak dapat diandalkan.


0
2017-07-27 13:11