GitHub - nefujiangping/entity_recognition: Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"

Models for Entity Recognition

Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.

Requirements

python 3.6
keras 2.2.4 (tensorflow backend)
keras-contrib 2.0.8 for CRF inference.
gensim for training word2vec.
bilm-tf for ELMo.

Components of Entity Recognition

Word Embedding

Static Word Embedding: word2vec, GloVe
Contextualized Word Representation: ELMo (_elmo), refer to Sec.

Sentence Representation

BiLSTM
DGCNN

Inference

sequence labeling (sequence_labeling.py)
- CRF
- softmax
predict start/end index of entities (_pointer)

Note

According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:

Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax): sequence_labeling.py
(Static Word Embedding, ELMo) × BiLSTM × pointer: bilstm_pointer.py and bilstm_pointer_elmo.py

Other models can be implemented by adding/modifying few codes.

How to run

Prepare data:
1. download official competition data to data folder
2. get sequence tagging train/dev/test data: bin/trans_data.py
3. prepare vocab, tag
  - vocab: word vocabulary, one word per line, with word word_count format
  - tag: BIOES ner tag list, one tag per line (O in first line)
4. follow the step 2 or 3 below
  - 2 is for models using static word embedding
  - 3 is for model using ELMo
Run model with static word embedding, here take word2vec as an example:
1. train word2vec: bin/train_w2v.py
2. modify config.py
3. run python sequence_labeling.py [bilstm/dgcnn] [softmax/crf] or python bilstm_pointer.py (remember to modify config.model_name before a new run, or the old model will be overridden)
Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of train/dev/test to file first, then load them for train/dev/test, not run ELMo on the fly):
1. follow the instruction described here to get contextualized sentence representation for train_full/dev/test data from pre-trained ELMo weights
2. modify config.py
3. run python bilstm_pointer_elmo.py

How to train a pure token-level ELMo from scratch?

Just follow the official instruction described here.
Some notes:
- to train a token-level language model, modify bin/train_elmo.py:
  from vocab = load_vocab(args.vocab_file, 50)
  to vocab = load_vocab(args.vocab_file, None)
- modify n_train_tokens
- remove char_cnn in options
- modify lstm.dim/lstm.projection_dim as you wish.
- n_gpus=2, n_train_tokens=94114921, lstm['dim']=2048, projection_dim=256, n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.

References

Blog:《基于CNN的阅读理解式问答模型：DGCNN 》
Blog:《基于DGCNN和概率图的轻量级信息抽取模型》
Named entity recognition tutorial: Named entity recognition series
Some codes
Sequence Evaluation tools: seqeval
Neural Sequence Labeling Toolkit: NCRF++
Contextualized Word Representation: ELMo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Models for Entity Recognition

Requirements

Components of Entity Recognition

Word Embedding

Sentence Representation

Inference

Note

How to run

How to train a pure token-level ELMo from scratch?

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
ELMo		ELMo
bin		bin
data		data
function		function
README.md		README.md
bilstm_pointer.py		bilstm_pointer.py
bilstm_pointer_elmo.py		bilstm_pointer_elmo.py
config.py		config.py
sequence_labeling.py		sequence_labeling.py

nefujiangping/entity_recognition

Folders and files

Latest commit

History

Repository files navigation

Models for Entity Recognition

Requirements

Components of Entity Recognition

Word Embedding

Sentence Representation

Inference

Note

How to run

How to train a pure token-level ELMo from scratch?

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages