Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.
- python 3.6
- keras 2.2.4 (tensorflow backend)
- keras-contrib 2.0.8 for CRF inference.
- gensim for training word2vec.
- bilm-tf for ELMo.
- Static Word Embedding: word2vec, GloVe
- Contextualized Word Representation: ELMo (
_elmo), refer to Sec.
- BiLSTM
- DGCNN
- sequence labeling (
sequence_labeling.py)- CRF
- softmax
- predict start/end index of entities (
_pointer)
According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:
- Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax):
sequence_labeling.py - (Static Word Embedding, ELMo) × BiLSTM × pointer:
bilstm_pointer.pyandbilstm_pointer_elmo.py
Other models can be implemented by adding/modifying few codes.
- Prepare data:
- download official competition data to
datafolder - get sequence tagging train/dev/test data:
bin/trans_data.py - prepare
vocab,tagvocab: word vocabulary, one word per line, withword word_countformattag:BIOESner tag list, one tag per line (Oin first line)
- follow the step 2 or 3 below
- 2 is for models using static word embedding
- 3 is for model using ELMo
- download official competition data to
- Run model with static word embedding, here take
word2vecas an example:- train word2vec:
bin/train_w2v.py - modify
config.py - run
python sequence_labeling.py [bilstm/dgcnn] [softmax/crf]orpython bilstm_pointer.py(remember to modifyconfig.model_namebefore a new run, or the old model will be overridden)
- train word2vec:
- Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of
train/dev/testto file first, then load them for train/dev/test, not run ELMo on the fly):- follow the instruction described here to get contextualized sentence representation for
train_full/dev/testdata from pre-trained ELMo weights - modify
config.py - run
python bilstm_pointer_elmo.py
- follow the instruction described here to get contextualized sentence representation for
- Just follow the official instruction described here.
- Some notes:
- to train a token-level language model, modify
bin/train_elmo.py:
fromvocab = load_vocab(args.vocab_file, 50)
tovocab = load_vocab(args.vocab_file, None) - modify
n_train_tokens - remove
char_cnninoptions - modify
lstm.dim/lstm.projection_dimas you wish. n_gpus=2,n_train_tokens=94114921,lstm['dim']=2048,projection_dim=256,n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
- to train a token-level language model, modify
- After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.
- Blog:《基于CNN的阅读理解式问答模型:DGCNN 》
- Blog:《基于DGCNN和概率图的轻量级信息抽取模型 》
- Named entity recognition tutorial: Named entity recognition series
- Some codes
- Sequence Evaluation tools: seqeval
- Neural Sequence Labeling Toolkit: NCRF++
- Contextualized Word Representation: ELMo