http://www.eurecom.fr/en/publication/6231/download/data-publi-6231.pdf
—————————————————————————————————————————
contributions:
1,describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world.
2,propose how to derive sentences from such a graph that effectively “describe" the similarity
across elements.
3,effective optimization to improve the quality of the learned embeddings and the performance of integration tasks.
4,propose a diverse collection of criteria.
steps:
1,定义3种node:row(r),attributions(A),token;边:把共现的连起来
2,random walk(随机游走)生成一些路径
3,From Walks to Sentences:
4,Embedding Construction:We piggyback on the plethora of effective embeddings algorithms such as word2vec, GloVe, fastText, and so on.
Optimization:IMPROVING LOCAL EMBEDDINGS
1,Handling Imbalanced Relations
2,Handling Missing and Noisy Data
Criteria:准备好所有node的embedding以后,就可以用于后续的任务
1,Schema Matching (SM)
- 直接拿CID的embedding做KNN查找。
2,Entity Resolution (ER)
- 直接拿RID的embedding做KNN查找。
3,Token Matching (TM).
- - 直接拿TID的embedding做KNN查找。Top1就是 conceptual synonym。
Experiment
1,Evaluating Embeddings Quality
- MatchAttribute (MA),MatchRow(MR),MatchConcept (MC)
2,Data Integration Tasks
- Schema Matching(SM)
- Entity Resolution
S/F/O:对muti-word的不同处理方式
DeepERpl:用local embedding+deeper网络
ER,不同TOPn,有不同效果。
- Token Matching