去哪儿网的笔试题目——大数据处理相关

悬赏园豆：50 [已解决问题] 解决于 2014-09-26 09:50

大致题目是这样：

金庸的杨过小龙女的那本书本做成“a.txt”文本，然后我们从这个文本读出，每三个字作为一个单位，挑选出其中出现在次数最多的10个串
如：日月当空武曌很牛逼    日月当，月当空，当空武是三字字符串

如：小龙女   654
        梅超风   540
        。。。   。。。

之类。

求解答

去哪儿字符处理大数据处理

问题补充：

求解答

在求大神下，比如1000万个url（网站地址，如：http://bbs.csdn.net/topics/390894007），保存在一个txt中，求出其中出现次数最多的十个；
搜索引擎中的题目，查找过去十天中所搜词语（字符串）出现次数最多的前100个；

都是相关类似的，求大神，跳大圈，阿门

学海没有鱼 | 初学一级 | 园豆：5
提问于：2014-09-24 20:14

< >

最佳答案

从实现原理来说很简单：

        IDictionary<string, int> Calc(Stream stream)
        {
            IDictionary<string, int> result = new Dictionary<string, int>();
            StreamReader reader = new StreamReader(stream);
            int count;
            char[] buffer = new char[4096];
            int index = 0;
            while ((count = reader.Read(buffer, index, buffer.Length - index)) > 0)
            {
                if(count + index < 3)
                {
                    index += count;
                    continue;
                }
                else
                {
                    switch(index)
                    {
                        case 0:
                            count -= 2;
                            break;
                        case 1:
                            count -= 1;
                            break;
                    }
                    index = 2;
                }
                for (int i = 0; i < count; i++)
                {
                    var s = string.Format("{0}{1}{2}", buffer[i], buffer[i + 1], buffer[i + 2]);
                    if(result.ContainsKey(s))
                    {
                        result[s] += 1;
                    }
                    else
                    {
                        result.Add(s, 1);
                    }
                }
                buffer[0] = buffer[count];
                buffer[1] = buffer[count + 1];
            }
            return result.OrderByDescending(x => x.Value).Take(10).ToDictionary(x => x.Key, x => x.Value);
        }

收获园豆：50

519740105 | 大侠五级 |园豆：5810 | 2014-09-25 10:33

我觉得这种写法可能会导致内存溢出哦~

不过我也没想到好的方式，来提前抛弃数据。

幻天芒 | 园豆：37207 (高人七级) | 2014-09-25 14:56

@幻天芒: 这个是最简单的实现。

至于内存溢出，就得考虑数据库的方式了。

519740105 | 园豆：5810 (大侠五级) | 2014-09-25 14:57

@519740105: 我觉得应该有什么策略可以先抛弃数据，我想的是分拆：你不是去10个串么?那我分成1000个小组，每个小组分别取前1000，然后在进行汇总。

幻天芒 | 园豆：37207 (高人七级) | 2014-09-25 15:07

@幻天芒: 恩。这个也是办法。

此外，还可以使用ReadLine的方式读取行列，同时识别标点符号（标点符号应该不满足题目需求的）。

具体的优化方案，需要费脑筋，设计算法。

按照你的方案，分1000个区域，每区域取前1000个，还是存在漏掉的可能的。

我想到的另外一个方案是使用链表，统计每个字出现的次数（毕竟用到的不同汉字的次数是非常有限的，按照3000常用汉字统计，大大的减少了内存的占用），然后，对使用频率非常低（或者后面一半）的汉字撇开，这剩下的汉字作为三字的首字，继续统计第二个字，以此类推，得出最终的结果。

这样，从内存的占用而言，不高，从计算复杂度而言，也只是O(n)，是有限的。

519740105 | 园豆：5810 (大侠五级) | 2014-09-25 15:19

@519740105: 同样的问题，还是有漏掉的可能。链表的话，我觉得不能减少对内存的使用，毕竟List底层也是链表。ReadLine是必须的，不能解决后续的保存问题。数据库原理对这个应该有帮助。

幻天芒 | 园豆：37207 (高人七级) | 2014-09-25 20:34

@幻天芒: 我做了实验，把金庸的鹿鼎记拿来弄，dictionary的大小是40K个元素，合计估计也就4M的样子，这个是可以接受的。在计算的时候，我是把标点符号、半角符号等都去除了，而统计时间也就花了不到1秒。

也统计了鹿鼎记使用的独立汉字数，是4000多，使用字典统计各个汉字使用次数花了1秒，而使用List统计，要两分钟。

我写了个简单的Node（也不算完全链表），使用List，简单的统计汉字使用就近三分钟，再统计三个字的数量，花了更长时间。

519740105 | 园豆：5810 (大侠五级) | 2014-09-25 21:19

@519740105: List不靠谱，常规代码就是你贴的那段。如果没溢出的话，就ok了。有心了，赞一个~

幻天芒 | 园豆：37207 (高人七级) | 2014-09-25 21:22

@幻天芒: 来合肥吧

519740105 | 园豆：5810 (大侠五级) | 2014-09-26 08:32

@519740105: 成都会好混点吧~我怕混不下去，哈哈~

幻天芒 | 园豆：37207 (高人七级) | 2014-09-26 09:04

还是差点，大数据处理必须考虑的空间、时间优化，如果笨拙的方法不是这个专业的也能写出来的吧

学海没有鱼 | 园豆：5 (初学一级) | 2014-09-26 09:50

求解这里用了哪些头文件，在哪个平台下操作的

学海没有鱼 | 园豆：5 (初学一级) | 2014-09-26 10:12

@邹而语: 中国人写程序的特点是喜欢玩小技巧与花样。喜欢从算法、数据结构出发，把一个程序写的漂漂亮亮，代码量少（占用存储空间）、时间占用少（能快速完成业务需求）、运行空间占用少（不需要占用太大甚至只占用很少的内存空间）。并且，中国人都以此沾沾自喜。

中国人的这种编码风格（其实现在的我也依然有），换作在80年代，甚至是90年代早期，自然无可厚非。

鲁迅先生说过：浪费别人的时间无异于谋财害命。

编程的时候，要把握一个思想（业界一直这样流传）：客户的电脑是客户的，那么客户的存储空间、内存空间、CPU时间更是客户的，我们不能随意浪费客户的这些资源，因为我们没有这个权力和权利。

中国人的编码特点一直保持着这个思想。

而印度人的编程风格却截然相反（好像是2000年左右看到的这篇文章，很值得国人借鉴）：现在的计算机配置都很高了，大量的内存啊、CPU时间啊，其实都是在闲着，我们干吗不直接使用呢？这样可以大大的节省开发成本（时间），与其把这些开发资源浪费在数据结构与算法上，还不如把这些资源充分的利用起来，去解决一些最有价值的事情。

于是，印度的软件产业很发达，而我们中国呢？各有优缺点吧，但整体比较起来，还是稍有不足，特别是在普通应用方面。

对于你提出的这个问题，也反应了这些问题。

最简单的方案，实现起来，花费的成本很低，看起来，似乎会有 @幻天芒所说的内存溢出的可能，但在实际使用中，这个可能非常低。而花费各种资源去写算法，是否可行了？答案是不置可否的。而针对你的问题，就我目前所花费的研究时间而言，都是白浪费了。

下面是我写的一些测试代码，直接用DICTIONARY，只花费了 100多毫秒，而我用List（当然，或许我不应该用List）写Node来统计，却要花费好几百秒（看你计划使用的汉字数量确定），而换成SortedSet后，效果并没有提升，反而更高（或许换乘SortedList会好点，没实验了），而且，在内存开销上，我想不一定要高，更特别的，还读取了两次文件。

我想这个问题，关键是因为我使用了List和Linq，导致了性能大大下降，而且，我只是简单的定义一个Node，而没有真正的去设计链表来实现，如果真正的设计一个好的链表和算法，是会把性能（空间与时间）提高吧（甚至可能比直接使用字典来的更高都有可能）。

    static class Program
    {
        /// <summary>
        /// 应用程序的主入口点。
        /// </summary>
        [STAThread]
        static void Main()
        {
            FileInfo file = new FileInfo("ldj.txt");
            IDictionary<string, int> result;
            IDictionary<char, int> result2;
            SortedSet<Node> result3;
            SortedSet<Node> result4;
            Stopwatch stop1 = new Stopwatch();
            Stopwatch stop2 = new Stopwatch();
            Stopwatch stop3 = new Stopwatch();
            Stopwatch stop4 = new Stopwatch();
            stop1.Start();
            using (var stream = file.OpenRead())
            {
                result = new Test().Calc(stream);
            }
            stop1.Stop();
            stop2.Start();
            using (var stream = file.OpenRead())
            {
                result2 = new Test().Calc2(stream);
            }
            stop2.Stop();
            stop3.Start();
            using (var stream = file.OpenRead())
            {
                result3 = new Test().Calc3(stream);
            }
            stop3.Stop();
            stop4.Start();
            using (var stream = file.OpenRead())
            {
                result4 = new Test().Calc3(stream, 6);
            }
            stop4.Stop();
            file = new FileInfo("ldj-1.txt");
            using (var writer = file.CreateText())
            {
                foreach (var itm in result.OrderByDescending(x => x.Value))
                {
                    writer.WriteLine(string.Format("{0}:\t{1}", itm.Key, itm.Value));
                }
                writer.WriteLine(string.Format("总耗时：{0}", stop1.Elapsed.TotalMilliseconds));
            }
            file = new FileInfo("ldj-2.txt");
            using (var writer = file.CreateText())
            {
                foreach (var itm in result2.OrderByDescending(x => x.Value))
                {
                    writer.WriteLine(string.Format("{0}:\t{1}", itm.Key, itm.Value));
                }
                writer.WriteLine(string.Format("总耗时：{0}", stop2.Elapsed.TotalMilliseconds));
            }
            file = new FileInfo("ldj-5.txt");
            using (var writer = file.CreateText())
            {
                foreach (var itm in result3.ToList()/*.Where(x => x.Deep == 3)*/.OrderByDescending(x => x.Count))
                {
                    writer.WriteLine(string.Format("{0}:\t{1}", itm.String, itm.Count));
                }
                writer.WriteLine(string.Format("总耗时：{0}", stop3.Elapsed.TotalMilliseconds));
            }
            file = new FileInfo("ldj-6.txt");
            using (var writer = file.CreateText())
            {
                foreach (var itm in result4/*.Where(x => x.Deep == 3)*/.OrderByDescending(x => x.Count))
                {
                    writer.WriteLine(string.Format("{0}{2}:\t{1}", itm.String, itm.Count, itm.Deep));
                }
                writer.WriteLine(string.Format("总耗时：{0}", stop4.Elapsed.TotalMilliseconds));
            }
            Application.EnableVisualStyles();
            Application.SetCompatibleTextRenderingDefault(false);
            Application.Run(new Form1());
        }

        public class Node : IComparable<Node>
        {
            public Node()
            {
                Deep = 1;
                Count = 1;
                Children = new SortedSet<Node>();
            }
            public Char Char { get; set; }

            public string String { get; set; }

            public int Count { get; set; }

            public int Deep { get; set; }

            public SortedSet<Node> Children { get; set; }

            public override bool Equals(object obj)
            {
                if (obj is char)
                {
                    return this.Char == (char)obj;
                }
                return false;
            }

            public override int GetHashCode()
            {
                return this.Char.GetHashCode();
            }

            public SortedSet<Node> ToList()
            {
                if (this.Children == null || this.Children.Where(x => x != null).Count() == 0)
                {
                    return new SortedSet<Node> { new Node { String = Char.ToString(), Count = Count, Deep = Deep } };
                }
                else
                {
                    SortedSet<Node> list = new SortedSet<Node>();
                    foreach (var node in Children)
                    {
                        foreach(var n in node.ToList().Where(x => x != null).Select(x => new Node { String = this.Char.ToString() + x.String, Count = x.Count, Deep = x.Deep }))
                        {
                            list.Add(n);
                        }
                    }
                    return list;
                }
            }


            int IComparable<Node>.CompareTo(Node other)
            {
                return this.Char.CompareTo(other.Char);
            }
        }

        public class Test
        {
            private readonly IList<char> s_symbol = new List<char> { 
                '。',
                '，',
                '、',
                '；',
                '：',
                '？',
                '！',
                '“',
                '”',
                '‘',
                '’',
                '╗',
                '╚',
                '┐',
                '└',
                '（',
                '）',
                '…',
                '—',
                '《',
                '》',
                '〈',
                '〉',
                '·',
                '',
                '∶',
                '－',
            };

            public IDictionary<char, int> Calc2(Stream stream)
            {
                IDictionary<char, int> result = new Dictionary<char, int>();
                StreamReader reader = new StreamReader(stream);
                int count;
                char[] buffer = new char[4096];
                int index = 0;
                while ((count = reader.Read(buffer, index, buffer.Length)) > 0)
                {
                    for (int i = 0; i < count; i++)
                    {
                        if (s_symbol.Contains(buffer[i]))
                        {
                            continue;
                        }
                        if (buffer[i] < 255)
                        {
                            continue;
                        }
                        if (result.ContainsKey(buffer[i]))
                        {
                            result[buffer[i]] += 1;
                        }
                        else
                        {
                            result.Add(buffer[i], 1);
                        }
                    }
                }
                return result;//.OrderByDescending(x => x.Value).Take(10).ToDictionary(x => x.Key, x => x.Value);
            }
            public SortedSet<Node> Calc3(Stream stream, int parts = 2)
            {
                SortedSet<Node> result = new SortedSet<Node>();
                StreamReader reader = new StreamReader(stream);
                int count;
                char[] buffer = new char[4096];
                int index = 0;
                while ((count = reader.Read(buffer, index, buffer.Length)) > 0)
                {
                    for (int i = 0; i < count; i++)
                    {
                        if (s_symbol.Contains(buffer[i]))
                        {
                            continue;
                        }
                        if (buffer[i] < 255)
                        {
                            continue;
                        }
                        var node = result.Where(x => x.Char == buffer[i]).SingleOrDefault();
                        if (node == null)
                        {
                            result.Add(new Node { Char = buffer[i] });
                        }
                        else
                        {
                            node.Count++;
                        }
                    }
                }
                index = 0;
                stream.Seek(0, SeekOrigin.Begin);
                var tmp = result.OrderByDescending(x => x.Count).Take(Math.Min(result.Count, Math.Max(result.Count / parts, 500))).ToList();
                result.Clear();
                foreach(var n in tmp)
                {
                    result.Add(n);
                }
                while ((count = reader.Read(buffer, index, buffer.Length - index)) > 0)
                {
                    if (count + index < 3)
                    {
                        index += count;
                        continue;
                    }
                    else
                    {
                        switch (index)
                        {
                            case 0:
                                count -= 2;
                                break;
                            case 1:
                                count -= 1;
                                break;
                        }
                        index = 2;
                    }
                    for (int i = 0; i < count; i++)
                    {
                        if (s_symbol.Contains(buffer[i]))
                        {
                            continue;
                        }
                        if (buffer[i] < 255)
                        {
                            continue;
                        }
                        var node1 = result.Where(x => x.Char == buffer[i]).SingleOrDefault();
                        if (node1 == null)
                        {
                            continue;
                        }
                        var nextIndex1 = LocNextChar(buffer, i, count);
                        if (nextIndex1 == -1)
                        {
                            break;
                        }
                        if (!result.Any(x => x.Char == buffer[nextIndex1]))
                        {
                            i = nextIndex1;
                            continue;
                        }
                        var nextIndex2 = LocNextChar(buffer, nextIndex1, count);
                        if (nextIndex2 == -1)
                        {
                            break;
                        }
                        if (!result.Any(x => x.Char == buffer[nextIndex2]))
                        {
                            i = nextIndex2;
                            continue;
                        }
                        var node2 = node1.Children.Where(x => x.Char == buffer[nextIndex1]).SingleOrDefault();
                        if (node2 == null)
                        {
                            node1.Children.Add(node2 = new Node { Char = buffer[nextIndex1], Deep = 2 });
                        }
                        else
                        {
                            node2.Count++;
                        }
                        var node3 = node2.Children.Where(x => x.Char == buffer[nextIndex2]).SingleOrDefault();
                        if (node3 == null)
                        {
                            node2.Children.Add(node3 = new Node { Char = buffer[nextIndex2], Deep = 3 });
                        }
                        else
                        {
                            node3.Count++;
                        }
                    }
                    string s = "";
                    for (int i = count + 1; i > 0; i--)
                    {
                        if (s_symbol.Contains(buffer[i]))
                        {
                            break;
                        }
                        if (buffer[i] < 255)
                        {
                            continue;
                        }
                        s = buffer[i].ToString() + s;
                        if (s.Length == 2)
                        {
                            break;
                        }
                    }
                    index = s.Length;
                    if (index > 0)
                    {
                        buffer[0] = s[0];
                    }
                    if (index > 1)
                    {
                        buffer[1] = s[1];
                    }
                }
                SortedSet<Node> list = new SortedSet<Node>();
                foreach (var node in result)
                {
                    foreach(var n in node.ToList())
                    {
                        list.Add(n);
                    }
                }
                return list;//.OrderByDescending(x => x.Value).Take(10).ToDictionary(x => x.Key, x => x.Value);
            }

            public IDictionary<string, int> Calc(Stream stream)
            {
                IDictionary<string, int> result = new Dictionary<string, int>();
                StreamReader reader = new StreamReader(stream);
                int count;
                char[] buffer = new char[4096];
                int index = 0;
                while ((count = reader.Read(buffer, index, buffer.Length - index)) > 0)
                {
                    if (count + index < 3)
                    {
                        index += count;
                        continue;
                    }
                    else
                    {
                        switch (index)
                        {
                            case 0:
                                count -= 2;
                                break;
                            case 1:
                                count -= 1;
                                break;
                        }
                        index = 2;
                    }
                    string s = null;
                    for (int i = 0; i < count; i++)
                    {
                        if (s_symbol.Contains(buffer[i]) || s_symbol.Contains(buffer[i + 1]) || s_symbol.Contains(buffer[i + 2]))
                        {
                            continue;
                        }
                        if(buffer[i] < 255)
                        {
                            continue;
                        }
                        s = "";
                        if(!GetChar(buffer, i, count, 3, ref s))
                        {
                            break;
                        }
                        if (result.ContainsKey(s))
                        {
                            result[s] += 1;
                        }
                        else
                        {
                            result.Add(s, 1);
                        }
                    }
                    s = "";
                    for (int i = count + 1; i > 0; i-- )
                    {
                        if (s_symbol.Contains(buffer[i]))
                        {
                            break;
                        }
                        if(buffer[i] < 255)
                        {
                            continue;
                        }
                        s = buffer[i].ToString() + s;
                        if(s.Length == 2)
                        {
                            break;
                        }
                    }
                    index = s.Length;
                    if(index > 0)
                    {
                        buffer[0] = s[0];
                    }
                    if(index > 1)
                    {
                        buffer[1] = s[1];
                    }
                }
                return result;//.OrderByDescending(x => x.Value).Take(10).ToDictionary(x => x.Key, x => x.Value);
            }

            private bool GetChar(char[] buffer, int index, int length, int count, ref string value)
            {
                int i = index;
                while(buffer[i] < 255 && i < length)
                {
                    i++;
                }
                if(i == length)
                {
                    return false;
                }
                if(s_symbol.Contains(buffer[i]))
                {
                    return false;
                }
                if(count == 1)
                {
                    value = buffer[i].ToString();
                    return true;
                }
                if (!GetChar(buffer, i + 1, length, count - 1, ref value))
                {
                    return false;
                }
                value = buffer[i].ToString() + value;
                return true;
            }

            private int LocNextChar(char[] buffer, int index, int length)
            {
                int i = index + 1;
                while (buffer[i] < 255 && i < length)
                {
                    i++;
                }
                if (i == length)
                {
                    return -1;
                }
                return i;
            }
        }
    }

View Code

519740105 | 园豆：5810 (大侠五级) | 2014-09-26 10:23

@邹而语: 忘记了，你可能用的是C/C++吧，我的这个结果是C#的。

意思一样的，通过字典（C++也有这个类，具体头文件？好久不弄C++了）来实现，整体价值是最高的。

519740105 | 园豆：5810 (大侠五级) | 2014-09-26 10:27

其他回答(1)

这题目出的好，不过也不是很难，主要考点substring,list,字典集合,冒泡排序等的使用。

唯我独萌 | 园豆：537 (小虾三级) | 2014-09-25 08:34

清除回答草稿

您需要登录以后才能回答，未注册用户请先注册。

欢迎，请先 登录 或者 注册 。

去哪儿网的笔试题目——大数据处理相关

欢迎，请先登录或者注册。